Interpreting the Data: Parallel Analysis with Sawzall
Some one on the loganalysis mailing list posted a link to a Google Labs paper: Interpreting the Data: Parallel Analysis with Sawzall.
It talks about a distributed aggregation and filtering method using Google’s Sawzall interpreted language. Very interesting paper, the concept of applying distributed computing resources to do work in parallel is not new. LogLogic have implemented this concept to achieve massive parallelism and performance on log analysis for quite sometime now.
The interesting part of the paper relates to its new language, Sawzall. It’s a new language designed specifically for simplicity and parallelism.
First I don’t understand why they couldn’t have created Sawzall as a library for one of the existing languages such as Perl or Python. After some discussion with a Googler, I am somewhat convinced that there might be good reason for a new language. The main reason being parallelism. Most of the languages aren’t designed to program and execute in parallel from the ground up.
However, I have to nitpick the performance example they gave in the paper. The benchmark test cases are all CPU-bound cases. However, earlier in the paper, the authors talked about the applications for this language being mostly IO-bound. It would seem to make sense if they gave some examples that are IO-bound and still be able to show the performance advantage of Sawzall.
Another question I have is how much Sawzall relies on GFS. I am assuming that the parallel execution of Sawzall depends on many of the GFS features, but I have no basis for that.

