Monday, March 26, 2012

Hive, Pig, Scalding, Scoobi, Scrunch and Spark

Comparison of Hadoop Frameworks


I had to do simple processing of log files in a Hadoop cluster. Writing Hadoop MapReduce classes in Java is the assembly code of Big Data. There are several high level Hadoop frameworks that make Hadoop programming easier. Here is the list of Hadoop frameworks I tried:

  • Pig
  • Scalding
  • Scoobi
  • Hive
  • Spark
  • Scrunch
  • Cascalog

The task was to read log files join with other data do some statistics on arrays of doubles. Programming this without Hadoop is simple, but caused me some grief with Hadoop.

This blog post is not a full review, but my first impression of these Hadoop frameworks.


Pig


http://pig.apache.org/

Created by Yahoo!
Language Pig Latin.

Pig is a data flow language / ETL system. It work at a much higher level than direct Hadoop in Java.
You are working with named tuples. It is mildly typed, meaning you can define a type for each field in a tuple or it will default to byte array.

  • Pig is well documented
  • Pig scripts are concise
  • You can use it for both script and interactive sessions
  • You can make Pig scripts embedded in Python run as Jython
  • Pig is a popular part of the Hadoop ecosystem

Issues

  • Pig Latin is a new language to learn
  • Pig Latin is not a full programming language, only a data flow language
  • You have to write User Defined Function / UDF in Java if you want to do something that the language does not support directly
  • Pig Latin does not feel uniform


Scalding


https://github.com/twitter/scalding

Created by Twitter.
Language Scala.

Scalding looks very promising. It had just been open sourced when I looked at it.

What sets Scalding apart from other Scala based frameworks is that you work with tuples with named fields.

There is a blog with code example:
http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/
It contains numerical code for Hadoop that did not look much harder than the equivalent non Hadoop code.

Issues

You call Scalding by running a Ruby build script, scald.rb.

This did not work on my Mac, OS X Lion, but it ran under Ubuntu with no problems.

Scalding has little documentation, but it is built on Cascading that does have good documentation.

Note on aggregate functions in Scalding

Scalding has some predefined aggregate function such as Sum and Count. Unfortunately Sum only works numerical types and I needed it to work on arrays of doubles.

You can build your own aggregator function using Group Builder followed by a scanLeft or foldLeft operation.

Workaround to get Scalding to run on OS X Lion


After spending some time I found a workaround for the problems with running Scalding under OS X Lion:


In the build scripts/scald.rb set:

COMPILE_CMD="scalac"

In build.sbt set scalaVersion to the version of Scala you are using:

scalaVersion := "2.9.1"

After that I was able to run the 5 tutorials, coming with Scalding.


Scoobi


https://github.com/NICTA/scoobi
http://www.opennicta.com/home/scoobi

Created by OpenNICTA.
Language Scala.

Scoobi looked powerful and simple.


Scoobi was easy to build with the Scala build system SBT.

Issues

I was having problem running examples. Turned out that Scoobi had a dependency of the Cloudera Hadoop distributions version 0.20.2. This is a popular Hadoop distribution.

I could not get it to run on my Mac, which has the Apache Hadoop distribution so I gave up. I have not revisited it yet.


Hive


http://hive.apache.org/

Created by Facebook.
Language SQL.

Hive works on tables made of named tuples with types. It does not check the type at write time, you just copy files into the directory that represent a table. Writing to Hive is very fast, but it does check types at read time.

You can run JDBC against Hive.

It was easy to get Hive running and I really liked it.

Issues

A problem was that Hive only understands a few file format:

  • Text format
  • Tab delimited format
  • Hadoop SequenceFile format

Starting from Hive version 0.9, is has support for Avro file format that can be used from different languages.

In order to do Sum by group I would have to create User Defined Aggregation Function. Turns out that UDF and UDAF is badly documented. I did not find any examples about how to write them for arrays of doubles.


Spark


http://www.spark-project.org/

Created by Matei Zaharia from UC Berkeley AMP Lab.
Language Scala.

I was very impressed by Spark. It was easy to build with SBT. It was very simple to write my program. It was trivial to define group by sum for double array, just by defining a vector class with addition.

Issues

I tested my program in local mode. I was very happy that I had a workable solution. Then I investigated moving it to a Hadoop cluster. For this Spark had dependency on Apache Mesos cluster manager. Mesos is a thin virtual layer that Hadoop is running on top of.

It turned out that Spark is not actually running on Hadoop. It is running on HDFS and is an alternative to Hadoop. Spark can run side by side with Hadoop if you have Apache Mesos installed.

Spark is an interesting framework that can outperform Hadoop for certain calculation. It uses the same code from running in memory calculation and code on a big HDSF cluster.

If you have full control over your cluster Spark could be a good option, but if you have to run on an established Hadoop cluster it is very invasive.


Scrunch


https://github.com/cloudera/crunch/tree/master/scrunch

Created by Cloudera.
Language Scala.


Scrunch looked powerful and simple. It is easy to build with SBT.

Issues

Dependent on Cloudera's Hadoop 0.20.2 distribution.

The web site describes it as alpha quality, so I did not do much with Scrunch.


Cascalog


https://github.com/nathanmarz/cascalog

Created by Nathan Marz from Twitter.
Language Clojure, a modern Lisp dialect.

As Scalding Cascalog is built on Cascading.

Easy to build with the Clojure build system Leiningen.

It is used as a replacement for Pig. You can run it from the Clojure REPL or run scripts, and get a full and consistent language.

I tried Cascalog and was impressed. It is a good option if you are working in Clojure


Hadoop vs. Storm


I had to solve the same log file statistics problem in real-time using the Storm framework and this was much simpler.

Why is Hadoop so hard

  • Hadoop is solving a hard problem
  • Hadoop is a big software stack with a lot of dependencies
  • Libraries only work with specific versions of Hadoop
  • Serialization is adding complexity, see next section
  • The technology is still not mature
Looks like some of these problems are getting addressed. Hadoop should be more stable now that Hadoop 1.0 has been released.


Serialization in Hadoop


Java have a built in serialization format, but it is not memory efficient. Serialization in Hadoop has to be: 
  • Memory efficient
  • Use compression
  • Support self-healing
  • Support splitting a file in several parts

Hadoop SequenceFile format has these properties, but unfortunately it does not speak so well with the rest of the world.

Serialization is adding complexity to Hadoop. One reason Storm is simpler is that it just uses Kryo Java serialization to send objects over the wire.

Apache Avro is a newer file format that does everything that is needed by Hadoop but is speaks well with other languages as well. Avro is supported in Pig 0.9 and should be in Hive 0.9.


High level Hadoop frameworks in Java


You do not have to use Scala or Clojure to do high level Hadoop in Java. Cascading and Crunch are two Java based high level Hadoop frameworks. They are both based on the idea is that you set up a Hadoop data flow with pipes.

Functional constructs are clumpy in Java. It is a nuisance but doable. When you deploy code to a Hadoop cluster you have to pack up all your library dependencies into on super jar. When you are using Scala or Clojure you need to also package the whole language into this super jar. This also adds complexity. So using Java is a perfectly reasonable choice.

Here are two high level Java Hadoop libraries:


Cascading


Both Scalding and Cascalog is built on top of Cascading.

Cascading is well documented. You can write concise Java code for Hadoop.


Crunch


Scrunch is built on top of Crunch.


Conclusion


I liked all of the Hadoop frameworks I tried, but there is a learning curve and I found problems with all of them.

Extract Transform Load

For ETL Hive and Pig are my top picks. They are easy to use, well supported, and part of the Hadoop ecosystem. It is simple to integrate a prebuilt Map Reduce classes in data flow in both. It is trivial to join data source. This is hard to do in plain Hadoop.

Cascalog is serious contender for ETL if you like Lisp / Clojure.

Hive vs. Pig

I prefer Hive. It is based on SQL. You can use your database intuition and you can access it though JDBC.

Scala based Hadoop frameworks

They all made Hadoop programming look remarkable close to normal Scala programming.

For programming Hadoop Scalding is my top pick since I like the named fields. 

Both Scrunch and Scoobi are simple and powerful Scala based Hadoop frameworks. They require Cloudera's Hadoop distribution, which is a very popular distribution.


7 comments:

blogger_sanyer said...

Thank you for taking time to writing your experience. I am also looking at various tools and Cascading looks very promising.

pvillela said...

Thanks for sharing your broad survey of frameworks that facilitate data analysis on top of Hadoop. This is very helpful.

Unknown said...

Very good comparison ! It will definitely help me in my presentation
on Hadoop and Big Data . Nice work, Thank you !

Anonymous said...
This comment has been removed by a blog administrator.
rmetzger said...

Great comparison. I did not know Scoobi before.

You should also have a look at Stratosphere (stratosphere.eu)
It has both a Java and a native Scala API.
It is probably comparable to Spark but has an optimizer, better support for iterative algorithms and (probably) better support for out-of-core execution.

Since Stratosphere runs with Hadoop YARN, you don't need full control of the cluster.

Unknown said...

Great Summary Sami. Any policy related to re-blogging your content? Let me know and I can address it.

freegnu said...

You have introduced me to some additional tools in the Hadoop toolkit.

Here are some corrections/my two cents.

In pig and hive you can add fields terminated by to the create or load to use csv or whatever.
Hadoop Map Reduce and HDFS are a performance package. Using one without the other dosen't make much sense performance wise. otherwise you are limited to streaming all of your input and output. Cloudera's Spark distribution combines all the layers so you don't have to do the integration work yourself. Cloudera's Manager will do the installs for you and their Hue package will put a web interface bow on it. Cloudera's Impala's performance can take those Hive queries and speed them up by avoiding the map reduce overhead and keeping the data in memory between intermediary steps on the distributed nodes.

Hortonworks' Yarn is an alternative to the above and they push SparkR and MapR as R rather than SQL alternatives to Clodera's SQL based Impala and R based Spark / Apache Mahout. Yes, Spark is a distribution and a package.

I'm just getting started with Hadoop and have only begun to discover the tools available and how to use them.

Some useful tips:
The rank statement combined with the filter statement in pig is good for getting rid of headers. Uniquely identifying rows and subsetting loaded data.
The latest versions of Hive and Impala create table statements can had a skip.initial.lines or skip.header.lines property. Saving on having to throw away valuable information and preprocess input.