Monday, April 1, 2013

Akka vs. Finagle vs. Storm

Akka, Finagle and Storm are 3 new open source frameworks for distributed parallel and concurrent programming. They all run on the JVM and work well with Java and Scala.

They are very useful for many common problems:

  • Real-time analytics
  • Complex website with different input and outputs
  • Finance
  • Multiplayer games
  • Big data


Akka, Finagle and Storm are all very elegant solutions optimized for different problems. It is confusing what framework you should use for what problem. I hope that I can clarify this.


The 30 seconds history of parallel programming

Parallel / concurrent programming is hard. It had rudimentary support in C and C++ in the 1990s.

In 1995 Java made it much simpler to do simple concurrent programming on one machine by adopting the monitor primitive. Still if you had more than a few threads you could easily get deadlocks, and it did not solve the bigger problem of spreading a computation over many machines.

The growth of big data created a strong demand for better frameworks.


MapReduce and Hadoop 

Hadoop the open source version of Google's MapReduce is the most well known method for distributed parallel programming.
It does streaming batch computation, by translating an algorithm to a sequence of map and reduce steps.
Map is fully parallel.
Reduce collect the result from the mapping step.

Hadoop has a steep learning curve. You should only use it when you have no other choice. Hadoop has a heavy stack with a lot of dependencies.

My post Hive, Pig, Scalding, Scoobi, Scrunch and Spark describes simpler frameworks built on top of Hadoop, but they are still complex.

Hadoop has long response time, and it not suited for real-time responses.

The Akka, Finagle and Storm frameworks are all easier to use than Hadoop and are suited for real-time responses.


Storm

Storm is created by Twitter and open sourced in 2011. It is written in Clojure and Java, but it works well with Scala. It is well suited for doing statistics and analytics on massive streams of data.

Storm can describe streaming computation very simply: You make a graph of you computation with some input data source called spouts at the top, below that computation nodes called bolts that can depend on any spout or bolt that has been computed above it, but you cannot have cycles. The graph is called a topology.

Features


  • Storm will deal with communication between machines and bolts
  • Consistent hashing to spread computation to right instance of a bolt
  • Error recovery due to hardware or network failure
  • Storm does not handle computations that fail due to inherent errors well
  • Can do analytics on Twitter scale input, literally
  • You can create a bolt in a non JVM language as long at it talks Thrift
  • Build in support for Ruby, Python, and Fancy

Word counter example in Storm

The hello world example for Hadoop is to count word frequencies in a big expanding text corpus.

This is a word counting topology in Storm:

  • Input spout is a stream of texts
  • Tokenization blot, you tell Storm how many instance of this you want, they can run either in local mode on one machine or in distributed mode on several machines.
  • Counting blot. This too can be split over different machines.


Spam filter topology in Storm

Let me give an outline of how to make a statistical spam filter using simple natural language processing. This can be used on emails, postings or comments.

There are 2 data paths:

Training path

Containing messages that have been classified as spam or ham.

  1. Input spout: is tuple of the text and status: spam or ham
  2. Tokenizer bolt. Input 1
  3. Word frequency bolt. Input bolt: 2
  4. Bigram finder bolt. Input bolts: 2 and 3
  5. Most important word and bigram selector bolt. Input bolts: 2, 3 and 4
  6. Feature extractor bolt. Input bolt: 5
  7. Bayesian model bolt. Input 3, 4 and 6,

Categorizer path

The unprocessed messages would be sent to the Bayesian model bolt so it can be categorized as spam or ham.

  • Input spout: tuple of the text and an id for message
  • Tokenizer
  • The feature extractor. Extracting optimal feature set of words and bigram. Bolt 6 from above.
  • Bayesian model. Bolt 7 from above
  • Result bolt tuple message id and probability of spam


This spam filter is quite a complex system that fits well with Storm.


Akka

Akka v0.6 was released in 2010. Akka is now at v2.2 and is maintained by TypeSafe the company that maintain Scala and it part of their stack. It is written in Scala and Java.

Features


  • Akka is based on the Erlang language that was developed to handle telecom equipment and be highly parallel and fault tolerant.
  • Akka gives you very fine grained parallel programming. You split you tasks up into smaller sub tasks and have a lot of actors compute these tasks and communicate the result.
  • Works well for two way communication.
  • An actor that can communicate with mailboxes and they are single threaded.
  • The motto of Akka's team is: Let it fail early.
  • If an actor fails Akka has different strategies for recovery.
  • Actors can have share access to an object on one machine, e.g. a cache.
  • Communication between different actors is very simple, it is just a method call and it will give you a future to a value.
  • Actors are very lightweight, you can have 2.7 million actors per GB of heap. 
  • There is a executions context with shared thread pool.
  • This works well when there is IO bound computations.
  • Turns computation into a Future, a monadic composable asynchronous computation.
  • Incorporate dataflow variables from a research language called Oz.

Use cases


  • Agents
  • Simulations
  • Social media 
  • Multiplayer game 
  • Finance
  • Online betting


Finagle

Finagle is created by Twitter and open sourced in 2011. It is written in Scala and Java.

Finagle lets internal and external heterogeneous services communicate. It implement asynchronous Remote Procedure Call (RPC) clients and servers.

Features


  • It is simple to create servers and clients for your application
  • Remote services calling each other
  • Speaks different protocols: HTTP, Thrift, Memcashed
  • Failover between servers
  • Is good for having one service query other services 
  • Uniform error policy
  • Uniform limits on retry
  • Two way communication
  • Load balancing
  • Turns computation into a Future, a monadic composable asynchronous computation.

Use cases

  • Complex website using different services and supporting many different protocols
  • Web crawler


Side by side comparisons


Akka vs. Finagle

Akka and Finagle can both do two-way communication.

Akka is great as long as everything lives on one actor system or multiple remote actors systems. It is very simple to have them communicate.

Finagle is much more flexible if you have heterogeneous services and you need different protocols and fallbacks between servers.


Akka vs. Storm

Akka is better for actors that talk back and forth, but you have to keep track the actors, and make strategies for setting up different actor systems on different servers and make asynchronous request to those actor systems. Akka is more flexible than Storm but there is also more to keep track of.

Storm is for computations that move from upstream sources to different downstream sinks. It is very simple to set this up in Storm so it run computation over many distributed servers.

Finagle vs. Storm

Finagle and Storm both handle failover between machines well.

Finagle does heterogeneous two-way communication.

Storm does homogeneous one way communication, but does it in a simple way



Serialization of objects

Serialization of objects is a problem for all of these, since you have to send object between different machine, and the default Java serialization has problems:

  • It is very verbose
  • It does not work well with Scala's singleton objects
Here are a few Scala open source libraries:





My experience

It has been hard for me to decide what framework is the better fit in a given situation, despite a lot of reading and experimenting.

I started working on an analytics program and Storm was a great fit and much simpler than Hadoop.

I moved to Akka since I needed to:
  • Incorporate more independent sub systems all written in Scala
  • Make asynchronous call to external services that could fail
  • Setup up on-demand services

Now I have to integrate this analytics program with other internal and external services some in Scala some in other languages I am now considering if Finagle might be a better fit for this. Despite Akka being easier in a homogeneous environment.


Afterthought

When I went to school parallel programming was a buzz word. The reasoning was: 

Parallel programming works like the brain and it will solve our computational problems

We did not really know how you would program it. Now it is finally coming of age.

Akka, Finagle, Storm and MapReduce are different elegant solutions to distributed parallel programming. They all use ideas from functional programming.