Languages and Logic: Big Data

Showing posts with label Big Data. Show all posts

Monday, September 23, 2013

Big Data: What Worked?

"Big data" created an explosion of new technologies and hype: NoSQL, Hadoop, cloud computing, highly parallel systems and analytics.

I have worked with big data technologies for several years. It has been a steep learning curve, but lately I had more success stories.

This post is about the big data technologies I like and continue to use. Big data is a big topic. These are some highlights from my experience.

I will relate big data technologies to modern web architecture with predictive analytics and raise the question:

What big data technologies should I use for my web startup?

Classic Three Tier Architecture

For a long time software development was dominated by the three tier architecture / client server architecture. It is well described and conceptually simple:

Client
Server for business logic
Database

It is straightforward to figure out what computations should go where.

Modern Web Architecture With Analytics

The modern web architecture is not nearly as well established. It is more like an 8 tiered architecture with the following components:

Web client
Caching
Stateless web server
Real-time services
Database
Hadoop for log file processing
Text search
Ad hoc analytics system

I was hoping that I could do something closer to the 3-tier architecture, but the components have very different features. Kicking off a Hadoop job from a web request could adversely affect your time to first byte.

A problem with the modern web architecture is that a given calculation can be done in many of those components.

Architecture for Predictive Analytics

It is not at all clear what component predictive analytics should be done in.

First you need to collect user metrics. In what components can do you do this?

Web servers, store the metric in Redis / caching
Web servers, store the metric in the database
Real-time services aggregates the user metric
Hadoop run on the log files

User metric is needed by predictive analytics and machine learning. Here are some scenarios for this:

If you are doing simple popularity based predictive analytics this can be done in the web server or a real-time service.
If you use a Bayesian bandit algorithm you will need to use a real-time service for that.
If you recommend based on user similarity or item similarity you will need to use Hadoop.

Hadoop

Hadoop is a very complex piece of software to handle very large amounts of data that cannot be handled by conventional software because it is too big to fit on one computer.

I compared different Hadoop libraries in my post: Hive, Pig, Scalding, Scoobi, Scrunch and Spark.

Most developers that have used Hadoop complain about it. I am no exception. I still have problems with Hadoop jobs failing due to errors that are hard to diagnose. Generally I have been a lot happier about Hadoop lately. I am only using Hadoop for big custom extractions or calculations from log files stored in HDFS. I do my Hadoop work in Scalding or HIVE

The Hadoop library Mahout can calculate user recommendations based on user similarity or item similarity.

Scalding

Hadoop code in Scalding looks a lot like normal Scala code. The scripts I am writing are often just 10 lines of code and look a lot like my other Scala code. The catch is that you need to be able to write idiomatic functional Scala code.

HIVE

HIVE makes it easy to extract and combine data from HDFS. You just write SQL after some setup of a directory with table structure in HDFS.

Real-time Services

Libraries like Akka, Finagle and Storm are good for having long running stateful computations.
It is hard to write correct highly parallel code that scales to multiple machines using normal multithreaded programming. For more details see my blog post: Akka vs. Finagle vs. Storm.

Akka and Spray

Akka is a simple actor model taken from the Erlang language. In Akka you have a lot of very lightweight actors, they can share a thread pool. They do not block on shared state but communicate by sending immutable messages.

One reason that Akka is a good fit for real-time services is that you can do varying degrees of loose coupling and all services can talk with each other.

It is hard to change from traditional multithreaded programming to using the actor model. There are just a lot of new actor idioms and design patterns that you have to learn. At first the actor model seems like working with a sack of fleas. You have much less control over the flow due to the distributed computation.

Spray makes it easy to put a web or RESTful interface to your service. This makes it easy to connect your service with the rest of the world. Spray also has the best Scala serialization system I have found.

Akka is well suited for: E-commerce, high frequency trading in finance, online advertising and simulations.

Akka in Online Advertising

Millions of users are interacting with fast changing ad campaigns. You could have actors for:

Each available user
Each ad campaign
Each ad
Clustering algorithms
Each cluster of users
Each cluster of ads

Each actor is developing in time and can notify and query all other actors.

NoSQL

There are a lot of options, with no query standard:

Cassandra
CouchDB
HBase
Memcached
MongoDB
Redis
SOLR

I will describe my initial excitement about NoSQL, the comeback of SQL databases and my current view on where to use NoSQL and where to use SQL.

MongoDB

MongoDB was my first NoSQL technology. I used it to store structured medical documents.

Creating a normalized SQL database that represents a structured data format is a sizable task and you easily end up with 20 tables. It is hard to insert a structured document into the database in the right sequence, so foreign key constraints are satisfied. LINQ to SQL helped with this but it was slow.

I was amazed by MongoDB's simplicity:

It was trivial to install
It could insert 1 million documents very fast
I could use the same Python NLP tools for many different types of documents

I felt that SQL databases were so 20th century.

After some use I realized that interacting with MongoDB was not as easy from Scala. I tried different libraries Subset and Casbah.

I also realized that it is a lot harder to query data from MongoDB than a SQL database both in syntax and expressiveness.
Recently SQL databases have added JSON as a data type, taking away some of MongoDB's advantage.

Today I use SQL databases for curated data. But MongoDB for ad hoc structured document data.

Redis

Redis is an advanced key value store that is mainly living in memory but with backup to disk. Redis is a good fit for caching. It has some specialized operations:

Simple to age out data
Simulates pub sub
Atomic update increments
Atomic list append
Set operations

Redis also supports sharding well, in the driver you just give a list of Redis servers and it will send the data to the right server. Redistributing data after adding more sharded servers to Redis is cumbersome.

I first thought that Redis had an odd array of features but it fits the niche of real-time caching.

SOLR

SOLR is the most used enterprise text search technology. It is built on top of Lucene.
It can store and search document with many fields using an advanced query language.
It has an ecosystem of plugins doing a lot of the things that you would want. It is also very useful for natural language processing. You can even use SOLR as a presentation system for your NLP algorithms.

To Cloud or not to Cloud

A few years back I thought that I would soon be doing all my work using cloud computing services like Amazon's AWS. This did not happen, but virtualization did. When I request a new server the OPS team usually spins up a virtual machine.
A problem with cloud services is that storage is expensive. Especially Hadoop sized storage.

If I were in a startup I would probably consider the cloud.

Big and Simple

My fist rule for software engineering is: Keep is simple.

This is particularly important in big data since size creates inherent complexity.
I made the mistake of being too ambitious too early and think out too many scenarios.

Startup Software Stack

Back to the question:

What big data technologies should I use for my web startup?

A common startup technology stack is:
Ruby on Rails for your web server and Python for your analytics and hope that a lot of beefy Amazon EC2 servers will scale your application when your product takes off.
It is fast to get started and the cloud will save you. What could possibly go wrong?

The big data approach I am describing here is more stable and scalable, but before you learn all these technologies you might run out of money.

My answer is: It depends on how much data and how much money you have.

Big Data Not Just Hype

"Big data" is misused and hyped. Still there is a real problem, we are generating an astounding amount of data and sometimes you have to work with it. You need new technologies to wrangle this data.

Whenever I see a reference to Hadoop in a library I get very uneasy. These complex big data technologies are often used where much simpler technologies would have sufficed. Make sure your really need them before you start. This could be the difference between your project succeeding or failing.

It has been humbling to learn these technologies but after much despair I now enjoy working with them and find them essential for those truly big problems.

Monday, April 1, 2013

Akka vs. Finagle vs. Storm

Akka, Finagle and Storm are 3 new open source frameworks for distributed parallel and concurrent programming. They all run on the JVM and work well with Java and Scala.

They are very useful for many common problems:

Real-time analytics
Complex website with different input and outputs
Finance
Multiplayer games
Big data

Akka, Finagle and Storm are all very elegant solutions optimized for different problems. It is confusing what framework you should use for what problem. I hope that I can clarify this.

The 30 seconds history of parallel programming

Parallel / concurrent programming is hard. It had rudimentary support in C and C++ in the 1990s.

In 1995 Java made it much simpler to do simple concurrent programming on one machine by adopting the monitor primitive. Still if you had more than a few threads you could easily get deadlocks, and it did not solve the bigger problem of spreading a computation over many machines.

The growth of big data created a strong demand for better frameworks.

MapReduce and Hadoop

Hadoop the open source version of Google's MapReduce is the most well known method for distributed parallel programming.
It does streaming batch computation, by translating an algorithm to a sequence of map and reduce steps.
Map is fully parallel.
Reduce collect the result from the mapping step.

Hadoop has a steep learning curve. You should only use it when you have no other choice. Hadoop has a heavy stack with a lot of dependencies.

My post Hive, Pig, Scalding, Scoobi, Scrunch and Spark describes simpler frameworks built on top of Hadoop, but they are still complex.

Hadoop has long response time, and it not suited for real-time responses.

The Akka, Finagle and Storm frameworks are all easier to use than Hadoop and are suited for real-time responses.

Storm

Storm is created by Twitter and open sourced in 2011. It is written in Clojure and Java, but it works well with Scala. It is well suited for doing statistics and analytics on massive streams of data.

Storm can describe streaming computation very simply: You make a graph of you computation with some input data source called spouts at the top, below that computation nodes called bolts that can depend on any spout or bolt that has been computed above it, but you cannot have cycles. The graph is called a topology.

Features

Storm will deal with communication between machines and bolts
Consistent hashing to spread computation to right instance of a bolt
Error recovery due to hardware or network failure
Storm does not handle computations that fail due to inherent errors well
Can do analytics on Twitter scale input, literally
You can create a bolt in a non JVM language as long at it talks Thrift
Build in support for Ruby, Python, and Fancy

Word counter example in Storm

The hello world example for Hadoop is to count word frequencies in a big expanding text corpus.

This is a word counting topology in Storm:

Input spout is a stream of texts
Tokenization blot, you tell Storm how many instance of this you want, they can run either in local mode on one machine or in distributed mode on several machines.
Counting blot. This too can be split over different machines.

Spam filter topology in Storm

Let me give an outline of how to make a statistical spam filter using simple natural language processing. This can be used on emails, postings or comments.

There are 2 data paths:

Training path

Containing messages that have been classified as spam or ham.

Input spout: is tuple of the text and status: spam or ham
Tokenizer bolt. Input 1
Word frequency bolt. Input bolt: 2
Bigram finder bolt. Input bolts: 2 and 3
Most important word and bigram selector bolt. Input bolts: 2, 3 and 4
Feature extractor bolt. Input bolt: 5
Bayesian model bolt. Input 3, 4 and 6,

Categorizer path

The unprocessed messages would be sent to the Bayesian model bolt so it can be categorized as spam or ham.

Input spout: tuple of the text and an id for message
Tokenizer
The feature extractor. Extracting optimal feature set of words and bigram. Bolt 6 from above.
Bayesian model. Bolt 7 from above
Result bolt tuple message id and probability of spam

This spam filter is quite a complex system that fits well with Storm.

Akka

Akka v0.6 was released in 2010. Akka is now at v2.2 and is maintained by TypeSafe the company that maintain Scala and it part of their stack. It is written in Scala and Java.

Features

Akka is based on the Erlang language that was developed to handle telecom equipment and be highly parallel and fault tolerant.
Akka gives you very fine grained parallel programming. You split you tasks up into smaller sub tasks and have a lot of actors compute these tasks and communicate the result.
Works well for two way communication.
An actor that can communicate with mailboxes and they are single threaded.
The motto of Akka's team is: Let it fail early.
If an actor fails Akka has different strategies for recovery.
Actors can have share access to an object on one machine, e.g. a cache.
Communication between different actors is very simple, it is just a method call and it will give you a future to a value.
Actors are very lightweight, you can have 2.7 million actors per GB of heap.
There is a executions context with shared thread pool.
This works well when there is IO bound computations.
Turns computation into a Future, a monadic composable asynchronous computation.
Incorporate dataflow variables from a research language called Oz.

Use cases

Agents
Simulations
Social media
Multiplayer game
Finance
Online betting

Finagle

Finagle is created by Twitter and open sourced in 2011. It is written in Scala and Java.

Finagle lets internal and external heterogeneous services communicate. It implement asynchronous Remote Procedure Call (RPC) clients and servers.

Features

It is simple to create servers and clients for your application
Remote services calling each other
Speaks different protocols: HTTP, Thrift, Memcashed
Failover between servers
Is good for having one service query other services
Uniform error policy
Uniform limits on retry
Two way communication
Load balancing
Turns computation into a Future, a monadic composable asynchronous computation.

Use cases

Complex website using different services and supporting many different protocols
Web crawler

Side by side comparisons

Akka vs. Finagle

Akka and Finagle can both do two-way communication.

Akka is great as long as everything lives on one actor system or multiple remote actors systems. It is very simple to have them communicate.

Finagle is much more flexible if you have heterogeneous services and you need different protocols and fallbacks between servers.

Akka vs. Storm

Akka is better for actors that talk back and forth, but you have to keep track the actors, and make strategies for setting up different actor systems on different servers and make asynchronous request to those actor systems. Akka is more flexible than Storm but there is also more to keep track of.

Storm is for computations that move from upstream sources to different downstream sinks. It is very simple to set this up in Storm so it run computation over many distributed servers.

Finagle vs. Storm

Finagle and Storm both handle failover between machines well.

Finagle does heterogeneous two-way communication.

Storm does homogeneous one way communication, but does it in a simple way

Serialization of objects

Serialization of objects is a problem for all of these, since you have to send object between different machine, and the default Java serialization has problems:

It is very verbose
It does not work well with Scala's singleton objects

Here are a few Scala open source libraries:

My experience

It has been hard for me to decide what framework is the better fit in a given situation, despite a lot of reading and experimenting.

I started working on an analytics program and Storm was a great fit and much simpler than Hadoop.

I moved to Akka since I needed to:

Incorporate more independent sub systems all written in Scala
Make asynchronous call to external services that could fail
Setup up on-demand services

Now I have to integrate this analytics program with other internal and external services some in Scala some in other languages I am now considering if Finagle might be a better fit for this. Despite Akka being easier in a homogeneous environment.

Afterthought

When I went to school parallel programming was a buzz word. The reasoning was:

Parallel programming works like the brain and it will solve our computational problems

We did not really know how you would program it. Now it is finally coming of age.

Akka, Finagle, Storm and MapReduce are different elegant solutions to distributed parallel programming. They all use ideas from functional programming.

Friday, June 17, 2011

Cloud Computing For Data Mining Part 1

The first half of this blog post is about selecting a cloud provider for a data mining and natural language processing system. I will compare 3 leading cloud computing providers Amazon Web Services, Windows Azure, OpenStack.
To help me chose a cloud provider I have been looking for users with experience running cloud computing for application similar to data mining. I found them at CloudCamp New York June 2011. It was an unconference, so the attendees were split into user discussion groups. The last half of the post I will mention the highlight from these discussions.

The Hype

"If you are not in the cloud you are not going to be in business!"

This is the message many programmers, software architects and project managers faces today. You do not want to go out of business because you could not keep up with the latest technologies; but looking back many companies have gone out of business because they invested in the latest must have technology, that turned out to be expensive and over engineered.

Reason For Moving To Cloud

I have a good business case from using cloud computing: Namely scale a data mining system to handle a lot of data. To begin with it could be a moderate amount of data, but it could be changed to a Big Data with short notice.

Horror Cloud Scenario

I am trying minimize the risk of this scenario:

I port to a cloud solution that is tied closely to one cloud provider
Move the applications over
After a few months I find that there are unforeseen problems
No easy path back
Angry customers are calling

Goals

Here are my cloud computing goals in a little more details:

Port data mining system and ASP.NET web applications to the cloud
Chose cloud compatible with code base in .NET and Python
Initially the data volume is moderate but it could possibly scale to Big Data
Keep cost and complexity under control
No downtime during transition
Minimize risk
Minimize vendor lock in
Run the same code in house and in the cloud
Make rollback to in house application possible

Amazon Web Services vs. Windows Azure vs. OpenStack

Choosing the right cloud computing provider has been time consuming, but also very important.

I took a quick stroll through Cloud Expo 2011, and most big computer companies were there presenting their cloud solutions.

Google App Engine is a big cloud service well suited for front end web application, but not good for data mining, so I will not cover that here.

The other 3 providers that have generated most momentum are: EC2, Azure and OpenStack.

Let me start by listing their similarities:

Virtual computers that can be started with short notice
Redundant robust storage
NoSQL structured data
Message queue for communication
Mountable hard disk
Local non persistent hard disk

Now I will write a little more about where they differ, and their good and the bad part:

Amazon Web Services, AWS, EC2, S3

Good:

This is the oldest cloud provider dating back to 2004
Very mature provider
Other providers are catching up with AWS's features
Well documented
Work well with open source, LAMP and Java
Integrated with Hadoop: Electric Map Reduce
A little cheaper than Windows Azure
Runs Linux, Open Solaris and Windows servers
You can run code on your local machine and just save the result into S3 storage

Bad:

You cannot run the same code in house and in the cloud
Vendor lock in

Windows Azure

Good:

Works well with the .NET framework and all Microsoft's tools
It is very simple to port an ASP.NET application to Azure
You can run the same code on you development machine and in the cloud
Very good development and debugging tools
F# is a great language for data mining in cloud computing
Great series of video screen casts

Bad:

Only run Windows
You need a Windows 7, Windows Server 2008 or Windows Vista to develop
Preferably you should have Visual Studio 2010
Vendor lock in

OpenStack

OpenStack is a new open source collaboration that is making a software stack that can be run both in house and it the cloud.

Good:

Open source
Generating a lot of buzz
Main participants NASA and Rackspace
Backed by 70 companies
You can run your application either in house or in the cloud

Bad:

Not yet mature enough for production use
Windows support is immature

Java, .NET Or Mixed Platform

For data mining selecting the right platform is a hard choice. Both Java and .NET are very attractive options.

Java only
For data mining and NLP there are a lot of great open source project written in Java. E.g. Mahout is a system for collaborative filtering and clustering of Big Data, with distributed machine learning. It is integrated with Hadoop.
There are many more OSS: OpenNLP, Solr, ManifoldCF,

.NET only
The development tools in .NET are great. It works well with Microsoft Office.
Visual Studio 2010 comes with F#, which is a great language for writing worker roles. It is very well suited for light weight threads or async, for highly parallel reactive programs.

Mix Java and .NET
You can mix Java and .NET. Cloud computing makes is easier than ever to integrate different platforms. You already have abstract language agnostic service for communication with message queue, blob storage, structured data. If you have an ASP.NET front end on top of a collaborative filtering of Big Data this would be a very attractive option.

I still think that combining 2 big platforms like Java and .NET is introducing complexity, compared to staying within one platform. You need an organization with good resources and coordination to do this.

Choice Of Cloud Provider

I still have a lot of unanswered questions at this point.

At the time of writing June 2011 OpenStack is not ready for production use. So that is out for now.

I have run some test on AWS. It was very easy to deploy my Python code to EC2 under Linux. Programming C# that used AWS services was simple.

I am stuck waiting to get a Window 7 machine so I can test Window Azure.

Both EC2 and Azure seem like viable options for what I need. I will get back to this in part 2 of the blog post.

Highlights from Cloud Camp 2011

A lot of people are trying to sell you cloud computing solutions. I have heard plenty of cloud computing hype. I have been seeking advice from people that were not trying to sell me anything and had some real experience, and try to find some of the failures and problems in cloud computing.

I went to Cloud Camp June 2011 during Cloud Expo 2011 in New York. Cloud computing users shared their experience. It was an unconference, meaning spontaneous user discussion breakout groups were formed. The rest of this post is highlight from these discussions.

Hadoop Is Great But Hard

Hadoop is a Java open source implementation of Google's Map Reduce. You can set up a workflow of operations and Hadoop will distribute them over a multiple computers, aggregate the result and rerun operations that fail. This sounds fantastic, but Hadoop is a pretty complex system, with a lot of new terminology and a steep learning curve.

Security Is Your Responsibility

Security is a big issue. You might assume that the cloud will take care of security, but you should not. E.g. you should clean up the hard disks that you have used it, so the next user cannot see your data.

Cloud Does Not Automatically Scale To Big Data

The assumption is that you put massive amounts of data in the cloud. And the cloud takes care of the scaling problems.
If you have a lot of data that needs little processing. Then cloud computing becomes expensive: you store all data in 3 different locations and it is expensive and slow to take it down to different compute nodes. This was mentioned as the reason why NASA could not using S3, but build its own Nebula platform.

You Accumulate Cost During Development

An entrepreneur building a startup ended up paying $2000 / month for EC2. He used a lot of different servers and they had to be running with multiple instances, even though he was no using a lot of resources. This might be cheap compared to going out and buying your own servers, but it was more expensive than he expected.

Applications Written In .NET Run Fine Under EC2 Windows

An entrepreneur said that he was running his company's .NET code under EC2. He thought that Amazon was more mature than Azure, and Azure was catching up. He preferred to make his own framework.

Simpler To Run .NET Application On Azure Than On EC2

A cloud computing consultant with lots of experience in both Azure and EC2 said: EC2 gives you a raw machine you have to do more to get your application running than if you plop it into Windows Azure.
It is very easy to port an ASP.NET application to Windows Azure.

Cash Flow, Operational Expenses And Capital Expenses

An often cited reason why cloud computing is great is that a company can replace big upfront capital expenses with smaller operational expenses. A few people mentioned that companies live by their cash flow and they do not like to have an unpredictable operational expenses, but are more comfortable with predictable capital expenses.