Friday, June 17, 2011

Cloud Computing For Data Mining Part 1

The first half of this blog post is about selecting a cloud provider for a data mining and natural language processing system. I will compare 3 leading cloud computing providers Amazon Web Services, Windows Azure, OpenStack.
To help me chose a cloud provider I have been looking for users with experience running cloud computing for application similar to data mining. I found them at CloudCamp New York June 2011. It was an unconference, so the attendees were split into user discussion groups. The last half of the post I will mention the highlight from these discussions.

The Hype

"If you are not in the cloud you are not going to be in business!"

This is the message many programmers, software architects and project managers faces today. You do not want to go out of business because you could not keep up with the latest technologies; but looking back many companies have gone out of business because they invested in the latest must have technology, that turned out to be expensive and over engineered.

Reason For Moving To Cloud

I have a good business case from using cloud computing: Namely scale a data mining system to handle a lot of data. To begin with it could be a moderate amount of data, but it could be changed to a Big Data with short notice.

Horror Cloud Scenario

I am trying minimize the risk of this scenario:
  1. I port to a cloud solution that is tied closely to one cloud provider
  2. Move the applications over
  3. After a few months I find that there are unforeseen problems
  4. No easy path back
  5. Angry customers are calling


Here are my cloud computing goals in a little more details:
  • Port data mining system and ASP.NET web applications to the cloud
  • Chose cloud compatible with code base in .NET and Python
  • Initially the data volume is moderate but it could possibly scale to Big Data
  • Keep cost and complexity under control
  • No downtime during transition
  • Minimize risk
  • Minimize vendor lock in
  • Run the same code in house and in the cloud
  • Make rollback to in house application possible

Amazon Web Services vs. Windows Azure vs. OpenStack

Choosing the right cloud computing provider has been time consuming, but also very important.

I took a quick stroll through Cloud Expo 2011, and most big computer companies were there presenting their cloud solutions.

Google App Engine is a big cloud service well suited for front end web application, but not good for data mining, so I will not cover that here.

The other 3 providers that have generated most momentum are: EC2, Azure and OpenStack.

Let me start by listing their similarities:
  • Virtual computers that can be started with short notice
  • Redundant robust storage
  • NoSQL structured data
  • Message queue for communication
  • Mountable hard disk
  • Local non persistent hard disk
Now I will write a little more about where they differ, and their good and the bad part:

Amazon Web Services, AWS, EC2, S3

  • This is the oldest cloud provider dating back to 2004
  • Very mature provider
  • Other providers are catching up with AWS's features
  • Well documented
  • Work well with open source, LAMP and Java
  • Integrated with Hadoop: Electric Map Reduce
  • A little cheaper than Windows Azure
  • Runs Linux, Open Solaris and Windows servers
  • You can run code on your local machine and just save the result into S3 storage

  • You cannot run the same code in house and in the cloud
  • Vendor lock in

Windows Azure

  • Works well with the .NET framework and all Microsoft's tools
  • It is very simple to port an ASP.NET application to Azure
  • You can run the same code on you development machine and in the cloud
  • Very good development and debugging tools
  • F# is a great language for data mining in cloud computing
  • Great series of video screen casts

  • Only run Windows
  • You need a Windows 7, Windows Server 2008 or Windows Vista to develop
  • Preferably you should have Visual Studio 2010
  • Vendor lock in


OpenStack is a new open source collaboration that is making a software stack that can be run both in house and it the cloud.

  • Open source
  • Generating a lot of buzz
  • Main participants NASA and Rackspace
  • Backed by 70 companies
  • You can run your application either in house or in the cloud

  • Not yet mature enough for production use
  • Windows support is immature

Java, .NET Or Mixed Platform

For data mining selecting the right platform is a hard choice. Both Java and .NET are very attractive options.

Java only
For data mining and NLP there are a lot of great open source project written in Java. E.g. Mahout is a system for  collaborative filtering and clustering of Big Data, with distributed machine learning. It is integrated with Hadoop.
There are many more OSS: OpenNLP, SolrManifoldCF,

.NET only
The development tools in .NET are great. It works well with Microsoft Office.
Visual Studio 2010 comes with F#, which is a great language for writing worker roles. It is very well suited for light weight threads or async, for highly parallel reactive programs.

Mix Java and .NET
You can mix Java and .NET. Cloud computing makes is easier than ever to integrate different platforms. You already have abstract language agnostic service for communication with message queue, blob storage, structured data. If you have an ASP.NET front end on top of a collaborative filtering of Big Data this would be a very attractive option.

I still think that combining 2 big platforms like Java and .NET is introducing complexity, compared to staying within one platform. You need an organization with good resources and coordination to do this.

Choice Of Cloud Provider

I still have a lot of unanswered questions at this point.

At the time of writing June 2011 OpenStack is not ready for production use. So that is out for now.

I have run some test on AWS. It was very easy to deploy my Python code to EC2 under Linux. Programming C# that used AWS services was simple.

I am stuck waiting to get a Window 7 machine so I can test Window Azure.

Both EC2 and Azure seem like viable options for what I need. I will get back to this in part 2 of the blog post.

Highlights from Cloud Camp 2011

A lot of people are trying to sell you cloud computing solutions. I have heard plenty of cloud computing hype. I have been seeking advice from people that were not trying to sell me anything and had some real experience, and try to find some of the failures and problems in cloud computing. 

I went to Cloud Camp June 2011 during Cloud Expo 2011 in New York. Cloud computing users shared their experience. It was an unconference, meaning spontaneous user discussion breakout groups were formed. The rest of this post is highlight from these discussions.

Hadoop Is Great But Hard

Hadoop is a Java open source implementation of Google's Map Reduce. You can set up a workflow of operations and Hadoop will distribute them over a multiple computers, aggregate the result and rerun operations that fail. This sounds fantastic, but Hadoop is a pretty complex system, with a lot of new terminology and a steep learning curve.

Security Is Your Responsibility

Security is a big issue. You might assume that the cloud will take care of security, but you should not. E.g. you should clean up the hard disks that you have used it, so the next user cannot see your data.

Cloud Does Not Automatically Scale To Big Data

The assumption is that you put massive amounts of data in the cloud. And the cloud takes care of the scaling problems.
If you have a lot of data that needs little processing. Then cloud computing becomes expensive: you store all data in 3 different locations and it is expensive and slow to take it down to different compute nodes. This was mentioned as the reason why NASA could not using S3, but build its own Nebula platform.

You Accumulate Cost During Development

An entrepreneur building a startup ended up paying $2000 / month for EC2. He used a lot of different servers and they had to be running with multiple instances, even though he was no using a lot of resources. This might be cheap compared to going out and buying your own servers, but it was more expensive than he expected.

Applications Written In .NET Run Fine Under EC2 Windows

An entrepreneur said that he was running his company's .NET code under EC2. He thought that Amazon was more mature than Azure, and Azure was catching up. He preferred to make his own framework.

Simpler To Run .NET Application On Azure Than On EC2

A cloud computing consultant with lots of experience in both Azure and EC2 said: EC2 gives you a raw machine you have to do more to get your application running than if you plop it into Windows Azure.
It is very easy to port an ASP.NET application to Windows Azure.

Cash Flow, Operational Expenses And Capital Expenses

An often cited reason why cloud computing is great is that a company can replace big upfront capital expenses with smaller operational expenses. A few people mentioned that companies live by their cash flow and they do not like to have an unpredictable operational expenses, but are more comfortable with predictable capital expenses.


Unknown said...

Nice pragmatic article, cheers.

Anonymous said...

Hi Sami! Great post, I was pretty much sitting on the edge of my seat by the end...but, you never posted a part 2. :( What did you choose in the end? What influenced your decision (if you don't mind my asking)? Thanks!

Sami Badawi said...

Hi Bruce,

I have not started using any cloud solutions yet. I have read up on Hadoop, and I think that this is a good solution for doing analytics on Big Data. A possible outcome is to start using it in house and move it to EC2 if more computation power is needed.

Anonymous said...

During my introduction to the cloud architecture, I have always wondered on how they will efficiently utilize data mining for architectures that has autonomous servers in bridge. I came across Dell's Hypervisor that can mine data at a very impressive rate of bandwidth.
polycom ip 550