Showing posts with label review. Show all posts
Showing posts with label review. Show all posts

Monday, September 23, 2013

Big Data: What Worked?

"Big data" created an explosion of new technologies and hype: NoSQL, Hadoop, cloud computing, highly parallel systems and analytics.

I have worked with big data technologies for several years. It has been a steep learning curve, but lately I had more success stories.

This post is about the big data technologies I like and continue to use. Big data is a big topic. These are some highlights from my experience.

I will relate big data technologies to modern web architecture with predictive analytics and raise the question:

What big data technologies should I use for my web startup?

Classic Three Tier Architecture

For a long time software development was dominated by the three tier architecture / client server architecture. It is well described and conceptually simple:
  • Client
  • Server for business logic
  • Database
It is straightforward to figure out what computations should go where.

Modern Web Architecture With Analytics

The modern web architecture is not nearly as well established. It is more like an 8 tiered architecture with the following components:
  • Web client
  • Caching
  • Stateless web server
  • Real-time services
  • Database
  • Hadoop for log file processing
  • Text search
  • Ad hoc analytics system
I was hoping that I could do something closer to the 3-tier architecture, but the components have very different features. Kicking off a Hadoop job from a web request could adversely affect your time to first byte.

A problem with the modern web architecture is that a given calculation can be done in many of those components.

Architecture for Predictive Analytics

It is not at all clear what component predictive analytics should be done in.

First you need to collect user metrics. In what components can do you do this?
  • Web servers, store the metric in Redis / caching
  • Web servers, store the metric in the database
  • Real-time services aggregates the user metric
  • Hadoop run on the log files

User metric is needed by predictive analytics and machine learning. Here are some scenarios for this:
  • If you are doing simple popularity based predictive analytics this can be done in the web server or a real-time service.
  • If you use a Bayesian bandit algorithm you will need to use a real-time service for that.
  • If you recommend based on user similarity or item similarity you will need to use Hadoop.

Hadoop

Hadoop is a very complex piece of software to handle very large amounts of data that cannot be handled by conventional software because it is too big to fit on one computer.

I compared different Hadoop libraries in my post: Hive, Pig, Scalding, Scoobi, Scrunch and Spark.

Most developers that have used Hadoop complain about it. I am no exception. I still have problems with Hadoop jobs failing due to errors that are hard to diagnose. Generally I have been a lot happier about Hadoop lately. I am only using Hadoop for big custom extractions or calculations from log files stored in HDFS. I do my Hadoop work in Scalding or HIVE

The Hadoop library Mahout can calculate user recommendations based on user similarity or item similarity.

Scalding

Hadoop code in Scalding looks a lot like normal Scala code. The scripts I am writing are often just 10 lines of code and look a lot like my other Scala code. The catch is that you need to be able to write idiomatic functional Scala code.

HIVE

HIVE makes it easy to extract and combine data from HDFS. You just write SQL after some setup of a directory with table structure in HDFS.

Real-time Services

Libraries like Akka, Finagle and Storm are good for having long running stateful computations.
It is hard to write correct highly parallel code that scales to multiple machines using normal multithreaded programming. For more details see my blog post: Akka vs. Finagle vs. Storm.

Akka and Spray

Akka is a simple actor model taken from the Erlang language. In Akka you have a lot of very lightweight actors, they can share a thread pool. They do not block on shared state but communicate by sending immutable messages.

One reason that Akka is a good fit for real-time services is that you can do varying degrees of loose coupling and all services can talk with each other.

It is hard to change from traditional multithreaded programming to using the actor model. There are just a lot of new actor idioms and design patterns that you have to learn. At first the actor model seems like working with a sack of fleas. You have much less control over the flow due to the distributed computation.

Spray makes it easy to put a web or RESTful interface to your service. This makes it easy to connect your service with the rest of the world. Spray also has the best Scala serialization system I have found.

Akka is well suited for: E-commerce, high frequency trading in finance, online advertising and simulations.

Akka in Online Advertising

Millions of users are interacting with fast changing ad campaigns. You could have actors for:
  • Each available user
  • Each ad campaign
  • Each ad
  • Clustering algorithms
  • Each cluster of users
  • Each cluster of ads
Each actor is developing in time and can notify and query all other actors.

NoSQL

There are a lot of options, with no query standard:
  • Cassandra
  • CouchDB
  • HBase
  • Memcached
  • MongoDB
  • Redis
  • SOLR
I will describe my initial excitement about NoSQL, the comeback of SQL databases and my current view on where to use NoSQL and where to use SQL.

MongoDB

MongoDB was my first NoSQL technology. I used it to store structured medical documents.

Creating a normalized SQL database that represents a structured data format is a sizable task and you easily end up with 20 tables. It is hard to insert a structured document into the database in the right sequence, so foreign key constraints are satisfied. LINQ to SQL helped with this but it was slow.

I was amazed by MongoDB's simplicity:
  • It was trivial to install
  • It could insert 1 million documents very fast
  • I could use the same Python NLP tools for many different types of documents

I felt that SQL databases were so 20th century.

After some use I realized that interacting with MongoDB was not as easy from Scala. I tried different libraries Subset and Casbah.
I also realized that it is a lot harder to query data from MongoDB than a SQL database both in syntax and expressiveness.
Recently SQL databases have added JSON as a data type, taking away some of MongoDB's advantage.

Today I use SQL databases for curated data. But MongoDB for ad hoc structured document data.

Redis

Redis is an advanced key value store that is mainly living in memory but with backup to disk. Redis is a good fit for caching. It has some specialized operations:
  • Simple to age out data
  • Simulates pub sub
  • Atomic update increments
  • Atomic list append
  • Set operations

Redis also supports sharding well, in the driver you just give a list of Redis servers and it will send the data to the right server. Redistributing data after adding more sharded servers to Redis is cumbersome.

I first thought that Redis had an odd array of features but it fits the niche of real-time caching.

SOLR

SOLR is the most used enterprise text search technology. It is built on top of Lucene.
It can store and search document with many fields using an advanced query language.
It has an ecosystem of plugins doing a lot of the things that you would want. It is also very useful for natural language processing. You can even use SOLR as a presentation system for your NLP algorithms.

To Cloud or not to Cloud

A few years back I thought that I would soon be doing all my work using cloud computing services like Amazon's AWS. This did not happen, but virtualization did. When I request a new server the OPS team usually spins up a virtual machine.
A problem with cloud services is that storage is expensive. Especially Hadoop sized storage.

If I were in a startup I would probably consider the cloud.

Big and Simple

My fist rule for software engineering is: Keep is simple.

This is particularly important in big data since size creates inherent complexity.
I made the mistake of being too ambitious too early and think out too many scenarios.

Startup Software Stack

Back to the question:

What big data technologies should I use for my web startup?

A common startup technology stack is:
Ruby on Rails for your web server and Python for your analytics and hope that a lot of beefy Amazon EC2 servers will scale your application when your product takes off.
It is fast to get started and the cloud will save you. What could possibly go wrong?

The big data approach I am describing here is more stable and scalable, but before you learn all these technologies you might run out of money.

My answer is: It depends on how much data and how much money you have.

Big Data Not Just Hype

"Big data" is misused and hyped. Still there is a real problem, we are generating an astounding amount of data and sometimes you have to work with it. You need new technologies to wrangle this data.

Whenever I see a reference to Hadoop in a library I get very uneasy. These complex big data technologies are often used where much simpler technologies would have sufficed. Make sure your really need them before you start. This could be the difference between your project succeeding or failing.

It has been humbling to learn these technologies but after much despair I now enjoy working with them and find them essential for those truly big problems.

Wednesday, June 23, 2010

Orange, R, RapidMiner, Statistica and WEKA

Review of open source and cheap software packages for Data Mining

This blog posting is comparing the following tools, after working with them for 2 months and using them for solving a real data mining problem:
  • Orange
  • R
  • RapidMiner
  • Statistica 8 with Data Miner module
  • WEKA
Statistica is commercial, all the other are open source. There is also a brief mention of the following Python libraries: mlpy, ffnet, NLTK.

Summary of first impression

This is a follow up on my previous post R, RapidMiner, Statistica, SSAS or WEKA describing my impression of the following software packages after using them for a couple of days each:
  • R
  • RapidMiner
  • SciPy
  • SQL Server Analysis Services, Business Intelligence Development Studio
  • SQL Server Analysis Services, Table Analysis Tool for Excel
  • Statistica 8 with Data Miner module
  • WEKA
Let me summarize what I found:

SciPy did not have what I needed. However I found a few other good Python-based solutions: Orange, mlpy, ffnet and NLTK.

The SSAS-based solutions held promise due to their close integration with Microsoft products, but I found them to be too closely tied to data warehouses so I postponed exploring them.

Statistica and RapidMiner had a lot of functionality and were polished, but the many features were overwhelming.

R was harder to get started with and WEKA was less polished, so I did not spend too much time on them.

Comparison matrix

In order to compress my current findings I am summarizing it in this matrix. This metric is only based on limited work with the different software packages and is not very accurate. The categories are:
Documentation; GUI and graphics; how polished the package is; ease of learning; controlling package from a script or program; how many machine learning algorithms that are available:

DocGUIPolishedEaseScriptingAlgorithms
Orange232332
Python libs111332
R322132
RapidMiner232223
Statistica333223
WEKA222323

Criteria for software package comparison

The comparison is based on a real data mining task that is relatively simple:
  • Supervised learning for categorization.
  • Over 200 attributes mainly numeric but 2 categorical / text.
  • One of the categorical attributes is the most important predictor.
  • Data is clean, so no need to clean outliers and missing data.
  • Accuracy is a good metric.
  • GUI with good graphic to explore the data is a plus.

General observations

The most popular data mining packages in the industry are SAS and SPSS, but they are quite expensive. Orange, R, RapidMiner, Statistica and WEKA all can be used for doing real data mining work. While some of them are unpolished.

There was a similar learning curve for most of the programs. Most programs took me a few days to get working, between the documentation and experimenting.

I had to reformulate my original problem. Neural network models did not work well on my categorical / text attributes. Statistica produced an accuracy of 90%, while RapidMiner produced an accuracy of 82%.
I replaced the 2 categorical attributes with a numeric attribute and accuracy of the best model increased to around 97%, and was much more uniform between the different tools.

Orange

Orange is an open source data mining package build on Python, NumPy, wrapped C, C++ and Qt.
  • Works both as a script and with an ETL work flow GUI.
  • Shortest script for doing training, cross validation, algorithms comparison and prediction.
  • I found Orange the easiest tool to learn.
  • Cross platform GUI.
Issues:
  • Not super polished.
  • The install is big since you need to install QT.

Python libs: ffnet, NumPy, mlpy, NLTK

A few Python libs deserve to be mentioned here: ffnet, NumPy, mlpy and NLTK.
  • If you do not care about the graphic exploration, you can set up an ffnet neural network in few lines of code.
  • There are several machine learning algorithms in mlpy.
  • The machine learning is NLTK is very elegant if you have a text mining or NLP problem.
  • The libraries are self contained.
Issues:
  • Limited list of machine learning algorithms.
  • Machine learning is not handled uniformly between the different libraries.

R

R is an open source statistical and data mining package and programming language.
  • Very extensive statistical library.
  • It is a powerful elegant array language in the tradition of APL, Mathematica and MATLAB, but also LISP/Scheme.
  • I was able to make a working machine learning program in just 40 lines of code.
Issues:
  • Less specialized towards data mining.
  • There is a steep learning curve, unless you are familiar with array languages.

R vs. Orange written in Python

Python and R have a lot in common: they are both elegant, minimal, interpreted languages with good numeric libraries. Still they have a different feel. So I was interested in seeing how they compared.
Orange / Python advantages
  • R is quite different from common programming languages.
  • Python is easier for most programmers to learn.
  • Python has better debugger.
  • Scripting data mining categorization problems is simpler in Orange.
  • Orange also has an ELT work flow GUI.
R advantages
  • R is even more minimal than Python.
  • Numerical programming is better integrated in R, in Python where you have to use external packages NumPy and SciPy.
  • R has better graphics.
  • R is more transparent since the Orange are wrapped C++ classes.
  • Easier to combine with other statistical calculations.
I made small script to solve my data mining problem in both Orange and R. This was my impression:

If all you want to do is to solve a categorization problem I found Orange to be simpler. You have to become very familiar with how Orange read the spreadsheet, the different attribute types, notably the Meta attribute.

Import and export of data from spreadsheet is easier in R, spreadsheet are stored in a data frames that the different machine learning algorithms are operating on. Programming in R really is very different, you are working on a higher abstraction level, but you do lose control over the details.

RapidMiner

RapidMiner is an open source statistical and data mining package written in Java.
  • Solid and complete package.
  • It easily reads and writes Excel files and different databases.
  • You program by piping components together in a graphic ETL work flows.
  • If you set up an illegal work flows RapidMiner suggest Quick Fixes to make it legal.
Issues:
  • I only got it to works under Windows, but others have gotten it to work in other environments, see comment below.
  • There are a lot of different ETL modules; it took a while to understand how to use them.
  • First I had a hard time making a comparison between different models. Eventually I found a way: You chose a cross validation and select different models one by one. When you run the model the will all be stored on the result page and you can do comparison there.

Statistica 8

Statistica is a commercial statistics and data mining software package for Windows.
There is a 90 day trial for Statistica 8 with data miner module in the textbook:
Handbook of Statistical Analysis and Data Mining Applications. There is also a free 30 day trial.
  • Generally very polished and good at everything, but it is also the only non open source program.
  • High accuracy even when I gave it bad input.
  • You can script everything in Statistica in VB.
  • Cheap compared to SPSS and SAS.
Issues:
  • So many options that it was hard to navigate the program.
  • The most important video about Data Miner Recipes is the very last out of 36.
  • Cost of Statistica is not available on their website.
  • It is cheap in a corporate setting, but not for private use.

WEKA

WEKA is an open source statistical and data mining library written in Java.
  • A lot of machine learning algorithms.
  • Easy to learn and use.
  • Good GUI.
  • Platform independent.
Issues:
  • Worse connectivity to Excel spreadsheet and non Java based databases.
  • CSV reader not as robust as in RapidMiner.
  • Not as polished.

RapidMiner vs. WEKA

The most similar data mining packages are RapidMiner and WEKA. There have many similarities:
  • Written in in Java.
  • Free / open source software with GPL license.
  • RapidMiner includes many learning algorithms from WEKA.
My first thought what that RapidMiner has everything that WEKA has, plus a lot of other functionality and is more polished. Therefore I did not spend too much time on WEKA. For the sake of completeness I took a second look at WEKA and I have to say that it was a lot easier to get WEKA to work. Sometimes less is more. Depending on what is more important functionality or ease of use.

Conclusion

There are several good and very different solutions. Let me finish by listing the strongest aspect of each tool:

Orange has elegant and concise scripting and can also be run in an ETL GUI mode.
R has elegant and concise scripting integrated with a vast statistical library.
RapidMiner has a lot of functionality, is polished and has good connectivity.
Statistica is the most polished product, and generally performed well in all categories. It gave good result when I gave it bad input.
WEKA is the easiest GUI to learn and use.

-Sami Badawi

Thursday, April 29, 2010

R, RapidMiner, Statistica, SSAS or WEKA

Choosing cheap software packages to get started with Data Mining

You have a data mining problem and you want to try to solve it with a data mining software package. The most popular packages in the industry are SAS and SPSS, but they are quite expensive, so you might have a hard time convincing your boss to purchase them before you already have produced impressive results.

When I needed data mining or machine learning algorithms in the past, I would program it from scratch and integrate it in my Java or C# code. But recently I needed a more interactive graphics environment to help with what is called Data Understanding phase in the CRISP-DM. I also wanted a way to compare the predictive accuracy of a broad array of algorithms, so I tried out several packages:
  • R
  • RapidMiner
  • SciPy
  • SQL Server Analysis Services, Business Intelligence Development Studio
  • SQL Server Analysis Services, Table Analysis Tool for Excel
  • Statistica 8 with Data Miner module
  • WEKA

Disclaimer for review

Here is a review of my first impression of these packages. First impression is not the best indicator for what going to work for you in the long run. I am sure that I have missed many features. Still I hope this can save you some time finding a solution that will work for your problem.

R

R is an open source statistical and data mining package and programming language.
  • Very extensive statistical library.
  • Very concise for solving statistical problems.
  • It is a powerful elegant array language in the tradition of APL, Mathematica and MATLAB, but also LISP/Scheme.
  • In a few lines you can set up an R program that does data mining and machine learning.
  • You have full control.
  • It is easier to integrate this into a work flow with your other programs. You just spawn an R program and pass input in and read output from a pipe.
  • Good plotting functionality.
Issues:
  • Less interactive GUI.
  • Less specialized towards data mining.
  • Language is pretty different from current mainstream languages like C, C#, C++, Java, PHP and VB.
  • There is a learning curve, unless you are familiar with array languages.
  • R was created in 1990.
Link: Screencast showing how a trained R user can generate a PMML neural network model in 60 seconds.

RapidMiner

RapidMiner is an open source statistical and data mining package written in Java.
  • Lot of data mining algorithms.
  • Feels polished.
  • Good graphics.
  • It easily reads and writes Excel files and different databases.
  • You program by piping components together in a graphic ETL workflows.
  • If you set up an illegal workflows RapidMiner suggest Quick Fixes to make it legal.
  • Good video tutorials / European dance parties. *:o)
Issues:
  • I only got it to works under Windows, but others have gotten it to work in other environments.
  • Harder to compare different algorithms than WEKA.

SciPy

SciPy is an open source Python wrapper around numerical libraries.
  • Good for mathematics.
  • Python is a simple, elegant and mature language.
Issues:
  • Data mining part is too immature.
  • Too much duct tape.

SQL Server Business Intelligence Development Studio

Microsoft SQL Server Analysis Services comes with data mining service.
If you have access to SQL Server 2005 or later with SSAS installed, you can use some of the data mining algorithms for free. If you want to scale it can become quite expensive.
  • If your are working with the Microsoft stack this integrate well.
  • Good data mining functionality.
  • Organized well.
  • Comes with some graphics.
Issues:
  • The machine learning is closely tied to data warehouses and cubes. This makes the learning curve steeper and deployment harder.
  • Documentation about using the BIDS GUI was hard to find. I looked in several books and several videos.
  • I need to do my data mining from within a web server or a command line program. For this you need to access the models using: Analysis Management Objects (AMO). Documentation for this was also hard to find.
  • You need good cooperation from your DBA, unless you have your own instance of SQL Server.
  • If you want to evaluate the performance of your predictive model, cross-validation is available only in SQL Server 2008 Enterprise.
Link: Good screencast about data mining with SSAS.

SQL Server Analysis Services, Table Analysis Tool Excel

Microsoft Excel data mining plug-in is dependent on SQL Server 2008 and Excel 2007.
  • This takes less interaction with the database and DBA than the Development Studio.
  • A lot of users have their data in Excel.
  • There is an Analysis ribbon / menu that is very simple to use. Even for users with very limited understanding of data mining.
  • The Machine Learning ribbon has more control over internals of the algorithms.
  • You can run with huge amount of data since the number crunching is done on the server.
Issues:
  • This also needs a connection to a SQL Server 2008 with Analysis Services running. Despite the data mining algorithms being relatively simple.
  • You need a special database inside Analysis Services that you have write permissions to.

Link: Excel Table Analysis Tool video

Statistica 8

Statistica is a commercial statistics and data mining software package.
There is a 90 day trial for Statistica 8 with data miner module in the textbook:
Handbook of Statistical Analysis and Data Mining Applications. There is also a free 30 day trial.
  • Statistica is cheaper than SAS and SPSS.
  • Six hours of instructional videos.
  • Data Miner Recipes wizard is the easiest tool for a beginner.
  • Lot of data mining algorithms.
  • GUI with a lot of functionality.
  • You program using menus and wizards.
  • Good graphics.
  • Easy to find and clean up outliers and missing data attributes.
Issues:
  • Overwhelming number of menu items.
  • The most important video about Data Miner Recipes is the very last.
  • Cost of Statistica is not available on their website.
  • It is cheap in a corporate setting, but not for private use.

WEKA

WEKA is an open source statistical and data mining library written in Java.
  • Many machine learning packages.
  • Good graphics.
  • Specialized for data mining.
  • Easy to work with.
  • Written in pure Java so it is multi platform.
  • Good for text mining.
  • You can train different learning algorithms at the same time and compare their result.
RapidMiner vs WEKA:
The most similar data mining packages are RapidMiner and WEKA. There have many similarities:
  • Written in in Java.
  • Free / open source software with GPL license.
  • RapidMiner includes many learning algorithms from WEKA.
Therefore the issues with WEKA is really how it compares to RapidMiner.
Issues compared to RapidMiner:
  • Worse connectivity to Excel spreadsheet and non Java based databases.
  • CSV reader not as robust.
  • Not as polished.

Criteria for software package comparison

My current data mining needs are relatively simple. I do not need the most sophisticated software packages. This is what I need now:
  • Supervised learning for categorization.
  • Over 200 features mainly numeric but 2 categorical.
  • Data is clean so no need to clean outliers and missing data.
  • Not important to avoiding mistakes.
  • Equal cost for type 1 and type 2 errors.
  • Accuracy is a good metric.
  • Easy to export model to production environment.
  • Good GUI with good graphic to explore the data.
  • Easy to compare a few different models e.g. boosted trees, naive bayes, neural network, random forest and vector support machine.

Summary

I did not have time to test all the tools enough for a real review. I was only trying to determine what data mining software packages to try first.
Try first list
  1. Statistica: Most polished, easiest to get started with. Good graphics and documentation.
  2. RapidMiner: Polished. Simplest and most uniform GUI. Good graphics. Open source.
  3. WEKA: A little unpolished. Good functionality for comparing different data mining algorithms.
  4. SSAS Table Analysis Tool, Data Mining ribbon: Showed promise, but I did not get it to do what I need.
  5. SSAS BIDS: Close tie to cube and data warehouse. Hard to find documentation about AMO programming. Could possibly give best integration with C# and VB.NET.
  6. SSAS Table Analysis Tool, Analysis ribbon: Simple to use but does not have the functionality I need.
  7. R: Not specialized towards data mining. Elegant but different programming paradigm.
  8. SciPy: Data mining library too immature.
Both RapidMiner and Statistica 8 do what I need now. So far I have found it easier to find functions using Statistica's menus and wizards, than RapidMiner's ETL workflows, but RapidMiner is open source. Still I would not be surprised if ended up using one or more than one package.

Preliminary quality comparison of Statistica and RapidMiner

I ran my predictive modeling task in both Statistica and RapidMiner. In the first match the model that preformed best in Statistica was neural network, with an error rate of approximately 10%.

I ran the neural network in RapidMiner the error rate was approximately 18%. I was surprised about the big difference. The reason is probably that one of my most important attributes is categorical with many values, and neural network does not work well with that. Statistica might have preformed better due to more hidden layers.

Second time I ran my predictive model, Statistica was having some numeric overflow for neural network and there were missing prediction values. This also surprised me I would expect that there could be problems with the training of the neural network, but not the calculation of and input on a trained model.

These problems can easily be the result of me being unfamiliarity with the software packages, but this was my first impression.

Link to my follow up post that is based on solving an actual data mining problem in Orange, R, RapidMiner, Statistica and WEKA after working with them for 2 months.

-Sami Badawi