Wednesday, June 23, 2010

Orange, R, RapidMiner, Statistica and WEKA

Review of open source and cheap software packages for Data Mining

This blog posting is comparing the following tools, after working with them for 2 months and using them for solving a real data mining problem:
  • Orange
  • R
  • RapidMiner
  • Statistica 8 with Data Miner module
  • WEKA
Statistica is commercial, all the other are open source. There is also a brief mention of the following Python libraries: mlpy, ffnet, NLTK.

Summary of first impression

This is a follow up on my previous post R, RapidMiner, Statistica, SSAS or WEKA describing my impression of the following software packages after using them for a couple of days each:
  • R
  • RapidMiner
  • SciPy
  • SQL Server Analysis Services, Business Intelligence Development Studio
  • SQL Server Analysis Services, Table Analysis Tool for Excel
  • Statistica 8 with Data Miner module
  • WEKA
Let me summarize what I found:

SciPy did not have what I needed. However I found a few other good Python-based solutions: Orange, mlpy, ffnet and NLTK.

The SSAS-based solutions held promise due to their close integration with Microsoft products, but I found them to be too closely tied to data warehouses so I postponed exploring them.

Statistica and RapidMiner had a lot of functionality and were polished, but the many features were overwhelming.

R was harder to get started with and WEKA was less polished, so I did not spend too much time on them.

Comparison matrix

In order to compress my current findings I am summarizing it in this matrix. This metric is only based on limited work with the different software packages and is not very accurate. The categories are:
Documentation; GUI and graphics; how polished the package is; ease of learning; controlling package from a script or program; how many machine learning algorithms that are available:

DocGUIPolishedEaseScriptingAlgorithms
Orange232332
Python libs111332
R322132
RapidMiner232223
Statistica333223
WEKA222323

Criteria for software package comparison

The comparison is based on a real data mining task that is relatively simple:
  • Supervised learning for categorization.
  • Over 200 attributes mainly numeric but 2 categorical / text.
  • One of the categorical attributes is the most important predictor.
  • Data is clean, so no need to clean outliers and missing data.
  • Accuracy is a good metric.
  • GUI with good graphic to explore the data is a plus.

General observations

The most popular data mining packages in the industry are SAS and SPSS, but they are quite expensive. Orange, R, RapidMiner, Statistica and WEKA all can be used for doing real data mining work. While some of them are unpolished.

There was a similar learning curve for most of the programs. Most programs took me a few days to get working, between the documentation and experimenting.

I had to reformulate my original problem. Neural network models did not work well on my categorical / text attributes. Statistica produced an accuracy of 90%, while RapidMiner produced an accuracy of 82%.
I replaced the 2 categorical attributes with a numeric attribute and accuracy of the best model increased to around 97%, and was much more uniform between the different tools.

Orange

Orange is an open source data mining package build on Python, NumPy, wrapped C, C++ and Qt.
  • Works both as a script and with an ETL work flow GUI.
  • Shortest script for doing training, cross validation, algorithms comparison and prediction.
  • I found Orange the easiest tool to learn.
  • Cross platform GUI.
Issues:
  • Not super polished.
  • The install is big since you need to install QT.

Python libs: ffnet, NumPy, mlpy, NLTK

A few Python libs deserve to be mentioned here: ffnet, NumPy, mlpy and NLTK.
  • If you do not care about the graphic exploration, you can set up an ffnet neural network in few lines of code.
  • There are several machine learning algorithms in mlpy.
  • The machine learning is NLTK is very elegant if you have a text mining or NLP problem.
  • The libraries are self contained.
Issues:
  • Limited list of machine learning algorithms.
  • Machine learning is not handled uniformly between the different libraries.

R

R is an open source statistical and data mining package and programming language.
  • Very extensive statistical library.
  • It is a powerful elegant array language in the tradition of APL, Mathematica and MATLAB, but also LISP/Scheme.
  • I was able to make a working machine learning program in just 40 lines of code.
Issues:
  • Less specialized towards data mining.
  • There is a steep learning curve, unless you are familiar with array languages.

R vs. Orange written in Python

Python and R have a lot in common: they are both elegant, minimal, interpreted languages with good numeric libraries. Still they have a different feel. So I was interested in seeing how they compared.
Orange / Python advantages
  • R is quite different from common programming languages.
  • Python is easier for most programmers to learn.
  • Python has better debugger.
  • Scripting data mining categorization problems is simpler in Orange.
  • Orange also has an ELT work flow GUI.
R advantages
  • R is even more minimal than Python.
  • Numerical programming is better integrated in R, in Python where you have to use external packages NumPy and SciPy.
  • R has better graphics.
  • R is more transparent since the Orange are wrapped C++ classes.
  • Easier to combine with other statistical calculations.
I made small script to solve my data mining problem in both Orange and R. This was my impression:

If all you want to do is to solve a categorization problem I found Orange to be simpler. You have to become very familiar with how Orange read the spreadsheet, the different attribute types, notably the Meta attribute.

Import and export of data from spreadsheet is easier in R, spreadsheet are stored in a data frames that the different machine learning algorithms are operating on. Programming in R really is very different, you are working on a higher abstraction level, but you do lose control over the details.

RapidMiner

RapidMiner is an open source statistical and data mining package written in Java.
  • Solid and complete package.
  • It easily reads and writes Excel files and different databases.
  • You program by piping components together in a graphic ETL work flows.
  • If you set up an illegal work flows RapidMiner suggest Quick Fixes to make it legal.
Issues:
  • I only got it to works under Windows, but others have gotten it to work in other environments, see comment below.
  • There are a lot of different ETL modules; it took a while to understand how to use them.
  • First I had a hard time making a comparison between different models. Eventually I found a way: You chose a cross validation and select different models one by one. When you run the model the will all be stored on the result page and you can do comparison there.

Statistica 8

Statistica is a commercial statistics and data mining software package for Windows.
There is a 90 day trial for Statistica 8 with data miner module in the textbook:
Handbook of Statistical Analysis and Data Mining Applications. There is also a free 30 day trial.
  • Generally very polished and good at everything, but it is also the only non open source program.
  • High accuracy even when I gave it bad input.
  • You can script everything in Statistica in VB.
  • Cheap compared to SPSS and SAS.
Issues:
  • So many options that it was hard to navigate the program.
  • The most important video about Data Miner Recipes is the very last out of 36.
  • Cost of Statistica is not available on their website.
  • It is cheap in a corporate setting, but not for private use.

WEKA

WEKA is an open source statistical and data mining library written in Java.
  • A lot of machine learning algorithms.
  • Easy to learn and use.
  • Good GUI.
  • Platform independent.
Issues:
  • Worse connectivity to Excel spreadsheet and non Java based databases.
  • CSV reader not as robust as in RapidMiner.
  • Not as polished.

RapidMiner vs. WEKA

The most similar data mining packages are RapidMiner and WEKA. There have many similarities:
  • Written in in Java.
  • Free / open source software with GPL license.
  • RapidMiner includes many learning algorithms from WEKA.
My first thought what that RapidMiner has everything that WEKA has, plus a lot of other functionality and is more polished. Therefore I did not spend too much time on WEKA. For the sake of completeness I took a second look at WEKA and I have to say that it was a lot easier to get WEKA to work. Sometimes less is more. Depending on what is more important functionality or ease of use.

Conclusion

There are several good and very different solutions. Let me finish by listing the strongest aspect of each tool:

Orange has elegant and concise scripting and can also be run in an ETL GUI mode.
R has elegant and concise scripting integrated with a vast statistical library.
RapidMiner has a lot of functionality, is polished and has good connectivity.
Statistica is the most polished product, and generally performed well in all categories. It gave good result when I gave it bad input.
WEKA is the easiest GUI to learn and use.

-Sami Badawi