Review of open source and cheap software packages for Data Mining
This blog posting is comparing the following tools, after working with them for 2 months and using them for solving a real data mining problem:- Orange
- R
- RapidMiner
- Statistica 8 with Data Miner module
- WEKA
Summary of first impression
This is a follow up on my previous post R, RapidMiner, Statistica, SSAS or WEKA describing my impression of the following software packages after using them for a couple of days each:- R
- RapidMiner
- SciPy
- SQL Server Analysis Services, Business Intelligence Development Studio
- SQL Server Analysis Services, Table Analysis Tool for Excel
- Statistica 8 with Data Miner module
- WEKA
SciPy did not have what I needed. However I found a few other good Python-based solutions: Orange, mlpy, ffnet and NLTK.
The SSAS-based solutions held promise due to their close integration with Microsoft products, but I found them to be too closely tied to data warehouses so I postponed exploring them.
Statistica and RapidMiner had a lot of functionality and were polished, but the many features were overwhelming.
R was harder to get started with and WEKA was less polished, so I did not spend too much time on them.
Comparison matrix
In order to compress my current findings I am summarizing it in this matrix. This metric is only based on limited work with the different software packages and is not very accurate. The categories are:Documentation; GUI and graphics; how polished the package is; ease of learning; controlling package from a script or program; how many machine learning algorithms that are available:
Doc | GUI | Polished | Ease | Scripting | Algorithms | |
---|---|---|---|---|---|---|
Orange | 2 | 3 | 2 | 3 | 3 | 2 |
Python libs | 1 | 1 | 1 | 3 | 3 | 2 |
R | 3 | 2 | 2 | 1 | 3 | 2 |
RapidMiner | 2 | 3 | 2 | 2 | 2 | 3 |
Statistica | 3 | 3 | 3 | 2 | 2 | 3 |
WEKA | 2 | 2 | 2 | 3 | 2 | 3 |
Criteria for software package comparison
The comparison is based on a real data mining task that is relatively simple:- Supervised learning for categorization.
- Over 200 attributes mainly numeric but 2 categorical / text.
- One of the categorical attributes is the most important predictor.
- Data is clean, so no need to clean outliers and missing data.
- Accuracy is a good metric.
- GUI with good graphic to explore the data is a plus.
General observations
The most popular data mining packages in the industry are SAS and SPSS, but they are quite expensive. Orange, R, RapidMiner, Statistica and WEKA all can be used for doing real data mining work. While some of them are unpolished.There was a similar learning curve for most of the programs. Most programs took me a few days to get working, between the documentation and experimenting.
I had to reformulate my original problem. Neural network models did not work well on my categorical / text attributes. Statistica produced an accuracy of 90%, while RapidMiner produced an accuracy of 82%.
I replaced the 2 categorical attributes with a numeric attribute and accuracy of the best model increased to around 97%, and was much more uniform between the different tools.
Orange
Orange is an open source data mining package build on Python, NumPy, wrapped C, C++ and Qt.- Works both as a script and with an ETL work flow GUI.
- Shortest script for doing training, cross validation, algorithms comparison and prediction.
- I found Orange the easiest tool to learn.
- Cross platform GUI.
Issues:
- Not super polished.
- The install is big since you need to install QT.
Python libs: ffnet, NumPy, mlpy, NLTK
A few Python libs deserve to be mentioned here: ffnet, NumPy, mlpy and NLTK.- If you do not care about the graphic exploration, you can set up an ffnet neural network in few lines of code.
- There are several machine learning algorithms in mlpy.
- The machine learning is NLTK is very elegant if you have a text mining or NLP problem.
- The libraries are self contained.
Issues:
- Limited list of machine learning algorithms.
- Machine learning is not handled uniformly between the different libraries.
R
R is an open source statistical and data mining package and programming language.- Very extensive statistical library.
- It is a powerful elegant array language in the tradition of APL, Mathematica and MATLAB, but also LISP/Scheme.
- I was able to make a working machine learning program in just 40 lines of code.
Issues:
- Less specialized towards data mining.
- There is a steep learning curve, unless you are familiar with array languages.
R vs. Orange written in Python
Python and R have a lot in common: they are both elegant, minimal, interpreted languages with good numeric libraries. Still they have a different feel. So I was interested in seeing how they compared.Orange / Python advantages
- R is quite different from common programming languages.
- Python is easier for most programmers to learn.
- Python has better debugger.
- Scripting data mining categorization problems is simpler in Orange.
- Orange also has an ELT work flow GUI.
R advantages
- R is even more minimal than Python.
- Numerical programming is better integrated in R, in Python where you have to use external packages NumPy and SciPy.
- R has better graphics.
- R is more transparent since the Orange are wrapped C++ classes.
- Easier to combine with other statistical calculations.
If all you want to do is to solve a categorization problem I found Orange to be simpler. You have to become very familiar with how Orange read the spreadsheet, the different attribute types, notably the Meta attribute.
Import and export of data from spreadsheet is easier in R, spreadsheet are stored in a data frames that the different machine learning algorithms are operating on. Programming in R really is very different, you are working on a higher abstraction level, but you do lose control over the details.
RapidMiner
RapidMiner is an open source statistical and data mining package written in Java.- Solid and complete package.
- It easily reads and writes Excel files and different databases.
- You program by piping components together in a graphic ETL work flows.
- If you set up an illegal work flows RapidMiner suggest Quick Fixes to make it legal.
Issues:
- I only got it to works under Windows, but others have gotten it to work in other environments, see comment below.
- There are a lot of different ETL modules; it took a while to understand how to use them.
- First I had a hard time making a comparison between different models. Eventually I found a way: You chose a cross validation and select different models one by one. When you run the model the will all be stored on the result page and you can do comparison there.
Statistica 8
Statistica is a commercial statistics and data mining software package for Windows.There is a 90 day trial for Statistica 8 with data miner module in the textbook:
Handbook of Statistical Analysis and Data Mining Applications. There is also a free 30 day trial.
- Generally very polished and good at everything, but it is also the only non open source program.
- High accuracy even when I gave it bad input.
- You can script everything in Statistica in VB.
- Cheap compared to SPSS and SAS.
Issues:
- So many options that it was hard to navigate the program.
- The most important video about Data Miner Recipes is the very last out of 36.
- Cost of Statistica is not available on their website.
- It is cheap in a corporate setting, but not for private use.
WEKA
WEKA is an open source statistical and data mining library written in Java.- A lot of machine learning algorithms.
- Easy to learn and use.
- Good GUI.
- Platform independent.
Issues:
- Worse connectivity to Excel spreadsheet and non Java based databases.
- CSV reader not as robust as in RapidMiner.
- Not as polished.
RapidMiner vs. WEKA
The most similar data mining packages are RapidMiner and WEKA. There have many similarities:- Written in in Java.
- Free / open source software with GPL license.
- RapidMiner includes many learning algorithms from WEKA.
Conclusion
There are several good and very different solutions. Let me finish by listing the strongest aspect of each tool:Orange has elegant and concise scripting and can also be run in an ETL GUI mode.
R has elegant and concise scripting integrated with a vast statistical library.
RapidMiner has a lot of functionality, is polished and has good connectivity.
Statistica is the most polished product, and generally performed well in all categories. It gave good result when I gave it bad input.
WEKA is the easiest GUI to learn and use.
-Sami Badawi
12 comments:
You said in both your original post and in this followup 2 months later that RapidMiner was Windows only. A quick check of the hosting site for the FOSS version on SourceForge shows that's not true:
Operating System:
64-bit MS Windows, All 32-bit MS Windows (95/98/NT/2000/XP), All POSIX (Linux/BSD/UNIX-like OSes),
OS X, Linux, OS Independent (Written in an interpreted language), Solaris
Please correct your posts.
Thanks
Hi,
first of all, thank you for your post!
a note about rapidminer,
You can use rapidminer in linux environment also. I have an installed one in ubuntu 9.10:)
Rapid Miner is releasing R-plugins in the 3rd week of sept'10.
Main Disadvantage(i m using vista-64bit) of rapidminer is it takes too much memory and your system will be very slow when you are using it :( may be bcoz of GUI,etc
Gurupad S. Hegde,
Student @ National Institute of Technology, Surat-India
Hi,Interesting post!
Well done!
Doing my research I find one amazing free to download book about Computer Vision.
This book presents research trends on computer vision, especially on application of robotics, and on advanced approaches for computer vision (such as omnidirectional vision).
The contents of this book allow the reader to know more technical aspects and applications of computer vision.
The intended audience is anyone who wishes to become familiar with the latest research work on computer vision, especially its applications on robots.
This book features representative work on the computer vision, and it puts more focus on robotics vision and omnidirectional vision.
This is the link where you can find it: http://sciyo.com/books/show/title/computer_vision
Hi blah blah,
I did not intend to censor your comment, you got caught in the spam filter.
I have posted your correction regarding RapidMiner working on other platforms than Windows. Thanks for bringing this to my attention.
RapidMiner seems to lack any facility to access data stored on a server, or cloud, unless you pay for the RapidAnalytics analysis server. Page 21 of the manual skips over this, as does page 96. The web page http://rapid-i.com/content/view/281/225/lang,en/ displays "Download the RapidAnalytics Community Edition: Coming Soon!" meaning no free lunch currently.
The forum also admits the application isn't designed to handle data in a pipeline, i.e. reading and processing lines. Processing huge data sets and files, out of memory, out of luck? Aside from the problem of access to data stores, and ability to process large data sets, there is difficulty even gaining access to their forum. See below, sent yesterday, which is an eternity in internet time. What could they be checking? Also, though open source, but do they expect me to write my own hook into an scp or ssh utility, or hope "coming soon" is soon?
Another question regards Pentaho, the company that actually owns Rapidminer and also is now administering Weka source code. Kind of reminds one of Oracle's new control of MySQL. Is this the future of "open source"?
Meanwhile, I will try work arounds, looks like a cool product to experiment with or copy.
From Rapid-I Forum
Sent Tuesday, December 7, 2010 6:30 pm
To (my email address)
Subject Welcome to Rapid-I Forum
Your registration request at Rapid-I Forum has been received
Before you can login and start using the forum, your request will be reviewed and approved. When this happens, you will receive another email from this address.
Regards,
The Rapid-I Forum Team.
I am hoping to do some statistical analysis using R on data obtain from a OLAP server preferably SSAS. Is there a package for R that you know of the would facilitate this or what approach do you have a suggestion.
Hi pechang,
To get started I would just export data from SSAS to a simple csv file. Then read it into R from that and do my statistical analysis.
When you have a working system you can worry about making more streamlined, if you need to.
Hi Mr.Badawi,
Did you evaluate these tools on datasize handling? I caught onto Orange, since i was familiar with python. But it seems not equipped for handling big data (time-series, 2mn+ instances). what tool would you suggest?
Hi Vcs,
I have not worked with time-series. So I don't know.
Maybe somebody knows if there are any cloud based data mining/clasification/visualisation web tool available?
Great post.
What would you choose between Orange and Scikits-learn for your Python project?
Thanks a lot for the post. In fact, I found weka is useful when I want to start working with ML since its GUI is not complicated. Beside, I enjoyed my time practicing with its examples, but when I tried using my own data, I found it not flexible because weka deals with arff file. I did not find any effective way to convert my excel file that has the data to "arff" file without using any sort of coding. Therefore, if you have any idea please email me at:
samirsarsam.ss@gmail.com
Thanks and Regards. Sam
Post a Comment