Choosing cheap software packages to get started with Data Mining
You have a data mining problem and you want to try to solve it with a data mining software package. The most popular packages in the industry are SAS and SPSS, but they are quite expensive, so you might have a hard time convincing your boss to purchase them before you already have produced impressive results.When I needed data mining or machine learning algorithms in the past, I would program it from scratch and integrate it in my Java or C# code. But recently I needed a more interactive graphics environment to help with what is called Data Understanding phase in the CRISP-DM. I also wanted a way to compare the predictive accuracy of a broad array of algorithms, so I tried out several packages:
- R
- RapidMiner
- SciPy
- SQL Server Analysis Services, Business Intelligence Development Studio
- SQL Server Analysis Services, Table Analysis Tool for Excel
- Statistica 8 with Data Miner module
- WEKA
Disclaimer for review
Here is a review of my first impression of these packages. First impression is not the best indicator for what going to work for you in the long run. I am sure that I have missed many features. Still I hope this can save you some time finding a solution that will work for your problem.R
R is an open source statistical and data mining package and programming language.- Very extensive statistical library.
- Very concise for solving statistical problems.
- It is a powerful elegant array language in the tradition of APL, Mathematica and MATLAB, but also LISP/Scheme.
- In a few lines you can set up an R program that does data mining and machine learning.
- You have full control.
- It is easier to integrate this into a work flow with your other programs. You just spawn an R program and pass input in and read output from a pipe.
- Good plotting functionality.
Issues:
- Less interactive GUI.
- Less specialized towards data mining.
- Language is pretty different from current mainstream languages like C, C#, C++, Java, PHP and VB.
- There is a learning curve, unless you are familiar with array languages.
- R was created in 1990.
RapidMiner
RapidMiner is an open source statistical and data mining package written in Java.- Lot of data mining algorithms.
- Feels polished.
- Good graphics.
- It easily reads and writes Excel files and different databases.
- You program by piping components together in a graphic ETL workflows.
- If you set up an illegal workflows RapidMiner suggest Quick Fixes to make it legal.
- Good video tutorials / European dance parties. *:o)
Issues:
- I only got it to works under Windows, but others have gotten it to work in other environments.
- Harder to compare different algorithms than WEKA.
SciPy
SciPy is an open source Python wrapper around numerical libraries.- Good for mathematics.
- Python is a simple, elegant and mature language.
Issues:
- Data mining part is too immature.
- Too much duct tape.
SQL Server Business Intelligence Development Studio
Microsoft SQL Server Analysis Services comes with data mining service.If you have access to SQL Server 2005 or later with SSAS installed, you can use some of the data mining algorithms for free. If you want to scale it can become quite expensive.
- If your are working with the Microsoft stack this integrate well.
- Good data mining functionality.
- Organized well.
- Comes with some graphics.
Issues:
- The machine learning is closely tied to data warehouses and cubes. This makes the learning curve steeper and deployment harder.
- Documentation about using the BIDS GUI was hard to find. I looked in several books and several videos.
- I need to do my data mining from within a web server or a command line program. For this you need to access the models using: Analysis Management Objects (AMO). Documentation for this was also hard to find.
- You need good cooperation from your DBA, unless you have your own instance of SQL Server.
- If you want to evaluate the performance of your predictive model, cross-validation is available only in SQL Server 2008 Enterprise.
SQL Server Analysis Services, Table Analysis Tool Excel
Microsoft Excel data mining plug-in is dependent on SQL Server 2008 and Excel 2007.- This takes less interaction with the database and DBA than the Development Studio.
- A lot of users have their data in Excel.
- There is an Analysis ribbon / menu that is very simple to use. Even for users with very limited understanding of data mining.
- The Machine Learning ribbon has more control over internals of the algorithms.
- You can run with huge amount of data since the number crunching is done on the server.
Issues:
- This also needs a connection to a SQL Server 2008 with Analysis Services running. Despite the data mining algorithms being relatively simple.
- You need a special database inside Analysis Services that you have write permissions to.
Link: Excel Table Analysis Tool video
Statistica 8
Statistica is a commercial statistics and data mining software package.There is a 90 day trial for Statistica 8 with data miner module in the textbook:
Handbook of Statistical Analysis and Data Mining Applications. There is also a free 30 day trial.
- Statistica is cheaper than SAS and SPSS.
- Six hours of instructional videos.
- Data Miner Recipes wizard is the easiest tool for a beginner.
- Lot of data mining algorithms.
- GUI with a lot of functionality.
- You program using menus and wizards.
- Good graphics.
- Easy to find and clean up outliers and missing data attributes.
Issues:
- Overwhelming number of menu items.
- The most important video about Data Miner Recipes is the very last.
- Cost of Statistica is not available on their website.
- It is cheap in a corporate setting, but not for private use.
WEKA
WEKA is an open source statistical and data mining library written in Java.- Many machine learning packages.
- Good graphics.
- Specialized for data mining.
- Easy to work with.
- Written in pure Java so it is multi platform.
- Good for text mining.
- You can train different learning algorithms at the same time and compare their result.
RapidMiner vs WEKA:
The most similar data mining packages are RapidMiner and WEKA. There have many similarities:- Written in in Java.
- Free / open source software with GPL license.
- RapidMiner includes many learning algorithms from WEKA.
Issues compared to RapidMiner:
- Worse connectivity to Excel spreadsheet and non Java based databases.
- CSV reader not as robust.
- Not as polished.
Criteria for software package comparison
My current data mining needs are relatively simple. I do not need the most sophisticated software packages. This is what I need now:- Supervised learning for categorization.
- Over 200 features mainly numeric but 2 categorical.
- Data is clean so no need to clean outliers and missing data.
- Not important to avoiding mistakes.
- Equal cost for type 1 and type 2 errors.
- Accuracy is a good metric.
- Easy to export model to production environment.
- Good GUI with good graphic to explore the data.
- Easy to compare a few different models e.g. boosted trees, naive bayes, neural network, random forest and vector support machine.
Summary
I did not have time to test all the tools enough for a real review. I was only trying to determine what data mining software packages to try first.Try first list
- Statistica: Most polished, easiest to get started with. Good graphics and documentation.
- RapidMiner: Polished. Simplest and most uniform GUI. Good graphics. Open source.
- WEKA: A little unpolished. Good functionality for comparing different data mining algorithms.
- SSAS Table Analysis Tool, Data Mining ribbon: Showed promise, but I did not get it to do what I need.
- SSAS BIDS: Close tie to cube and data warehouse. Hard to find documentation about AMO programming. Could possibly give best integration with C# and VB.NET.
- SSAS Table Analysis Tool, Analysis ribbon: Simple to use but does not have the functionality I need.
- R: Not specialized towards data mining. Elegant but different programming paradigm.
- SciPy: Data mining library too immature.
Preliminary quality comparison of Statistica and RapidMiner
I ran my predictive modeling task in both Statistica and RapidMiner. In the first match the model that preformed best in Statistica was neural network, with an error rate of approximately 10%.I ran the neural network in RapidMiner the error rate was approximately 18%. I was surprised about the big difference. The reason is probably that one of my most important attributes is categorical with many values, and neural network does not work well with that. Statistica might have preformed better due to more hidden layers.
Second time I ran my predictive model, Statistica was having some numeric overflow for neural network and there were missing prediction values. This also surprised me I would expect that there could be problems with the training of the neural network, but not the calculation of and input on a trained model.
These problems can easily be the result of me being unfamiliarity with the software packages, but this was my first impression.
Link to my follow up post that is based on solving an actual data mining problem in Orange, R, RapidMiner, Statistica and WEKA after working with them for 2 months.
-Sami Badawi