Thursday, April 29, 2010

R, RapidMiner, Statistica, SSAS or WEKA

Choosing cheap software packages to get started with Data Mining

You have a data mining problem and you want to try to solve it with a data mining software package. The most popular packages in the industry are SAS and SPSS, but they are quite expensive, so you might have a hard time convincing your boss to purchase them before you already have produced impressive results.

When I needed data mining or machine learning algorithms in the past, I would program it from scratch and integrate it in my Java or C# code. But recently I needed a more interactive graphics environment to help with what is called Data Understanding phase in the CRISP-DM. I also wanted a way to compare the predictive accuracy of a broad array of algorithms, so I tried out several packages:
  • R
  • RapidMiner
  • SciPy
  • SQL Server Analysis Services, Business Intelligence Development Studio
  • SQL Server Analysis Services, Table Analysis Tool for Excel
  • Statistica 8 with Data Miner module
  • WEKA

Disclaimer for review

Here is a review of my first impression of these packages. First impression is not the best indicator for what going to work for you in the long run. I am sure that I have missed many features. Still I hope this can save you some time finding a solution that will work for your problem.

R

R is an open source statistical and data mining package and programming language.
  • Very extensive statistical library.
  • Very concise for solving statistical problems.
  • It is a powerful elegant array language in the tradition of APL, Mathematica and MATLAB, but also LISP/Scheme.
  • In a few lines you can set up an R program that does data mining and machine learning.
  • You have full control.
  • It is easier to integrate this into a work flow with your other programs. You just spawn an R program and pass input in and read output from a pipe.
  • Good plotting functionality.
Issues:
  • Less interactive GUI.
  • Less specialized towards data mining.
  • Language is pretty different from current mainstream languages like C, C#, C++, Java, PHP and VB.
  • There is a learning curve, unless you are familiar with array languages.
  • R was created in 1990.
Link: Screencast showing how a trained R user can generate a PMML neural network model in 60 seconds.

RapidMiner

RapidMiner is an open source statistical and data mining package written in Java.
  • Lot of data mining algorithms.
  • Feels polished.
  • Good graphics.
  • It easily reads and writes Excel files and different databases.
  • You program by piping components together in a graphic ETL workflows.
  • If you set up an illegal workflows RapidMiner suggest Quick Fixes to make it legal.
  • Good video tutorials / European dance parties. *:o)
Issues:
  • I only got it to works under Windows, but others have gotten it to work in other environments.
  • Harder to compare different algorithms than WEKA.

SciPy

SciPy is an open source Python wrapper around numerical libraries.
  • Good for mathematics.
  • Python is a simple, elegant and mature language.
Issues:
  • Data mining part is too immature.
  • Too much duct tape.

SQL Server Business Intelligence Development Studio

Microsoft SQL Server Analysis Services comes with data mining service.
If you have access to SQL Server 2005 or later with SSAS installed, you can use some of the data mining algorithms for free. If you want to scale it can become quite expensive.
  • If your are working with the Microsoft stack this integrate well.
  • Good data mining functionality.
  • Organized well.
  • Comes with some graphics.
Issues:
  • The machine learning is closely tied to data warehouses and cubes. This makes the learning curve steeper and deployment harder.
  • Documentation about using the BIDS GUI was hard to find. I looked in several books and several videos.
  • I need to do my data mining from within a web server or a command line program. For this you need to access the models using: Analysis Management Objects (AMO). Documentation for this was also hard to find.
  • You need good cooperation from your DBA, unless you have your own instance of SQL Server.
  • If you want to evaluate the performance of your predictive model, cross-validation is available only in SQL Server 2008 Enterprise.
Link: Good screencast about data mining with SSAS.

SQL Server Analysis Services, Table Analysis Tool Excel

Microsoft Excel data mining plug-in is dependent on SQL Server 2008 and Excel 2007.
  • This takes less interaction with the database and DBA than the Development Studio.
  • A lot of users have their data in Excel.
  • There is an Analysis ribbon / menu that is very simple to use. Even for users with very limited understanding of data mining.
  • The Machine Learning ribbon has more control over internals of the algorithms.
  • You can run with huge amount of data since the number crunching is done on the server.
Issues:
  • This also needs a connection to a SQL Server 2008 with Analysis Services running. Despite the data mining algorithms being relatively simple.
  • You need a special database inside Analysis Services that you have write permissions to.

Link: Excel Table Analysis Tool video

Statistica 8

Statistica is a commercial statistics and data mining software package.
There is a 90 day trial for Statistica 8 with data miner module in the textbook:
Handbook of Statistical Analysis and Data Mining Applications. There is also a free 30 day trial.
  • Statistica is cheaper than SAS and SPSS.
  • Six hours of instructional videos.
  • Data Miner Recipes wizard is the easiest tool for a beginner.
  • Lot of data mining algorithms.
  • GUI with a lot of functionality.
  • You program using menus and wizards.
  • Good graphics.
  • Easy to find and clean up outliers and missing data attributes.
Issues:
  • Overwhelming number of menu items.
  • The most important video about Data Miner Recipes is the very last.
  • Cost of Statistica is not available on their website.
  • It is cheap in a corporate setting, but not for private use.

WEKA

WEKA is an open source statistical and data mining library written in Java.
  • Many machine learning packages.
  • Good graphics.
  • Specialized for data mining.
  • Easy to work with.
  • Written in pure Java so it is multi platform.
  • Good for text mining.
  • You can train different learning algorithms at the same time and compare their result.
RapidMiner vs WEKA:
The most similar data mining packages are RapidMiner and WEKA. There have many similarities:
  • Written in in Java.
  • Free / open source software with GPL license.
  • RapidMiner includes many learning algorithms from WEKA.
Therefore the issues with WEKA is really how it compares to RapidMiner.
Issues compared to RapidMiner:
  • Worse connectivity to Excel spreadsheet and non Java based databases.
  • CSV reader not as robust.
  • Not as polished.

Criteria for software package comparison

My current data mining needs are relatively simple. I do not need the most sophisticated software packages. This is what I need now:
  • Supervised learning for categorization.
  • Over 200 features mainly numeric but 2 categorical.
  • Data is clean so no need to clean outliers and missing data.
  • Not important to avoiding mistakes.
  • Equal cost for type 1 and type 2 errors.
  • Accuracy is a good metric.
  • Easy to export model to production environment.
  • Good GUI with good graphic to explore the data.
  • Easy to compare a few different models e.g. boosted trees, naive bayes, neural network, random forest and vector support machine.

Summary

I did not have time to test all the tools enough for a real review. I was only trying to determine what data mining software packages to try first.
Try first list
  1. Statistica: Most polished, easiest to get started with. Good graphics and documentation.
  2. RapidMiner: Polished. Simplest and most uniform GUI. Good graphics. Open source.
  3. WEKA: A little unpolished. Good functionality for comparing different data mining algorithms.
  4. SSAS Table Analysis Tool, Data Mining ribbon: Showed promise, but I did not get it to do what I need.
  5. SSAS BIDS: Close tie to cube and data warehouse. Hard to find documentation about AMO programming. Could possibly give best integration with C# and VB.NET.
  6. SSAS Table Analysis Tool, Analysis ribbon: Simple to use but does not have the functionality I need.
  7. R: Not specialized towards data mining. Elegant but different programming paradigm.
  8. SciPy: Data mining library too immature.
Both RapidMiner and Statistica 8 do what I need now. So far I have found it easier to find functions using Statistica's menus and wizards, than RapidMiner's ETL workflows, but RapidMiner is open source. Still I would not be surprised if ended up using one or more than one package.

Preliminary quality comparison of Statistica and RapidMiner

I ran my predictive modeling task in both Statistica and RapidMiner. In the first match the model that preformed best in Statistica was neural network, with an error rate of approximately 10%.

I ran the neural network in RapidMiner the error rate was approximately 18%. I was surprised about the big difference. The reason is probably that one of my most important attributes is categorical with many values, and neural network does not work well with that. Statistica might have preformed better due to more hidden layers.

Second time I ran my predictive model, Statistica was having some numeric overflow for neural network and there were missing prediction values. This also surprised me I would expect that there could be problems with the training of the neural network, but not the calculation of and input on a trained model.

These problems can easily be the result of me being unfamiliarity with the software packages, but this was my first impression.

Link to my follow up post that is based on solving an actual data mining problem in Orange, R, RapidMiner, Statistica and WEKA after working with them for 2 months.

-Sami Badawi

Saturday, April 3, 2010

Data Mining rediscovers Artificial Intelligence

Artificial intelligence started in the 1950s with very high expectations. AI did not deliver on the expectations and fell into decades long discredit. I am seeing signs that Data Mining and Business Intelligence are bringing AI into mainstream computing. This blog posting is a personal account of my long struggle to work in artificial intelligence during different trends in computer science.

In the 1980s I was studying mathematics and physics, which I really enjoyed. I was concerned about my job prospects, there are not many math or science jobs outside of academia. Artificial intelligence seemed equally interesting but more practical, and I thought that it could provide me with a living wage. Little did I know that artificial intelligence was about to become an unmentionable phrase that you should not put on your resume if you wanted a paying job.

Highlights of the history of artificial intelligence

  • In 1956 AI was founded.
  • In 1957 Frank Rosenblatt invented Perceptron, the first generation of neural networks. It was based on the way the human brain works, and provided simple solutions to some simple problems.
  • In 1958 John McCarthy invented LISP, the classic AI language. Mainstream programming languages have borrowed heavily from LISP and are only now catching up with LISP.
  • In the 1960s AI got lots of defense funding. Especially military translation software translating from Russian to English.
AI theory made quick advances and a lot was developed early on. AI techniques worked well on small problems. It was expected that AI could learn, using machine learning, and this soon would lead to human like intelligence.

This did not work out as planned. The machine translation did not work well enough to be usable. The defense funding dried up. The approaches that had worked well for small problems did not scale to bigger domains. Artificial intelligence fell out of favor in the 1970s.

AI advances in the 1980s

When I started studying AI, it was in the middle of a renaissance and I was optimistic about recent advances:
  • The discovery of new types of neural networks, after Perceptron networks had been discredited in an article by Marvin Minsky
  • Commercial expert system were thriving
  • The Japanese Fifth Generation Computer Systems project, written in the new elegant declarative Prolog language had many people in the West worried
  • Advances in probability theory Bayesian Networks / Causal Network
In order to combat this brittleness of intelligence Doug Lenat started a large scale AI project CYC in 1984. His idea was that there is no free lunch, and in order to build an intelligent system, you have to use many different types of fine tuned logical inference; and you have to hand encode it with a lot of common sense knowledge. Cycorp spent hundreds of man years building their huge ontology. Their hope was that CYC would be able to start learning on its own, after training it for some years.

AI in the 1990s

I did not loose my patience but other people did, and AI went from the technology of the future to yesterday's news. It had become a loser that you did not want to be associated with.

During the Internet bubble when venture capital founding was abundant, I was briefly involved with an AI Internet start up company. The company did not take off; its main business was emailing discount coupons out to AOL costumers. This left me disillusioned, thinking that I just have to put on a happy face when I worked on the next web application or trading system.

AI usage today

Even though AI stopped being cool, regular people are using its use it in more and more places:
  • Spam filter
  • Search engines use natural language processing
  • Biometric, face and fingerprint detection
  • OCR, check reading in ATM
  • Image processing in coffee machine detecting misaligned cups
  • Fraud detection
  • Movie and book recommendations
  • Machine translation
  • Speech understanding and generation in phone menu system

Euphemistic words for AI techniques

The rule seem to be that you can use AI techniques as long as you call it something else, e.g.:
  • Business Intelligence
  • Collective Intelligence
  • Data Mining
  • Information Retrieval
  • Machine Learning
  • Natural Language Processing
  • Predictive Analytics
  • Pattern Matching

AI is entering mainstream computing now

Recently I have seen signs that AI techniques are moving into mainstream computing:
  • I went to a presentation for SPSS statistical modeling software, and was shocked how many people now are using data mining and machine learning techniques. I was sitting next to people working in a prison, adoption agency, marketing, disease prevention NGO.
  • I started working on a data warehouse using SQL Server Analytic Services, and found that SSAS has a suite of machine learning tools.
  • Functional and declarative techniques are spreading to mainstream programming languages.

Business Intelligence compared to AI

Business Intelligence is about aggregating a company's data into an understandable format and analyzing it to provide better business decisions. BI is currently the most popular field using artificial intelligence techniques. Here are a few words about how it differs from AI:
  • BI is driven by vendors instead of academia
  • BI is centered around expensive software packages with a lot of marketing
  • The scope is limited, e.g. find good prospective customers for your products
  • Everything is living in databases or data warehouses
  • BI is data driven
  • Reporting is a very important component of BI

Getting a job in AI

I recently made a big effort to steer my career towards AI. I started an open source computer vision project, ShapeLogic and put AI back on my resume. A head hunter contacted me and asked if I had any experience in Predictive Analytics. It took me 15 minutest to convince her that Predictive Analytics and AI was close enough that she could forward my resume. I got the job, my first real AI and NLP job.

The work I am doing is not dramatically different from normal software development work. I spend less time on machine learning than on getting AJAX to work with C# ASP.NET for the web GUI; or upgrade the database ORM from ADO.NET strongly typed datasets to LINQ to SQL. However, it was very gratifying to see my program started to perform a task that had been very time consuming for the company's medical staff.

Is AI regaining respect?

No, not now. There are lots of job postings for BI and data mining but barely any for artificial intelligence. AI is still not a popular word, except in video games where AI means something different. When I worked as a games developer what was called AI was just checking if your character was close to an enemy and then the enemy would start shooting in your character's direction.

After 25 long years of waiting I am very happy to see AI techniques has finally become a commodity, and I enjoy working with it even if I have to disguise this work by whatever the buzzword of the day is.

-Sami Badawi