Languages and Logic

Thursday, April 29, 2010

R, RapidMiner, Statistica, SSAS or WEKA

Choosing cheap software packages to get started with Data Mining

You have a data mining problem and you want to try to solve it with a data mining software package. The most popular packages in the industry are SAS and SPSS, but they are quite expensive, so you might have a hard time convincing your boss to purchase them before you already have produced impressive results.

When I needed data mining or machine learning algorithms in the past, I would program it from scratch and integrate it in my Java or C# code. But recently I needed a more interactive graphics environment to help with what is called Data Understanding phase in the CRISP-DM. I also wanted a way to compare the predictive accuracy of a broad array of algorithms, so I tried out several packages:

R
RapidMiner
SciPy
SQL Server Analysis Services, Business Intelligence Development Studio
SQL Server Analysis Services, Table Analysis Tool for Excel
Statistica 8 with Data Miner module
WEKA

Disclaimer for review

Here is a review of my first impression of these packages. First impression is not the best indicator for what going to work for you in the long run. I am sure that I have missed many features. Still I hope this can save you some time finding a solution that will work for your problem.

R

R is an open source statistical and data mining package and programming language.

Very extensive statistical library.
Very concise for solving statistical problems.
It is a powerful elegant array language in the tradition of APL, Mathematica and MATLAB, but also LISP/Scheme.
In a few lines you can set up an R program that does data mining and machine learning.
You have full control.
It is easier to integrate this into a work flow with your other programs. You just spawn an R program and pass input in and read output from a pipe.
Good plotting functionality.

Issues:

Less interactive GUI.
Less specialized towards data mining.
Language is pretty different from current mainstream languages like C, C#, C++, Java, PHP and VB.
There is a learning curve, unless you are familiar with array languages.
R was created in 1990.

Link: Screencast showing how a trained R user can generate a PMML neural network model in 60 seconds.

RapidMiner

RapidMiner is an open source statistical and data mining package written in Java.

Lot of data mining algorithms.
Feels polished.
Good graphics.
It easily reads and writes Excel files and different databases.
You program by piping components together in a graphic ETL workflows.
If you set up an illegal workflows RapidMiner suggest Quick Fixes to make it legal.
Good video tutorials / European dance parties. *:o)

Issues:

I only got it to works under Windows, but others have gotten it to work in other environments.
Harder to compare different algorithms than WEKA.

SciPy

SciPy is an open source Python wrapper around numerical libraries.

Good for mathematics.
Python is a simple, elegant and mature language.

Issues:

Data mining part is too immature.
Too much duct tape.

SQL Server Business Intelligence Development Studio

Microsoft SQL Server Analysis Services comes with data mining service.
If you have access to SQL Server 2005 or later with SSAS installed, you can use some of the data mining algorithms for free. If you want to scale it can become quite expensive.

If your are working with the Microsoft stack this integrate well.
Good data mining functionality.
Organized well.
Comes with some graphics.

Issues:

The machine learning is closely tied to data warehouses and cubes. This makes the learning curve steeper and deployment harder.
Documentation about using the BIDS GUI was hard to find. I looked in several books and several videos.
I need to do my data mining from within a web server or a command line program. For this you need to access the models using: Analysis Management Objects (AMO). Documentation for this was also hard to find.
You need good cooperation from your DBA, unless you have your own instance of SQL Server.
If you want to evaluate the performance of your predictive model, cross-validation is available only in SQL Server 2008 Enterprise.

Link: Good screencast about data mining with SSAS.

SQL Server Analysis Services, Table Analysis Tool Excel

Microsoft Excel data mining plug-in is dependent on SQL Server 2008 and Excel 2007.

This takes less interaction with the database and DBA than the Development Studio.
A lot of users have their data in Excel.
There is an Analysis ribbon / menu that is very simple to use. Even for users with very limited understanding of data mining.
The Machine Learning ribbon has more control over internals of the algorithms.
You can run with huge amount of data since the number crunching is done on the server.

Issues:

This also needs a connection to a SQL Server 2008 with Analysis Services running. Despite the data mining algorithms being relatively simple.
You need a special database inside Analysis Services that you have write permissions to.

Link: Excel Table Analysis Tool video

Statistica 8

Statistica is a commercial statistics and data mining software package.
There is a 90 day trial for Statistica 8 with data miner module in the textbook:
Handbook of Statistical Analysis and Data Mining Applications. There is also a free 30 day trial.

Statistica is cheaper than SAS and SPSS.
Six hours of instructional videos.
Data Miner Recipes wizard is the easiest tool for a beginner.
Lot of data mining algorithms.
GUI with a lot of functionality.
You program using menus and wizards.
Good graphics.
Easy to find and clean up outliers and missing data attributes.

Issues:

Overwhelming number of menu items.
The most important video about Data Miner Recipes is the very last.
Cost of Statistica is not available on their website.
It is cheap in a corporate setting, but not for private use.

WEKA

WEKA is an open source statistical and data mining library written in Java.

Many machine learning packages.
Good graphics.
Specialized for data mining.
Easy to work with.
Written in pure Java so it is multi platform.
Good for text mining.
You can train different learning algorithms at the same time and compare their result.

RapidMiner vs WEKA:

The most similar data mining packages are RapidMiner and WEKA. There have many similarities:

Written in in Java.
Free / open source software with GPL license.
RapidMiner includes many learning algorithms from WEKA.

Therefore the issues with WEKA is really how it compares to RapidMiner.

Issues compared to RapidMiner:

Worse connectivity to Excel spreadsheet and non Java based databases.
CSV reader not as robust.
Not as polished.

Criteria for software package comparison

My current data mining needs are relatively simple. I do not need the most sophisticated software packages. This is what I need now:

Supervised learning for categorization.
Over 200 features mainly numeric but 2 categorical.
Data is clean so no need to clean outliers and missing data.
Not important to avoiding mistakes.
Equal cost for type 1 and type 2 errors.
Accuracy is a good metric.
Easy to export model to production environment.
Good GUI with good graphic to explore the data.
Easy to compare a few different models e.g. boosted trees, naive bayes, neural network, random forest and vector support machine.

Summary

I did not have time to test all the tools enough for a real review. I was only trying to determine what data mining software packages to try first.

Try first list

Statistica: Most polished, easiest to get started with. Good graphics and documentation.
RapidMiner: Polished. Simplest and most uniform GUI. Good graphics. Open source.
WEKA: A little unpolished. Good functionality for comparing different data mining algorithms.
SSAS Table Analysis Tool, Data Mining ribbon: Showed promise, but I did not get it to do what I need.
SSAS BIDS: Close tie to cube and data warehouse. Hard to find documentation about AMO programming. Could possibly give best integration with C# and VB.NET.
SSAS Table Analysis Tool, Analysis ribbon: Simple to use but does not have the functionality I need.
R: Not specialized towards data mining. Elegant but different programming paradigm.
SciPy: Data mining library too immature.

Both RapidMiner and Statistica 8 do what I need now. So far I have found it easier to find functions using Statistica's menus and wizards, than RapidMiner's ETL workflows, but RapidMiner is open source. Still I would not be surprised if ended up using one or more than one package.

Preliminary quality comparison of Statistica and RapidMiner

I ran my predictive modeling task in both Statistica and RapidMiner. In the first match the model that preformed best in Statistica was neural network, with an error rate of approximately 10%.

I ran the neural network in RapidMiner the error rate was approximately 18%. I was surprised about the big difference. The reason is probably that one of my most important attributes is categorical with many values, and neural network does not work well with that. Statistica might have preformed better due to more hidden layers.

Second time I ran my predictive model, Statistica was having some numeric overflow for neural network and there were missing prediction values. This also surprised me I would expect that there could be problems with the training of the neural network, but not the calculation of and input on a trained model.

These problems can easily be the result of me being unfamiliarity with the software packages, but this was my first impression.

Link to my follow up post that is based on solving an actual data mining problem in Orange, R, RapidMiner, Statistica and WEKA after working with them for 2 months.

-Sami Badawi

Saturday, April 3, 2010

Data Mining rediscovers Artificial Intelligence

Artificial intelligence started in the 1950s with very high expectations. AI did not deliver on the expectations and fell into decades long discredit. I am seeing signs that Data Mining and Business Intelligence are bringing AI into mainstream computing. This blog posting is a personal account of my long struggle to work in artificial intelligence during different trends in computer science.

In the 1980s I was studying mathematics and physics, which I really enjoyed. I was concerned about my job prospects, there are not many math or science jobs outside of academia. Artificial intelligence seemed equally interesting but more practical, and I thought that it could provide me with a living wage. Little did I know that artificial intelligence was about to become an unmentionable phrase that you should not put on your resume if you wanted a paying job.

Highlights of the history of artificial intelligence

In 1956 AI was founded.
In 1957 Frank Rosenblatt invented Perceptron, the first generation of neural networks. It was based on the way the human brain works, and provided simple solutions to some simple problems.
In 1958 John McCarthy invented LISP, the classic AI language. Mainstream programming languages have borrowed heavily from LISP and are only now catching up with LISP.
In the 1960s AI got lots of defense funding. Especially military translation software translating from Russian to English.

AI theory made quick advances and a lot was developed early on. AI techniques worked well on small problems. It was expected that AI could learn, using machine learning, and this soon would lead to human like intelligence.

This did not work out as planned. The machine translation did not work well enough to be usable. The defense funding dried up. The approaches that had worked well for small problems did not scale to bigger domains. Artificial intelligence fell out of favor in the 1970s.

AI advances in the 1980s

When I started studying AI, it was in the middle of a renaissance and I was optimistic about recent advances:

The discovery of new types of neural networks, after Perceptron networks had been discredited in an article by Marvin Minsky
Commercial expert system were thriving
The Japanese Fifth Generation Computer Systems project, written in the new elegant declarative Prolog language had many people in the West worried
Advances in probability theory Bayesian Networks / Causal Network

In order to combat this brittleness of intelligence Doug Lenat started a large scale AI project CYC in 1984. His idea was that there is no free lunch, and in order to build an intelligent system, you have to use many different types of fine tuned logical inference; and you have to hand encode it with a lot of common sense knowledge. Cycorp spent hundreds of man years building their huge ontology. Their hope was that CYC would be able to start learning on its own, after training it for some years.

AI in the 1990s

I did not loose my patience but other people did, and AI went from the technology of the future to yesterday's news. It had become a loser that you did not want to be associated with.

During the Internet bubble when venture capital founding was abundant, I was briefly involved with an AI Internet start up company. The company did not take off; its main business was emailing discount coupons out to AOL costumers. This left me disillusioned, thinking that I just have to put on a happy face when I worked on the next web application or trading system.

AI usage today

Even though AI stopped being cool, regular people are using its use it in more and more places:

Spam filter
Search engines use natural language processing
Biometric, face and fingerprint detection
OCR, check reading in ATM
Image processing in coffee machine detecting misaligned cups
Fraud detection
Movie and book recommendations
Machine translation
Speech understanding and generation in phone menu system

Euphemistic words for AI techniques

The rule seem to be that you can use AI techniques as long as you call it something else, e.g.:

Business Intelligence
Collective Intelligence
Data Mining
Information Retrieval
Machine Learning
Natural Language Processing
Predictive Analytics
Pattern Matching

AI is entering mainstream computing now

Recently I have seen signs that AI techniques are moving into mainstream computing:

I went to a presentation for SPSS statistical modeling software, and was shocked how many people now are using data mining and machine learning techniques. I was sitting next to people working in a prison, adoption agency, marketing, disease prevention NGO.
I started working on a data warehouse using SQL Server Analytic Services, and found that SSAS has a suite of machine learning tools.
Functional and declarative techniques are spreading to mainstream programming languages.

Business Intelligence compared to AI

Business Intelligence is about aggregating a company's data into an understandable format and analyzing it to provide better business decisions. BI is currently the most popular field using artificial intelligence techniques. Here are a few words about how it differs from AI:

BI is driven by vendors instead of academia
BI is centered around expensive software packages with a lot of marketing
The scope is limited, e.g. find good prospective customers for your products
Everything is living in databases or data warehouses
BI is data driven
Reporting is a very important component of BI

Getting a job in AI

I recently made a big effort to steer my career towards AI. I started an open source computer vision project, ShapeLogic and put AI back on my resume. A head hunter contacted me and asked if I had any experience in Predictive Analytics. It took me 15 minutest to convince her that Predictive Analytics and AI was close enough that she could forward my resume. I got the job, my first real AI and NLP job.

The work I am doing is not dramatically different from normal software development work. I spend less time on machine learning than on getting AJAX to work with C# ASP.NET for the web GUI; or upgrade the database ORM from ADO.NET strongly typed datasets to LINQ to SQL. However, it was very gratifying to see my program started to perform a task that had been very time consuming for the company's medical staff.

Is AI regaining respect?

No, not now. There are lots of job postings for BI and data mining but barely any for artificial intelligence. AI is still not a popular word, except in video games where AI means something different. When I worked as a games developer what was called AI was just checking if your character was close to an enemy and then the enemy would start shooting in your character's direction.

After 25 long years of waiting I am very happy to see AI techniques has finally become a commodity, and I enjoy working with it even if I have to disguise this work by whatever the buzzword of the day is.

-Sami Badawi

Wednesday, March 17, 2010

SharpNLP vs NLTK called from C# review

C# and VB.net have fewer open source NLP libraries than languages like C++, Java, LISP and Perl. My last blog post: Open Source NLP in C# 3.5 using NLTK is about calling NLTK, which is written in Python, from IronPython embedded under C# or VB.net.

An alternative is to use SharpNLP, which is the leading open source NLP project written in C# 2.0. SharpNLP is not as big as other Open Source NLP projects. This blog posting is a short comparison of SharpNLP and NLTK embedded in C#.

Documentation

NLTK has excellent documentation, including an introductory online book on NLP and Python programming.

For SharpNLP the source code is the documentation. There is also a short introductory article by SharpNLP's author Richard J. Northedge.

Ease of learning

NLTK is very easy to work with under Python, but integrating it as embedded IronPython under C# took me a few days. It is still a lot simpler to get Python and C# to work together than Python and C++.

SharpNLP's lack of documentation makes it harder to use; but it is very simple to install.

Ease of use

NLTK it is great to work with in the Python interpreter.

SharpNLP simplifies life by not having to deal with the embedding of IronPython under C# and the mismatching between the 2 languages.

Machine learning and statistical models

NLTK comes with a variety of machine learning and statistical models: decision trees, naive Bayesian, and maximum entropy. They are very easy to train and validate, but do not preform well for large data sets.

SharpNLP is focused on maximum entropy modeling.

Tokenizer quality

NLTK has a very simple RegEx based tokenizer that works well in most cases.

SharpNLP has a more advanced maximum entropy based tokenizer that can split "don't" into "do | n't". On the other hand it sometimes makes errors and splits a normal word into 2 words.

Development community

NLTK has an active development community, with an active mailing list.

SharpNLP was last release was in December 2006. It is a port of the Java based OpenNLP, and can read models from OpenNLP. SharpNLP has a low volume mailing list.

Code quality

NLTK lets you write programs that read from web pages, clean HTML out of text and do machine learning in a few lines of code.

SharpNLP is written in C# 2.0 using generics. It is a port from OpenNLP and maintains a Java flavor, but it is still very readable and pleasant to work with.

License

NLTK's license is Apache License, Version 2.0, which should fit most people's need.

SharpNLP's license is LGPL 2.1. This is a versatile license, but maybe a little harder to work with when the project is not active.

Applications

NLTK comes with a theorem prover for reasoning about semantic content of text.

SharpNLP comes with an name, organization, time, date and percentage finder.
It is very simple to add an advanced GUI, using WPF or WinForms.

Conclusion

Both packages comes with a lot of functionality. They both have weaknesses, but they are definitely usable. I have both SharpNLP and embedded NLTK in my NLP toolbox.

-Sami Badawi

Thursday, March 11, 2010

Open Source NLP in C# 3.5 using NLTK

I am working on natural language processing algorithms in a C# 3.5 environment. I did not find any open source NLP packages for C# or VB.NET.
NLTK is a great open source NLP package written in Python. It comes with an online book. I decided to try to embed IronPython under C# and run NLTK from there. Here are a few thoughts about the experience.

Problems with embedding IronPython and NLTK

Some libraries that NLTK uses are not installed in IronPython, e.g. zlib and numpy, you can mainly patch this up
You need a good understanding of how embedded IronPython works
The connection between Python and C# is not seamless
Sending data between Python and C# takes work
NLTK is pretty slow at starting up
Doing large scale machine learning in NLTK is slow

C# and IronPython

IronPython is a very good implementation of Python, but in C# 3.5 there is still a mismatch between C# and Python; this becomes an issue when you are dealing with a library as big as NLTK.
The integration between IronPython and C# is going to improve with C# 4.0. How much remains to be seen.

To embed or not to embed

When is embedding IronPython and NLTK inside C# a good idea?

Separate processes for NLTK under CPython and C#

If your C# tasks and your NLP tasks are not interacting too much, it might be simpler to have a C# program call a NLP CPython program as an external process. E.g. you want to analyze the content of a Word document. You would open the Word document in C# create a Python process pipe the text into it and read the result back in JSON or XML and display it in ASP, WPF or WinForms.

Small NLP tasks

There is a learning curve for both NLTK and embedded IronPython, that slows down you down when you start work.

Medium sized NLP projects

The setup cost is not an issue so embedding IronPython and NLTK could work very well here.

Big NLP projects

The setup cost is not an issue, but at some point the mismatch between Python and C#, will start to outweigh the advantages you get.

Prototyping in NLTK

Start writing your application in NLTK either under CPython or IronPython. This should improve development time substantially. You might find that your prototype is good enough and you do not need to port it to C#; or you will have a working program that you can port to C#.

References

Post about running NLTK from IronPython
Chapter 15 of IronPython in Action is about embedding IronPython in C# or VB.NET
Source code examples from IronPython in Action
Here is a short intro to embedding IronPython by Michael Foord
I tried loading Jeff Hardy's IronPython.Zlib.dll using Assembly.LoadFile, that did not work but I could add it with clr.AddReference from the embedded Python code

-Sami Badawi

Monday, November 17, 2008

Computer vision C++ libraries review

I am trying to create an easy to use, minimalistic C++ cross-platform computer vision system, with a non-restrictive license. My biggest challenge was to chose the best libraries and to get them to work together; this took some investigation and experimenting. This posting is a brief description of my findings.

This is what ShapeLogic C++ currently looks like:

Windows

Linux

In order to construct ShapeLogic C++, I had to make choices within the following categories:

Computer vision and image processing libraries
GUI libraries
Unit test systems
Build systems
Compilers under Windows
C++ IDEs under UNIX

ImageJ, the Java open source image processing tool, is the inspiration for the first part of my work: it is very simple to learn, use and program in. This is a follow up to my last posting: Computer Vision C++ vs Java. The result of my work is released as an open source project ShapeLogic C++, under the MIT license.

Computer vision and image processing libraries

The candidates I considered were:

GIL, Generic Image Library
OpenCV
VXL

GIL, Generic Image Library

GIL, Generic Image Library by Adobe.

Pros

Very non intrusive, only based on header files
Puts a wrapper around most image format
You can write a algorithm once and it will work for most image types
Part of Boost since 1.35

Cons

Does not come with a lot of image processing algorithms

OpenCV

OpenCV, Open Computer Vision by Intel.

Pros

Very simple
Works with both C and C++
Very broad range of algorithms
Complex algorithms: face detection, convexity defects
Very popular

Cons

You have to use OpenCV's IplImage
IplImage byte order is BGR instead of the normal RGB

VXL, Vision X Library

VXL a combination of 2 big older vision libraries TargetJR and IUE

Pros

Well tested technology
Simpler build process using cmake
Uses modern programming techniques: classes, template and STL
It has a lot of functionality
Simple to get started with

Cons

It is not using normal STL, but in order to work on different compiler it had to make its own version with different names.
Class structure is somewhat complex.

Choice computer vision library for ShapeLogic C++
OpenCV for existing image processing and vision algorithms. GIL for writing new algorithms.

Cross platform GUI

The candidates I considered were:

GIMP plugin
GTK+, GIMP toolkit
FLTK
HighGui from OpenCV
PhotoShop plugin
wxWidget

Run ShapeLogic as a GIMP plugin

Pros

GIMP is the main cross platform OSS image editing programs.
It is in wide use.
Has a lot of powerful features including scripting functionality in Scheme and Python.
From a user perspective this would be an excellent choice.

Cons

GIMP is GPL, but you could have a wrapper around plugins in order to access them as GIMP plugins.
The plugin works with tiles, which gives good performance, but does not fit well with either the way OpenCV or GIL are processing images.

GTK+, GIMP Toolkit

Pros

GTK+ is a great looking and very powerful framework that works on: Windows, Linux, Mac, a.o.

Cons

It is written in C and has a homegrown object system, which is not type safe.

GTKMM C++ wrapper around GTK+

Pros

CTKMM is a great looking and very powerful framework, that works on: Windows, Linux, Mac, a.o.
It feels natural to program in for a C++ programmer.

Cons

The class hierarchy is somewhat deep since it is built on top of GTK.

FLTK, Fast light toolkit

Pros

FLTK is very lightweight.
Very clean C++, you actually have a main().
Native C++ build on top of X11 or Windows.
Fluid, a simple GUI builder

Cons

Not as many widgets.
Dated look.

HighGui from OpenCV

Pros

Very lightweight.
There is some functionality for displaying images, video and an event handler for mouse events.

Cons

Does not come with a menu system.

Run ShapeLogic as a PhotoShop plugin

Pros

PhotoShop is the main image editing programs.
It is in wide use, has a lot of powerful features including macros.
From a user perspective this would be an excellent choice.

Cons

The PhotoShop SDK is not freely available, you have to apply to get it.
The plugin does not fit well with either the way OpenCV or GIL are processing images.

wxWidgets

Pros

wxWidgets is a full featured GUI toolkit, built on top of native toolkits: Win32, Mac OS X, GTK+, X11, Motif, WinCE and more.
Looks good and modern.
Big community.
Several GUI builders.

Cons

The programming style is close to Windows MFC programming.
There are many layers.

Choice
This was a hard choice and I went back and forth between FLTK and wxWidgets, but went with FLTK. All the GUI code is separate from the image processing code, so if I wanted to change from FLTK to another toolkit later it should not be too dramatic.

C++ unit test frameworks

There are a lot of different choices and no clear leader. Some of the candidates were:

Boost.test
CppUnit
Google C++ Testing Framework

Boost.test

Pros

Boost.test is part of the Boost library.
Powerful with a lot of options.

Cons

You have to manually set up test suites.
It is somewhat heavy.
The documentation is extensive but not easy to read.

CppUnit

Pros

CppUnit is following a standard unit testing convention XUnit.
Integration with Eclipse CDT.

Cons

You have to manually set up test suites.
It is an extra library to install.

Google C++ Testing Frameworks

Pros

Google test is following a standard unit testing convention XUnit.
Strong focus on simplicity.
Documentation is short and easy to read.

Cons

It is an extra library to install.

Choice of C++ unit test framework for ShapeLogic C++

I spent quite a bit of time reading the Boost Test documentation, finally I tried Google C++ Testing Framework and got it working very fast.

Build system

The candidates I considered were:

Boost build
Make

Boost build

Pros

Boost build is part of Boost.
Clean design, made as a Make replacement.
Works on most platforms and with most compilers.
The scripts are pretty short.

Cons

There is a learning curve.

Make

Pros

Make is the standard for build on C++.
Widely used.
Works with Eclipse, MSVC, NetBeans.
Short scripts.

Cons

It has gotten messy over time.
Shell script dependency.
There is too much magic for my taste.

Choice of build system for ShapeLogic C++

I chose to go with Boost Build because it has a cleaner design, but Make looks very competitive when looking over the pros and cons.

Compilers under Windows

In order to compile Boost you need a pretty modern and standard compliant compiler. The candidates that I looked at are:

Cygwin GCC
MinGW GCC
MSVC Microsoft Visual C++

Cygwin GCC

Pros

Cygwin GCC is close to GCC under UNIX

Cons

Uses emulation of UNIX system call.
You can only use it to build GPL compatible application.

MinGW GCC

Pros

MingGW integrates well with Eclipse CDT.
Works more natively with Windows.
Most libraries build fine with MinGW.

Cons

It was supposed to be able to build FLTK, but I tried a few times and could not get it to work.
In order to run Make files you also have to install MSYS, which is a minimal shell.

MSVC, Microsoft Visual C++

Pros

MSVC is a high quality compiler.
Most used compiler under Windows.
There is a free Express version.

Cons

There seems to be some restrictions of the Express version that I did not quite understand.

Choice of compiler under Windows for ShapeLogic C++

MSVC.

C++ IDEs under UNIX

The candidates I considered were:

Eclipse 3.4
Emacs / Xemacs
NetBeans 6.1

Eclipse 3.4

Pros

Eclipse 3.4 CDT has a good debugger.
Easy to jump from classes to files defining the classes.

Cons

Not nearly as good as Eclipse for Java.
Unstable under Linux AMD64.

Emacs

Pros

Emacs is powerful tools that runs in a terminal.
Takes up less resources.
Not dependent on Java.

Cons

The Java bases IDE have more features.
Demands more knowledge to use.

NetBeans 6.1

Pros

NetBeans 6.1 seem a little lighter than Eclipse.

Cons

It is made to work with Make files, and ShapeLogic C++ is using Boost Build / Bjam.

Choice of IDE under UNIX for ShapeLogic C++

Eclipse.

Summary of libraries and tools used

Boost the C++ library
Generic Image Library for writing new image processing code
OpenCV for existing computer vision algorithms
FLTK, Fast Light Toolkit lightweight cross platform GUI
Google C++ Testing Framework
Boost.build v2 for command line based build system

Status of ShapeLogic C++

ShapeLogic C++ 0.4 is the first alpha release. It can do some useful work, but it still mainly an example application.

Currently has

Comes with some image processing operation
Comes with 3 brushes
It is pretty simple to program an image processing algorithm

Missing

Drawing is currently slow and there is only one pen size
None of the ShapeLogic Java algorithms have been ported yet
Documentation is poor

Hardest problems

Learning how FLTK works
Building a cross platform C++ build script covering several libraries

None of these problems will effect ShapeLogic users.

Porting computer vision code from Java to C++

Before I started porting ShapeLogic from Java, I thought that C++ was moving towards becoming a legacy language. What I have learned from this work is that C++ has advanced substantially since 2002, when I last used it professionally. C++ still seems competitive, at least in computer vision, and according to my old video game colleagues also in games, where I though that C# might have taken a lead by now. Both C++ and Java have substantial advantages.

C++

OpenCV has a lot of vision algorithms, e.g. face recognition
C++ is faster than Java
Better for video processing
Programs are shorter
Generic programming working on primitive types
You can make build script that build under both Windows and UNIX

Java / ImageJ

ImageJ has more open source algorithms for medical image processing
Better support for medical image files formats under ImageJ
IDEs are better under Java
Build process is simpler than C++
Simpler language
Better support for parallel processing
A lot easier to dynamically load plugins

The next step is to port my framework for declarative programming -- it is based on lazy streams -- and port the Color Particle Analyzer. C++ / Boost have good support for functional programming techniques: Boost.Bind and Boost.Lambda, and the Phoenix library has just been accepted into Boost. When complete, I will do another posting about how it went.

-Sami Badawi
http://www.shapelogic.org

Tuesday, September 2, 2008

Computer Vision C++ vs Java review

In 2007 I created an open source computer vision project, ShapeLogic, built in Java to work with ImageJ. This setup has been very easy to work with and very productive. Bjarne Stroustrup the creator of C++ gave an interview about the new features in the C++0x standard and TR1. C++ now has a lot of innovating programming constructs e.g. template meta programming, lambda functions, concepts and traits. When I found out that "axiom" is going to be a keyword in C++ my inner mathematician demanded that I take a second look at C++ in connection with computer vision.

This post is a review of my personal past experience with computer vision in C++ and Java. I did my masters thesis in computer vision in the early 90ies, but I ended up working in other fields: video games, Internet and finance, which only left a little time to do vision in my free time. While both C++ and Java were good choices for professional vision programmers, several of the approaches I chose caused me to run out of steam. I also tried to do computer vision with functional, declarative and hybrid languages e.g. Oz, Scheme and Scala but will not cover that here.

Borland C++ early 90ies

C++ did not have STL or any other standard library so I used Borland's OWL library for images and for the application. I used C++ templates, classes with multiple inheritance, RTTI just to set up basic container functionality. There were a few books that has some free C or C++ source code for image processing and vision, but they did not spawn a user community. I did not really get to do anything interesting.

JAI, Java Advanced Imaging late 90ies

I was very excited when Java came around, this was the language to cure all programming ailments. Now they had added a library that could be used for vision and a lot of big companies were sponsoring JAI. It turned out to be a very complex framework with a deep class hierarchy, I spent a lot of time reading the manual trying to find out how to get access to image pixels. I gave up using it and the framework never gained much popularity.

VXL, C++, STL, Boost, Python, GCC, Linux around 2000

Open source software, OSS had started to become prominent. There were 2 OSS libraries:

OpenCV (Open Computer Vision), wich was still in alpha.
VXL (Vision X Library) wich was a merge of 2 big non OSS libs TargetJR and IUE.

VXL finally got into beta and I tried to combine it with Python for more high level processing.

Tools needed for build and GUI

VXL does builds using CMake to create Make files
Boost uses BJam to do builds
Python bindings using Pyste from Boost
VXL used FLTK and OpenGL as a GUI

Problems encountered

It was hard to get the different build systems, CMake, Bjam and Make, to work together
GCC 3.1 and 3.2 core dumped when compiling certain Boost classes
Python bindings worked for simple C++ classes, but not for the nested template classes in VXL
It was hard to debug the template programs
Emacs was not really as easy to use as Visual Studio
Bad drivers for OpenGL on Linux

I actually got some examples set up, but spent more time fighting with the tool stack than doing vision work.

ImageJ in Java around 2004

A colleague showed me a visualization tool he had worked on and said that he did it in around 1 month. I barely believed him, but tried the underlying framework, ImageJ. To my big surprise I was up and running and doing real work in a few hours. ImageJ just got things right. It was built using pure Java by one man, Wayne Rasband. It is very easy to work with and very modular, so a lot of people have made plugins and there is a vibrant development community. When I started working on ShapeLogic that was the best choice.

OpenCV, GIL Generic Image Library, Boost and Eclipse in C++ 2008

In the light of advance in the C++ language and tools, I have decided to try it again.
C++ image libraries choices

I chose to start with OpenCV made by Intel and GIL made by Adobe but a part of Boost since 1.35.

C++ IDE tried

Eclipse 3.4
NetBeans 6.1

Eclipse worked better for me, it has its own build system so you do not have to mess with Make files.

C++ cross platform GUIs

Not sure which one will be best for my purpose.

First attempt
I tried Boost, OpenCV and GIL and got them up and running under both Linux and Windows in a few hours. Eclipse CDT C++ IDE works great.

Porting ShapeLogic algorithms to C++ version

My plan is to port some algorithms from ShapeLogic from Java to C++. ShapeLogic is a toolkit for declarative programming, specialized for vision. In principle you should be able to make a list of rules for categorizing say the shape of a particle in a particle analyzer. You put them in a database or a flat file and the same rules should work for C++ and Java version of ShapeLogic. In practice this might not work out.

Advantages of C++ and Java

This is a loose first assessment.
Constructs used in ShapeLogic that are missing or less convenient in C++

Uniform cross platform GUI
Dynamic cross platform libraries
HashTable
Reflection
Garbage collection
Antlr for parsing logic language

Advantages of C++ over Java in vision

Substantially higher speed
Better handling of video
Used more frequently for computer vision programming
Good tracking and face recognition algorithms in OpenCV

For me, Java has been very good for doing medical image processing algorithms. I have heard conflicting evidence about whether it is feasible for doing computer vision on video using Java. Video handling in Java has been bad up to now, this is supposed to be fixed with the new JavaFX. Shadow Monsters is a computer vision based art piece taking video footage of silhouette of the viewer and adding monsters to them, I saw it on display at Museum of Modern Art. It was programmed using Processing, which is a Java based image processing tool for artists. I discussed the issue with a computer programmer / artist who said that he had tried to do a motion algorithm in Processing and had to port it to C++ based Openframeworks since Java was too slow. After being discouraged by my prior attempts to do vision in C++, I am very happy to see the dramatic developments in C++ and see if it is suitable for a simple port of ShapeLogic algorithms. The result of this C++ port will be covered in 2 postings:

Computer vision C++ libraries review
Declarative framework and Particle Analyzer Java to C++ port

-Sami Badawi
http://www.shapelogic.org

Monday, July 7, 2008

ShapeLogic 1.2 with color particle analyzer released

Here are the release notes for ShapeLogic 1.2

Changes

Particle analyzer working directly on color and gray scale images without manual user intervention
Both particle counter and particle analyzer now take parameters and print reports about each particle's color, area, standard deviation to result table
Color replacer replaces one color within a tolerance with another color. Parameter input dialog with preview check box
Organize plugins and macros under ShapeLogic? and ShapeLogicOld? menus, until 1.1 they where all placed under shapelogic menu
ShapeLogic still has beta development status

The particle analyzer in ShapeLogic v 1.2 has gone through limited
testing and seems to work well. There is still a bug in the edge
tracer.

Using particle analyzer as a ImageJ plugin

The particle analyzer was tested on the particle sample images from ImageJ embryos.jpg

To run it from ImageJ select "Color Particle Analyzer" in the ShapeLogic? menu:

First a particle count dialog is displayed:

Here is the result of running the non-customized particle analyzer on it. This is written to a result table that can be exported to Excel:

The categories for the particles are only examples, it is easy to setup different rules for categorizing particles. In ShapeLogic? 1.3 there will be custom rules to recognize specific cells.

ShapeLogic? 1.2 also contains the second version of a color particle counter. It also prints a smaller report of the particle's properties.

Plans for next release

The next release, ShapeLogic v 1.3, will be a more mature particle analyzer which will come with custom rules to recognize specific cells.

Seeking particle images

In order to create these rules, I am looking for images of particles on a relatively uniform background. Please let me know if you have sample images that I could work from, preferably standard images like the embryo sample image that comes with ImageJ.

Possible future plans for particle analyzer

Create rule for recognizing cells using neural networks or machine learning techniques
Be able to handle a background that is not uniform, and cell organelles
Incorporate reasoning under uncertainty using the lazy stream library
Find overlapping particles and distinguish them as separate

Download ShapeLogic 1.2

-Sami Badawi
http://www.shapelogic.org