Thursday, December 16, 2010

NLTK under Python 2.7 and SciPy 0.9.0

Python 2.7 has been out for months, but I have been stuck using Python 2.6 since SciPy was not working for Python 2.7. SciPy 0.9 Beta 1 binary distribution has just been released.
Normally I try to stay clear of beta quality software, but I really like some of the new features in Python 2.7 especially the argparse module, so despite my better judgement I installed Python 2.7.1 and SciPy 0.9.0 Beta 1, to run with a big NLTK based library. This is blog post describes the configuration that I use; and my first impression of the stability.

SciPy 0.9 RC1 was released January 2011.
SciPy 0.9 was released February 2011.
I tried both of them and found almost the same result as for SciPy 0.9 Beta 1, which this review was originally written for.

Direct downloads
Here is a list of the programs I installed directly:

Installation of NLTK

The install was very simple just type:

\Python27\lib\site-packages\ nltk

Other libraries installed with

  • CherryPy
  • ipython
  • PIL
  • pymongo
  • pyodbc

    YAML Library
    On a Windows Vista computer with no MS C++ compiler were I tested this NLTK install I also had to do a manual install of YAML from:

    Libraries from unofficial binary distributions
    There are a few packages that have build problems, but can be loaded from Christoph Gohlke's site with Unofficial Windows Binaries for Python Extension Packages: I downloaded and installed:
    • matplotlib-1.0.0.win32-py2.7.exe
    • opencv-python-2.2.0.win32-py2.7.exe


    The installation was simple. Everything installed cleanly. I ran some bigger scripts and they ran fine. Development and debugging also worked fine. Out of 134 NLTK related unit tests only one failed under Python 2.7

    Problems with SciPy algorithms

    The failing unit test was maximum entropy training using the LBFGSB optimization algorithm. These were my settings:
    nltk.MaxentClassifier.train(train, algorithm='LBFGSB', gaussian_prior_sigma=1, trace=2)

    First the maximum entropy training would not run because it was calling the method rmatvec() in scipy/sparse/ This method has been deprecated for a while and has been taken out of the SciPy 0.9. I found this method in SciPy 0.8 and added it back. My unit test ran, but instead of finishing in a couple of seconds it took around 10 minutes eating up 1.5GB before it crashed. After this I gave up on LBFGSB.

    If you do not want to use LBFGSB, megam is another efficient optimization algorithm. However it is implemented in OCaml and I did not want to install OCaml on a Windows computer.

    This problem occurred for both SciPy 0.9 Beta 1 and RC1.

    Python 2.6 and 2.7 interpreters active in PyDev

    Another problem was that having both Python 2.6 and 2.7 interpreters active in PyDev made it less stable. When I started scripts from PyDev sometime they timed out before starting. PyLint would also show errors in code that was correct. I deleted Python 2.6 interpreter under PyDev Preferences, and PyDev worked fine with just Python 2.7.

    I also added a version check the one failing unit test, since it caused problems for my machine.
    if (2, 7) < sys.version_info: return

    Multiple versions of Python on Windows

    If you install Python 2.7 and realize that some code is only running under Python 2.6 or that you have to rollback. Here are a few simple suggestions:

    I did a Google search for:
    python multiple versions windows
    This will show many ways to deal with this problem. One way is calling a little Python script that change the Windows register settings.

    Multiple versions of Python have not been a big problem for me. So I favor a very simple approach. The main issue is file extension binding. What program gets called when you double click a py file or type on the command line.

    Changing file extension binding for rollback to Python 2.6

    Under Windows XP You can change file extensions in Windows Explorer:
    Under Tools > Folder Option > File Types
    Select the PY Extension and press Advanced then press Change
    Select open press Edit
    The value is:
    "C:\Python27\python.exe" "%1" %*
    You can change this to use a different interpreter:
    "C:\Python26\python.exe" "%1" %*

    Or even simpler when I want to run the older Python interpreter I just type:
    Instead of typing

    Is Python 2.7 and SciPy 0.9.0 Beta 1 stable enough for NLTK use?

    The installation of all the needed software was fast and unproblematic. I would certainly not use it in a production environment. If you are doing a lot of numerical algorithms you should probably hold off. If you are impatient and you do not need to do new training it is worth trying it, you can always roll back.

    Friday, November 12, 2010

    Growing Python projects from small to large scale

    You need significantly different principles for developing small, medium and large scale software system.

    When my project started to become big I searched the Internet for some guidelines or best practices for how to scale Python, but did not find much. Here are a few of my observations on what technique to use for what project sizes.

    General principle

    For a small system you can spend most of your time solving the problem, but the bigger the system gets the more time you spend on project plans, coordination and documentation. The complexity and cost does not scale linearly with the size of the project but maybe scales with the square of the size. This holds for different styles of project management both waterfall and agile.

    A central problem is minimizing dependencies and avoiding tight coupling. John Lakos has written an excellent book on software scaling called: Large-Scale C++ Software Design here is a summary. It is a very scientific and stringent approach, which is specific for C++. He developed a metric for how much dependencies you have in your system. His technique are not a good fit for smaller projects, you could finish several scripts before you could even implement his methodology.

    Small scripts
    Keep it simple. Focus on the core functionality. Minimize the time you spend on setting up the project.

    Medium applications
    Spending some time organizing things, will save you time in the long run.

    Large applications
    Here you need a lot of structure; otherwise the project will not be stable.

    Development environment

    Small scripts

    I use PyWin Windows IDE.
    • It is lightweight
    • No need for Java or Eclipse
    • Syntax highlighting
    • Code completion at run time and some at write time
    • Allow primitive debugging
    • You do not need to set up a project to use it.

    Medium applications
    I have used both PyWin and PyDev.

    Large applications
    I would strongly recommend PyDev Eclipse plugin. It is a modern IDE and runs pylint continuously and has good code completion while writing code. It will find maybe half the error a compiler would find. This improves the stability a lot and was the most important change that I made from my old coding style.

    Organization of code

    Small scripts
    Use one module / file with all the code in. This can have several classes. The advantage is that deployment becomes trivial: you just email the script to the user. This works for modules up to around 3000 lines of code.

    Medium applications
    Use one directory with all modules in. This gives you fewer issues with PYTHONPATH.

    Make a convention for naming field names, database name and parameter name. Put all these names in a module that only contains string constants, and use these in your code instead of raw string.

    Use a separate repository for the project. I package the Python and other self written executable together in a repository, even when I have another source control system for the compiled sources.

    This works up till around 40 Python modules, then it become hard to find anything.

    Large applications
    Read and follow the Python style guide. Before I followed a Java style guide since Java is big on coding convention, but the Python style is actually pretty different. A noticeable difference is a Java file contains a main class with a title case name and the file has the same name. In python modules should have short lowercase name while the classes still should have title case names.

    Organizing packages as an a-cyclical graph
    Refactor the modules into packages. The packages should be organized as an a-cyclical graph. So at the lowest level you would have an util package that is not allowed to reference anything else. You can have other specialized packages that can access the util package. Over that I have the main source directory with code that is central and general. Over that I have a loader package that can access all the other packages.

    One problem when you have different directories is that you need the PYTHONPATH include all the code. A good way to do this is to try to add the parent directory to the system path before you import any of the modules.


    Small scripts
    Usually I have:
    • Python docstring in the program. 
    • Print a usage message

    Medium applications
    Have a directory for documentation. To keep it simple I prefer to use simple HTML. I find that Mozilla SeaMonkey is simple to use and generates clean HTML you can do a diff on. Often I have:
    • User documentation page 
    • Programmer documentation page
    • Release notes
    • Example data

    Large applications
    At this point using automatically generated documentation and some sort of wiki format for writing documentation is a good idea.


    Input and output account for a sizable part of your code. I prefer to use the most lightweight method I can get away with.

    Small scripts and medium applications
    Communication is done with flat files, csv files and database.

    Large applications
    Communication is done with flat files, csv files, database, MongoDB and CherryPy.

    MongoDB have dramatically simplified my work, before different types of structured data demanded their own database with several tables. Now I just load the data into a MongoDB collection. MongoDB make very different structured documents look very uniform and trivial to load from Python. After that I can use the same script on very different data.

    When you have a script and find out that you need to have other programs call it. It is very simple to create XML, JSON or text based RESTful web service using CherryPy. You just add a 1 line annotation to a method and it is now a web service. You barely have to make any changes to your program. CherryPy feels very Pythonic. This will give you very cheap way to connect to a GUI and a web site written in other languages.

    Unit tests

    Small scripts
    Unit tests give you a small advantage. I still write unit tests unless there is an emergency, and then I usually regret it.

    Large applications
    The bigger the system the more important it is that the individual pieces works. Large systems are not maintainable if you do not have unit tests.

    Source control system

    I put any code that I use for production in a source control system. I usually use Subversion or GIT.

    Subversion is good for centralized development, and it is nice that each check in has a sequential revision number so that you can see revision number 123 and next 124.

    GIT is better for distributed development; it is easy to create a local repository for a project.

    Small scripts
    One repository for each type of script.

    Medium and large applications
    One repository for each project.

    Use of standard libraries

    Small scripts
    Use the simplest approach that gets the work done.

    Large applications

    When my application grew I realized that I recreated functionality from the standard libraries; for instance from these libraries:
    I refactored my program to use the standard library and found that it were much better than what I had written. For bigger application using standard libraries makes your code less buggy and more maintainable. So spend some time to find what has already been written.

    How well does Python scale compared to compiled languages

    There are mixed opinions on this topic. Scripts are generally small and large systems are generally written in compiled languages. The extra checks and rigidity you get from a compiled language is more important the bigger you applications get. If you are writing a financial application and have very low tolerance for errors this could be significant.

    I am using Python for natural language processing: classification, named entity recognition, sentiment analysis and information extraction. I have to write many complex custom scripts fast.

    Based on my earlier experience with writing smaller Python scripts I was concerned about writing a bigger application. I found a good setup with PyDev, unit test and source control. It gives me much of the stability I am used to in a compiled language, while I can still can do rapid development.

    -Sami Badawi

    Friday, October 29, 2010

    Natural language processing in Clojure, Go and Cython

    I work in natural language processing, programming in C# 3.5 and Python. My work includes classification, named entity recognition, sentiment analysis and information extraction. Both C# and Python are great languages, but I do have some unmet needs. I investigated if there are any new languages that would help. I only looked at minimal language that would be simple to learn. The 3 top contenders were: Clojure, Go and Cython. Both Clojure, Go have innovative approaches to non locking concurrency. This is my first impression of working with these languages.

    For contrast let me start by listing the features of my current languages.

    C# 3.5

    C# is an advanced object orientated / functional hybrid language and programming platform:
    • It is fast
    • Great development environment
    • You can do almost any tasks in it
    • Great database support with LINQ to SQL
    • Advanced web development with
    • Advanced GUI toolkit with WPF
    • Good concurrency with threading library
    • Good MongoDB library
    • Works best on Windows
    • Not well suited for rapid development
    While many features of C# are not directly related to NLP they are very convenient. C# has some NLP libraries: SharpNLP is a port of OpenNLP from Java. Lucene has also been ported. The ports are behind the Java implementation, but still give a good foundation.


    Python is an elegant scripting language, with a strong focus on simplicity.
    • NLTK is a great NLP library
    • Lot of open source math and science libraries
    • PyDev is a good development environment
    • Good MongoDB library
    • Great for rapid development
    • It is interpreted and not very fast
    • Problems with GIL based threading model

      C# vs. Python and unmet needs

      I was not sure what language I would prefer to work with. I suspected that C# would win out with all it advanced features. Due to demand for fast turnaround, I ended up doing more work in Python, and have been very happy with that choice. I have a lot of scripts that can be piped together to create new applications, with the help of the very fast and flexible MongoDB.

      I do have some concerns about Python moving forward:
      • Will it scale if I get really large amount of text
      • Will speed improve on multi core processors
      • Will it work with cloud computing
      • Part of speech tagging is slow


      Java is a modern object oriented language. Like C# it is a programming platform:
      • Has most NLP libraries: OpenNLP, Mahout, Lucene, WEKA
      • It is fast
      • Great development environment: Eclipse and NetBeans
      • You can do almost any tasks in it
      • Great database support with JDBC and Hibernate
      • Many web development frameworks
      • Good GUI toolkit: Swing and JavaFX
      • Good concurrency with threading library
      • Functional style programming is clumsy
      • Working with MongoDB is clumsy
      • Java code is verbose

      I would not hesitate using Java for NLP, but my company is not a Java shop.


      Clojure was released in 2007. It is a right sized LISP. Not very big like Common LISP or very small like Scheme.
      • Gives easy access to Java libraries: OpenNLP, Mahout, Lucene, WEKA, OpinionFinder
      • Innovative non locking concurrency primitives
      • Good IDEs in Eclipse and NetBeans
      • Easy to work with
      • Code and data is unified
      • Interactive REPL
      • LISP is the classic artificial intelligence language
      • If you need speed you can write Java code
      • Good MongoDB library
      • The IDE is not working as well as IDEs for Java or C#

        Clojure is minimal in the sense that it is build on around 10 primitive programming constructs. The rest of the language is constructed with macros written in Clojure.

        Once I got Clojure installed it was easy to work with and program in. Most of the good features about Python also applies to Clojure: it is minimal and has batteries included. Still I think that Python is a simpler language than Clojure.

        Use case
        Clojure is a good way to script the extensive Java libraries, for rapid development. It has more natural interaction with MongoDB than Java.

        Clojure OpenNLP

        The clojure-opennlp project is a thin Clojure wrapper around OpenNLP. It came with all the corpora used as training data for OpenNLP nicely packaged and it works well. You can script OpenNLP approximately as terse as NLTK, from an interactively repl.

        I tried it in both Eclipse and NetBeans. They seem somewhat equal in number of features. I had a little better luck with the Eclipse version.

        clojure-opennlp is using a Maven built system, but has a nontraditional directory layout, this caused problems for both Eclipse and NetBeans, they both took some configuration.

        Eclipse Counterclockwise
        The Counterclockwise instruction for labrepl mainly worked for installing clojure-opennlp.
        When you were done you had to go in add the example directory the source directories under properties.

        NetBeans Enclojure
        I imported the project. I had to move the Clojure file from example directory to a different position to get it to work.

        Maven plugins for Clojure
        The standard Maven directory layout has several advantages, e.g. if you want to mix Java and Clojure code. I created my own Maven pom configuration file up, based on examples of other Clojure Maven projects. They used Clojure plugins for Maven, I could not get this to work. Eventually I ripped these plugins out and was left with very pain POM file that worked.

        Go / Golang

        Go was announced November 2009. It is created by Google to deal with multicore and networked machines. It feels like a mixture of Python and C. It is a very clean and minimal language.
        • It is fast
        • Good standard library
        • Excellent support for concurrency
        • It is trivial to write your own load balancer
        • The Eclipse IDE is in an early stage
        • Debugger is not working
        • Windows port is not done and has just been released
        It was hard to find the right Go Windows port, there are several Go windows port projects with no code.

        Use cases
        I currently have a problem when downloading a lot HTML pages and parsing them to a tree structure. This does not have the best support in C#. I found a library that translates HTML to XHTML and then I can use LINQ to process it. The library is not documented, not very fast and fails for some HTML files.

        Go comes with a HTML library that parses HTML 5, it is simple to write a program with some threads that download and other that parse the files into a DOM tree structure.
        I would use Golang for loading large amounts of text in a cloud computing environment.


        Cython was released in July 2007. It is a static compiler to write Python extension modules in a mixture of Python and C.

        Process for using Cython
        • Start by writing normal Python code
        • Find modules that are too slow
        • Add static types
        • Compile it with Cython using the setup tool
        • This produces compiled modules that can be used with normal Python
        • It is still more complex that normal Python code
        • You need to know C to use it
        I was surprised how simple it was to get it working both under Windows and Linux. I did not have to mess with make files or configure the compiles. Cython integrated well with NumPy and SciPy. This expands the programming tasks you can do with Python substantially.

        Use cases
        Speed up slow POS tagging.

          My previous language experience

          Over the years I have experimented with a long list of non mainstream languages: functional, declarative, parallel, array, dynamic and hybrid languages. Many of these were frustrating experiences. I would read about a new language and get very excited. However this would often be the chain of events:
          • Download language
          • Installed Cygwin
          • Find out how the language's build system works
          • Try to find a version of the GCC compiler that will compile it
          • Get the right version of Emacs installed
          • Try to get the debugger working under Emacs
          • Start programming from scratch since the libraries were sparse
          • Burn out

          You only have so much mental capacity, and if you do not use a language you forget it. Only Scala made it into my toolbox.

          Do Clojure, Go or Cython belong in your programmer's toolbox

          Clojure, Go and Cython are all simple languages. They are easy to install, easy learn, they all have big standard libraries so you can be productive in them right away. This is my first impression:
          • Clojure is a good way to script the extensive Java libraries, for rapid application development and for AI work.
          • Go is a great language but it is still rough around the edges. There are not any big NLP libraries written for Go yet. I would not try to use it for my main NLP tasks.
          • Cython was useful right away for my NLP work. It makes it possible to do fast numerical programming in Python without too much trouble.

          -Sami Badawi

          Wednesday, June 23, 2010

          Orange, R, RapidMiner, Statistica and WEKA

          Review of open source and cheap software packages for Data Mining

          This blog posting is comparing the following tools, after working with them for 2 months and using them for solving a real data mining problem:
          • Orange
          • R
          • RapidMiner
          • Statistica 8 with Data Miner module
          • WEKA
          Statistica is commercial, all the other are open source. There is also a brief mention of the following Python libraries: mlpy, ffnet, NLTK.

          Summary of first impression

          This is a follow up on my previous post R, RapidMiner, Statistica, SSAS or WEKA describing my impression of the following software packages after using them for a couple of days each:
          • R
          • RapidMiner
          • SciPy
          • SQL Server Analysis Services, Business Intelligence Development Studio
          • SQL Server Analysis Services, Table Analysis Tool for Excel
          • Statistica 8 with Data Miner module
          • WEKA
          Let me summarize what I found:

          SciPy did not have what I needed. However I found a few other good Python-based solutions: Orange, mlpy, ffnet and NLTK.

          The SSAS-based solutions held promise due to their close integration with Microsoft products, but I found them to be too closely tied to data warehouses so I postponed exploring them.

          Statistica and RapidMiner had a lot of functionality and were polished, but the many features were overwhelming.

          R was harder to get started with and WEKA was less polished, so I did not spend too much time on them.

          Comparison matrix

          In order to compress my current findings I am summarizing it in this matrix. This metric is only based on limited work with the different software packages and is not very accurate. The categories are:
          Documentation; GUI and graphics; how polished the package is; ease of learning; controlling package from a script or program; how many machine learning algorithms that are available:

          Python libs111332

          Criteria for software package comparison

          The comparison is based on a real data mining task that is relatively simple:
          • Supervised learning for categorization.
          • Over 200 attributes mainly numeric but 2 categorical / text.
          • One of the categorical attributes is the most important predictor.
          • Data is clean, so no need to clean outliers and missing data.
          • Accuracy is a good metric.
          • GUI with good graphic to explore the data is a plus.

          General observations

          The most popular data mining packages in the industry are SAS and SPSS, but they are quite expensive. Orange, R, RapidMiner, Statistica and WEKA all can be used for doing real data mining work. While some of them are unpolished.

          There was a similar learning curve for most of the programs. Most programs took me a few days to get working, between the documentation and experimenting.

          I had to reformulate my original problem. Neural network models did not work well on my categorical / text attributes. Statistica produced an accuracy of 90%, while RapidMiner produced an accuracy of 82%.
          I replaced the 2 categorical attributes with a numeric attribute and accuracy of the best model increased to around 97%, and was much more uniform between the different tools.


          Orange is an open source data mining package build on Python, NumPy, wrapped C, C++ and Qt.
          • Works both as a script and with an ETL work flow GUI.
          • Shortest script for doing training, cross validation, algorithms comparison and prediction.
          • I found Orange the easiest tool to learn.
          • Cross platform GUI.
          • Not super polished.
          • The install is big since you need to install QT.

          Python libs: ffnet, NumPy, mlpy, NLTK

          A few Python libs deserve to be mentioned here: ffnet, NumPy, mlpy and NLTK.
          • If you do not care about the graphic exploration, you can set up an ffnet neural network in few lines of code.
          • There are several machine learning algorithms in mlpy.
          • The machine learning is NLTK is very elegant if you have a text mining or NLP problem.
          • The libraries are self contained.
          • Limited list of machine learning algorithms.
          • Machine learning is not handled uniformly between the different libraries.


          R is an open source statistical and data mining package and programming language.
          • Very extensive statistical library.
          • It is a powerful elegant array language in the tradition of APL, Mathematica and MATLAB, but also LISP/Scheme.
          • I was able to make a working machine learning program in just 40 lines of code.
          • Less specialized towards data mining.
          • There is a steep learning curve, unless you are familiar with array languages.

          R vs. Orange written in Python

          Python and R have a lot in common: they are both elegant, minimal, interpreted languages with good numeric libraries. Still they have a different feel. So I was interested in seeing how they compared.
          Orange / Python advantages
          • R is quite different from common programming languages.
          • Python is easier for most programmers to learn.
          • Python has better debugger.
          • Scripting data mining categorization problems is simpler in Orange.
          • Orange also has an ELT work flow GUI.
          R advantages
          • R is even more minimal than Python.
          • Numerical programming is better integrated in R, in Python where you have to use external packages NumPy and SciPy.
          • R has better graphics.
          • R is more transparent since the Orange are wrapped C++ classes.
          • Easier to combine with other statistical calculations.
          I made small script to solve my data mining problem in both Orange and R. This was my impression:

          If all you want to do is to solve a categorization problem I found Orange to be simpler. You have to become very familiar with how Orange read the spreadsheet, the different attribute types, notably the Meta attribute.

          Import and export of data from spreadsheet is easier in R, spreadsheet are stored in a data frames that the different machine learning algorithms are operating on. Programming in R really is very different, you are working on a higher abstraction level, but you do lose control over the details.


          RapidMiner is an open source statistical and data mining package written in Java.
          • Solid and complete package.
          • It easily reads and writes Excel files and different databases.
          • You program by piping components together in a graphic ETL work flows.
          • If you set up an illegal work flows RapidMiner suggest Quick Fixes to make it legal.
          • I only got it to works under Windows, but others have gotten it to work in other environments, see comment below.
          • There are a lot of different ETL modules; it took a while to understand how to use them.
          • First I had a hard time making a comparison between different models. Eventually I found a way: You chose a cross validation and select different models one by one. When you run the model the will all be stored on the result page and you can do comparison there.

          Statistica 8

          Statistica is a commercial statistics and data mining software package for Windows.
          There is a 90 day trial for Statistica 8 with data miner module in the textbook:
          Handbook of Statistical Analysis and Data Mining Applications. There is also a free 30 day trial.
          • Generally very polished and good at everything, but it is also the only non open source program.
          • High accuracy even when I gave it bad input.
          • You can script everything in Statistica in VB.
          • Cheap compared to SPSS and SAS.
          • So many options that it was hard to navigate the program.
          • The most important video about Data Miner Recipes is the very last out of 36.
          • Cost of Statistica is not available on their website.
          • It is cheap in a corporate setting, but not for private use.


          WEKA is an open source statistical and data mining library written in Java.
          • A lot of machine learning algorithms.
          • Easy to learn and use.
          • Good GUI.
          • Platform independent.
          • Worse connectivity to Excel spreadsheet and non Java based databases.
          • CSV reader not as robust as in RapidMiner.
          • Not as polished.

          RapidMiner vs. WEKA

          The most similar data mining packages are RapidMiner and WEKA. There have many similarities:
          • Written in in Java.
          • Free / open source software with GPL license.
          • RapidMiner includes many learning algorithms from WEKA.
          My first thought what that RapidMiner has everything that WEKA has, plus a lot of other functionality and is more polished. Therefore I did not spend too much time on WEKA. For the sake of completeness I took a second look at WEKA and I have to say that it was a lot easier to get WEKA to work. Sometimes less is more. Depending on what is more important functionality or ease of use.


          There are several good and very different solutions. Let me finish by listing the strongest aspect of each tool:

          Orange has elegant and concise scripting and can also be run in an ETL GUI mode.
          R has elegant and concise scripting integrated with a vast statistical library.
          RapidMiner has a lot of functionality, is polished and has good connectivity.
          Statistica is the most polished product, and generally performed well in all categories. It gave good result when I gave it bad input.
          WEKA is the easiest GUI to learn and use.

          -Sami Badawi

          Thursday, April 29, 2010

          R, RapidMiner, Statistica, SSAS or WEKA

          Choosing cheap software packages to get started with Data Mining

          You have a data mining problem and you want to try to solve it with a data mining software package. The most popular packages in the industry are SAS and SPSS, but they are quite expensive, so you might have a hard time convincing your boss to purchase them before you already have produced impressive results.

          When I needed data mining or machine learning algorithms in the past, I would program it from scratch and integrate it in my Java or C# code. But recently I needed a more interactive graphics environment to help with what is called Data Understanding phase in the CRISP-DM. I also wanted a way to compare the predictive accuracy of a broad array of algorithms, so I tried out several packages:
          • R
          • RapidMiner
          • SciPy
          • SQL Server Analysis Services, Business Intelligence Development Studio
          • SQL Server Analysis Services, Table Analysis Tool for Excel
          • Statistica 8 with Data Miner module
          • WEKA

          Disclaimer for review

          Here is a review of my first impression of these packages. First impression is not the best indicator for what going to work for you in the long run. I am sure that I have missed many features. Still I hope this can save you some time finding a solution that will work for your problem.


          R is an open source statistical and data mining package and programming language.
          • Very extensive statistical library.
          • Very concise for solving statistical problems.
          • It is a powerful elegant array language in the tradition of APL, Mathematica and MATLAB, but also LISP/Scheme.
          • In a few lines you can set up an R program that does data mining and machine learning.
          • You have full control.
          • It is easier to integrate this into a work flow with your other programs. You just spawn an R program and pass input in and read output from a pipe.
          • Good plotting functionality.
          • Less interactive GUI.
          • Less specialized towards data mining.
          • Language is pretty different from current mainstream languages like C, C#, C++, Java, PHP and VB.
          • There is a learning curve, unless you are familiar with array languages.
          • R was created in 1990.
          Link: Screencast showing how a trained R user can generate a PMML neural network model in 60 seconds.


          RapidMiner is an open source statistical and data mining package written in Java.
          • Lot of data mining algorithms.
          • Feels polished.
          • Good graphics.
          • It easily reads and writes Excel files and different databases.
          • You program by piping components together in a graphic ETL workflows.
          • If you set up an illegal workflows RapidMiner suggest Quick Fixes to make it legal.
          • Good video tutorials / European dance parties. *:o)
          • I only got it to works under Windows, but others have gotten it to work in other environments.
          • Harder to compare different algorithms than WEKA.


          SciPy is an open source Python wrapper around numerical libraries.
          • Good for mathematics.
          • Python is a simple, elegant and mature language.
          • Data mining part is too immature.
          • Too much duct tape.

          SQL Server Business Intelligence Development Studio

          Microsoft SQL Server Analysis Services comes with data mining service.
          If you have access to SQL Server 2005 or later with SSAS installed, you can use some of the data mining algorithms for free. If you want to scale it can become quite expensive.
          • If your are working with the Microsoft stack this integrate well.
          • Good data mining functionality.
          • Organized well.
          • Comes with some graphics.
          • The machine learning is closely tied to data warehouses and cubes. This makes the learning curve steeper and deployment harder.
          • Documentation about using the BIDS GUI was hard to find. I looked in several books and several videos.
          • I need to do my data mining from within a web server or a command line program. For this you need to access the models using: Analysis Management Objects (AMO). Documentation for this was also hard to find.
          • You need good cooperation from your DBA, unless you have your own instance of SQL Server.
          • If you want to evaluate the performance of your predictive model, cross-validation is available only in SQL Server 2008 Enterprise.
          Link: Good screencast about data mining with SSAS.

          SQL Server Analysis Services, Table Analysis Tool Excel

          Microsoft Excel data mining plug-in is dependent on SQL Server 2008 and Excel 2007.
          • This takes less interaction with the database and DBA than the Development Studio.
          • A lot of users have their data in Excel.
          • There is an Analysis ribbon / menu that is very simple to use. Even for users with very limited understanding of data mining.
          • The Machine Learning ribbon has more control over internals of the algorithms.
          • You can run with huge amount of data since the number crunching is done on the server.
          • This also needs a connection to a SQL Server 2008 with Analysis Services running. Despite the data mining algorithms being relatively simple.
          • You need a special database inside Analysis Services that you have write permissions to.

          Link: Excel Table Analysis Tool video

          Statistica 8

          Statistica is a commercial statistics and data mining software package.
          There is a 90 day trial for Statistica 8 with data miner module in the textbook:
          Handbook of Statistical Analysis and Data Mining Applications. There is also a free 30 day trial.
          • Statistica is cheaper than SAS and SPSS.
          • Six hours of instructional videos.
          • Data Miner Recipes wizard is the easiest tool for a beginner.
          • Lot of data mining algorithms.
          • GUI with a lot of functionality.
          • You program using menus and wizards.
          • Good graphics.
          • Easy to find and clean up outliers and missing data attributes.
          • Overwhelming number of menu items.
          • The most important video about Data Miner Recipes is the very last.
          • Cost of Statistica is not available on their website.
          • It is cheap in a corporate setting, but not for private use.


          WEKA is an open source statistical and data mining library written in Java.
          • Many machine learning packages.
          • Good graphics.
          • Specialized for data mining.
          • Easy to work with.
          • Written in pure Java so it is multi platform.
          • Good for text mining.
          • You can train different learning algorithms at the same time and compare their result.
          RapidMiner vs WEKA:
          The most similar data mining packages are RapidMiner and WEKA. There have many similarities:
          • Written in in Java.
          • Free / open source software with GPL license.
          • RapidMiner includes many learning algorithms from WEKA.
          Therefore the issues with WEKA is really how it compares to RapidMiner.
          Issues compared to RapidMiner:
          • Worse connectivity to Excel spreadsheet and non Java based databases.
          • CSV reader not as robust.
          • Not as polished.

          Criteria for software package comparison

          My current data mining needs are relatively simple. I do not need the most sophisticated software packages. This is what I need now:
          • Supervised learning for categorization.
          • Over 200 features mainly numeric but 2 categorical.
          • Data is clean so no need to clean outliers and missing data.
          • Not important to avoiding mistakes.
          • Equal cost for type 1 and type 2 errors.
          • Accuracy is a good metric.
          • Easy to export model to production environment.
          • Good GUI with good graphic to explore the data.
          • Easy to compare a few different models e.g. boosted trees, naive bayes, neural network, random forest and vector support machine.


          I did not have time to test all the tools enough for a real review. I was only trying to determine what data mining software packages to try first.
          Try first list
          1. Statistica: Most polished, easiest to get started with. Good graphics and documentation.
          2. RapidMiner: Polished. Simplest and most uniform GUI. Good graphics. Open source.
          3. WEKA: A little unpolished. Good functionality for comparing different data mining algorithms.
          4. SSAS Table Analysis Tool, Data Mining ribbon: Showed promise, but I did not get it to do what I need.
          5. SSAS BIDS: Close tie to cube and data warehouse. Hard to find documentation about AMO programming. Could possibly give best integration with C# and VB.NET.
          6. SSAS Table Analysis Tool, Analysis ribbon: Simple to use but does not have the functionality I need.
          7. R: Not specialized towards data mining. Elegant but different programming paradigm.
          8. SciPy: Data mining library too immature.
          Both RapidMiner and Statistica 8 do what I need now. So far I have found it easier to find functions using Statistica's menus and wizards, than RapidMiner's ETL workflows, but RapidMiner is open source. Still I would not be surprised if ended up using one or more than one package.

          Preliminary quality comparison of Statistica and RapidMiner

          I ran my predictive modeling task in both Statistica and RapidMiner. In the first match the model that preformed best in Statistica was neural network, with an error rate of approximately 10%.

          I ran the neural network in RapidMiner the error rate was approximately 18%. I was surprised about the big difference. The reason is probably that one of my most important attributes is categorical with many values, and neural network does not work well with that. Statistica might have preformed better due to more hidden layers.

          Second time I ran my predictive model, Statistica was having some numeric overflow for neural network and there were missing prediction values. This also surprised me I would expect that there could be problems with the training of the neural network, but not the calculation of and input on a trained model.

          These problems can easily be the result of me being unfamiliarity with the software packages, but this was my first impression.

          Link to my follow up post that is based on solving an actual data mining problem in Orange, R, RapidMiner, Statistica and WEKA after working with them for 2 months.

          -Sami Badawi

          Saturday, April 3, 2010

          Data Mining rediscovers Artificial Intelligence

          Artificial intelligence started in the 1950s with very high expectations. AI did not deliver on the expectations and fell into decades long discredit. I am seeing signs that Data Mining and Business Intelligence are bringing AI into mainstream computing. This blog posting is a personal account of my long struggle to work in artificial intelligence during different trends in computer science.

          In the 1980s I was studying mathematics and physics, which I really enjoyed. I was concerned about my job prospects, there are not many math or science jobs outside of academia. Artificial intelligence seemed equally interesting but more practical, and I thought that it could provide me with a living wage. Little did I know that artificial intelligence was about to become an unmentionable phrase that you should not put on your resume if you wanted a paying job.

          Highlights of the history of artificial intelligence

          • In 1956 AI was founded.
          • In 1957 Frank Rosenblatt invented Perceptron, the first generation of neural networks. It was based on the way the human brain works, and provided simple solutions to some simple problems.
          • In 1958 John McCarthy invented LISP, the classic AI language. Mainstream programming languages have borrowed heavily from LISP and are only now catching up with LISP.
          • In the 1960s AI got lots of defense funding. Especially military translation software translating from Russian to English.
          AI theory made quick advances and a lot was developed early on. AI techniques worked well on small problems. It was expected that AI could learn, using machine learning, and this soon would lead to human like intelligence.

          This did not work out as planned. The machine translation did not work well enough to be usable. The defense funding dried up. The approaches that had worked well for small problems did not scale to bigger domains. Artificial intelligence fell out of favor in the 1970s.

          AI advances in the 1980s

          When I started studying AI, it was in the middle of a renaissance and I was optimistic about recent advances:
          • The discovery of new types of neural networks, after Perceptron networks had been discredited in an article by Marvin Minsky
          • Commercial expert system were thriving
          • The Japanese Fifth Generation Computer Systems project, written in the new elegant declarative Prolog language had many people in the West worried
          • Advances in probability theory Bayesian Networks / Causal Network
          In order to combat this brittleness of intelligence Doug Lenat started a large scale AI project CYC in 1984. His idea was that there is no free lunch, and in order to build an intelligent system, you have to use many different types of fine tuned logical inference; and you have to hand encode it with a lot of common sense knowledge. Cycorp spent hundreds of man years building their huge ontology. Their hope was that CYC would be able to start learning on its own, after training it for some years.

          AI in the 1990s

          I did not loose my patience but other people did, and AI went from the technology of the future to yesterday's news. It had become a loser that you did not want to be associated with.

          During the Internet bubble when venture capital founding was abundant, I was briefly involved with an AI Internet start up company. The company did not take off; its main business was emailing discount coupons out to AOL costumers. This left me disillusioned, thinking that I just have to put on a happy face when I worked on the next web application or trading system.

          AI usage today

          Even though AI stopped being cool, regular people are using its use it in more and more places:
          • Spam filter
          • Search engines use natural language processing
          • Biometric, face and fingerprint detection
          • OCR, check reading in ATM
          • Image processing in coffee machine detecting misaligned cups
          • Fraud detection
          • Movie and book recommendations
          • Machine translation
          • Speech understanding and generation in phone menu system

          Euphemistic words for AI techniques

          The rule seem to be that you can use AI techniques as long as you call it something else, e.g.:
          • Business Intelligence
          • Collective Intelligence
          • Data Mining
          • Information Retrieval
          • Machine Learning
          • Natural Language Processing
          • Predictive Analytics
          • Pattern Matching

          AI is entering mainstream computing now

          Recently I have seen signs that AI techniques are moving into mainstream computing:
          • I went to a presentation for SPSS statistical modeling software, and was shocked how many people now are using data mining and machine learning techniques. I was sitting next to people working in a prison, adoption agency, marketing, disease prevention NGO.
          • I started working on a data warehouse using SQL Server Analytic Services, and found that SSAS has a suite of machine learning tools.
          • Functional and declarative techniques are spreading to mainstream programming languages.

          Business Intelligence compared to AI

          Business Intelligence is about aggregating a company's data into an understandable format and analyzing it to provide better business decisions. BI is currently the most popular field using artificial intelligence techniques. Here are a few words about how it differs from AI:
          • BI is driven by vendors instead of academia
          • BI is centered around expensive software packages with a lot of marketing
          • The scope is limited, e.g. find good prospective customers for your products
          • Everything is living in databases or data warehouses
          • BI is data driven
          • Reporting is a very important component of BI

          Getting a job in AI

          I recently made a big effort to steer my career towards AI. I started an open source computer vision project, ShapeLogic and put AI back on my resume. A head hunter contacted me and asked if I had any experience in Predictive Analytics. It took me 15 minutest to convince her that Predictive Analytics and AI was close enough that she could forward my resume. I got the job, my first real AI and NLP job.

          The work I am doing is not dramatically different from normal software development work. I spend less time on machine learning than on getting AJAX to work with C# ASP.NET for the web GUI; or upgrade the database ORM from ADO.NET strongly typed datasets to LINQ to SQL. However, it was very gratifying to see my program started to perform a task that had been very time consuming for the company's medical staff.

          Is AI regaining respect?

          No, not now. There are lots of job postings for BI and data mining but barely any for artificial intelligence. AI is still not a popular word, except in video games where AI means something different. When I worked as a games developer what was called AI was just checking if your character was close to an enemy and then the enemy would start shooting in your character's direction.

          After 25 long years of waiting I am very happy to see AI techniques has finally become a commodity, and I enjoy working with it even if I have to disguise this work by whatever the buzzword of the day is.

          -Sami Badawi

          Wednesday, March 17, 2010

          SharpNLP vs NLTK called from C# review

          C# and have fewer open source NLP libraries than languages like C++, Java, LISP and Perl. My last blog post: Open Source NLP in C# 3.5 using NLTK is about calling NLTK, which is written in Python, from IronPython embedded under C# or

          An alternative is to use SharpNLP, which is the leading open source NLP project written in C# 2.0. SharpNLP is not as big as other Open Source NLP projects. This blog posting is a short comparison of SharpNLP and NLTK embedded in C#.


          NLTK has excellent documentation, including an introductory online book on NLP and Python programming.

          For SharpNLP the source code is the documentation. There is also a short introductory article by SharpNLP's author Richard J. Northedge.

          Ease of learning

          NLTK is very easy to work with under Python, but integrating it as embedded IronPython under C# took me a few days. It is still a lot simpler to get Python and C# to work together than Python and C++.

          SharpNLP's lack of documentation makes it harder to use; but it is very simple to install.

          Ease of use

          NLTK it is great to work with in the Python interpreter.

          SharpNLP simplifies life by not having to deal with the embedding of IronPython under C# and the mismatching between the 2 languages.

          Machine learning and statistical models

          NLTK comes with a variety of machine learning and statistical models: decision trees, naive Bayesian, and maximum entropy. They are very easy to train and validate, but do not preform well for large data sets.

          SharpNLP is focused on maximum entropy modeling.

          Tokenizer quality

          NLTK has a very simple RegEx based tokenizer that works well in most cases.

          SharpNLP has a more advanced maximum entropy based tokenizer that can split "don't" into "do | n't". On the other hand it sometimes makes errors and splits a normal word into 2 words.

          Development community

          NLTK has an active development community, with an active mailing list.

          SharpNLP was last release was in December 2006. It is a port of the Java based OpenNLP, and can read models from OpenNLP. SharpNLP has a low volume mailing list.

          Code quality

          NLTK lets you write programs that read from web pages, clean HTML out of text and do machine learning in a few lines of code.

          SharpNLP is written in C# 2.0 using generics. It is a port from OpenNLP and maintains a Java flavor, but it is still very readable and pleasant to work with.


          NLTK's license is Apache License, Version 2.0, which should fit most people's need.

          SharpNLP's license is LGPL 2.1. This is a versatile license, but maybe a little harder to work with when the project is not active.


          NLTK comes with a theorem prover for reasoning about semantic content of text.

          SharpNLP comes with an name, organization, time, date and percentage finder.
          It is very simple to add an advanced GUI, using WPF or WinForms.


          Both packages comes with a lot of functionality. They both have weaknesses, but they are definitely usable. I have both SharpNLP and embedded NLTK in my NLP toolbox.

          -Sami Badawi

          Thursday, March 11, 2010

          Open Source NLP in C# 3.5 using NLTK

          I am working on natural language processing algorithms in a C# 3.5 environment. I did not find any open source NLP packages for C# or VB.NET.
          NLTK is a great open source NLP package written in Python. It comes with an online book. I decided to try to embed IronPython under C# and run NLTK from there. Here are a few thoughts about the experience.

          Problems with embedding IronPython and NLTK

          • Some libraries that NLTK uses are not installed in IronPython, e.g. zlib and numpy, you can mainly patch this up
          • You need a good understanding of how embedded IronPython works
          • The connection between Python and C# is not seamless
          • Sending data between Python and C# takes work
          • NLTK is pretty slow at starting up
          • Doing large scale machine learning in NLTK is slow

          C# and IronPython

          IronPython is a very good implementation of Python, but in C# 3.5 there is still a mismatch between C# and Python; this becomes an issue when you are dealing with a library as big as NLTK.
          The integration between IronPython and C# is going to improve with C# 4.0. How much remains to be seen.

          To embed or not to embed

          When is embedding IronPython and NLTK inside C# a good idea?

          Separate processes for NLTK under CPython and C#

          If your C# tasks and your NLP tasks are not interacting too much, it might be simpler to have a C# program call a NLP CPython program as an external process. E.g. you want to analyze the content of a Word document. You would open the Word document in C# create a Python process pipe the text into it and read the result back in JSON or XML and display it in ASP, WPF or WinForms.

          Small NLP tasks

          There is a learning curve for both NLTK and embedded IronPython, that slows down you down when you start work.

          Medium sized NLP projects

          The setup cost is not an issue so embedding IronPython and NLTK could work very well here.

          Big NLP projects

          The setup cost is not an issue, but at some point the mismatch between Python and C#, will start to outweigh the advantages you get.

          Prototyping in NLTK

          Start writing your application in NLTK either under CPython or IronPython. This should improve development time substantially. You might find that your prototype is good enough and you do not need to port it to C#; or you will have a working program that you can port to C#.


          -Sami Badawi