Friday, October 29, 2010

Natural language processing in Clojure, Go and Cython

I work in natural language processing, programming in C# 3.5 and Python. My work includes classification, named entity recognition, sentiment analysis and information extraction. Both C# and Python are great languages, but I do have some unmet needs. I investigated if there are any new languages that would help. I only looked at minimal language that would be simple to learn. The 3 top contenders were: Clojure, Go and Cython. Both Clojure, Go have innovative approaches to non locking concurrency. This is my first impression of working with these languages.

For contrast let me start by listing the features of my current languages.

C# 3.5

C# is an advanced object orientated / functional hybrid language and programming platform:
  • It is fast
  • Great development environment
  • You can do almost any tasks in it
  • Great database support with LINQ to SQL
  • Advanced web development with ASP.net
  • Advanced GUI toolkit with WPF
  • Good concurrency with threading library
  • Good MongoDB library
Issues
  • Works best on Windows
  • Not well suited for rapid development
While many features of C# are not directly related to NLP they are very convenient. C# has some NLP libraries: SharpNLP is a port of OpenNLP from Java. Lucene has also been ported. The ports are behind the Java implementation, but still give a good foundation.

Python

Python is an elegant scripting language, with a strong focus on simplicity.
  • NLTK is a great NLP library
  • Lot of open source math and science libraries
  • PyDev is a good development environment
  • Good MongoDB library
  • Great for rapid development
Issues
  • It is interpreted and not very fast
  • Problems with GIL based threading model

    C# vs. Python and unmet needs

    I was not sure what language I would prefer to work with. I suspected that C# would win out with all it advanced features. Due to demand for fast turnaround, I ended up doing more work in Python, and have been very happy with that choice. I have a lot of scripts that can be piped together to create new applications, with the help of the very fast and flexible MongoDB.

    I do have some concerns about Python moving forward:
    • Will it scale if I get really large amount of text
    • Will speed improve on multi core processors
    • Will it work with cloud computing
    • Part of speech tagging is slow


    Java

    Java is a modern object oriented language. Like C# it is a programming platform:
    • Has most NLP libraries: OpenNLP, Mahout, Lucene, WEKA
    • It is fast
    • Great development environment: Eclipse and NetBeans
    • You can do almost any tasks in it
    • Great database support with JDBC and Hibernate
    • Many web development frameworks
    • Good GUI toolkit: Swing and JavaFX
    • Good concurrency with threading library
    Issues
    • Functional style programming is clumsy
    • Working with MongoDB is clumsy
    • Java code is verbose

    I would not hesitate using Java for NLP, but my company is not a Java shop.

    Clojure

    Clojure was released in 2007. It is a right sized LISP. Not very big like Common LISP or very small like Scheme.
    • Gives easy access to Java libraries: OpenNLP, Mahout, Lucene, WEKA, OpinionFinder
    • Innovative non locking concurrency primitives
    • Good IDEs in Eclipse and NetBeans
    • Easy to work with
    • Code and data is unified
    • Interactive REPL
    • LISP is the classic artificial intelligence language
    • If you need speed you can write Java code
    • Good MongoDB library
     Issues
    • The IDE is not working as well as IDEs for Java or C#

      Clojure is minimal in the sense that it is build on around 10 primitive programming constructs. The rest of the language is constructed with macros written in Clojure.

      Once I got Clojure installed it was easy to work with and program in. Most of the good features about Python also applies to Clojure: it is minimal and has batteries included. Still I think that Python is a simpler language than Clojure.

      Use case
      Clojure is a good way to script the extensive Java libraries, for rapid development. It has more natural interaction with MongoDB than Java.

      Clojure OpenNLP

      The clojure-opennlp project is a thin Clojure wrapper around OpenNLP. It came with all the corpora used as training data for OpenNLP nicely packaged and it works well. You can script OpenNLP approximately as terse as NLTK, from an interactively repl.

      I tried it in both Eclipse and NetBeans. They seem somewhat equal in number of features. I had a little better luck with the Eclipse version.

      clojure-opennlp is using a Maven built system, but has a nontraditional directory layout, this caused problems for both Eclipse and NetBeans, they both took some configuration.

      Eclipse Counterclockwise
      The Counterclockwise instruction for labrepl mainly worked for installing clojure-opennlp.
      When you were done you had to go in add the example directory the source directories under properties.

      NetBeans Enclojure
      I imported the project. I had to move the Clojure file from example directory to a different position to get it to work.

      Maven plugins for Clojure
      The standard Maven directory layout has several advantages, e.g. if you want to mix Java and Clojure code. I created my own Maven pom configuration file up, based on examples of other Clojure Maven projects. They used Clojure plugins for Maven, I could not get this to work. Eventually I ripped these plugins out and was left with very pain POM file that worked.

      Go / Golang

      Go was announced November 2009. It is created by Google to deal with multicore and networked machines. It feels like a mixture of Python and C. It is a very clean and minimal language.
      • It is fast
      • Good standard library
      • Excellent support for concurrency
      • It is trivial to write your own load balancer
      Issues
      • The Eclipse IDE is in an early stage
      • Debugger is not working
      • Windows port is not done and has just been released
      It was hard to find the right Go Windows port, there are several Go windows port projects with no code.

      Use cases
      I currently have a problem when downloading a lot HTML pages and parsing them to a tree structure. This does not have the best support in C#. I found a library that translates HTML to XHTML and then I can use LINQ to process it. The library is not documented, not very fast and fails for some HTML files.

      Go comes with a HTML library that parses HTML 5, it is simple to write a program with some threads that download and other that parse the files into a DOM tree structure.
      I would use Golang for loading large amounts of text in a cloud computing environment.

      Cython

      Cython was released in July 2007. It is a static compiler to write Python extension modules in a mixture of Python and C.

      Process for using Cython
      • Start by writing normal Python code
      • Find modules that are too slow
      • Add static types
      • Compile it with Cython using the setup tool
      • This produces compiled modules that can be used with normal Python
      Issues
      • It is still more complex that normal Python code
      • You need to know C to use it
      I was surprised how simple it was to get it working both under Windows and Linux. I did not have to mess with make files or configure the compiles. Cython integrated well with NumPy and SciPy. This expands the programming tasks you can do with Python substantially.

      Use cases
      Speed up slow POS tagging.

        My previous language experience

        Over the years I have experimented with a long list of non mainstream languages: functional, declarative, parallel, array, dynamic and hybrid languages. Many of these were frustrating experiences. I would read about a new language and get very excited. However this would often be the chain of events:
        • Download language
        • Installed Cygwin
        • Find out how the language's build system works
        • Try to find a version of the GCC compiler that will compile it
        • Get the right version of Emacs installed
        • Try to get the debugger working under Emacs
        • Start programming from scratch since the libraries were sparse
        • Burn out

        You only have so much mental capacity, and if you do not use a language you forget it. Only Scala made it into my toolbox.

        Do Clojure, Go or Cython belong in your programmer's toolbox

        Clojure, Go and Cython are all simple languages. They are easy to install, easy learn, they all have big standard libraries so you can be productive in them right away. This is my first impression:
        • Clojure is a good way to script the extensive Java libraries, for rapid application development and for AI work.
        • Go is a great language but it is still rough around the edges. There are not any big NLP libraries written for Go yet. I would not try to use it for my main NLP tasks.
        • Cython was useful right away for my NLP work. It makes it possible to do fast numerical programming in Python without too much trouble.


        -Sami Badawi