Languages and Logic

Friday, June 17, 2011

Cloud Computing For Data Mining Part 1

The first half of this blog post is about selecting a cloud provider for a data mining and natural language processing system. I will compare 3 leading cloud computing providers Amazon Web Services, Windows Azure, OpenStack.
To help me chose a cloud provider I have been looking for users with experience running cloud computing for application similar to data mining. I found them at CloudCamp New York June 2011. It was an unconference, so the attendees were split into user discussion groups. The last half of the post I will mention the highlight from these discussions.

The Hype

"If you are not in the cloud you are not going to be in business!"

This is the message many programmers, software architects and project managers faces today. You do not want to go out of business because you could not keep up with the latest technologies; but looking back many companies have gone out of business because they invested in the latest must have technology, that turned out to be expensive and over engineered.

Reason For Moving To Cloud

I have a good business case from using cloud computing: Namely scale a data mining system to handle a lot of data. To begin with it could be a moderate amount of data, but it could be changed to a Big Data with short notice.

Horror Cloud Scenario

I am trying minimize the risk of this scenario:

I port to a cloud solution that is tied closely to one cloud provider
Move the applications over
After a few months I find that there are unforeseen problems
No easy path back
Angry customers are calling

Goals

Here are my cloud computing goals in a little more details:

Port data mining system and ASP.NET web applications to the cloud
Chose cloud compatible with code base in .NET and Python
Initially the data volume is moderate but it could possibly scale to Big Data
Keep cost and complexity under control
No downtime during transition
Minimize risk
Minimize vendor lock in
Run the same code in house and in the cloud
Make rollback to in house application possible

Amazon Web Services vs. Windows Azure vs. OpenStack

Choosing the right cloud computing provider has been time consuming, but also very important.

I took a quick stroll through Cloud Expo 2011, and most big computer companies were there presenting their cloud solutions.

Google App Engine is a big cloud service well suited for front end web application, but not good for data mining, so I will not cover that here.

The other 3 providers that have generated most momentum are: EC2, Azure and OpenStack.

Let me start by listing their similarities:

Virtual computers that can be started with short notice
Redundant robust storage
NoSQL structured data
Message queue for communication
Mountable hard disk
Local non persistent hard disk

Now I will write a little more about where they differ, and their good and the bad part:

Amazon Web Services, AWS, EC2, S3

Good:

This is the oldest cloud provider dating back to 2004
Very mature provider
Other providers are catching up with AWS's features
Well documented
Work well with open source, LAMP and Java
Integrated with Hadoop: Electric Map Reduce
A little cheaper than Windows Azure
Runs Linux, Open Solaris and Windows servers
You can run code on your local machine and just save the result into S3 storage

Bad:

You cannot run the same code in house and in the cloud
Vendor lock in

Windows Azure

Good:

Works well with the .NET framework and all Microsoft's tools
It is very simple to port an ASP.NET application to Azure
You can run the same code on you development machine and in the cloud
Very good development and debugging tools
F# is a great language for data mining in cloud computing
Great series of video screen casts

Bad:

Only run Windows
You need a Windows 7, Windows Server 2008 or Windows Vista to develop
Preferably you should have Visual Studio 2010
Vendor lock in

OpenStack

OpenStack is a new open source collaboration that is making a software stack that can be run both in house and it the cloud.

Good:

Open source
Generating a lot of buzz
Main participants NASA and Rackspace
Backed by 70 companies
You can run your application either in house or in the cloud

Bad:

Not yet mature enough for production use
Windows support is immature

Java, .NET Or Mixed Platform

For data mining selecting the right platform is a hard choice. Both Java and .NET are very attractive options.

Java only
For data mining and NLP there are a lot of great open source project written in Java. E.g. Mahout is a system for collaborative filtering and clustering of Big Data, with distributed machine learning. It is integrated with Hadoop.
There are many more OSS: OpenNLP, Solr, ManifoldCF,

.NET only
The development tools in .NET are great. It works well with Microsoft Office.
Visual Studio 2010 comes with F#, which is a great language for writing worker roles. It is very well suited for light weight threads or async, for highly parallel reactive programs.

Mix Java and .NET
You can mix Java and .NET. Cloud computing makes is easier than ever to integrate different platforms. You already have abstract language agnostic service for communication with message queue, blob storage, structured data. If you have an ASP.NET front end on top of a collaborative filtering of Big Data this would be a very attractive option.

I still think that combining 2 big platforms like Java and .NET is introducing complexity, compared to staying within one platform. You need an organization with good resources and coordination to do this.

Choice Of Cloud Provider

I still have a lot of unanswered questions at this point.

At the time of writing June 2011 OpenStack is not ready for production use. So that is out for now.

I have run some test on AWS. It was very easy to deploy my Python code to EC2 under Linux. Programming C# that used AWS services was simple.

I am stuck waiting to get a Window 7 machine so I can test Window Azure.

Both EC2 and Azure seem like viable options for what I need. I will get back to this in part 2 of the blog post.

Highlights from Cloud Camp 2011

A lot of people are trying to sell you cloud computing solutions. I have heard plenty of cloud computing hype. I have been seeking advice from people that were not trying to sell me anything and had some real experience, and try to find some of the failures and problems in cloud computing.

I went to Cloud Camp June 2011 during Cloud Expo 2011 in New York. Cloud computing users shared their experience. It was an unconference, meaning spontaneous user discussion breakout groups were formed. The rest of this post is highlight from these discussions.

Hadoop Is Great But Hard

Hadoop is a Java open source implementation of Google's Map Reduce. You can set up a workflow of operations and Hadoop will distribute them over a multiple computers, aggregate the result and rerun operations that fail. This sounds fantastic, but Hadoop is a pretty complex system, with a lot of new terminology and a steep learning curve.

Security Is Your Responsibility

Security is a big issue. You might assume that the cloud will take care of security, but you should not. E.g. you should clean up the hard disks that you have used it, so the next user cannot see your data.

Cloud Does Not Automatically Scale To Big Data

The assumption is that you put massive amounts of data in the cloud. And the cloud takes care of the scaling problems.
If you have a lot of data that needs little processing. Then cloud computing becomes expensive: you store all data in 3 different locations and it is expensive and slow to take it down to different compute nodes. This was mentioned as the reason why NASA could not using S3, but build its own Nebula platform.

You Accumulate Cost During Development

An entrepreneur building a startup ended up paying $2000 / month for EC2. He used a lot of different servers and they had to be running with multiple instances, even though he was no using a lot of resources. This might be cheap compared to going out and buying your own servers, but it was more expensive than he expected.

Applications Written In .NET Run Fine Under EC2 Windows

An entrepreneur said that he was running his company's .NET code under EC2. He thought that Amazon was more mature than Azure, and Azure was catching up. He preferred to make his own framework.

Simpler To Run .NET Application On Azure Than On EC2

A cloud computing consultant with lots of experience in both Azure and EC2 said: EC2 gives you a raw machine you have to do more to get your application running than if you plop it into Windows Azure.
It is very easy to port an ASP.NET application to Windows Azure.

Cash Flow, Operational Expenses And Capital Expenses

An often cited reason why cloud computing is great is that a company can replace big upfront capital expenses with smaller operational expenses. A few people mentioned that companies live by their cash flow and they do not like to have an unpredictable operational expenses, but are more comfortable with predictable capital expenses.

Friday, April 1, 2011

Practical Probabilistic Topic Models for NLP

Latent Dirichlet Allocation, LDA is a new and very powerful technique for finding the topics in a collection of texts, using unsupervised learning. LDA is a probabilistic topic models. LDA was developed in 2003 and rely on advanced math. This post is a practical guide about how to get started building LDA models and software.

LDA will have a substantial impact on corpus based natural language processing; since it opens up for easy creation of semantic models based on machine learning.

Motivation for topic models

With the Internet we have large amount of text available. Having the text categorized into topics make text search much more precise and makes it possible to find similar documents.

Text categorization is not an easy problem:

Texts usually deals with more than one topic
There is no clear standard for categorization
Doing it by hand is infeasible

Nuanced categorized is a hard problem, with many moving parts, but in 2003 David M. Blei, Andrew Y. Ng and Michael I. Jordan published an article on a new approach called Latent Dirichlet Allocation. LDA can be implemented base on research articles, but if you are not a machine learning academic the math is intimidating and the material is still new.

There is actually good material available, but finding all the pieces takes some work. Most things you need are available online for free. Here is a chronological account for what I did to understanding LDA and start implementing it.

Need for more sophisticated hierarchical topic models

In 2009 I needed a fine grained classification of text, using unsupervised or semi supervised training. I spend a little time thinking about it, and had some idea about making bootstrapped training in a 2 layered hierarchy. It was hackish, complex and I was not sure how numerically stable it was. I never got around to implement it.

David Blei

I went to 4 Annual Machine Learning Symposium in 2009 and asked around for solutions to my problem. Several attendees told me to look at David Blei work. I did but he has written a lot of math heavy articles, so I did not know were to start.

I was lucky to see David Blei give a presentation on LDA first at the 5 Annual Machine Learning Symposium. David Blei works at Princeton and just exudes brilliance. He gave a lucid entertaining description of the LDA with examples. It wa really shocking to see the LDA algorithm find scientific topics on its own with no human intervention.

I saw him give the same talk at the NYC Machine Learning Meetup, and luckily that was videotaped here are part 1 and part 2. I watched these videos a few times. This gave me a good intuition for the algorithm.

I looked through his articles and found a good beginner articles BleiLafferty2009. I read through that several time, but I could not understand it.

I went out and bought the text book that David Blei recommended: Pattern Recognition and Machine Learning by Christopher M. Bishop. After reading the introduction chapter, I read BleiLafferty2009 again and was able to understand it. On page 10 the essence of the algorithm is described in a small text box.

Software implementation of LDA

There are plenty open source implementation of LDA. Here are a few observations:

lda-c in C by David Blei is an implementation in old school C. The code is readable, concise and clean.

lda for R package by Jonathan Chang. Implementing many models with extensive documentation.

Online LDA in Python by Matt Hoffman. Short code, but not too much documentation.

LDA Apache Mahout in Java. Active development community works with Hadoop / MapReduce.

JGibbLDA in Java

GibbsLDA++ C++

Arbylon Projects Java

No matter what language you prefer there should be a good implementation.

Practical software considerations

All the implementations looked good. But if you want to use LDA software then robustness, scalability and extendibility are big issues. First you just want the algorithm to run for simple text input. Next day you want the following options:

Better word tokenizer
Bigrams and collocation
Words stemmer
LDA on structured text
Read from database

Programming language choice for LDA

Here is a little common sense advice on choice of programming language for LDA programming.

C
C is an elegant, simple system programming language.
C is not my first choice of a language for text processing.

C++
C++ is a very powerful but also complex language.

NLP lib: The Lemur Project

I would be happy to use C++ for text processing.

C# is a great language.

NLP lib: SharpNLP.

You will have to implement LDA yourself or port one of the other implementations. SciPy is getting ported to C# but it does not have the best numeric libraries.

Clojure

Clojure is a moderate sized LISP dialect build on the Java JVM.

NLP lib: OpenNLP through clojure-opennlp.

LISP is classic AI language and you can use one of the Java LDA implementations.

Java

Java is modern object oriented programming language with access to every thinkable library.

NLP lib: OpenNLP.

Python

Python is an elegant language very well suited for NLP.

NLP lib: NLTK, using NumPy and SciPy

R
R is a fantastic language for statistics, but not so great for low level text processing.
NLP lib:
The R implementation of LDA looks great; I think that it is common to do all the preprocessing in another language say Perl. And then do all the rest of the work in R.

Different versions of LDA

There are now a lot of different LDA models geared towards different domains. Let me just mention a couple:

Online LDA
Online means that: you do learning of the models in small batches; instead of on all the documents. This is useful for a continuously running system.

Dynamic LDA
Good for handling text that stretches over a long time interval say 100 years.

Hierarchical LDA
This will handle topics are organized in hierarchies.

Gray box approach to LDA

The math needed for LDA is advanced. If you do not succeed in understand it I still think that you can learn to use the code, if you are willing to take something on faith and get your hands dirty.

Bedtime Science Stories My Science Education Blog

I started a science education blog called: Bedtime Science Stories. Here is a little excerpt from my first post: Can and should a 3 year old girl be into science?

I have a 3 year old daughter that has take a bit of an interest in science. We have been talking about science when I put her to bed at night.

Last Sunday I discovered a new book called Battle Hymn of the Tiger Mother by Amy Chua, who is a law professor at Yale. She is using extreme methods to push her 2 daughters to academic excellence. They had to be the best in their class in everything except drama and physical education. Math was a topic that she really drilled them in. Just reading the back cover sent me into a rage; so much that I decided to start a new blog: Bedtime Science Stories, just to get my anger out.

Science should not be an elite activity. Making it very competitive will make a new generation of kids hate math and science. Understanding our world is worthwhile activity even if you are not the best in your class.

My 3 year old daughter

Tuesday, February 15, 2011

Is IBM Watson Beginning An AI Boom?

Artificial intelligence fell out of favor in the 1970s, the start of first artificial intelligence winter, and has mainly been out of favor since. In April 2010 I wrote a post about how you can now get a paying job doing AI, machine learning and natural language processing outside academia.

Now barely one year later I have seen a few demonstrations that signal that artificial intelligence has taken another leap towards mainstream acceptance:

Yann LeCun demonstrated a computer vision system that could learn to recognize objects from his pocket after being shown a few examples, under a talk about learning feature hierarchies for computer vision
Andrew Hogue demonstrated Google Squared and Google Sentiment Analysis at Google Tech Talk, those systems both show rudimentary understanding of web pages and use word association
IBM Watson super computer is competing against the best human players on Jeopardy

All these 3 systems contain some real intelligence. Rudimentary by human standard, but AI has gone from the very specialized systems to handling more general tasks. It feels like AI is picking up steam. I am seeing startups based on machine learning pop up. This reminds me of the Internet boom in 1990s. I moved to New York in 1996, at the beginning of the Internet boom. I saw firsthand the crazy gold rush where fortunes were made and lost in short time, Internet startups were everywhere and everybody was talking about IPOs. This got me thinking, are we headed towards an artificial intelligence boom, and what would it look like?

IBM Watson

IBM Watson is a well executed factoid extraction system, but it is a brilliant marketing move, promoting IBM's new POWER7 system and their Smart Planet consulting services. It gives some people the impression that we already have human-like AI, and in that sense it could serve as a catalyst for investments in AI. This post is not about human-like artificial intelligence, but about the spread of shallow artificial intelligence.

Applications For Shallow Artificial Intelligence

Both people and corporations would gain value from having AI systems that they could ask free form questions to and get answers from in very diverse topics. In particular in these fields:

Medical science
Law
Surveillance
Military

Many people, me included, are concerned about a big brother state and military use of AI, but I do not think that is going to stop adaption. These people play for keeps.

There are signs that the financial service industry is starting to use sentiment analysis for their pricing and risk models. Shallow AI would be a good candidate for more advanced algorithmic trading.

Bottom Up vs. Top Down Approaches

Here is a very brief simplified introduction to AI techniques and tools. AI is a loosely defined field, with a loose collection of techniques. You can roughly categorize them it top down and bottom up approaches.

Top down or symbolic techniques

Automated reasoning
Logic
Many forms of tree search
Semantic networks
Planning

Bottom up or machine learning techniques

Neural networks, computer with similar structure to the brain
Machine learning

The top down systems are programmed by hand, while the bottom up systems learns themselves based on examples without human intervention, a bit like the brain.

What Is Causing This Sudden Leap?

Many top down techniques were developed by the 1960s. They were very good ideas, but they did not scale; they only worked for small toy problems.
Neural networks are an important bottom up technique. They started in 1950s, but fell out of favor; they came roaring back in 1980s. In the 1990 the machine learning / statistical approaches to natural language processing beat out Chomsky's generative grammar approach.

The technology that is needed for what we are doing now have been around for a long time. Why are these systems popping up now?

I think that we are seeing the beginning of a combination machine learning with top down techniques. The reason why this have taken so long is that it is hard to combine top down and bottom up techniques. Let me elaborate a little bit:

Bottom up AI / machine learning are black boxes that you give some input and expected output and it will adjust a lot of parameter numbers so it can mimic the result. Usually the numbers will not make much sense they just work.

In top down / symbolic AI you are creating detailed algorithms for working with concepts that make sense.

Both top down and bottom up techniques are now well developed and better understood. This makes it easier to combine them.

Other reasons for the leap are:

Cheap, powerful and highly parallel computers
Open source software, were programmers from around the world develop free software. This makes programming into more of an industrial assembly of parts.

Who Will Benefit From An AI Boom?

Here are some groups of companies that made a lot of money during the Internet boom:

Cisco and Oracle the tool makers
Amazon and eBay small companies that grew to become domineering in e-commerce
Google and Yahoo advertisement driven information companies

Initially big companies like IBM and Google that can create the technology should have an advantage, whether it will be in the capacity of tool makers or domineering players.

It is hard to predict how high the barrier to entry in AI will be. AI programs are just trained on regular text found on or off the Internet. And today's super computer is tomorrow's game console. The Internet has a few domineering players, but it is generally decentralized and anybody can have a web presence.

New York is now filled with startups using machine learning as a central element. They are "funded", but it seems like they got some seed capital. So maybe there is room for smaller companies to compete in the AI space.

Job Skills That Will Be Required In An AI Boom

During the Internet boom I met people with a bit of technical flair and no education beyond high school who picked up HTML in a week and next thing they were making $60/hour doing plain HTML. I think that the jobs in artificial intelligence are going to be a little more complex than those 1990s web developer jobs.

In my own work I have noticed a move from writing programming to teaching software based on examples. This is a dramatic change, and it requires a different skill set.

I think that there will still be plenty of need for programmers, but cognitive science, mathematics, statistics and linguistics will be skills in demand.

My work would benefit from me having better English language skills. The topic that I am dealing with is, after all, the English language. So maybe that English literature degree could come in handy.

Currently I feel optimistic about the field of artificial intelligence; there is progress after years of stagnation. We are wrestling a few secrets away from Mother Nature, and are making progress in understanding how the brain works. Sill, introduction of such powerful technology as artificial intelligence is going to affect society for better and worse.

Thursday, December 16, 2010

NLTK under Python 2.7 and SciPy 0.9.0

Python 2.7 has been out for months, but I have been stuck using Python 2.6 since SciPy was not working for Python 2.7. SciPy 0.9 Beta 1 binary distribution has just been released.
Normally I try to stay clear of beta quality software, but I really like some of the new features in Python 2.7 especially the argparse module, so despite my better judgement I installed Python 2.7.1 and SciPy 0.9.0 Beta 1, to run with a big NLTK based library. This is blog post describes the configuration that I use; and my first impression of the stability.

SciPy 0.9 RC1 was released January 2011.
SciPy 0.9 was released February 2011.
I tried both of them and found almost the same result as for SciPy 0.9 Beta 1, which this review was originally written for.

Direct downloads
Here is a list of the programs I installed directly:

Installation of NLTK

The install was very simple just type:

\Python27\lib\site-packages\easy_install.py nltk

Other libraries installed with easy_install.py

CherryPy
ipython
PIL
pymongo
pyodbc

YAML Library
On a Windows Vista computer with no MS C++ compiler were I tested this NLTK install I also had to do a manual install of YAML from:
http://pyyaml.org/wiki/PyYAML

Libraries from unofficial binary distributions
There are a few packages that have build problems, but can be loaded from Christoph Gohlke's site with Unofficial Windows Binaries for Python Extension Packages: http://www.lfd.uci.edu/~gohlke/pythonlibs/ I downloaded and installed:

matplotlib-1.0.0.win32-py2.7.exe
opencv-python-2.2.0.win32-py2.7.exe

Stability

The installation was simple. Everything installed cleanly. I ran some bigger scripts and they ran fine. Development and debugging also worked fine. Out of 134 NLTK related unit tests only one failed under Python 2.7

Problems with SciPy algorithms

The failing unit test was maximum entropy training using the LBFGSB optimization algorithm. These were my settings:
nltk.MaxentClassifier.train(train, algorithm='LBFGSB', gaussian_prior_sigma=1, trace=2)

First the maximum entropy training would not run because it was calling the method rmatvec() in scipy/sparse/base.py. This method has been deprecated for a while and has been taken out of the SciPy 0.9. I found this method in SciPy 0.8 and added it back. My unit test ran, but instead of finishing in a couple of seconds it took around 10 minutes eating up 1.5GB before it crashed. After this I gave up on LBFGSB.

If you do not want to use LBFGSB, megam is another efficient optimization algorithm. However it is implemented in OCaml and I did not want to install OCaml on a Windows computer.

This problem occurred for both SciPy 0.9 Beta 1 and RC1.

Python 2.6 and 2.7 interpreters active in PyDev

Another problem was that having both Python 2.6 and 2.7 interpreters active in PyDev made it less stable. When I started scripts from PyDev sometime they timed out before starting. PyLint would also show errors in code that was correct. I deleted Python 2.6 interpreter under PyDev Preferences, and PyDev worked fine with just Python 2.7.

I also added a version check the one failing unit test, since it caused problems for my machine.
if (2, 7) < sys.version_info: return

Multiple versions of Python on Windows

If you install Python 2.7 and realize that some code is only running under Python 2.6 or that you have to rollback. Here are a few simple suggestions:

I did a Google search for:
python multiple versions windows
This will show many ways to deal with this problem. One way is calling a little Python script that change the Windows register settings.

Multiple versions of Python have not been a big problem for me. So I favor a very simple approach. The main issue is file extension binding. What program gets called when you double click a py file or type script.py on the command line.

Changing file extension binding for rollback to Python 2.6

Under Windows XP You can change file extensions in Windows Explorer:
Under Tools > Folder Option > File Types
Select the PY Extension and press Advanced then press Change
Select open press Edit
The value is:
"C:\Python27\python.exe" "%1" %*
You can change this to use a different interpreter:
"C:\Python26\python.exe" "%1" %*

Or even simpler when I want to run the older Python interpreter I just type:
\Python26\python.exe script.py
Instead of typing
script.py

Is Python 2.7 and SciPy 0.9.0 Beta 1 stable enough for NLTK use?

The installation of all the needed software was fast and unproblematic. I would certainly not use it in a production environment. If you are doing a lot of numerical algorithms you should probably hold off. If you are impatient and you do not need to do new training it is worth trying it, you can always roll back.

Friday, November 12, 2010

Growing Python projects from small to large scale

You need significantly different principles for developing small, medium and large scale software system.

When my project started to become big I searched the Internet for some guidelines or best practices for how to scale Python, but did not find much. Here are a few of my observations on what technique to use for what project sizes.

General principle

For a small system you can spend most of your time solving the problem, but the bigger the system gets the more time you spend on project plans, coordination and documentation. The complexity and cost does not scale linearly with the size of the project but maybe scales with the square of the size. This holds for different styles of project management both waterfall and agile.

A central problem is minimizing dependencies and avoiding tight coupling. John Lakos has written an excellent book on software scaling called: Large-Scale C++ Software Design here is a summary. It is a very scientific and stringent approach, which is specific for C++. He developed a metric for how much dependencies you have in your system. His technique are not a good fit for smaller projects, you could finish several scripts before you could even implement his methodology.

Small scripts
Keep it simple. Focus on the core functionality. Minimize the time you spend on setting up the project.

Medium applications
Spending some time organizing things, will save you time in the long run.

Large applications
Here you need a lot of structure; otherwise the project will not be stable.

Development environment

Small scripts

I use PyWin Windows IDE.

It is lightweight
No need for Java or Eclipse
Syntax highlighting
Code completion at run time and some at write time
Allow primitive debugging
You do not need to set up a project to use it.

Medium applications
I have used both PyWin and PyDev.

Large applications
I would strongly recommend PyDev Eclipse plugin. It is a modern IDE and runs pylint continuously and has good code completion while writing code. It will find maybe half the error a compiler would find. This improves the stability a lot and was the most important change that I made from my old coding style.

Organization of code

Small scripts
Use one module / file with all the code in. This can have several classes. The advantage is that deployment becomes trivial: you just email the script to the user. This works for modules up to around 3000 lines of code.

Medium applications
Use one directory with all modules in. This gives you fewer issues with PYTHONPATH.

Make a convention for naming field names, database name and parameter name. Put all these names in a module that only contains string constants, and use these in your code instead of raw string.

Use a separate repository for the project. I package the Python and other self written executable together in a repository, even when I have another source control system for the compiled sources.

This works up till around 40 Python modules, then it become hard to find anything.

Large applications
Read and follow the Python style guide. Before I followed a Java style guide since Java is big on coding convention, but the Python style is actually pretty different. A noticeable difference is a Java file contains a main class with a title case name and the file has the same name. In python modules should have short lowercase name while the classes still should have title case names.

Organizing packages as an a-cyclical graph
Refactor the modules into packages. The packages should be organized as an a-cyclical graph. So at the lowest level you would have an util package that is not allowed to reference anything else. You can have other specialized packages that can access the util package. Over that I have the main source directory with code that is central and general. Over that I have a loader package that can access all the other packages.

One problem when you have different directories is that you need the PYTHONPATH include all the code. A good way to do this is to try to add the parent directory to the system path before you import any of the modules.

Documentation

Small scripts
Usually I have:

Python docstring in the program.
Print a usage message

Medium applications
Have a directory for documentation. To keep it simple I prefer to use simple HTML. I find that Mozilla SeaMonkey is simple to use and generates clean HTML you can do a diff on. Often I have:

User documentation page
Programmer documentation page
Release notes
Example data

Large applications
At this point using automatically generated documentation and some sort of wiki format for writing documentation is a good idea.

Communication

Input and output account for a sizable part of your code. I prefer to use the most lightweight method I can get away with.

Small scripts and medium applications
Communication is done with flat files, csv files and database.

Large applications
Communication is done with flat files, csv files, database, MongoDB and CherryPy.

MongoDB have dramatically simplified my work, before different types of structured data demanded their own database with several tables. Now I just load the data into a MongoDB collection. MongoDB make very different structured documents look very uniform and trivial to load from Python. After that I can use the same script on very different data.

When you have a script and find out that you need to have other programs call it. It is very simple to create XML, JSON or text based RESTful web service using CherryPy. You just add a 1 line annotation to a method and it is now a web service. You barely have to make any changes to your program. CherryPy feels very Pythonic. This will give you very cheap way to connect to a GUI and a web site written in other languages.

Unit tests

Small scripts
Unit tests give you a small advantage. I still write unit tests unless there is an emergency, and then I usually regret it.

Large applications
The bigger the system the more important it is that the individual pieces works. Large systems are not maintainable if you do not have unit tests.

Source control system

I put any code that I use for production in a source control system. I usually use Subversion or GIT.

Subversion is good for centralized development, and it is nice that each check in has a sequential revision number so that you can see revision number 123 and next 124.

GIT is better for distributed development; it is easy to create a local repository for a project.

Small scripts
One repository for each type of script.

Medium and large applications
One repository for each project.

Use of standard libraries

Small scripts
Use the simplest approach that gets the work done.

Large applications

When my application grew I realized that I recreated functionality from the standard libraries; for instance from these libraries:

I refactored my program to use the standard library and found that it were much better than what I had written. For bigger application using standard libraries makes your code less buggy and more maintainable. So spend some time to find what has already been written.

How well does Python scale compared to compiled languages

There are mixed opinions on this topic. Scripts are generally small and large systems are generally written in compiled languages. The extra checks and rigidity you get from a compiled language is more important the bigger you applications get. If you are writing a financial application and have very low tolerance for errors this could be significant.

I am using Python for natural language processing: classification, named entity recognition, sentiment analysis and information extraction. I have to write many complex custom scripts fast.

Based on my earlier experience with writing smaller Python scripts I was concerned about writing a bigger application. I found a good setup with PyDev, unit test and source control. It gives me much of the stability I am used to in a compiled language, while I can still can do rapid development.

-Sami Badawi

Friday, October 29, 2010

Natural language processing in Clojure, Go and Cython

I work in natural language processing, programming in C# 3.5 and Python. My work includes classification, named entity recognition, sentiment analysis and information extraction. Both C# and Python are great languages, but I do have some unmet needs. I investigated if there are any new languages that would help. I only looked at minimal language that would be simple to learn. The 3 top contenders were: Clojure, Go and Cython. Both Clojure, Go have innovative approaches to non locking concurrency. This is my first impression of working with these languages.

For contrast let me start by listing the features of my current languages.

C# 3.5

C# is an advanced object orientated / functional hybrid language and programming platform:

It is fast
Great development environment
You can do almost any tasks in it
Great database support with LINQ to SQL
Advanced web development with ASP.net
Advanced GUI toolkit with WPF
Good concurrency with threading library
Good MongoDB library

Issues

Works best on Windows
Not well suited for rapid development

While many features of C# are not directly related to NLP they are very convenient. C# has some NLP libraries: SharpNLP is a port of OpenNLP from Java. Lucene has also been ported. The ports are behind the Java implementation, but still give a good foundation.

Python

Python is an elegant scripting language, with a strong focus on simplicity.

NLTK is a great NLP library
Lot of open source math and science libraries
PyDev is a good development environment
Good MongoDB library
Great for rapid development

Issues

It is interpreted and not very fast
Problems with GIL based threading model

C# vs. Python and unmet needs

I was not sure what language I would prefer to work with. I suspected that C# would win out with all it advanced features. Due to demand for fast turnaround, I ended up doing more work in Python, and have been very happy with that choice. I have a lot of scripts that can be piped together to create new applications, with the help of the very fast and flexible MongoDB.

I do have some concerns about Python moving forward:

Will it scale if I get really large amount of text
Will speed improve on multi core processors
Will it work with cloud computing
Part of speech tagging is slow

Java

Java is a modern object oriented language. Like C# it is a programming platform:

Has most NLP libraries: OpenNLP, Mahout, Lucene, WEKA
It is fast
Great development environment: Eclipse and NetBeans
You can do almost any tasks in it
Great database support with JDBC and Hibernate
Many web development frameworks
Good GUI toolkit: Swing and JavaFX
Good concurrency with threading library

Issues

Functional style programming is clumsy
Working with MongoDB is clumsy
Java code is verbose

I would not hesitate using Java for NLP, but my company is not a Java shop.

Clojure

Clojure was released in 2007. It is a right sized LISP. Not very big like Common LISP or very small like Scheme.

Gives easy access to Java libraries: OpenNLP, Mahout, Lucene, WEKA, OpinionFinder
Innovative non locking concurrency primitives
Good IDEs in Eclipse and NetBeans
Easy to work with
Code and data is unified
Interactive REPL
LISP is the classic artificial intelligence language
If you need speed you can write Java code
Good MongoDB library

Issues

The IDE is not working as well as IDEs for Java or C#

Clojure is minimal in the sense that it is build on around 10 primitive programming constructs. The rest of the language is constructed with macros written in Clojure.

Once I got Clojure installed it was easy to work with and program in. Most of the good features about Python also applies to Clojure: it is minimal and has batteries included. Still I think that Python is a simpler language than Clojure.

Use case
Clojure is a good way to script the extensive Java libraries, for rapid development. It has more natural interaction with MongoDB than Java.

Clojure OpenNLP

The clojure-opennlp project is a thin Clojure wrapper around OpenNLP. It came with all the corpora used as training data for OpenNLP nicely packaged and it works well. You can script OpenNLP approximately as terse as NLTK, from an interactively repl.

I tried it in both Eclipse and NetBeans. They seem somewhat equal in number of features. I had a little better luck with the Eclipse version.

clojure-opennlp is using a Maven built system, but has a nontraditional directory layout, this caused problems for both Eclipse and NetBeans, they both took some configuration.

Eclipse Counterclockwise
The Counterclockwise instruction for labrepl mainly worked for installing clojure-opennlp.
When you were done you had to go in add the example directory the source directories under properties.

NetBeans Enclojure
I imported the project. I had to move the Clojure file from example directory to a different position to get it to work.

Maven plugins for Clojure
The standard Maven directory layout has several advantages, e.g. if you want to mix Java and Clojure code. I created my own Maven pom configuration file up, based on examples of other Clojure Maven projects. They used Clojure plugins for Maven, I could not get this to work. Eventually I ripped these plugins out and was left with very pain POM file that worked.

Go / Golang

Go was announced November 2009. It is created by Google to deal with multicore and networked machines. It feels like a mixture of Python and C. It is a very clean and minimal language.

It is fast
Good standard library
Excellent support for concurrency
It is trivial to write your own load balancer

Issues

The Eclipse IDE is in an early stage
Debugger is not working
Windows port is not done and has just been released

It was hard to find the right Go Windows port, there are several Go windows port projects with no code.

Use cases
I currently have a problem when downloading a lot HTML pages and parsing them to a tree structure. This does not have the best support in C#. I found a library that translates HTML to XHTML and then I can use LINQ to process it. The library is not documented, not very fast and fails for some HTML files.

Go comes with a HTML library that parses HTML 5, it is simple to write a program with some threads that download and other that parse the files into a DOM tree structure.
I would use Golang for loading large amounts of text in a cloud computing environment.

Cython

Cython was released in July 2007. It is a static compiler to write Python extension modules in a mixture of Python and C.

Process for using Cython

Start by writing normal Python code
Find modules that are too slow
Add static types
Compile it with Cython using the setup tool
This produces compiled modules that can be used with normal Python

Issues

It is still more complex that normal Python code
You need to know C to use it

I was surprised how simple it was to get it working both under Windows and Linux. I did not have to mess with make files or configure the compiles. Cython integrated well with NumPy and SciPy. This expands the programming tasks you can do with Python substantially.

Use cases
Speed up slow POS tagging.

My previous language experience

Over the years I have experimented with a long list of non mainstream languages: functional, declarative, parallel, array, dynamic and hybrid languages. Many of these were frustrating experiences. I would read about a new language and get very excited. However this would often be the chain of events:

Download language
Installed Cygwin
Find out how the language's build system works
Try to find a version of the GCC compiler that will compile it
Get the right version of Emacs installed
Try to get the debugger working under Emacs
Start programming from scratch since the libraries were sparse
Burn out

You only have so much mental capacity, and if you do not use a language you forget it. Only Scala made it into my toolbox.

Do Clojure, Go or Cython belong in your programmer's toolbox

Clojure, Go and Cython are all simple languages. They are easy to install, easy learn, they all have big standard libraries so you can be productive in them right away. This is my first impression:

Clojure is a good way to script the extensive Java libraries, for rapid application development and for AI work.
Go is a great language but it is still rough around the edges. There are not any big NLP libraries written for Go yet. I would not try to use it for my main NLP tasks.
Cython was useful right away for my NLP work. It makes it possible to do fast numerical programming in Python without too much trouble.

-Sami Badawi