Thursday, December 16, 2010

NLTK under Python 2.7 and SciPy 0.9.0

Python 2.7 has been out for months, but I have been stuck using Python 2.6 since SciPy was not working for Python 2.7. SciPy 0.9 Beta 1 binary distribution has just been released.
Normally I try to stay clear of beta quality software, but I really like some of the new features in Python 2.7 especially the argparse module, so despite my better judgement I installed Python 2.7.1 and SciPy 0.9.0 Beta 1, to run with a big NLTK based library. This is blog post describes the configuration that I use; and my first impression of the stability.

SciPy 0.9 RC1 was released January 2011.
SciPy 0.9 was released February 2011.
I tried both of them and found almost the same result as for SciPy 0.9 Beta 1, which this review was originally written for.

Direct downloads
Here is a list of the programs I installed directly:

Installation of NLTK

The install was very simple just type:

\Python27\lib\site-packages\easy_install.py nltk


Other libraries installed with easy_install.py

  • CherryPy
  • ipython
  • PIL
  • pymongo
  • pyodbc

    YAML Library
    On a Windows Vista computer with no MS C++ compiler were I tested this NLTK install I also had to do a manual install of YAML from:
    http://pyyaml.org/wiki/PyYAML

    Libraries from unofficial binary distributions
    There are a few packages that have build problems, but can be loaded from Christoph Gohlke's site with Unofficial Windows Binaries for Python Extension Packages: http://www.lfd.uci.edu/~gohlke/pythonlibs/ I downloaded and installed:
    • matplotlib-1.0.0.win32-py2.7.exe
    • opencv-python-2.2.0.win32-py2.7.exe

    Stability

    The installation was simple. Everything installed cleanly. I ran some bigger scripts and they ran fine. Development and debugging also worked fine. Out of 134 NLTK related unit tests only one failed under Python 2.7

    Problems with SciPy algorithms

    The failing unit test was maximum entropy training using the LBFGSB optimization algorithm. These were my settings:
    nltk.MaxentClassifier.train(train, algorithm='LBFGSB', gaussian_prior_sigma=1, trace=2)

    First the maximum entropy training would not run because it was calling the method rmatvec() in scipy/sparse/base.py. This method has been deprecated for a while and has been taken out of the SciPy 0.9. I found this method in SciPy 0.8 and added it back. My unit test ran, but instead of finishing in a couple of seconds it took around 10 minutes eating up 1.5GB before it crashed. After this I gave up on LBFGSB.

    If you do not want to use LBFGSB, megam is another efficient optimization algorithm. However it is implemented in OCaml and I did not want to install OCaml on a Windows computer.

    This problem occurred for both SciPy 0.9 Beta 1 and RC1.

    Python 2.6 and 2.7 interpreters active in PyDev

    Another problem was that having both Python 2.6 and 2.7 interpreters active in PyDev made it less stable. When I started scripts from PyDev sometime they timed out before starting. PyLint would also show errors in code that was correct. I deleted Python 2.6 interpreter under PyDev Preferences, and PyDev worked fine with just Python 2.7.

    I also added a version check the one failing unit test, since it caused problems for my machine.
    if (2, 7) < sys.version_info: return

    Multiple versions of Python on Windows

    If you install Python 2.7 and realize that some code is only running under Python 2.6 or that you have to rollback. Here are a few simple suggestions:

    I did a Google search for:
    python multiple versions windows
    This will show many ways to deal with this problem. One way is calling a little Python script that change the Windows register settings.

    Multiple versions of Python have not been a big problem for me. So I favor a very simple approach. The main issue is file extension binding. What program gets called when you double click a py file or type script.py on the command line.

    Changing file extension binding for rollback to Python 2.6

    Under Windows XP You can change file extensions in Windows Explorer:
    Under Tools > Folder Option > File Types
    Select the PY Extension and press Advanced then press Change
    Select open press Edit
    The value is:
    "C:\Python27\python.exe" "%1" %*
    You can change this to use a different interpreter:
    "C:\Python26\python.exe" "%1" %*

    Or even simpler when I want to run the older Python interpreter I just type:
    \Python26\python.exe script.py
    Instead of typing
    script.py

    Is Python 2.7 and SciPy 0.9.0 Beta 1 stable enough for NLTK use?

    The installation of all the needed software was fast and unproblematic. I would certainly not use it in a production environment. If you are doing a lot of numerical algorithms you should probably hold off. If you are impatient and you do not need to do new training it is worth trying it, you can always roll back.