Friday, November 12, 2010

Growing Python projects from small to large scale

You need significantly different principles for developing small, medium and large scale software system.

When my project started to become big I searched the Internet for some guidelines or best practices for how to scale Python, but did not find much. Here are a few of my observations on what technique to use for what project sizes.

General principle


For a small system you can spend most of your time solving the problem, but the bigger the system gets the more time you spend on project plans, coordination and documentation. The complexity and cost does not scale linearly with the size of the project but maybe scales with the square of the size. This holds for different styles of project management both waterfall and agile.

A central problem is minimizing dependencies and avoiding tight coupling. John Lakos has written an excellent book on software scaling called: Large-Scale C++ Software Design here is a summary. It is a very scientific and stringent approach, which is specific for C++. He developed a metric for how much dependencies you have in your system. His technique are not a good fit for smaller projects, you could finish several scripts before you could even implement his methodology.

Small scripts
Keep it simple. Focus on the core functionality. Minimize the time you spend on setting up the project.

Medium applications
Spending some time organizing things, will save you time in the long run.

Large applications
Here you need a lot of structure; otherwise the project will not be stable.

Development environment


Small scripts

I use PyWin Windows IDE.
  • It is lightweight
  • No need for Java or Eclipse
  • Syntax highlighting
  • Code completion at run time and some at write time
  • Allow primitive debugging
  • You do not need to set up a project to use it.

Medium applications
I have used both PyWin and PyDev.

Large applications
I would strongly recommend PyDev Eclipse plugin. It is a modern IDE and runs pylint continuously and has good code completion while writing code. It will find maybe half the error a compiler would find. This improves the stability a lot and was the most important change that I made from my old coding style.

Organization of code


Small scripts
Use one module / file with all the code in. This can have several classes. The advantage is that deployment becomes trivial: you just email the script to the user. This works for modules up to around 3000 lines of code.

Medium applications
Use one directory with all modules in. This gives you fewer issues with PYTHONPATH.

Make a convention for naming field names, database name and parameter name. Put all these names in a module that only contains string constants, and use these in your code instead of raw string.

Use a separate repository for the project. I package the Python and other self written executable together in a repository, even when I have another source control system for the compiled sources.

This works up till around 40 Python modules, then it become hard to find anything.

Large applications
Read and follow the Python style guide. Before I followed a Java style guide since Java is big on coding convention, but the Python style is actually pretty different. A noticeable difference is a Java file contains a main class with a title case name and the file has the same name. In python modules should have short lowercase name while the classes still should have title case names.

Organizing packages as an a-cyclical graph
Refactor the modules into packages. The packages should be organized as an a-cyclical graph. So at the lowest level you would have an util package that is not allowed to reference anything else. You can have other specialized packages that can access the util package. Over that I have the main source directory with code that is central and general. Over that I have a loader package that can access all the other packages.

One problem when you have different directories is that you need the PYTHONPATH include all the code. A good way to do this is to try to add the parent directory to the system path before you import any of the modules.

Documentation


Small scripts
Usually I have:
  • Python docstring in the program. 
  • Print a usage message

Medium applications
Have a directory for documentation. To keep it simple I prefer to use simple HTML. I find that Mozilla SeaMonkey is simple to use and generates clean HTML you can do a diff on. Often I have:
  • User documentation page 
  • Programmer documentation page
  • Release notes
  • Example data

Large applications
At this point using automatically generated documentation and some sort of wiki format for writing documentation is a good idea.

Communication


Input and output account for a sizable part of your code. I prefer to use the most lightweight method I can get away with.

Small scripts and medium applications
Communication is done with flat files, csv files and database.

Large applications
Communication is done with flat files, csv files, database, MongoDB and CherryPy.

MongoDB have dramatically simplified my work, before different types of structured data demanded their own database with several tables. Now I just load the data into a MongoDB collection. MongoDB make very different structured documents look very uniform and trivial to load from Python. After that I can use the same script on very different data.

When you have a script and find out that you need to have other programs call it. It is very simple to create XML, JSON or text based RESTful web service using CherryPy. You just add a 1 line annotation to a method and it is now a web service. You barely have to make any changes to your program. CherryPy feels very Pythonic. This will give you very cheap way to connect to a GUI and a web site written in other languages.

Unit tests


Small scripts
Unit tests give you a small advantage. I still write unit tests unless there is an emergency, and then I usually regret it.

Large applications
The bigger the system the more important it is that the individual pieces works. Large systems are not maintainable if you do not have unit tests.

Source control system


I put any code that I use for production in a source control system. I usually use Subversion or GIT.

Subversion is good for centralized development, and it is nice that each check in has a sequential revision number so that you can see revision number 123 and next 124.

GIT is better for distributed development; it is easy to create a local repository for a project.

Small scripts
One repository for each type of script.

Medium and large applications
One repository for each project.

Use of standard libraries


Small scripts
Use the simplest approach that gets the work done.

Large applications

When my application grew I realized that I recreated functionality from the standard libraries; for instance from these libraries:
I refactored my program to use the standard library and found that it were much better than what I had written. For bigger application using standard libraries makes your code less buggy and more maintainable. So spend some time to find what has already been written.

How well does Python scale compared to compiled languages


There are mixed opinions on this topic. Scripts are generally small and large systems are generally written in compiled languages. The extra checks and rigidity you get from a compiled language is more important the bigger you applications get. If you are writing a financial application and have very low tolerance for errors this could be significant.

I am using Python for natural language processing: classification, named entity recognition, sentiment analysis and information extraction. I have to write many complex custom scripts fast.

Based on my earlier experience with writing smaller Python scripts I was concerned about writing a bigger application. I found a good setup with PyDev, unit test and source control. It gives me much of the stability I am used to in a compiled language, while I can still can do rapid development.


-Sami Badawi