Saturday, July 23, 2011

Scala, Eclipse and Maven integration tutorial

I have evaluated Scala as a language for cloud computing and Hadoop. One requirement was a robust development environment, with a real build system, a good IDE with code completion and debugging.

The combination of ScalaEclipse and Maven seemed like a fit for this requirement, but my initial experience was mixed.


Problems with Scala, Eclipse and Maven integration

It was easy to install Scala, Eclipse and Maven, but when I set up a project it had a persistent error in Eclipse:

object Predef does not have a member AnyRef

Other problems:

  • There were problems running the unit test.
  • I had to restart Eclipse a lot.
  • Eclipse had Scala set to version 2.9.0.1 while Maven had 2.8.0. When I tried to change Maven to use 2.9.0.1 the pom.xml file would be marked as having an error.
I searched internet for help but could not find it. After a good deal of experimenting I sorted out the problems and found a good solution.


Software versions

My setup is:
  • Scala 2.9.0.1.
  • Eclipse 3.7 Indigo
  • Scala-ide Eclipse plugin: scala nightly 29 - http://download.scala-ide.org/nightly-update-wip-experiment-2.9.0-1


Scala, Eclipse Maven project setup tutorial

Here are that steps that I took to set up at new Scala, Eclipse and Maven project so it works with unit testing.

Press menu item: File - New - Other...



Select Maven Project




Select the org.scala-tools.archetypes scala-archetype-simple




Add group id and artifact id to project. Click Finish



This will create the project with example program and unit tests, but it will leave Eclipse in an unstable state







In the project's pom.xml file make the changes that I have marked in red:


<properties>
 <maven.compiler.source>1.5</maven.compiler.source>
 <maven.compiler.target>1.5</maven.compiler.target>
 <encoding>UTF-8</encoding>
 <scala.version>2.9.0-1</scala.version>
</properties>


<dependency>
 <groupId>org.scala-tools.testing</groupId>
 <artifactId>specs_${scala.version}</artifactId>
 <version>1.6.8</version>
 <scope>test</scope>
</dependency>

Now both Scala IDE and Maven are both using the same version of Scala. Scala 2.9.0.1


Right click the whole project and select: Configure - Add Scala Nature




Now use the Maven build system to clean, build and run unit tests. Run from either Eclipse or command line.

From Eclipse, right click the whole project and selecting:
Maven clean
Maven install





From command line:

C:\prog\apache-maven-2.2.1\bin\mvn clean
C:\prog\apache-maven-2.2.1\bin\mvn install

Note that you have to use Maven 2.2 and not Maven 3.



Now there should be no more errors.
The unit test: "scalatest.scala" has some problems, delete it.

Run all unit tests from Eclipse. By right clicking the whole project and select Run As JUnit Test



Now you can see the result in the JUnit runner.


Final impression of Scala, Eclipse and Maven integration

Once I had resolved the problems the Scala, Eclipse and Maven combination was a great development environment meeting my requirements.

One thing that is currently missing from the Scala Eclipse plugin is code refactoring. Refactoring works very well in both Eclipse for Java and Visual Studio for C#.



Tuesday, July 12, 2011

Natural language processing in F# and Scala

I do natural language processing in C# 3.5 and Python. My work includes classification, named entity recognition, sentiment analysis and information extraction. Both C# and Python are great languages, but I have some unmet needs. I am investigating if there are any new languages that would help.

I 2010 I tried out 3 new languages:
Natural language processing in Clojure, Go and Cython

Recently I have investigated F# and Scala. They are both hybrid functional - object oriented languages; inspired by ML / OCaml / Haskell and Java / C#.

Python as the benchmark

Python is widely used in natural language processing. I am most productive in Python for NLP work. Here are a few reasons why:

  • NLTK is a great Python NLP library
  • Lot of open source math and science libraries e.g. NumPy and SciPy
  • PyDev is a good development environment
  • Good integration with MongoDB library
  • Great for rapid development

Python shortcomings
  • Slow compared to compiled language
  • GUI support is crude
  • Multi-threading is crude
  • Compilation does give more robustness

It should be possible to make a super language that has the elegance of Python, but without these shortcomings.


My first Scala experience

In 2006 I thought Scala was this super language. It is very advanced; you can call any Java libraries from Scala, including all the open source libraries. But I ran into a list of problems with Scala:

  • The Scala IDE was far behind Eclipse Java
  • Scala is a quite complex language
  • The Java libraries and the functional programming libraries were badly integrated
  • There were no Scala REPL or interpreter like in Python
Scala was stable enough for use, but it did not improve my productivity so after some months I went back to using Python as my scripting language.


Python's weakness

Recently I had to make a small text processing application that end users could use directly. This was not the best fit for Python. Normally my Python programs have no GUI and are controlled by command line parameters.

I had 2 Python options:

Make simple GUI using TkInter
TkInter is a Python wrapper of TK, the cross platform GUI toolkit. It is pretty crude by modern GUI standards, but would have been good enough. However trying to install all the Python libraries that I needed on the end users machine would be setting myself of for a maintains nightmare.

Wrap code in web application
I could wrap a web interface around it. The application is using a lot of memory and I would have to maintain a web application.

I had a 1 week hard deadline for the task and both of these options looked unappealing. I needed something else...


My first F# application

I took a chance on F#, and managed to learn enough F# to finish the program by my 1 week deadline.

There is no GUI builder for F# in Visual Studio, but it was pretty easy to hand code a simple WinForms GUI to wrap around the core code. It was not pretty but you could give it to an end user. The whole application ended up being one 40KB executable file, and it was very fast. F# had actually filled a niche that Python does not do so well.

There were also problems, I wrote the whole application from scratch, while in Python I would have been able to use NLTK, write the code faster and get better results.

All in all this was very good experience. I thought that F# would be a good supplement to my Python library. It would both give me raw speed when I need it and good connectivity with C#, ASP.NET, WPF and Microsoft Office.


Functional programming benefits

Functional programming is a great fit for my NLP work.

I have a lot of different text sources: database, flat file, directory, public RESTful web application services.

I have many word transformations: stop word filters, stemmers, custom filters.

I need many operations building on other operations: Bigram finder, POS tagger, named entity recognizer.

Created different reports: database, csv, Excel.

In functional languages you can just take any combinations of these operations and easily pipe them together while getting good compiler support. This does not fit so well with object oriented programming were you are more concerned with encapsulation.


F# impression

F# is the first compiled language I tried that is comparable to Python in simplicity and elegance. It has a real Pythonic feel:

  • F# is fast
  • Simple and elegant
  • Good development environment in Visual Studio 2010
  • Best concurrency support of any language I have seen
  • Good database support
  • Good MongoDB library
  • Simple to combine F# with C# or VB.NET for ASP or WPF
  • Good REPL

Issues

  • Runs best under Windows
  • For an IDE you really need Visual Studio 2008 or 2010, and that cost at least $700
  • F# can be compiled and run the shell from SharpDevelop 4.0 and 4.1, but you do not have the same productivity
  • The math libraries under .NET are not as good as NumPy and SciPy
  • The NLP libraries are better under Python


Scala revisited

After the success with F# I was very curious about why F# has been so much more successful than my first experience with Scala.

I looked at an F# and Scala cheat sheet and thought they look remarkably similar. I watched a few screen casts and found no obvious problems. I bought the book: Programming in Scala, Second Edition, it turned out to be a very interesting computer science book and I read the whole 852 pages. Scala still looked good.

I installed the Scala Eclipse plugin and wrote some code. Both the language and the IDE have come a long way during the last 5 years:
  • 15 books about Scala
  • 2 great free books
  • Tooling is much better
  • IDE is much better with code completion
  • Native NLP libs: ScalaNLP and Kiama

Of all the issues I had when I first tried Scala. The only remaining one is:
Scala is a pretty complex language

It is incredible how Scala has taken a lot of messy features from Java and turned it into a clean modular system, at the cost of some complex abstractions.


F# vs. Scala

Despite many similarities, the languages have a different feel. F# is simpler to understand, while Scala is the more orthogonal language. I have been very impressed by both.

F# better

  • Simpler to understand
  • Fantastic concurrency
  • Tail recursion optimized
  • Works well with Windows Azure

Scala better 

  • More orthogonal, reusing the same constructs
  • Works with any Java library so more libraries
  • Better NLP libraries
  • Works well with Hadoop


Cloud computing

Functional programming works well with cloud computing. For me the availability of a good functional language is a substantial factor in selecting a cloud platform.

Google introduced MapReduce to handle massive parallel multi computer applications.

Hadoop is the Java based open source version of MapReduce. To run Hadoop natively it has to run a JVM language like Java or Scala.

Hadoop Streaming extends a limited version of Hadoop to work with programs written in other programming languages as long as they work like a UNIX pipes that read from stdin and write to stdout.

There is a Python wrapper for Hadoop Streaming called Dumbo. Python is around 10 times slower than Java and Dumbo is a limited version of the Hadoop, so if you are trying to do NLP on massive amount of data this might not solve your problems.

Scala is fast and will give you full access to run native Hadoop.

Microsoft's version or MapReduce is called: Dryad or LINQ to HPC. It is not officially released yet, but F# works well with Windows Azure.


NLP and other languages

Let me finish by giving a few short comparisons of F# and Scala with other languages:


Clojure vs. Scala

Clojure is a LISP dialect that it also running on the JVM, and it the other big functional language running there. Clojure has some distinct niches for NLP:

Clojure better
  • Language understanding
  • Formal semantic: taking text and translating it to first order propositional logic
  • Artificial intelligence tasks

Scala better
  • It is easy to write fast Scala code
  • Smaller learning curve coming from Java
I tried Clojure recently and was very impressed; but more of my work falls in the category that would benefit from Scala.


Java vs. Scala

Java better

  • Better IDE tools and support
  • Better GUI builders
  • Great refactoring support
  • Many more programmers that know Java

Scala better

  • Terser code
  • Closures
  • First class function
  • More expressive language


C# vs. F#


C# better

  • Better IDE tools and support
  • Better GUI builders
  • There are a lot more programmers that know C#
  • Better LINQ to SQL support

F# better

  • Terse code
  • Better support for concurrency, Synch, continuations
  • More productive for NLP


Conclusion

F# and Scala are similar hybrid functional object oriented languages.

For years I have periodically tried functional programming languages to see if they were ready for mainstream corporate computing; and they were not. With the recent spread of functional features into object oriented languages I thought that real functional programming languages would soon be forgotten.

I was pleasantly surprised by how well F# and Scala work now. Functional languages are finally coming of age and becoming useful in mainstream corporate computing. They are stable enough, and they have niches were they are more productive than object oriented languages like C# and Java.

I really enjoy programming in F# and Scala, they are a very good fit for natural language processing and cloud computing. For bigger NLP projects I now prefer to use F# or Scala over C# or Java.

For GUI and web programming the object oriented languages still rules. Stick with C# or Java if the NLP part is small or GUI or web interface is the domineering part.

Java and C# are also improving e.g. by adding more functional features. Many working programmers are well served by just waiting for Java 8 or C# 5. But functional programming is here to stay. Rejoice...


Newer follow up article covering Scala, ScalaNLP and Spark MLib

Friday, June 17, 2011

Cloud Computing For Data Mining Part 1

The first half of this blog post is about selecting a cloud provider for a data mining and natural language processing system. I will compare 3 leading cloud computing providers Amazon Web Services, Windows Azure, OpenStack.
To help me chose a cloud provider I have been looking for users with experience running cloud computing for application similar to data mining. I found them at CloudCamp New York June 2011. It was an unconference, so the attendees were split into user discussion groups. The last half of the post I will mention the highlight from these discussions.


The Hype

"If you are not in the cloud you are not going to be in business!"

This is the message many programmers, software architects and project managers faces today. You do not want to go out of business because you could not keep up with the latest technologies; but looking back many companies have gone out of business because they invested in the latest must have technology, that turned out to be expensive and over engineered.


Reason For Moving To Cloud

I have a good business case from using cloud computing: Namely scale a data mining system to handle a lot of data. To begin with it could be a moderate amount of data, but it could be changed to a Big Data with short notice.


Horror Cloud Scenario

I am trying minimize the risk of this scenario:
  1. I port to a cloud solution that is tied closely to one cloud provider
  2. Move the applications over
  3. After a few months I find that there are unforeseen problems
  4. No easy path back
  5. Angry customers are calling


Goals

Here are my cloud computing goals in a little more details:
  • Port data mining system and ASP.NET web applications to the cloud
  • Chose cloud compatible with code base in .NET and Python
  • Initially the data volume is moderate but it could possibly scale to Big Data
  • Keep cost and complexity under control
  • No downtime during transition
  • Minimize risk
  • Minimize vendor lock in
  • Run the same code in house and in the cloud
  • Make rollback to in house application possible


Amazon Web Services vs. Windows Azure vs. OpenStack

Choosing the right cloud computing provider has been time consuming, but also very important.

I took a quick stroll through Cloud Expo 2011, and most big computer companies were there presenting their cloud solutions.

Google App Engine is a big cloud service well suited for front end web application, but not good for data mining, so I will not cover that here.

The other 3 providers that have generated most momentum are: EC2, Azure and OpenStack.

Let me start by listing their similarities:
  • Virtual computers that can be started with short notice
  • Redundant robust storage
  • NoSQL structured data
  • Message queue for communication
  • Mountable hard disk
  • Local non persistent hard disk
Now I will write a little more about where they differ, and their good and the bad part:


Amazon Web Services, AWS, EC2, S3

Good:
  • This is the oldest cloud provider dating back to 2004
  • Very mature provider
  • Other providers are catching up with AWS's features
  • Well documented
  • Work well with open source, LAMP and Java
  • Integrated with Hadoop: Electric Map Reduce
  • A little cheaper than Windows Azure
  • Runs Linux, Open Solaris and Windows servers
  • You can run code on your local machine and just save the result into S3 storage

Bad:
  • You cannot run the same code in house and in the cloud
  • Vendor lock in


Windows Azure

Good:
  • Works well with the .NET framework and all Microsoft's tools
  • It is very simple to port an ASP.NET application to Azure
  • You can run the same code on you development machine and in the cloud
  • Very good development and debugging tools
  • F# is a great language for data mining in cloud computing
  • Great series of video screen casts

Bad:
  • Only run Windows
  • You need a Windows 7, Windows Server 2008 or Windows Vista to develop
  • Preferably you should have Visual Studio 2010
  • Vendor lock in


OpenStack

OpenStack is a new open source collaboration that is making a software stack that can be run both in house and it the cloud.

Good:
  • Open source
  • Generating a lot of buzz
  • Main participants NASA and Rackspace
  • Backed by 70 companies
  • You can run your application either in house or in the cloud

Bad:
  • Not yet mature enough for production use
  • Windows support is immature


Java, .NET Or Mixed Platform

For data mining selecting the right platform is a hard choice. Both Java and .NET are very attractive options.

Java only
For data mining and NLP there are a lot of great open source project written in Java. E.g. Mahout is a system for  collaborative filtering and clustering of Big Data, with distributed machine learning. It is integrated with Hadoop.
There are many more OSS: OpenNLP, SolrManifoldCF,

.NET only
The development tools in .NET are great. It works well with Microsoft Office.
Visual Studio 2010 comes with F#, which is a great language for writing worker roles. It is very well suited for light weight threads or async, for highly parallel reactive programs.

Mix Java and .NET
You can mix Java and .NET. Cloud computing makes is easier than ever to integrate different platforms. You already have abstract language agnostic service for communication with message queue, blob storage, structured data. If you have an ASP.NET front end on top of a collaborative filtering of Big Data this would be a very attractive option.

I still think that combining 2 big platforms like Java and .NET is introducing complexity, compared to staying within one platform. You need an organization with good resources and coordination to do this.


Choice Of Cloud Provider

I still have a lot of unanswered questions at this point.

At the time of writing June 2011 OpenStack is not ready for production use. So that is out for now.

I have run some test on AWS. It was very easy to deploy my Python code to EC2 under Linux. Programming C# that used AWS services was simple.

I am stuck waiting to get a Window 7 machine so I can test Window Azure.

Both EC2 and Azure seem like viable options for what I need. I will get back to this in part 2 of the blog post.


Highlights from Cloud Camp 2011

A lot of people are trying to sell you cloud computing solutions. I have heard plenty of cloud computing hype. I have been seeking advice from people that were not trying to sell me anything and had some real experience, and try to find some of the failures and problems in cloud computing. 

I went to Cloud Camp June 2011 during Cloud Expo 2011 in New York. Cloud computing users shared their experience. It was an unconference, meaning spontaneous user discussion breakout groups were formed. The rest of this post is highlight from these discussions.


Hadoop Is Great But Hard

Hadoop is a Java open source implementation of Google's Map Reduce. You can set up a workflow of operations and Hadoop will distribute them over a multiple computers, aggregate the result and rerun operations that fail. This sounds fantastic, but Hadoop is a pretty complex system, with a lot of new terminology and a steep learning curve.


Security Is Your Responsibility

Security is a big issue. You might assume that the cloud will take care of security, but you should not. E.g. you should clean up the hard disks that you have used it, so the next user cannot see your data.


Cloud Does Not Automatically Scale To Big Data

The assumption is that you put massive amounts of data in the cloud. And the cloud takes care of the scaling problems.
If you have a lot of data that needs little processing. Then cloud computing becomes expensive: you store all data in 3 different locations and it is expensive and slow to take it down to different compute nodes. This was mentioned as the reason why NASA could not using S3, but build its own Nebula platform.


You Accumulate Cost During Development

An entrepreneur building a startup ended up paying $2000 / month for EC2. He used a lot of different servers and they had to be running with multiple instances, even though he was no using a lot of resources. This might be cheap compared to going out and buying your own servers, but it was more expensive than he expected.


Applications Written In .NET Run Fine Under EC2 Windows

An entrepreneur said that he was running his company's .NET code under EC2. He thought that Amazon was more mature than Azure, and Azure was catching up. He preferred to make his own framework.


Simpler To Run .NET Application On Azure Than On EC2

A cloud computing consultant with lots of experience in both Azure and EC2 said: EC2 gives you a raw machine you have to do more to get your application running than if you plop it into Windows Azure.
It is very easy to port an ASP.NET application to Windows Azure.


Cash Flow, Operational Expenses And Capital Expenses

An often cited reason why cloud computing is great is that a company can replace big upfront capital expenses with smaller operational expenses. A few people mentioned that companies live by their cash flow and they do not like to have an unpredictable operational expenses, but are more comfortable with predictable capital expenses.


Friday, April 1, 2011

Practical Probabilistic Topic Models for NLP

Latent Dirichlet Allocation, LDA is a new and very powerful technique for finding the topics in a collection of texts, using unsupervised learning. LDA is a probabilistic topic models. LDA was developed in 2003 and rely on advanced math. This post is a practical guide about how to get started building LDA models and software.

LDA will have a substantial impact on corpus based natural language processing; since it opens up for easy creation of semantic models based on machine learning.

Motivation for topic models


With the Internet we have large amount of text available. Having the text categorized into topics make text search much more precise and makes it possible to find similar documents.

Text categorization is not an easy problem:
  • Texts usually deals with more than one topic
  • There is no clear standard for categorization
  • Doing it by hand is infeasible
Nuanced categorized is a hard problem, with many moving parts, but in 2003 David M. Blei, Andrew Y. Ng and Michael I. Jordan published an article on a new approach called Latent Dirichlet Allocation. LDA can be implemented base on research articles, but if you are not a machine learning academic the math is intimidating and the material is still new.

There is actually good material available, but finding all the pieces takes some work. Most things you need are available online for free. Here is a chronological account for what I did to understanding LDA and start implementing it.

Need for more sophisticated hierarchical topic models


In 2009 I needed a fine grained classification of text, using unsupervised or semi supervised training. I spend a little time thinking about it, and had some idea about making bootstrapped training in a 2 layered hierarchy. It was hackish, complex and I was not sure how numerically stable it was. I never got around to implement it.


David Blei


I went to 4 Annual Machine Learning Symposium in 2009 and asked around for solutions to my problem. Several attendees told me to look at David Blei work. I did but he has written a lot of math heavy articles, so I did not know were to start.

I was lucky to see David Blei give a presentation on LDA first at the 5 Annual Machine Learning Symposium. David Blei works at Princeton and just exudes brilliance. He gave a lucid entertaining description of the LDA with examples. It wa really shocking to see the LDA algorithm find scientific topics on its own with no human intervention.

I saw him give the same talk at the NYC Machine Learning Meetup, and luckily that was videotaped here are part 1 and part 2. I watched these videos a few times. This gave me a good intuition for the algorithm.

I looked through his articles and found a good beginner articles BleiLafferty2009. I read through that several time, but I could not understand it.

I went out and bought the text book that David Blei recommended: Pattern Recognition and Machine Learning by Christopher M. Bishop. After reading the introduction chapter, I read BleiLafferty2009 again and was able to understand it. On page 10 the essence of the algorithm is described in a small text box.


Software implementation of LDA


There are plenty open source implementation of LDA. Here are a few observations:

lda-c in C by David Blei is an implementation in old school C. The code is readable, concise and clean.

lda for R package by Jonathan Chang. Implementing many models with extensive documentation.

Online LDA in Python by Matt Hoffman. Short code, but not too much documentation.

LDA Apache Mahout in Java. Active development community works with Hadoop / MapReduce.



No matter what language you prefer there should be a good implementation.


Practical software considerations


All the implementations looked good. But if you want to use LDA software then robustness, scalability and extendibility are big issues. First you just want the algorithm to run for simple text input. Next day you want the following options:
  • Better word tokenizer
  • Bigrams and collocation
  • Words stemmer
  • LDA on structured text
  • Read from database


Programming language choice for LDA


Here is a little common sense advice on choice of programming language for LDA programming.

C
C is an elegant, simple system programming language.
C is not my first choice of a language for text processing.


C++
C++ is a very powerful but also complex language.
NLP lib: The Lemur Project
I would be happy to use C++ for text processing.


C#
C# is a great language.
NLP lib:  SharpNLP. 
You will have to implement LDA yourself or port one of the other implementations. SciPy is getting ported to C# but it does not have the best numeric libraries.


Clojure
Clojure is a moderate sized LISP dialect build on the Java JVM.
NLP lib: OpenNLP through clojure-opennlp.
LISP is classic AI language and you can use one of the Java LDA implementations.


Java
Java is modern object oriented programming language with access to every thinkable library.
NLP lib: OpenNLP.


Python
Python is an elegant language very well suited for NLP.
NLP lib: NLTK, using NumPy and SciPy


R
R is a fantastic language for statistics, but not so great for low level text processing.
NLP lib:
The R implementation of LDA looks great; I think that it is common to do all the preprocessing in another language say Perl. And then do all the rest of the work in R.


Different versions of LDA


There are now a lot of different LDA models geared towards different domains. Let me just mention a couple:

Online LDA
Online means that: you do learning of the models in small batches; instead of on all the documents. This is useful for a continuously running system.

Dynamic LDA
Good for handling text that stretches over a long time interval say 100 years.

Hierarchical LDA
This will handle topics are organized in hierarchies.


Gray box approach to LDA


The math needed for LDA is advanced. If you do not succeed in understand it I still think that you can learn to use the code, if you are willing to take something on faith and get your hands dirty.

Bedtime Science Stories My Science Education Blog

I started a science education blog called: Bedtime Science Stories. Here is a little excerpt from my first post: Can and should a 3 year old girl be into science?


I have a 3 year old daughter that has take a bit of an interest in science. We have been talking about science when I put her to bed at night.


Last Sunday I discovered a new book called Battle Hymn of the Tiger Mother by Amy Chua, who is a law professor at Yale. She is using extreme methods to push her 2 daughters to academic excellence. They had to be the best in their class in everything except drama and physical education. Math was a topic that she really drilled them in. Just reading the back cover sent me into a rage; so much that I decided to start a new blog: Bedtime Science Stories, just to get my anger out.

Science should not be an elite activity. Making it very competitive will make a new generation of kids hate math and science. Understanding our world is worthwhile activity even if you are not the best in your class.

My 3 year old daughter

Tuesday, February 15, 2011

Is IBM Watson Beginning An AI Boom?

Artificial intelligence fell out of favor in the 1970s, the start of first artificial intelligence winter, and has mainly been out of favor since. In April 2010 I wrote a post about how you can now get a paying job doing AI, machine learning and natural language processing outside academia.

Now barely one year later I have seen a few demonstrations that signal that artificial intelligence has taken another leap towards mainstream acceptance:
  • Yann LeCun demonstrated a computer vision system that could learn to recognize objects from his pocket after being shown a few examples, under a talk about learning feature hierarchies for computer vision 
  • Andrew Hogue demonstrated Google Squared and Google Sentiment Analysis at Google Tech Talk, those systems both show rudimentary understanding of web pages and use word association
  • IBM Watson super computer is competing against the best human players on Jeopardy 
    All these 3 systems contain some real intelligence. Rudimentary by human standard, but AI has gone from the very specialized systems to handling more general tasks. It feels like AI is picking up steam. I am seeing startups based on machine learning pop up. This reminds me of the Internet boom in 1990s. I moved to New York in 1996, at the beginning of the Internet boom. I saw firsthand the crazy gold rush where fortunes were made and lost in short time, Internet startups were everywhere and everybody was talking about IPOs. This got me thinking, are we headed towards an artificial intelligence boom, and what would it look like?

    IBM Watson

    IBM Watson is a well executed factoid extraction system, but it is a brilliant marketing move, promoting IBM's new POWER7 system and their Smart Planet consulting services. It gives some people the impression that we already have human-like AI, and in that sense it could serve as a catalyst for investments in AI. This post is not about human-like artificial intelligence, but about the spread of shallow artificial intelligence.

    Applications For Shallow Artificial Intelligence

    Both people and corporations would gain value from having AI systems that they could ask free form questions to and get answers from in very diverse topics. In particular in these fields:
    • Medical science
    • Law
    • Surveillance
    • Military

    Many people, me included, are concerned about a big brother state and military use of AI, but I do not think that is going to stop adaption. These people play for keeps.

    There are signs that the financial service industry is starting to use sentiment analysis for their pricing and risk models. Shallow AI would be a good candidate for more advanced algorithmic trading.

    Bottom Up vs. Top Down Approaches

    Here is a very brief simplified introduction to AI techniques and tools. AI is a loosely defined field, with a loose collection of techniques. You can roughly categorize them it top down and bottom up approaches.

    Top down or symbolic techniques
    • Automated reasoning
    • Logic
    • Many forms of tree search
    • Semantic networks
    • Planning
    Bottom up or machine learning techniques
    • Neural networks, computer with similar structure to the brain
    • Machine learning

    The top down systems are programmed by hand, while the bottom up systems learns themselves based on examples without human intervention, a bit like the brain.

    What Is Causing This Sudden Leap?

    Many top down techniques were developed by the 1960s. They were very good ideas, but they did not scale; they only worked for small toy problems.
    Neural networks are an important bottom up technique. They started in 1950s, but fell out of favor; they came roaring back in 1980s. In the 1990 the machine learning / statistical approaches to natural language processing beat out Chomsky's generative grammar approach.

    The technology that is needed for what we are doing now have been around for a long time. Why are these systems popping up now?

    I think that we are seeing the beginning of a combination machine learning with top down techniques. The reason why this have taken so long is that it is hard to combine top down and bottom up techniques. Let me elaborate a little bit:

    Bottom up AI / machine learning are black boxes that you give some input and expected output and it will adjust a lot of parameter numbers so it can mimic the result. Usually the numbers will not make much sense they just work.

    In top down / symbolic AI you are creating detailed algorithms for working with concepts that make sense.

    Both top down and bottom up techniques are now well developed and better understood. This makes it easier to combine them.

    Other reasons for the leap are:
    • Cheap, powerful and highly parallel computers
    • Open source software, were programmers from around the world develop free software. This makes programming into more of an industrial assembly of parts.

    Who Will Benefit From An AI Boom?

    Here are some groups of companies that made a lot of money during the Internet boom:
    • Cisco and Oracle the tool makers
    • Amazon and eBay small companies that grew to become domineering in e-commerce
    • Google and Yahoo advertisement driven information companies

    Initially big companies like IBM and Google that can create the technology should have an advantage, whether it will be in the capacity of tool makers or domineering players.

    It is hard to predict how high the barrier to entry in AI will be. AI programs are just trained on regular text found on or off the Internet. And today's super computer is tomorrow's game console. The Internet has a few domineering players, but it is generally decentralized and anybody can have a web presence.

    New York is now filled with startups using machine learning as a central element. They are "funded", but it seems like they got some seed capital. So maybe there is room for smaller companies to compete in the AI space.

    Job Skills That Will Be Required In An AI Boom

    During the Internet boom I met people with a bit of technical flair and no education beyond high school who picked up HTML in a week and next thing they were making $60/hour doing plain HTML. I think that the jobs in artificial intelligence are going to be a little more complex than those 1990s web developer jobs.

    In my own work I have noticed a move from writing programming to teaching software based on examples. This is a dramatic change, and it requires a different skill set.

    I think that there will still be plenty of need for programmers, but cognitive science, mathematics, statistics and linguistics will be skills in demand.

    My work would benefit from me having better English language skills. The topic that I am dealing with is, after all, the English language. So maybe that English literature degree could come in handy.

    Currently I feel optimistic about the field of artificial intelligence; there is progress after years of stagnation. We are wrestling a few secrets away from Mother Nature, and are making progress in understanding how the brain works. Sill, introduction of such powerful technology as artificial intelligence is going to affect society for better and worse.