Languages and Logic: Predictive Analytics

"Big data" created an explosion of new technologies and hype: NoSQL, Hadoop, cloud computing, highly parallel systems and analytics.

I have worked with big data technologies for several years. It has been a steep learning curve, but lately I had more success stories.

This post is about the big data technologies I like and continue to use. Big data is a big topic. These are some highlights from my experience.

I will relate big data technologies to modern web architecture with predictive analytics and raise the question:

What big data technologies should I use for my web startup?

Classic Three Tier Architecture

For a long time software development was dominated by the three tier architecture / client server architecture. It is well described and conceptually simple:

Client
Server for business logic
Database

It is straightforward to figure out what computations should go where.

Modern Web Architecture With Analytics

The modern web architecture is not nearly as well established. It is more like an 8 tiered architecture with the following components:

Web client
Caching
Stateless web server
Real-time services
Database
Hadoop for log file processing
Text search
Ad hoc analytics system

I was hoping that I could do something closer to the 3-tier architecture, but the components have very different features. Kicking off a Hadoop job from a web request could adversely affect your time to first byte.

A problem with the modern web architecture is that a given calculation can be done in many of those components.

Architecture for Predictive Analytics

It is not at all clear what component predictive analytics should be done in.

First you need to collect user metrics. In what components can do you do this?

Web servers, store the metric in Redis / caching
Web servers, store the metric in the database
Real-time services aggregates the user metric
Hadoop run on the log files

User metric is needed by predictive analytics and machine learning. Here are some scenarios for this:

If you are doing simple popularity based predictive analytics this can be done in the web server or a real-time service.
If you use a Bayesian bandit algorithm you will need to use a real-time service for that.
If you recommend based on user similarity or item similarity you will need to use Hadoop.

Hadoop

Hadoop is a very complex piece of software to handle very large amounts of data that cannot be handled by conventional software because it is too big to fit on one computer.

I compared different Hadoop libraries in my post: Hive, Pig, Scalding, Scoobi, Scrunch and Spark.

Most developers that have used Hadoop complain about it. I am no exception. I still have problems with Hadoop jobs failing due to errors that are hard to diagnose. Generally I have been a lot happier about Hadoop lately. I am only using Hadoop for big custom extractions or calculations from log files stored in HDFS. I do my Hadoop work in Scalding or HIVE

The Hadoop library Mahout can calculate user recommendations based on user similarity or item similarity.

Scalding

Hadoop code in Scalding looks a lot like normal Scala code. The scripts I am writing are often just 10 lines of code and look a lot like my other Scala code. The catch is that you need to be able to write idiomatic functional Scala code.

HIVE

HIVE makes it easy to extract and combine data from HDFS. You just write SQL after some setup of a directory with table structure in HDFS.

Real-time Services

Libraries like Akka, Finagle and Storm are good for having long running stateful computations.
It is hard to write correct highly parallel code that scales to multiple machines using normal multithreaded programming. For more details see my blog post: Akka vs. Finagle vs. Storm.

Akka and Spray

Akka is a simple actor model taken from the Erlang language. In Akka you have a lot of very lightweight actors, they can share a thread pool. They do not block on shared state but communicate by sending immutable messages.

One reason that Akka is a good fit for real-time services is that you can do varying degrees of loose coupling and all services can talk with each other.

It is hard to change from traditional multithreaded programming to using the actor model. There are just a lot of new actor idioms and design patterns that you have to learn. At first the actor model seems like working with a sack of fleas. You have much less control over the flow due to the distributed computation.

Spray makes it easy to put a web or RESTful interface to your service. This makes it easy to connect your service with the rest of the world. Spray also has the best Scala serialization system I have found.

Akka is well suited for: E-commerce, high frequency trading in finance, online advertising and simulations.

Akka in Online Advertising

Millions of users are interacting with fast changing ad campaigns. You could have actors for:

Each available user
Each ad campaign
Each ad
Clustering algorithms
Each cluster of users
Each cluster of ads

Each actor is developing in time and can notify and query all other actors.

NoSQL

There are a lot of options, with no query standard:

Cassandra
CouchDB
HBase
Memcached
MongoDB
Redis
SOLR

I will describe my initial excitement about NoSQL, the comeback of SQL databases and my current view on where to use NoSQL and where to use SQL.

MongoDB

MongoDB was my first NoSQL technology. I used it to store structured medical documents.

Creating a normalized SQL database that represents a structured data format is a sizable task and you easily end up with 20 tables. It is hard to insert a structured document into the database in the right sequence, so foreign key constraints are satisfied. LINQ to SQL helped with this but it was slow.

I was amazed by MongoDB's simplicity:

It was trivial to install
It could insert 1 million documents very fast
I could use the same Python NLP tools for many different types of documents

I felt that SQL databases were so 20th century.

After some use I realized that interacting with MongoDB was not as easy from Scala. I tried different libraries Subset and Casbah.

I also realized that it is a lot harder to query data from MongoDB than a SQL database both in syntax and expressiveness.
Recently SQL databases have added JSON as a data type, taking away some of MongoDB's advantage.

Today I use SQL databases for curated data. But MongoDB for ad hoc structured document data.

Redis

Redis is an advanced key value store that is mainly living in memory but with backup to disk. Redis is a good fit for caching. It has some specialized operations:

Simple to age out data
Simulates pub sub
Atomic update increments
Atomic list append
Set operations

Redis also supports sharding well, in the driver you just give a list of Redis servers and it will send the data to the right server. Redistributing data after adding more sharded servers to Redis is cumbersome.

I first thought that Redis had an odd array of features but it fits the niche of real-time caching.

SOLR

SOLR is the most used enterprise text search technology. It is built on top of Lucene.
It can store and search document with many fields using an advanced query language.
It has an ecosystem of plugins doing a lot of the things that you would want. It is also very useful for natural language processing. You can even use SOLR as a presentation system for your NLP algorithms.

To Cloud or not to Cloud

A few years back I thought that I would soon be doing all my work using cloud computing services like Amazon's AWS. This did not happen, but virtualization did. When I request a new server the OPS team usually spins up a virtual machine.
A problem with cloud services is that storage is expensive. Especially Hadoop sized storage.

If I were in a startup I would probably consider the cloud.

Big and Simple

My fist rule for software engineering is: Keep is simple.

This is particularly important in big data since size creates inherent complexity.
I made the mistake of being too ambitious too early and think out too many scenarios.

Startup Software Stack

Back to the question:

What big data technologies should I use for my web startup?

A common startup technology stack is:
Ruby on Rails for your web server and Python for your analytics and hope that a lot of beefy Amazon EC2 servers will scale your application when your product takes off.
It is fast to get started and the cloud will save you. What could possibly go wrong?

The big data approach I am describing here is more stable and scalable, but before you learn all these technologies you might run out of money.

My answer is: It depends on how much data and how much money you have.

Big Data Not Just Hype

"Big data" is misused and hyped. Still there is a real problem, we are generating an astounding amount of data and sometimes you have to work with it. You need new technologies to wrangle this data.

Whenever I see a reference to Hadoop in a library I get very uneasy. These complex big data technologies are often used where much simpler technologies would have sufficed. Make sure your really need them before you start. This could be the difference between your project succeeding or failing.

It has been humbling to learn these technologies but after much despair I now enjoy working with them and find them essential for those truly big problems.

Artificial intelligence started in the 1950s with very high expectations. AI did not deliver on the expectations and fell into decades long discredit. I am seeing signs that Data Mining and Business Intelligence are bringing AI into mainstream computing. This blog posting is a personal account of my long struggle to work in artificial intelligence during different trends in computer science.

In the 1980s I was studying mathematics and physics, which I really enjoyed. I was concerned about my job prospects, there are not many math or science jobs outside of academia. Artificial intelligence seemed equally interesting but more practical, and I thought that it could provide me with a living wage. Little did I know that artificial intelligence was about to become an unmentionable phrase that you should not put on your resume if you wanted a paying job.

Highlights of the history of artificial intelligence

In 1956 AI was founded.
In 1957 Frank Rosenblatt invented Perceptron, the first generation of neural networks. It was based on the way the human brain works, and provided simple solutions to some simple problems.
In 1958 John McCarthy invented LISP, the classic AI language. Mainstream programming languages have borrowed heavily from LISP and are only now catching up with LISP.
In the 1960s AI got lots of defense funding. Especially military translation software translating from Russian to English.

AI theory made quick advances and a lot was developed early on. AI techniques worked well on small problems. It was expected that AI could learn, using machine learning, and this soon would lead to human like intelligence.

This did not work out as planned. The machine translation did not work well enough to be usable. The defense funding dried up. The approaches that had worked well for small problems did not scale to bigger domains. Artificial intelligence fell out of favor in the 1970s.

AI advances in the 1980s

When I started studying AI, it was in the middle of a renaissance and I was optimistic about recent advances:

The discovery of new types of neural networks, after Perceptron networks had been discredited in an article by Marvin Minsky
Commercial expert system were thriving
The Japanese Fifth Generation Computer Systems project, written in the new elegant declarative Prolog language had many people in the West worried
Advances in probability theory Bayesian Networks / Causal Network

In order to combat this brittleness of intelligence Doug Lenat started a large scale AI project CYC in 1984. His idea was that there is no free lunch, and in order to build an intelligent system, you have to use many different types of fine tuned logical inference; and you have to hand encode it with a lot of common sense knowledge. Cycorp spent hundreds of man years building their huge ontology. Their hope was that CYC would be able to start learning on its own, after training it for some years.

AI in the 1990s

I did not loose my patience but other people did, and AI went from the technology of the future to yesterday's news. It had become a loser that you did not want to be associated with.

During the Internet bubble when venture capital founding was abundant, I was briefly involved with an AI Internet start up company. The company did not take off; its main business was emailing discount coupons out to AOL costumers. This left me disillusioned, thinking that I just have to put on a happy face when I worked on the next web application or trading system.

AI usage today

Even though AI stopped being cool, regular people are using its use it in more and more places:

Spam filter
Search engines use natural language processing
Biometric, face and fingerprint detection
OCR, check reading in ATM
Image processing in coffee machine detecting misaligned cups
Fraud detection
Movie and book recommendations
Machine translation
Speech understanding and generation in phone menu system

Euphemistic words for AI techniques

The rule seem to be that you can use AI techniques as long as you call it something else, e.g.:

Business Intelligence
Collective Intelligence
Data Mining
Information Retrieval
Machine Learning
Natural Language Processing
Predictive Analytics
Pattern Matching

AI is entering mainstream computing now

Recently I have seen signs that AI techniques are moving into mainstream computing:

I went to a presentation for SPSS statistical modeling software, and was shocked how many people now are using data mining and machine learning techniques. I was sitting next to people working in a prison, adoption agency, marketing, disease prevention NGO.
I started working on a data warehouse using SQL Server Analytic Services, and found that SSAS has a suite of machine learning tools.
Functional and declarative techniques are spreading to mainstream programming languages.

Business Intelligence compared to AI

Business Intelligence is about aggregating a company's data into an understandable format and analyzing it to provide better business decisions. BI is currently the most popular field using artificial intelligence techniques. Here are a few words about how it differs from AI:

BI is driven by vendors instead of academia
BI is centered around expensive software packages with a lot of marketing
The scope is limited, e.g. find good prospective customers for your products
Everything is living in databases or data warehouses
BI is data driven
Reporting is a very important component of BI

Getting a job in AI

I recently made a big effort to steer my career towards AI. I started an open source computer vision project, ShapeLogic and put AI back on my resume. A head hunter contacted me and asked if I had any experience in Predictive Analytics. It took me 15 minutest to convince her that Predictive Analytics and AI was close enough that she could forward my resume. I got the job, my first real AI and NLP job.

The work I am doing is not dramatically different from normal software development work. I spend less time on machine learning than on getting AJAX to work with C# ASP.NET for the web GUI; or upgrade the database ORM from ADO.NET strongly typed datasets to LINQ to SQL. However, it was very gratifying to see my program started to perform a task that had been very time consuming for the company's medical staff.

Is AI regaining respect?

No, not now. There are lots of job postings for BI and data mining but barely any for artificial intelligence. AI is still not a popular word, except in video games where AI means something different. When I worked as a games developer what was called AI was just checking if your character was close to an enemy and then the enemy would start shooting in your character's direction.

After 25 long years of waiting I am very happy to see AI techniques has finally become a commodity, and I enjoy working with it even if I have to disguise this work by whatever the buzzword of the day is.

-Sami Badawi

Languages and Logic

Monday, September 23, 2013

Big Data: What Worked?