I have worked with big data technologies for several years. It has been a steep learning curve, but lately I had more success stories.
This post is about the big data technologies I like and
continue to use. Big data is a big topic. These are some highlights from my
experience.
I will relate big data technologies to modern web architecture with predictive analytics and raise the question:
What big data technologies should I use for my web startup?
Classic Three Tier Architecture
For a long time software development was dominated by the three tier architecture / client server architecture. It is well described and conceptually simple:
- Client
- Server for business logic
- Database
It is straightforward to figure out what computations should go where.
The modern web architecture is not nearly as well established. It is more like an 8 tiered architecture with the following components:Modern Web Architecture With Analytics
- Web client
- Caching
- Stateless web server
- Real-time services
- Database
- Hadoop for log file processing
- Text search
- Ad hoc analytics system
A problem with the modern web architecture is that a given calculation can be done in many of those components.
Architecture for Predictive Analytics
First you need to collect user metrics. In what components can do you do this?
- Web servers, store the metric in Redis / caching
- Web servers, store the metric in the database
- Real-time services aggregates the user metric
- Hadoop run on the log files
User metric is needed by predictive analytics and machine learning. Here are some scenarios for this:
- If you are doing simple popularity based predictive analytics this can be done in the web server or a real-time service.
- If you use a Bayesian bandit algorithm you will need to use a real-time service for that.
- If you recommend based on user similarity or item similarity you will need to use Hadoop.
Hadoop
Hadoop is a very complex piece of software to handle very large amounts of data that cannot be handled by conventional software because it is too big to fit on one computer.I compared different Hadoop libraries in my post: Hive, Pig, Scalding, Scoobi, Scrunch and Spark.
The Hadoop library Mahout can calculate user recommendations based on user similarity or item similarity.
Scalding
Hadoop code in Scalding looks a lot like normal Scala code. The scripts I am writing are often just 10 lines of code and look a lot like my other Scala code. The catch is that you need to be able to write idiomatic functional Scala code.HIVE
HIVE makes it easy to extract and combine data from HDFS. You just write SQL after some setup of a directory with table structure in HDFS.Real-time Services
Libraries like Akka, Finagle and Storm are good for having long running stateful computations.
It is hard to write correct highly parallel code that scales to multiple machines using normal multithreaded programming. For more details see my blog post: Akka vs. Finagle vs. Storm.
It is hard to write correct highly parallel code that scales to multiple machines using normal multithreaded programming. For more details see my blog post: Akka vs. Finagle vs. Storm.
Akka and Spray
Akka is a simple actor model taken from the Erlang language. In Akka you have a lot of very lightweight actors, they can share a thread pool. They do not block on shared state but communicate by sending immutable messages.One reason that Akka is a good fit for real-time services is that you can do varying degrees of loose coupling and all services can talk with each other.
It is hard to change from traditional multithreaded programming to using the actor model. There are just a lot of new actor idioms and design patterns that you have to learn. At first the actor model seems like working with a sack of fleas. You have much less control over the flow due to the distributed computation.
Akka is well suited for: E-commerce, high frequency trading in finance, online advertising and simulations.
Akka in Online Advertising
Millions of users are interacting with fast changing ad campaigns. You could have actors for:- Each available user
- Each ad campaign
- Each ad
- Clustering algorithms
- Each cluster of users
- Each cluster of ads
NoSQL
There are a lot of options, with no query standard:- Cassandra
- CouchDB
- HBase
- Memcached
- MongoDB
- Redis
- SOLR
MongoDB
MongoDB was my first NoSQL technology. I used it to store structured medical documents.Creating a normalized SQL database that represents a structured data format is a sizable task and you easily end up with 20 tables. It is hard to insert a structured document into the database in the right sequence, so foreign key constraints are satisfied. LINQ to SQL helped with this but it was slow.
I was amazed by MongoDB's simplicity:
- It was trivial to install
- It could insert 1 million documents very fast
- I could use the same Python NLP tools for many different types of documents
I felt that SQL databases were so 20th century.
After some use I realized that interacting with MongoDB was not as easy from Scala. I tried different libraries Subset and Casbah.
I also realized that it is a lot harder to query data from MongoDB than a SQL database both in syntax and expressiveness.Recently SQL databases have added JSON as a data type, taking away some of MongoDB's advantage.
Today I use SQL databases for curated data. But MongoDB for ad hoc structured document data.
Redis
Redis is an advanced key value store that is mainly living in memory but with backup to disk. Redis is a good fit for caching. It has some specialized operations:- Simple to age out data
- Simulates pub sub
- Atomic update increments
- Atomic list append
- Set operations
Redis also supports sharding well, in the driver you just give a list of Redis servers and it will send the data to the right server. Redistributing data after adding more sharded servers to Redis is cumbersome.
I first thought that Redis had an odd array of features but it fits the niche of real-time caching.
SOLR
SOLR is the most used enterprise text search technology. It is built on top of Lucene.It can store and search document with many fields using an advanced query language.
It has an ecosystem of plugins doing a lot of the things that you would want. It is also very useful for natural language processing. You can even use SOLR as a presentation system for your NLP algorithms.
To Cloud or not to Cloud
A few years back I thought that I would soon be doing all my work using cloud computing services like Amazon's AWS. This did not happen, but virtualization did. When I request a new server the OPS team usually spins up a virtual machine.
A problem with cloud services is that storage is expensive. Especially Hadoop sized storage.
A problem with cloud services is that storage is expensive. Especially Hadoop sized storage.
If I were in a startup I would probably consider the cloud.
Big and Simple
My fist rule for software engineering is: Keep is simple.
This is particularly important in big data since size creates inherent complexity.
I made the mistake of being too ambitious too early and think out too many scenarios.
I made the mistake of being too ambitious too early and think out too many scenarios.
Startup Software Stack
What big data technologies should I use for my web startup?
A common startup technology stack is:
Ruby on Rails for your web server and Python for your analytics and hope that a lot of beefy Amazon EC2 servers will scale your application when your product takes off.
It is fast to get started and the cloud will save you. What could possibly go wrong?
The big data approach I am describing here is more stable and scalable, but before you learn all these technologies you might run out of money.
My answer is: It depends on how much data and how much money you have.
Big Data Not Just Hype
Whenever I see a reference to Hadoop in a library I get very uneasy. These complex big data technologies are often used where much simpler technologies would have sufficed. Make sure your really need them before you start. This could be the difference between your project succeeding or failing.