<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-7506593179569894775</id><updated>2012-01-02T04:09:56.199-05:00</updated><category term='GIL'/><category term='dad'/><category term='SciPy'/><category term='NLTK'/><category term='Predictive Analytics'/><category term='Clojure'/><category term='VB.NET'/><category term='open source'/><category term='OpenStack'/><category term='ASP.NET'/><category term='MongoDB'/><category term='medical'/><category term='Orange'/><category term='FLTK'/><category term='AI'/><category term='girls'/><category term='haskell'/><category term='natural language processing'/><category term='toddlers'/><category term='Openframeworks'/><category term='IronPython'/><category term='TR1'/><category term='review'/><category term='closures'/><category term='ImageJ'/><category term='startups'/><category term='kids'/><category term='science education'/><category term='scripting'/><category term='Statistica'/><category term='scala'/><category term='science girls'/><category term='Tiger Mother'/><category term='java'/><category term='logic'/><category term='Data Mining'/><category term='F#'/><category term='sentiment analysis'/><category term='LDA'/><category term='NetBeans'/><category term='Big Data'/><category term='NumPy'/><category term='C++0x standard'/><category term='Google tech talk'/><category term='Cython'/><category term='particle counter'/><category term='problems'/><category term='groovy'/><category term='STL'/><category term='Latent Dirichlet Allocation'/><category term='Eclipse'/><category term='SSAS'/><category term='statistics'/><category term='Table Analysis Tool'/><category term='Boost'/><category term='ShapeLogic'/><category term='Python'/><category term='DLR'/><category term='javascript'/><category term='WEKA'/><category term='SQL Server'/><category term='ffnet'/><category term='OpenCV'/><category term='particle analyzer'/><category term='Business Intelligence'/><category term='Amazon Web Services'/><category term='C++'/><category term='lazy'/><category term='Artificial Intelligence'/><category term='PyDev'/><category term='unit test'/><category term='comparison'/><category term='Hadoop'/><category term='IBM Watson'/><category term='Processing'/><category term='Project Euler'/><category term='Windows Azure'/><category term='hype'/><category term='Golang'/><category term='vs'/><category term='Rapidminer'/><category term='Go'/><category term='math'/><category term='NLP'/><category term='JVM'/><category term='children'/><category term='vision'/><category term='linguistics'/><category term='cloud computing'/><category term='mlpy'/><category term='stream'/><category term='tutorial'/><category term='CherryPy'/><category term='declarative programming'/><category term='SharpNLP'/><category term='computer art'/><category term='large scale'/><category term='cell'/><category term='C#'/><category term='computer vision'/><category term='Maven'/><category term='Google Squared'/><category term='VXL'/><category term='functional programming'/><category term='mathematics'/><category term='project management'/><category term='jruby'/><category term='machine learning'/><category term='image processing'/><category term='failure'/><category term='R'/><title type='text'>AI Computer Vision</title><subtitle type='html'>Machine learning, natural language processing and ShapeLogic</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>32</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-2349368635938451327</id><published>2011-07-23T23:37:00.022-04:00</published><updated>2011-07-25T19:55:42.937-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='problems'/><category scheme='http://www.blogger.com/atom/ns#' term='Eclipse'/><category scheme='http://www.blogger.com/atom/ns#' term='tutorial'/><category scheme='http://www.blogger.com/atom/ns#' term='Maven'/><category scheme='http://www.blogger.com/atom/ns#' term='unit test'/><category scheme='http://www.blogger.com/atom/ns#' term='scala'/><title type='text'>Scala, Eclipse and Maven integration tutorial</title><content type='html'>I have evaluated Scala as a language for cloud computing and Hadoop. One requirement was a robust development environment, with a real build system, a good IDE with code completion and debugging.&lt;br /&gt;&lt;br /&gt;The combination of &lt;a href="http://www.scala-lang.org/"&gt;Scala&lt;/a&gt;,&amp;nbsp;&lt;a href="http://eclipse.org/"&gt;Eclipse&lt;/a&gt; and &lt;a href="http://maven.apache.org/"&gt;Maven&lt;/a&gt; seemed like a fit for this requirement, but my initial experience was mixed.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Problems with Scala, Eclipse and Maven integration&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;It was easy to install Scala, Eclipse and Maven, but when I set up a project it had a persistent error in Eclipse:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;object Predef does not have a member AnyRef&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div&gt;Other problems:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;There were problems running the unit test.&lt;/li&gt;&lt;li&gt;I had to restart Eclipse a lot.&lt;/li&gt;&lt;li&gt;Eclipse had Scala set to version 2.9.0.1 while Maven had 2.8.0. When I tried to change Maven to use 2.9.0.1 the pom.xml file would be marked as having an error.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;I searched internet for help but could not find it. After a good deal of experimenting I sorted out the problems and found a good solution.&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Software versions&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;My setup is:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Scala 2.9.0.1.&lt;/li&gt;&lt;li&gt;Eclipse 3.7 Indigo&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.scala-ide.org/"&gt;Scala-ide Eclipse plugin&lt;/a&gt;: scala nightly 29 - http://download.scala-ide.org/nightly-update-wip-experiment-2.9.0-1&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Scala, Eclipse Maven project setup tutorial&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Here are that steps that I took to set up at new Scala, Eclipse and Maven project so it works with unit testing.&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;Press menu item: File - New - Other...&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-XxDbiFTmpek/TiuMx3c4GVI/AAAAAAAAAHo/4WYIuPafF6U/s1600/file-menu-new.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="197" src="http://1.bp.blogspot.com/-XxDbiFTmpek/TiuMx3c4GVI/AAAAAAAAAHo/4WYIuPafF6U/s400/file-menu-new.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;Select Maven Project&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-Fge4Ex9FG54/Tit5hXnP56I/AAAAAAAAAHE/QP80RVn8L_A/s1600/create_maven.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="380" src="http://3.bp.blogspot.com/-Fge4Ex9FG54/Tit5hXnP56I/AAAAAAAAAHE/QP80RVn8L_A/s400/create_maven.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;Select the org.scala-tools.archetypes scala-archetype-simple&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-W5M6QHlEapA/Tit6i6ETyRI/AAAAAAAAAHI/YD1Ee7DxIis/s1600/maven_archetype.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="305" src="http://3.bp.blogspot.com/-W5M6QHlEapA/Tit6i6ETyRI/AAAAAAAAAHI/YD1Ee7DxIis/s400/maven_archetype.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;Add group id and artifact id to project. Click Finish&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-M4PRn4HYBIA/TiuOFcVmZ1I/AAAAAAAAAHs/_0op6i8T8P0/s1600/group-artifact-names.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="303" src="http://2.bp.blogspot.com/-M4PRn4HYBIA/TiuOFcVmZ1I/AAAAAAAAAHs/_0op6i8T8P0/s400/group-artifact-names.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;This will create the project with example program and unit tests, but it will leave&amp;nbsp;Eclipse in an unstable state&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-3eogqUCzSGo/Tit7tMjceZI/AAAAAAAAAHM/SOh4Jdb7N_M/s1600/error1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="238" src="http://2.bp.blogspot.com/-3eogqUCzSGo/Tit7tMjceZI/AAAAAAAAAHM/SOh4Jdb7N_M/s400/error1.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-Z84AjKOByS8/Tit7yS3PSFI/AAAAAAAAAHQ/dFJOCn8MUqQ/s1600/error2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="156" src="http://4.bp.blogspot.com/-Z84AjKOByS8/Tit7yS3PSFI/AAAAAAAAAHQ/dFJOCn8MUqQ/s400/error2.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;In the project's pom.xml file make the changes that I have marked in red:&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&amp;lt;properties&amp;gt;&lt;br /&gt; &amp;lt;maven.compiler.source&amp;gt;1.5&amp;lt;/maven.compiler.source&amp;gt;&lt;br /&gt; &amp;lt;maven.compiler.target&amp;gt;1.5&amp;lt;/maven.compiler.target&amp;gt;&lt;br /&gt; &amp;lt;encoding&amp;gt;UTF-8&amp;lt;/encoding&amp;gt;&lt;br /&gt; &amp;lt;scala.version&amp;gt;&lt;span class="Apple-style-span" style="color: red;"&gt;2.9.0-1&lt;/span&gt;&amp;lt;/scala.version&amp;gt;&lt;br /&gt;&amp;lt;/properties&amp;gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;lt;dependency&amp;gt;&lt;br /&gt; &amp;lt;groupId&amp;gt;org.scala-tools.testing&amp;lt;/groupId&amp;gt;&lt;br /&gt; &amp;lt;artifactId&amp;gt;specs_${scala.version}&amp;lt;/artifactId&amp;gt;&lt;br /&gt; &amp;lt;version&amp;gt;&lt;span class="Apple-style-span" style="color: red;"&gt;1.6.8&lt;/span&gt;&amp;lt;/version&amp;gt;&lt;br /&gt; &amp;lt;scope&amp;gt;test&amp;lt;/scope&amp;gt;&lt;br /&gt;&amp;lt;/dependency&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;Now both Scala IDE and Maven are both using the same version of Scala. Scala 2.9.0.1&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Right&amp;nbsp;click&amp;nbsp;the whole project and select: Configure - Add Scala Nature&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-P_5Ur3qISJA/TiuHa61_4dI/AAAAAAAAAHg/qp2sB8t2ROk/s1600/scala-nature.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="53" src="http://2.bp.blogspot.com/-P_5Ur3qISJA/TiuHa61_4dI/AAAAAAAAAHg/qp2sB8t2ROk/s400/scala-nature.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;Now use the Maven build system to clean, build and run unit tests. Run from either Eclipse or command line.&lt;br /&gt;&lt;br /&gt;From Eclipse, right click the whole project and selecting:&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;Maven clean&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;Maven install&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-R_Wc0dUK8z0/TiuImigNDoI/AAAAAAAAAHk/pglnPYV0W6Q/s1600/run-as-menu.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="195" src="http://4.bp.blogspot.com/-R_Wc0dUK8z0/TiuImigNDoI/AAAAAAAAAHk/pglnPYV0W6Q/s400/run-as-menu.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;From command line:&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;C:\prog\apache-maven-2.2.1\bin\mvn clean&lt;/span&gt;&lt;/div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;C:\prog\apache-maven-2.2.1\bin\mvn install&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Note that you have to use Maven 2.2 and not Maven 3.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-PPAaHDxG6Ck/Tit9k4W_W3I/AAAAAAAAAHY/ah_ImLZSz6s/s1600/package-explore-ok.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://2.bp.blogspot.com/-PPAaHDxG6Ck/Tit9k4W_W3I/AAAAAAAAAHY/ah_ImLZSz6s/s320/package-explore-ok.png" width="280" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Now there should be no more errors.&lt;br /&gt;The unit test: "scalatest.scala" has some problems, delete it.&lt;br /&gt;&lt;br /&gt;Run all unit tests from Eclipse.&amp;nbsp;By right clicking the whole project and select Run As JUnit Test&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-e7h2i3NjXN4/Ti04C-JrEUI/AAAAAAAAAHw/t5BODp0pPgo/s1600/unittest1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-e7h2i3NjXN4/Ti04C-JrEUI/AAAAAAAAAHw/t5BODp0pPgo/s1600/unittest1.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Now you can see the result in the JUnit runner.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Final impression of Scala, Eclipse and Maven integration&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;Once I had resolved the problems the Scala, Eclipse and Maven combination was a great development environment&amp;nbsp;meeting my requirements.&lt;br /&gt;&lt;br /&gt;One thing that is currently missing from the Scala Eclipse plugin is code refactoring. Refactoring works very well in both Eclipse for Java and Visual Studio for C#.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-2349368635938451327?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/2349368635938451327/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=2349368635938451327' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/2349368635938451327'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/2349368635938451327'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2011/07/scala-eclipse-and-maven-integration.html' title='Scala, Eclipse and Maven integration tutorial'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-XxDbiFTmpek/TiuMx3c4GVI/AAAAAAAAAHo/4WYIuPafF6U/s72-c/file-menu-new.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-9114932582834947880</id><published>2011-07-12T00:28:00.018-04:00</published><updated>2011-07-13T22:08:53.518-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='functional programming'/><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='natural language processing'/><category scheme='http://www.blogger.com/atom/ns#' term='F#'/><category scheme='http://www.blogger.com/atom/ns#' term='C#'/><category scheme='http://www.blogger.com/atom/ns#' term='cloud computing'/><category scheme='http://www.blogger.com/atom/ns#' term='Python'/><category scheme='http://www.blogger.com/atom/ns#' term='NLP'/><category scheme='http://www.blogger.com/atom/ns#' term='Clojure'/><category scheme='http://www.blogger.com/atom/ns#' term='scala'/><title type='text'>Natural language processing in F# and Scala</title><content type='html'>I do natural language processing in C# 3.5 and Python. My work includes classification, named entity recognition, sentiment analysis and information extraction. Both C# and Python are great languages, but I have some unmet needs. I am investigating if there are any new languages that would help.&lt;br /&gt;&lt;br /&gt;I 2010 I tried out 3 new languages:&lt;br /&gt;&lt;a href="http://blog.samibadawi.com/2010/10/natural-language-processing-in-clojure.html"&gt;Natural language processing in Clojure, Go and Cython&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Recently I have investigated &lt;a href="http://research.microsoft.com/en-us/um/cambridge/projects/fsharp/"&gt;F#&lt;/a&gt; and &lt;a href="http://www.scala-lang.org/"&gt;Scala&lt;/a&gt;. They are both hybrid functional - object oriented languages; inspired&amp;nbsp;by &lt;a href="http://en.wikipedia.org/wiki/ML_(programming_language)"&gt;ML&lt;/a&gt;&amp;nbsp;/ &lt;a href="http://caml.inria.fr/"&gt;OCaml&lt;/a&gt;&amp;nbsp;/ &lt;a href="http://www.haskell.org/haskellwiki/Haskell"&gt;Haskell&lt;/a&gt; and Java / C#.&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Python as the benchmark&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt; is widely used in natural language processing. I am most productive in Python for NLP work. Here are a few reasons why:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.nltk.org/"&gt;NLTK&lt;/a&gt; is a great Python NLP library&lt;/li&gt;&lt;li&gt;Lot of open source math and science libraries e.g.&amp;nbsp;&lt;a href="http://numpy.scipy.org/"&gt;NumPy&lt;/a&gt; and &lt;a href="http://scipy.scipy.org/"&gt;SciPy&lt;/a&gt;&lt;/li&gt;&lt;li&gt;PyDev is a good development environment&lt;/li&gt;&lt;li&gt;Good integration with MongoDB library&lt;/li&gt;&lt;li&gt;Great for rapid development&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;b&gt;Python shortcomings&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Slow compared to&amp;nbsp;compiled language&lt;/li&gt;&lt;li&gt;GUI support is crude&lt;/li&gt;&lt;li&gt;Multi-threading&amp;nbsp;is crude&lt;/li&gt;&lt;li&gt;Compilation does give more&amp;nbsp;robustness&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;It should be possible to make a super language that has the elegance of Python, but without these shortcomings.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;My first Scala experience&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;In 2006 I thought&amp;nbsp;Scala was&amp;nbsp;this super language. It is very advanced; you can call any Java libraries from Scala, including all the open source libraries. But I ran into a list of problems with Scala:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The Scala IDE was far behind Eclipse Java&lt;/li&gt;&lt;li&gt;Scala is a quite complex language&lt;/li&gt;&lt;li&gt;The Java libraries and the functional programming libraries were badly integrated&lt;/li&gt;&lt;li&gt;There were no Scala REPL or interpreter like in Python&lt;/li&gt;&lt;/ul&gt;Scala was stable enough for use, but it did not improve my productivity so after some months I went back to using Python as my scripting language.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Python's weakness&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Recently&amp;nbsp;I had to make a small text processing application that end users could use directly. This was not the best fit for Python.&amp;nbsp;Normally my Python programs have no GUI and are controlled by command line parameters.&lt;br /&gt;&lt;br /&gt;I had 2 Python options:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Make simple GUI using TkInter&lt;/b&gt;&lt;br /&gt;&lt;a href="http://wiki.python.org/moin/TkInter"&gt;TkInter&lt;/a&gt; is a Python wrapper of &lt;a href="http://www.tcl.tk/"&gt;TK&lt;/a&gt;, the cross platform GUI toolkit. It is pretty crude by modern GUI standards, but would have been good enough. However trying to install all the Python libraries that I needed on the end users machine would be setting myself of for a maintains nightmare.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Wrap code in web application&lt;/b&gt;&lt;br /&gt;I could wrap a web interface around it. The application is using a lot of memory and I would have to maintain a web application.&lt;br /&gt;&lt;br /&gt;I had a 1 week hard deadline for the task and both of&amp;nbsp;these options looked&amp;nbsp;unappealing. I needed something else...&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;My first F# application&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;I took a chance on F#, and managed to learn enough F# to finish the program by my&amp;nbsp;1 week deadline.&lt;/div&gt;&lt;br /&gt;There is no GUI builder for F# in Visual Studio, but it was pretty easy to hand code a simple WinForms GUI to wrap around the core code.&amp;nbsp;It was not pretty but you could give it to an end user.&amp;nbsp;The whole application ended up being one 40KB executable&amp;nbsp;file, and it was very fast. F# had actually filled a niche that Python does not do so well.&lt;br /&gt;&lt;br /&gt;There were also problems, I wrote the whole application from scratch, while in Python I would have been able to use NLTK,&amp;nbsp;write the code faster and&amp;nbsp;get better results.&lt;br /&gt;&lt;br /&gt;All in all this was very good experience. I thought that F# would be a good supplement to my Python library. It would both give me raw speed when I need it and good connectivity with C#, ASP.NET, WPF&amp;nbsp;and Microsoft Office.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Functional programming benefits&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;Functional programming is a great fit for my NLP work.&lt;br /&gt;&lt;br /&gt;I have a lot of different text sources: database, flat file, directory, public RESTful web application services.&lt;br /&gt;&lt;br /&gt;I have&amp;nbsp;many word transformations: stop word filters, stemmers, custom filters.&lt;br /&gt;&lt;br /&gt;I need many operations building on other operations: Bigram finder, POS tagger, named entity recognizer.&lt;br /&gt;&lt;br /&gt;Created different reports: database, csv, Excel.&lt;br /&gt;&lt;br /&gt;In functional languages you can just take any combinations of these operations and&amp;nbsp;easily&amp;nbsp;pipe them together while getting good compiler support. This does not fit so well with object oriented programming were you are more&amp;nbsp;concerned&amp;nbsp;with encapsulation.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;F# impression&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;F# is the first compiled language I tried that is comparable to Python in simplicity and elegance. It has a real Pythonic feel:&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;ul&gt;&lt;li&gt;F# is fast&lt;/li&gt;&lt;li&gt;Simple and elegant&lt;/li&gt;&lt;li&gt;Good development environment in Visual Studio 2010&lt;/li&gt;&lt;li&gt;Best concurrency support of any language I have seen&lt;/li&gt;&lt;li&gt;Good database support&lt;/li&gt;&lt;li&gt;Good MongoDB library&lt;/li&gt;&lt;li&gt;Simple to combine F# with C# or VB.NET for ASP or WPF&lt;/li&gt;&lt;li&gt;Good REPL&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;b&gt;Issues&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Runs best under Windows&lt;/li&gt;&lt;li&gt;For an IDE you really need&amp;nbsp;Visual Studio 2008 or 2010, and that cost at least $700&lt;/li&gt;&lt;li&gt;F# can be compiled and run the shell from &lt;a href="http://sharpdevelop.net/opensource/sd/"&gt;SharpDevelop&lt;/a&gt; 4.0 and 4.1, but you do not have the same productivity&lt;/li&gt;&lt;li&gt;The math libraries under .NET are not as good as NumPy and SciPy&lt;/li&gt;&lt;li&gt;The NLP libraries are better under Python&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Scala revisited&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;After the success with F# I was very curious about why F# has been so much more successful than my first experience with Scala.&lt;br /&gt;&lt;br /&gt;I looked at an&amp;nbsp;&lt;a href="http://hyperpolyglot.org/ml"&gt;F# and Scala cheat sheet&lt;/a&gt;&amp;nbsp;and thought they look remarkably similar.&amp;nbsp;I watched a few screen casts and found no obvious problems. I bought the book:&amp;nbsp;&lt;a href="http://www.artima.com/shop/programming_in_scala_2ed"&gt;Programming in Scala, Second Edition&lt;/a&gt;, it turned out to be a very interesting computer science book and&amp;nbsp;I read the whole 852 pages.&amp;nbsp;Scala still looked good.&lt;br /&gt;&lt;br /&gt;I installed the Scala Eclipse plugin and wrote some code. Both the language and the IDE have come a long way during the last 5 years:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;15 books about Scala&lt;/li&gt;&lt;li&gt;2 great free books&lt;/li&gt;&lt;li&gt;Tooling is much better&lt;/li&gt;&lt;li&gt;IDE is much better with code&amp;nbsp;completion&lt;/li&gt;&lt;li&gt;Native NLP libs: &lt;a href="http://www.scalanlp.org/"&gt;ScalaNLP&lt;/a&gt; and &lt;a href="http://code.google.com/p/kiama/"&gt;Kiama&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Of all the issues I had when I first tried Scala. The only remaining one is:&lt;br /&gt;&lt;i&gt;Scala is a pretty complex language&lt;/i&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;It is incredible how Scala has taken a lot of messy features from Java and&amp;nbsp;turned&amp;nbsp;it into a clean modular system, at the cost of some complex abstractions.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;F# vs. Scala&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;Despite many similarities, the languages have a different&amp;nbsp;feel. F# is simpler to understand, while Scala is&amp;nbsp;the&amp;nbsp;more orthogonal language. I have been very impressed by both.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;F# better&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Simpler to understand&lt;/li&gt;&lt;li&gt;Fantastic concurrency&lt;/li&gt;&lt;li&gt;Tail recursion optimized&lt;/li&gt;&lt;li&gt;Works well with Windows Azure&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;b&gt;Scala better&lt;/b&gt;&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;ul&gt;&lt;li&gt;More orthogonal, reusing the same constructs&lt;/li&gt;&lt;li&gt;Works with any Java library so more libraries&lt;/li&gt;&lt;li&gt;Better NLP libraries&lt;/li&gt;&lt;li&gt;Works well with Hadoop&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Cloud computing&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;Functional programming works well with cloud computing. For me the availability of a good functional language is a substantial factor in selecting a cloud platform.&lt;br /&gt;&lt;br /&gt;Google introduced MapReduce to handle massive parallel multi computer applications.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt; is the Java based open source version of MapReduce. To run Hadoop natively it has to run a JVM language like Java or Scala.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://hadoop.apache.org/common/docs/r0.15.2/streaming.html"&gt;Hadoop Streaming&lt;/a&gt; extends a limited version of Hadoop to work with programs written in other programming languages as long as they work like a UNIX pipes that read from stdin and write to stdout.&lt;br /&gt;&lt;br /&gt;There is a Python wrapper for Hadoop Streaming called &lt;a href="http://www.audioscrobbler.net/development/dumbo/"&gt;Dumbo&lt;/a&gt;. Python is around 10 times slower than Java and Dumbo is a limited version of the Hadoop, so if you are trying to do NLP on massive amount of data this might not solve your problems.&lt;br /&gt;&lt;br /&gt;Scala is fast and will give you full access to run native Hadoop.&lt;br /&gt;&lt;br /&gt;Microsoft's version or MapReduce is called: &lt;a href="http://research.microsoft.com/en-us/projects/dryad/"&gt;Dryad&lt;/a&gt; or LINQ to HPC. It is not officially released yet, but F# works well with Windows Azure.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;NLP and other languages&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Let me finish by giving a few short comparisons of F# and Scala with&amp;nbsp;other languages:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Clojure vs. Scala&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a href="http://clojure.org/"&gt;Clojure&lt;/a&gt; is a LISP dialect that it also running on the JVM, and it the other big functional language running there. Clojure has some distinct niches for NLP:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Clojure better&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Language understanding&lt;/li&gt;&lt;li&gt;Formal semantic: taking text and translating it to first order&amp;nbsp;propositional&amp;nbsp;logic&lt;/li&gt;&lt;li&gt;Artificial intelligence tasks&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;b&gt;Scala better&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;It is easy to write fast Scala code&lt;/li&gt;&lt;li&gt;Smaller learning curve coming from Java&lt;/li&gt;&lt;/ul&gt;I tried Clojure recently and was very impressed; but more of my work falls in the category that would&amp;nbsp;benefit&amp;nbsp;from Scala.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Java vs. Scala&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;b&gt;Java better&lt;/b&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;ul&gt;&lt;li&gt;Better IDE tools and support&lt;/li&gt;&lt;li&gt;Better GUI builders&lt;/li&gt;&lt;li&gt;Great refactoring support&lt;/li&gt;&lt;li&gt;Many more programmers that know Java&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;b&gt;Scala better&lt;/b&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;ul&gt;&lt;li&gt;Terser code&lt;/li&gt;&lt;li&gt;Closures&lt;/li&gt;&lt;li&gt;First class function&lt;/li&gt;&lt;li&gt;More expressive language&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;C# vs. F#&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;b&gt;C# better&lt;/b&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;ul&gt;&lt;li&gt;Better IDE tools and support&lt;/li&gt;&lt;li&gt;Better GUI builders&lt;/li&gt;&lt;li&gt;There are a lot more programmers that know C#&lt;/li&gt;&lt;li&gt;Better LINQ to SQL support&lt;/li&gt;&lt;/ul&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;b&gt;F# better&lt;/b&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;ul&gt;&lt;li&gt;Terse code&lt;/li&gt;&lt;li&gt;Better support for&amp;nbsp;concurrency,&amp;nbsp;Synch, continuations&lt;/li&gt;&lt;li&gt;More productive for NLP&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Conclusion&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;F# and Scala&amp;nbsp;are similar hybrid functional object oriented languages.&lt;br /&gt;&lt;br /&gt;For years I have periodically tried functional programming languages to see if they were ready for mainstream corporate computing; and they were not. With the recent spread of functional features into object oriented languages&amp;nbsp;I thought that&amp;nbsp;real&amp;nbsp;functional programming languages would soon be forgotten.&lt;br /&gt;&lt;br /&gt;I was&amp;nbsp;pleasantly&amp;nbsp;surprised by how well F# and Scala work now. Functional languages are finally coming of age and becoming useful in mainstream corporate computing.&amp;nbsp;They are stable enough, and they&amp;nbsp;have niches were they are more productive than object oriented languages like C# and Java.&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;I really enjoy programming in F# and Scala, they are&amp;nbsp;a very good fit for natural language processing and cloud computing. For bigger NLP projects&amp;nbsp;I now prefer to use F# or Scala over C# or Java.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;For GUI and web programming the object oriented languages still rules. Stick with C# or Java if the NLP part is small or GUI or web interface is the domineering part.&lt;br /&gt;&lt;br /&gt;Java and C# are also improving e.g. by adding more functional features. Many working programmers are well served by just waiting for Java 8 or C# 5. But functional programming is here to stay. Rejoice...&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-9114932582834947880?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/9114932582834947880/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=9114932582834947880' title='11 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/9114932582834947880'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/9114932582834947880'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2011/07/natural-language-processing-in-f-and.html' title='Natural language processing in F# and Scala'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>11</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-7168257420998988288</id><published>2011-06-17T14:30:00.011-04:00</published><updated>2011-06-19T22:32:59.030-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='F#'/><category scheme='http://www.blogger.com/atom/ns#' term='C#'/><category scheme='http://www.blogger.com/atom/ns#' term='cloud computing'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><category scheme='http://www.blogger.com/atom/ns#' term='JVM'/><category scheme='http://www.blogger.com/atom/ns#' term='NLP'/><category scheme='http://www.blogger.com/atom/ns#' term='Windows Azure'/><category scheme='http://www.blogger.com/atom/ns#' term='Hadoop'/><category scheme='http://www.blogger.com/atom/ns#' term='hype'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenStack'/><category scheme='http://www.blogger.com/atom/ns#' term='Big Data'/><category scheme='http://www.blogger.com/atom/ns#' term='problems'/><category scheme='http://www.blogger.com/atom/ns#' term='ASP.NET'/><category scheme='http://www.blogger.com/atom/ns#' term='natural language processing'/><category scheme='http://www.blogger.com/atom/ns#' term='Amazon Web Services'/><category scheme='http://www.blogger.com/atom/ns#' term='failure'/><title type='text'>Cloud Computing For Data Mining Part 1</title><content type='html'>The first half of this blog post is about selecting a&amp;nbsp;cloud provider for a data mining and natural language processing system. I will compare 3 leading cloud computing providers Amazon Web Services, Windows Azure, OpenStack.&lt;br /&gt;To help me chose a cloud provider I have been looking for users with experience running cloud computing for application similar to data mining.&amp;nbsp;I found them at&amp;nbsp;&lt;a href="http://cloudcamp-ny-2011.eventbrite.com/"&gt;CloudCamp New York June 2011&lt;/a&gt;.&amp;nbsp;It was an unconference, so the attendees were split into user discussion groups.&amp;nbsp;The last half of the post I will mention the highlight from these discussions.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;The Hype&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;b&gt;&lt;i&gt;"If you are not in the cloud you are not going to be in business!"&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;This is the message&amp;nbsp;many programmers, software architects and project managers faces today. You do not want to go out of&amp;nbsp;business&amp;nbsp;because you could not keep up with the latest&amp;nbsp;technologies; but looking back many companies have gone out of business because they invested in the latest must have&amp;nbsp;technology, that turned out to be expensive and over&amp;nbsp;engineered.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Reason For Moving To Cloud&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;I have a good business case from using cloud computing: Namely scale a data mining system to handle a lot of data. To begin with it could be a moderate amount of data, but it could be changed to a Big Data with short notice.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Horror Cloud Scenario&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div&gt;I am trying minimize the risk of this scenario:&lt;/div&gt;&lt;ol&gt;&lt;li&gt;I port to a cloud solution that is tied closely to one cloud provider&lt;/li&gt;&lt;li&gt;Move the applications over&lt;/li&gt;&lt;li&gt;After a few months I find that there are&amp;nbsp;unforeseen&amp;nbsp;problems&lt;/li&gt;&lt;li&gt;No easy path back&lt;/li&gt;&lt;li&gt;Angry customers are calling&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;&lt;div&gt;&lt;div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Goals&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;br /&gt;Here are my cloud computing goals in a little more details:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Port data mining system and ASP.NET web applications to the cloud&lt;/li&gt;&lt;li&gt;Chose cloud compatible with code base in .NET and&amp;nbsp;Python&lt;/li&gt;&lt;li&gt;Initially the data&amp;nbsp;volume&amp;nbsp;is moderate but it could possibly scale to Big Data&lt;/li&gt;&lt;li&gt;Keep cost and complexity under control&lt;/li&gt;&lt;li&gt;No downtime during transition&lt;/li&gt;&lt;li&gt;Minimize risk&lt;/li&gt;&lt;li&gt;Minimize vendor lock in&lt;/li&gt;&lt;li&gt;Run the same code in house and in the cloud&lt;/li&gt;&lt;li&gt;Make rollback to in house application possible&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Amazon Web Services vs. Windows Azure vs. OpenStack&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;Choosing the right cloud computing provider has been time consuming, but also very important.&lt;br /&gt;&lt;br /&gt;I took a quick stroll through Cloud Expo 2011, and most big computer companies were there presenting their cloud solutions.&lt;br /&gt;&lt;br /&gt;Google App Engine is a big cloud service well suited for front end web application, but not good for data mining, so I will not cover that here.&lt;br /&gt;&lt;br /&gt;The other 3 providers that have generated most momentum are: EC2, Azure and OpenStack.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Let me start by listing their similarities:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Virtual computers that can be started with short notice&lt;/li&gt;&lt;li&gt;Redundant robust storage&lt;/li&gt;&lt;li&gt;NoSQL structured data&lt;/li&gt;&lt;li&gt;Message queue for&amp;nbsp;communication&lt;/li&gt;&lt;li&gt;Mountable&amp;nbsp;hard disk&lt;/li&gt;&lt;li&gt;Local non persistent hard disk&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;Now I will write a little more about where they differ, and their good and the bad part:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Amazon Web Services, AWS, EC2, S3&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Good:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;This is the oldest cloud provider dating back to 2004&lt;/li&gt;&lt;li&gt;Very mature&amp;nbsp;provider&lt;/li&gt;&lt;li&gt;Other&amp;nbsp;providers&amp;nbsp;are catching up with AWS's features&lt;/li&gt;&lt;li&gt;Well documented&lt;/li&gt;&lt;li&gt;Work well with open source, LAMP and Java&lt;/li&gt;&lt;li&gt;Integrated with Hadoop: Electric Map Reduce&lt;/li&gt;&lt;li&gt;A little cheaper than Windows Azure&lt;/li&gt;&lt;li&gt;Runs Linux, Open Solaris and Windows servers&lt;/li&gt;&lt;li&gt;You can run code on your local machine and just save the result into S3 storage&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;Bad:&lt;br /&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;You cannot run the same code in house and in the cloud&lt;/li&gt;&lt;li&gt;Vendor lock in&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Windows Azure&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Good:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Works well with the .NET framework and all Microsoft's tools&lt;/li&gt;&lt;li&gt;It is very simple to port an ASP.NET application to Azure&lt;/li&gt;&lt;li&gt;You can run the same code on you development machine and in the cloud&lt;/li&gt;&lt;li&gt;Very good development and debugging tools&lt;/li&gt;&lt;li&gt;F# is a great language for data mining in cloud computing&lt;/li&gt;&lt;li&gt;Great&amp;nbsp;series&amp;nbsp;of &lt;a href="http://www.msdev.com/Directory/SearchResults.aspx?keyword=windows+Azure#5,1"&gt;video screen casts&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;Bad:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Only run Windows&lt;/li&gt;&lt;li&gt;You need a Windows 7, Windows Server 2008 or Windows Vista to develop&lt;/li&gt;&lt;li&gt;Preferably you should have Visual Studio 2010&lt;/li&gt;&lt;li&gt;Vendor lock in&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;OpenStack&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;a href="http://www.openstack.org/"&gt;OpenStack&lt;/a&gt; is a new open source collaboration that is making a software stack that can be run both in house and it the cloud.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Good:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Open source&lt;/li&gt;&lt;li&gt;Generating a lot of buzz&lt;/li&gt;&lt;li&gt;Main participants &lt;a href="http://www.nasa.gov/"&gt;NASA&lt;/a&gt; and &lt;a href="http://www.rackspace.com/"&gt;Rackspace&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Backed by 70 companies&lt;/li&gt;&lt;li&gt;You can run your application either in house or in the cloud&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;Bad:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Not yet mature enough for production use&lt;/li&gt;&lt;li&gt;Windows support is immature&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Java, .NET Or Mixed Platform&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;br /&gt;For data mining selecting the right platform is a hard choice. Both Java and .NET are very attractive options.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Java only&lt;/b&gt;&lt;br /&gt;For data mining and NLP there are a lot of great open source project written in Java. E.g. &lt;a href="http://en.wikipedia.org/wiki/Mahout"&gt;Mahout&lt;/a&gt; is a system for&amp;nbsp;&amp;nbsp;collaborative filtering and clustering of&amp;nbsp;Big Data, with distributed machine learning. It is integrated with Hadoop.&lt;br /&gt;There are many more&amp;nbsp;OSS: &lt;a href="http://incubator.apache.org/opennlp/"&gt;OpenNLP&lt;/a&gt;, &lt;a href="http://lucene.apache.org/solr/"&gt;Solr&lt;/a&gt;,&amp;nbsp;&lt;a href="http://incubator.apache.org/connectors/"&gt;ManifoldCF&lt;/a&gt;,&lt;br /&gt;&lt;br /&gt;&lt;b&gt;.NET only&lt;/b&gt;&lt;br /&gt;The development tools in .NET are great. It works well with Microsoft Office.&lt;br /&gt;Visual Studio 2010 comes with F#, which is a great language for writing worker roles. It is very well suited for light weight threads or async, for highly parallel reactive programs.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Mix Java and .NET&lt;/b&gt;&lt;br /&gt;You can mix Java and .NET. Cloud computing makes is easier than ever to integrate different platforms. You already have abstract language agnostic service for communication with message queue, blob storage, structured data.&amp;nbsp;If you have an ASP.NET front end on top of a collaborative filtering of Big Data this would be a very attractive option.&lt;br /&gt;&lt;br /&gt;I still think that combining 2 big platforms like Java and .NET is introducing complexity, compared to staying within one platform. You need an organization with good&amp;nbsp;resources&amp;nbsp;and coordination to do this.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Choice Of Cloud Provider&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;I still have a lot of unanswered questions at this point.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At the time of writing June 2011 OpenStack is not ready for production use. So that is out for now.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;I have run some test on AWS. It was very easy to deploy my Python code to EC2 under Linux. Programming C# that used AWS services was simple.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;I am stuck waiting to get a Window 7 machine so I can test Window Azure.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;Both EC2 and Azure seem like viable options for what I need. I will get back to this in part 2 of the blog post.&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;Highlights from Cloud Camp 2011&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;A lot of people are trying to sell you cloud computing solutions.&amp;nbsp;I have heard plenty of cloud computing hype.&amp;nbsp;I have been seeking advice from people that were not trying to sell me anything and had some real experience, and try to find some of the&amp;nbsp;failures and problems in cloud computing.&amp;nbsp;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;I went to &lt;a href="http://cloudcamp-ny-2011.eventbrite.com/"&gt;Cloud Camp June 2011&lt;/a&gt; during Cloud Expo 2011 in New York. Cloud computing users&amp;nbsp;shared their experience.&amp;nbsp;It was an unconference, meaning&amp;nbsp;spontaneous&amp;nbsp;user discussion&amp;nbsp;breakout groups were formed.&amp;nbsp;The rest of this post is highlight from these discussions.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Hadoop Is Great But Hard&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a href="http://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt; is a Java open source implementation of Google's Map Reduce. You can set up a workflow of operations and Hadoop will distribute them over a multiple computers, aggregate the result and rerun operations that fail. This sounds fantastic, but&amp;nbsp;Hadoop is a pretty complex system, with a lot of new terminology and a steep learning curve.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Security Is Your Responsibility&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Security is a big issue.&amp;nbsp;You might assume that the cloud will take care of security, but you should not.&amp;nbsp;E.g. you should clean up the hard disks that you have used it, so the next user cannot see your data.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Cloud Does Not Automatically Scale To Big Data&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The assumption is that you put massive amounts of data in the cloud. And the cloud takes care of the scaling problems.&lt;br /&gt;If you have a lot of data that needs little processing. Then cloud computing becomes expensive: you store all data in 3 different locations and it is expensive and slow to take it down to different compute nodes. This was mentioned as the reason why NASA could not using S3, but build its own &lt;a href="http://nebula.nasa.gov/"&gt;Nebula&lt;/a&gt;&amp;nbsp;platform.&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;You Accumulate Cost During Development&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;An&amp;nbsp;entrepreneur&amp;nbsp;building a startup ended up paying $2000 / month for EC2. He used a lot of different servers and they had to be running with multiple instances, even though he was no using a lot of&amp;nbsp;resources. This might be cheap compared to going out and buying your own servers, but it was more expensive than he expected.&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Applications Written In .NET Run Fine Under EC2 Windows&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;An&amp;nbsp;entrepreneur said that he was running his company's .NET code under EC2. He thought that Amazon was more mature than Azure, and Azure was catching up. He&amp;nbsp;preferred&amp;nbsp;to make his own framework.&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Simpler To Run .NET Application On Azure Than On EC2&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;A cloud computing consultant with lots of experience in both Azure and EC2 said: EC2 gives you a raw machine you have to do more to get your application running than if you plop it into Windows Azure.&lt;br /&gt;It is very easy to port an ASP.NET application to Windows Azure.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Cash Flow, Operational&amp;nbsp;Expenses&amp;nbsp;And Capital Expenses&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;br /&gt;An often cited reason why cloud computing is great is that a company can replace big upfront capital expenses with smaller operational expenses. A few people mentioned that companies live by their cash flow and they do not like to have an unpredictable&amp;nbsp;operational expenses, but are more&amp;nbsp;comfortable&amp;nbsp;with&amp;nbsp;predictable capital expenses.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-7168257420998988288?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/7168257420998988288/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=7168257420998988288' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/7168257420998988288'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/7168257420998988288'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2011/06/cloud-computing-for-data-mining-part-1.html' title='Cloud Computing For Data Mining Part 1'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>3</thr:total><georss:featurename>New York, NY, USA</georss:featurename><georss:point>40.7143528 -74.0059731</georss:point><georss:box>40.4942638 -74.2853821 40.9344418 -73.7265641</georss:box></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-2226367747908646123</id><published>2011-04-01T13:29:00.002-04:00</published><updated>2011-07-29T10:31:49.793-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='LDA'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><category scheme='http://www.blogger.com/atom/ns#' term='Latent Dirichlet Allocation'/><title type='text'>Practical Probabilistic Topic Models for NLP</title><content type='html'>Latent Dirichlet Allocation, &lt;a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation"&gt;LDA&lt;/a&gt; is a new and very powerful technique for finding the topics in a collection of texts, using unsupervised learning.&amp;nbsp;LDA&amp;nbsp;is a probabilistic topic models. LDA was developed in 2003&amp;nbsp;and rely on advanced math.&amp;nbsp;This post is a practical guide about how to get started building LDA models and software.&lt;br /&gt;&lt;br /&gt;LDA will have a substantial impact on corpus based natural language processing; since it opens up for easy creation of semantic models based on machine learning.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Motivation for topic models&lt;/h3&gt;&lt;br /&gt;With the Internet we have large amount of text available. Having the text categorized into topics make text search much more precise and makes it possible to find similar documents.&lt;br /&gt;&lt;br /&gt;Text categorization is not an easy problem:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Texts usually deals with more than one topic&lt;/li&gt;&lt;li&gt;There is no clear standard for&amp;nbsp;categorization&lt;/li&gt;&lt;li&gt;Doing it by hand is infeasible&lt;/li&gt;&lt;/ul&gt;Nuanced categorized&amp;nbsp;is a hard problem, with many moving parts, but in 2003 David M. Blei, Andrew Y. Ng and Michael I. Jordan published an article on a new approach called&amp;nbsp;&lt;a href="http://www.cs.princeton.edu/~blei/papers/BleiJordan2003.pdf"&gt;Latent Dirichlet Allocation&lt;/a&gt;. LDA&amp;nbsp;can be implemented base on research articles, but if you are not a machine learning academic the math is intimidating and the material is still new.&lt;br /&gt;&lt;br /&gt;There is actually good material available, but finding all the pieces takes some work.&amp;nbsp;Most things you need are available&amp;nbsp;&lt;a href="http://www.cs.princeton.edu/~blei/topicmodeling.html"&gt;online&amp;nbsp;for free&lt;/a&gt;.&amp;nbsp;Here is a&amp;nbsp;chronological&amp;nbsp;account for what I did to understanding LDA and start implementing it.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Need for more&amp;nbsp;sophisticated&amp;nbsp;hierarchical&amp;nbsp;topic models&lt;/h3&gt;&lt;br /&gt;In 2009 I needed&amp;nbsp;a fine grained classification of text, using unsupervised or semi supervised training.&amp;nbsp;I spend a little time thinking about it, and had some idea about making bootstrapped training in a 2 layered&amp;nbsp;hierarchy. It was hackish, complex and I was not sure how numerically stable it was. I never got around to implement it.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;h3&gt;David Blei&lt;/h3&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;I went to 4&amp;nbsp;Annual Machine Learning Symposium in 2009 and asked around for solutions to my problem. Several attendees&amp;nbsp;told me to look at David Blei work. I did but he has written a lot of math heavy articles, so I did not know were to start.&lt;br /&gt;&lt;br /&gt;I was lucky to see David Blei give a presentation on LDA first at the 5 Annual Machine Learning Symposium.&amp;nbsp;David Blei works at&amp;nbsp;Princeton&amp;nbsp;and just&amp;nbsp;exudes&amp;nbsp;brilliance.&amp;nbsp;He gave a lucid entertaining description of the LDA with examples.&amp;nbsp;It wa really shocking to see the LDA algorithm find scientific topics on its own with no human&amp;nbsp;intervention.&lt;br /&gt;&lt;br /&gt;I saw him give the same talk at the &lt;a href="http://www.meetup.com/NYC-Machine-Learning/"&gt;NYC Machine Learning Meetup&lt;/a&gt;, and luckily that was videotaped here are&amp;nbsp;&lt;a href="http://blip.tv/file/4446448"&gt;part 1&lt;/a&gt; and &lt;a href="http://blip.tv/file/4446525"&gt;part 2&lt;/a&gt;.&amp;nbsp;I watched these videos a few times. This gave me a good intuition for the algorithm.&lt;br /&gt;&lt;br /&gt;I looked through his articles and found a good beginner articles&amp;nbsp;&lt;a href="http://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf"&gt;BleiLafferty2009&lt;/a&gt;. I read&amp;nbsp;through&amp;nbsp;that several time, but I could not understand it.&lt;br /&gt;&lt;br /&gt;I went out and bought the text book that David Blei recommended: Pattern Recognition and Machine Learning by Christopher M. Bishop. After reading the introduction chapter, I read&amp;nbsp;BleiLafferty2009 again and was able to understand it. On page 10 the&amp;nbsp;essence&amp;nbsp;of the algorithm is described in a small text box.&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;h3&gt;Software implementation&amp;nbsp;of LDA&lt;/h3&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;There are plenty open source implementation of LDA. Here are a few&amp;nbsp;observations:&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;a href="http://www.cs.princeton.edu/~blei/lda-c/index.html"&gt;lda-c in C&lt;/a&gt;&amp;nbsp;by David Blei is an implementation in old school C. The code is readable, concise and clean.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://cran.r-project.org/web/packages/lda/"&gt;lda for R package&lt;/a&gt;&amp;nbsp;by Jonathan Chang. Implementing many models&amp;nbsp;with extensive documentation.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.cs.princeton.edu/~blei/downloads/onlineldavb.tar"&gt;Online LDA in Python&lt;/a&gt; by Matt Hoffman. Short code, but not too much documentation.&lt;br /&gt;&lt;br /&gt;&lt;a href="https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html"&gt;LDA Apache Mahout in Java&lt;/a&gt;. Active development community works with Hadoop / MapReduce.&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;a href="http://jgibblda.sourceforge.net/"&gt;JGibbLDA in Java&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://gibbslda.sourceforge.net/"&gt;GibbsLDA++ C++&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;a href="http://www.arbylon.net/projects/"&gt;Arbylon Projects Java&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;No matter what language you prefer there should be a good implementation.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;h3&gt;Practical software considerations&lt;/h3&gt;&lt;/div&gt;&lt;br /&gt;All the implementations looked good. But if you want to use LDA software then robustness, scalability and extendibility are big issues.&amp;nbsp;First you just want the algorithm to run for simple text input. Next day you want the following options:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Better word&amp;nbsp;tokenizer&lt;/li&gt;&lt;li&gt;Bigrams and collocation&lt;/li&gt;&lt;li&gt;Words stemmer&lt;/li&gt;&lt;li&gt;LDA on structured text&lt;/li&gt;&lt;li&gt;Read from database&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;h3&gt;Programming language choice for LDA&lt;/h3&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;Here is a little common sense advice on choice of programming language for LDA programming.&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;b&gt;C&lt;/b&gt;&lt;br /&gt;C is an elegant, simple system programming language.&lt;br /&gt;C is not my first choice of a language for text processing.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;C++&lt;/b&gt;&lt;br /&gt;C++&amp;nbsp;is a very powerful but also complex language.&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;NLP lib:&amp;nbsp;The Lemur Project&lt;/div&gt;I would be happy&amp;nbsp;to use&amp;nbsp;C++ for text processing.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;b&gt;C#&lt;/b&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;C# is a great language.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;NLP lib:&amp;nbsp;&amp;nbsp;SharpNLP.&amp;nbsp;&lt;/div&gt;&lt;div&gt;You will have to implement LDA yourself or port one of the other implementations. SciPy is getting ported to C# but it does not have the best numeric libraries.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Clojure&lt;/b&gt;&lt;/div&gt;Clojure is a&amp;nbsp;moderate sized&amp;nbsp;LISP dialect build on the Java JVM.&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;NLP lib: OpenNLP through clojure-opennlp.&lt;/div&gt;&lt;div&gt;LISP is classic AI language and you can use one of the Java LDA implementations.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;b&gt;Java&lt;/b&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;Java is modern object oriented programming language with access to every thinkable library.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;NLP lib: OpenNLP.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;b&gt;Python&lt;/b&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;Python is an elegant language very well suited for NLP.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;NLP lib: NLTK, using NumPy and SciPy&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;b&gt;R&lt;/b&gt;&lt;br /&gt;R is a&amp;nbsp;fantastic&amp;nbsp;language for statistics, but not so great for low level text processing.&lt;br /&gt;NLP lib:&lt;br /&gt;The R implementation of LDA looks great;&amp;nbsp;I think that it is common to do all the preprocessing in another language say Perl. And then do all the rest of the work in R.&lt;br /&gt;&lt;div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;h3&gt;Different versions of LDA&lt;/h3&gt;&lt;/div&gt;&lt;br /&gt;There are now a lot of different LDA models&amp;nbsp;geared&amp;nbsp;towards different domains. Let me just mention a couple:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Online LDA&lt;/b&gt;&lt;br /&gt;Online means that: you do learning of the models in small batches; instead of on all the documents. This is useful for a continuously running system.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Dynamic LDA&lt;/b&gt;&lt;br /&gt;Good for handling text that stretches over a long time interval say 100 years.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Hierarchical&amp;nbsp;LDA&lt;/b&gt;&lt;br /&gt;This will handle topics are organized in&amp;nbsp;hierarchies.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Gray box approach to LDA&lt;/h3&gt;&lt;br /&gt;The math needed for LDA is advanced. If you do not succeed in understand it I still think that you can learn to use the code, if you are willing to take something on faith and get your hands dirty.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-2226367747908646123?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/2226367747908646123/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=2226367747908646123' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/2226367747908646123'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/2226367747908646123'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2011/04/practical-probabilistic-topic-models.html' title='Practical Probabilistic Topic Models for NLP'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-8235726724800065773</id><published>2011-04-01T13:29:00.000-04:00</published><updated>2011-04-01T13:29:26.051-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='kids'/><category scheme='http://www.blogger.com/atom/ns#' term='science education'/><category scheme='http://www.blogger.com/atom/ns#' term='girls'/><category scheme='http://www.blogger.com/atom/ns#' term='science girls'/><category scheme='http://www.blogger.com/atom/ns#' term='dad'/><category scheme='http://www.blogger.com/atom/ns#' term='mathematics'/><category scheme='http://www.blogger.com/atom/ns#' term='toddlers'/><category scheme='http://www.blogger.com/atom/ns#' term='children'/><category scheme='http://www.blogger.com/atom/ns#' term='math'/><category scheme='http://www.blogger.com/atom/ns#' term='Tiger Mother'/><title type='text'>Bedtime Science Stories My Science Education Blog</title><content type='html'>I started a science education blog called:&amp;nbsp;&lt;a href="http://www.bedtime-science-stories.org/"&gt;Bedtime Science Stories&lt;/a&gt;.&amp;nbsp;Here is a little&amp;nbsp;excerpt&amp;nbsp;from my first post: &lt;a href="http://www.bedtime-science-stories.org/2011/03/can-and-should-3-year-old-girl-be-into.html"&gt;Can and should a 3 year old girl be into science?&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;I have a 3 year old daughter that has take a bit of an interest in science. We have been talking about science when I put her to bed at night.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;Last Sunday I discovered a new book called Battle Hymn of the Tiger Mother by Amy Chua, who is a law professor at Yale. She is using extreme methods to push her 2 daughters to academic excellence. They had to be the best in their class in everything except drama and physical education. Math was a topic that she really drilled them in. Just reading the back cover sent me into a rage; so much that I decided to start a new blog: &lt;a href="http://www.bedtime-science-stories.org/"&gt;Bedtime Science Stories&lt;/a&gt;, just to get my anger out.&lt;br /&gt;&lt;br /&gt;Science should not be an elite activity. Making it very competitive will make a new generation of kids hate math and science.&amp;nbsp;Understanding our world is worthwhile activity even if you are not the best in your class.&lt;br /&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-RZtP-TTGmYQ/TZSx4JWe1BI/AAAAAAAAAD4/iZ0Uz7Edb5E/s1600/kid_2010-12.JPG" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="400" src="http://2.bp.blogspot.com/-RZtP-TTGmYQ/TZSx4JWe1BI/AAAAAAAAAD4/iZ0Uz7Edb5E/s400/kid_2010-12.JPG" width="167" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;My 3 year old daughter&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-8235726724800065773?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/8235726724800065773/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=8235726724800065773' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/8235726724800065773'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/8235726724800065773'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2011/04/bedtime-science-stories-my-science.html' title='Bedtime Science Stories My Science Education Blog'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-RZtP-TTGmYQ/TZSx4JWe1BI/AAAAAAAAAD4/iZ0Uz7Edb5E/s72-c/kid_2010-12.JPG' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-7511619318875662810</id><published>2011-02-15T01:56:00.043-05:00</published><updated>2011-02-18T09:31:04.657-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sentiment analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='Google tech talk'/><category scheme='http://www.blogger.com/atom/ns#' term='AI'/><category scheme='http://www.blogger.com/atom/ns#' term='linguistics'/><category scheme='http://www.blogger.com/atom/ns#' term='Artificial Intelligence'/><category scheme='http://www.blogger.com/atom/ns#' term='natural language processing'/><category scheme='http://www.blogger.com/atom/ns#' term='startups'/><category scheme='http://www.blogger.com/atom/ns#' term='IBM Watson'/><category scheme='http://www.blogger.com/atom/ns#' term='Google Squared'/><category scheme='http://www.blogger.com/atom/ns#' term='NLP'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Is IBM Watson Beginning An AI Boom?</title><content type='html'>Artificial intelligence fell out of favor in the 1970s, the start of first artificial intelligence winter, and has mainly been out of favor since. In April 2010 I wrote a &lt;a href="http://samibadawi.blogspot.com/2010/04/data-mining-rediscovers-artificial.html"&gt;post&lt;/a&gt; about how you can now get a paying job doing AI, machine learning and natural language processing outside academia.&lt;br /&gt;&lt;br /&gt;Now barely one year later I have seen a few demonstrations that signal that artificial intelligence has taken another leap towards mainstream acceptance:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Yann LeCun demonstrated a computer vision system that could learn to recognize objects from his pocket after being shown a few examples, under a talk about &lt;a href="http://www.meetup.com/NYC-Machine-Learning/events/16489490/"&gt;learning feature hierarchies&lt;/a&gt; for computer vision&amp;nbsp;&lt;/li&gt;&lt;li&gt;Andrew Hogue demonstrated Google Squared and Google Sentiment Analysis at &lt;a href="http://www.youtube.com/watch?v=5lCSDOuqv1A&amp;amp;p=AD8A7B6D66DDD297"&gt;Google Tech Talk&lt;/a&gt;, those systems both show rudimentary understanding of web pages and use word association&lt;/li&gt;&lt;li&gt;IBM Watson super computer is competing against the best human players on Jeopardy&amp;nbsp;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;/ul&gt;All these 3 systems contain some real intelligence. Rudimentary by human standard, but AI has gone from the very specialized systems to handling more general tasks. It feels like AI is picking up steam. I am seeing startups based on machine learning pop up. This reminds me of the Internet boom in 1990s. I moved to New York in 1996, at the beginning of the Internet boom. I saw firsthand the crazy gold rush where fortunes were made and lost in short time, Internet startups were everywhere and everybody was talking about IPOs. This got me thinking, are we headed towards an artificial intelligence boom, and what would it look like?&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;IBM Watson&lt;/h3&gt;IBM Watson is a well executed factoid extraction system, but it is a brilliant marketing move, promoting IBM's new POWER7 system and their Smart Planet consulting services. It gives some people the impression that we already have human-like AI, and in that sense it could serve as a catalyst for investments in AI.&amp;nbsp;This post is not about human-like artificial intelligence, but about the spread of shallow artificial intelligence.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Applications For Shallow Artificial Intelligence&lt;/h3&gt;Both people and corporations would gain value from having AI systems that they could ask free form questions to and get answers from in very diverse topics. In particular in these fields:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Medical science&lt;/li&gt;&lt;li&gt;Law&lt;/li&gt;&lt;li&gt;Surveillance&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Military&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Many people, me included, are concerned about a big brother state and military use of AI, but I do not think that is going to stop adaption. These people play for keeps.&lt;br /&gt;&lt;br /&gt;There are signs that the financial service industry is starting to use sentiment analysis for their pricing and risk models. Shallow AI would be a good candidate for more advanced algorithmic trading.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Bottom Up vs. Top Down Approaches&lt;/h3&gt;Here is a very brief simplified introduction to AI techniques and tools. AI is a loosely defined field, with a loose collection of techniques. You can roughly categorize them it top down and bottom up approaches.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Top down or symbolic techniques&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Automated reasoning&lt;/li&gt;&lt;li&gt;Logic&lt;/li&gt;&lt;li&gt;Many forms of tree search&lt;/li&gt;&lt;li&gt;Semantic networks&lt;/li&gt;&lt;li&gt;Planning&lt;/li&gt;&lt;/ul&gt;&lt;b&gt;Bottom up or machine learning techniques&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Neural networks, computer with similar structure to the brain&lt;/li&gt;&lt;li&gt;Machine learning &lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;The top down systems are programmed by hand, while the bottom up systems learns themselves based on examples without human intervention, a bit like the brain.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;What Is Causing This Sudden Leap?&lt;/h3&gt;Many top down techniques  were developed by the 1960s. They were very good ideas, but they did not scale; they only worked for small toy problems. &lt;br /&gt;Neural networks are an important bottom up technique. They started in 1950s, but fell out of favor; they came roaring back in 1980s. In the 1990 the machine learning / statistical approaches to natural language processing beat out Chomsky's generative grammar approach.&lt;br /&gt;&lt;br /&gt;The technology that is needed for what we are doing now have been around for a long time. Why are these systems popping up now?&lt;br /&gt;&lt;br /&gt;I think that we are seeing the beginning of a combination machine learning with top down techniques. The reason why this have taken so long is that it is hard to combine top down and bottom up techniques. Let me elaborate a little bit:&lt;br /&gt;&lt;br /&gt;Bottom up AI / machine learning are black boxes that you give some input and expected output and it will adjust a lot of parameter numbers so it can mimic the result. Usually the numbers will not make much sense they just work.&lt;br /&gt;&lt;br /&gt;In top down / symbolic AI you are creating detailed algorithms for working with concepts that make sense.&lt;br /&gt;&lt;br /&gt;Both top down and bottom up techniques are now well developed and better understood. This makes it easier to combine them.&lt;br /&gt;&lt;br /&gt;Other reasons for the leap are: &lt;br /&gt;&lt;ul&gt;&lt;li&gt;Cheap, powerful and highly parallel computers&lt;/li&gt;&lt;li&gt;Open source software, were programmers from around the world develop free software. This makes programming into more of an industrial assembly of parts.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;h3&gt;Who Will Benefit From An AI Boom?&lt;/h3&gt;Here are some groups of companies that made a lot of money during the Internet boom:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Cisco and Oracle the tool makers&lt;/li&gt;&lt;li&gt;Amazon and eBay small companies that grew to become domineering in e-commerce&lt;/li&gt;&lt;li&gt;Google and Yahoo advertisement driven information companies&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Initially big companies like IBM and Google that can create the technology should have an advantage, whether it will be in the capacity of tool makers or domineering players.&lt;br /&gt;&lt;br /&gt;It is hard to predict how high the barrier to entry in AI will be. AI programs are just trained on regular text found on or off the Internet. And today's super computer is tomorrow's game console. The Internet has a few domineering players, but it is generally decentralized and anybody can have a web presence.&lt;br /&gt;&lt;br /&gt;New York is now filled with startups using machine learning as a central element. They are "funded", but it seems like they got some seed capital. So maybe there is room for smaller&amp;nbsp;companies to&amp;nbsp;compete&amp;nbsp;in the AI space.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Job Skills That Will Be Required In An AI Boom&lt;/h3&gt;During the Internet boom I met people with a bit of technical flair and no education beyond high school who picked up HTML in a week and next thing they were making $60/hour doing plain HTML. I think that the jobs in artificial intelligence are going to be a little more complex than those 1990s web developer jobs.&lt;br /&gt;&lt;br /&gt;In my own work I have noticed a move from writing programming to teaching software based on examples. This is a dramatic change, and it requires a different skill set.&lt;br /&gt;&lt;br /&gt;I think that there will still be plenty of need for programmers, but cognitive science, mathematics, statistics and linguistics will be skills in demand.&lt;br /&gt;&lt;br /&gt;My work would benefit from me having better English language skills. The topic that I am dealing with is, after all, the English language. So maybe that English literature degree could come in handy.&lt;br /&gt;&lt;br /&gt;Currently I feel optimistic about the field of artificial intelligence; there is progress after years of stagnation. We are wrestling a few secrets away from Mother Nature, and are making progress in understanding how the brain works. Sill, introduction of such powerful technology as artificial intelligence is going to affect society for better and worse.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-7511619318875662810?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/7511619318875662810/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=7511619318875662810' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/7511619318875662810'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/7511619318875662810'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2011/02/is-ibm-watson-beginning-of-artificial.html' title='Is IBM Watson Beginning An AI Boom?'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-2578426564817637783</id><published>2010-12-16T13:09:00.037-05:00</published><updated>2011-03-02T13:46:00.200-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='NLTK'/><category scheme='http://www.blogger.com/atom/ns#' term='PyDev'/><category scheme='http://www.blogger.com/atom/ns#' term='SciPy'/><category scheme='http://www.blogger.com/atom/ns#' term='Python'/><category scheme='http://www.blogger.com/atom/ns#' term='NumPy'/><title type='text'>NLTK under Python 2.7 and SciPy 0.9.0</title><content type='html'>Python 2.7 has been out for months, but I have been stuck using Python 2.6 since SciPy was not working for Python 2.7. &lt;a href="http://sourceforge.net/projects/scipy/files/scipy/0.9.0b1/"&gt;SciPy 0.9 Beta 1&lt;/a&gt; binary distribution has just been released.&lt;br /&gt;Normally I try to stay clear of beta quality software, but&amp;nbsp;I really like some of the new features in Python 2.7 especially the argparse module, so despite my better judgement I installed Python 2.7.1 and SciPy 0.9.0 Beta 1, to run with a big NLTK based library. This is blog post describes the configuration that I use; and my first impression of the stability.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://sourceforge.net/projects/scipy/files/scipy/0.9.0rc1/"&gt;SciPy 0.9 RC1&lt;/a&gt; was released&amp;nbsp;January&amp;nbsp;2011.&lt;br /&gt;&lt;a href="http://sourceforge.net/projects/scipy/files/scipy/0.9.0/"&gt;SciPy 0.9&lt;/a&gt;&amp;nbsp;was released&amp;nbsp;February&amp;nbsp;2011.&lt;br /&gt;I tried both of them and found almost the same result as for SciPy 0.9 Beta 1, which this review was originally written for.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Direct downloads&lt;/b&gt;&lt;br /&gt;Here is a list of the programs I installed directly:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.python.org/download/releases/2.7.1/"&gt;python-2.7.1.msi&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://sourceforge.net/projects/pywin32/files/pywin32/Build%20214/"&gt;pywin32-214.win32-py2.7.exe&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://sourceforge.net/projects/numpy/files/NumPy/"&gt;numpy-1.5.1-win32-superpack-python2.7.exe&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://sourceforge.net/projects/scipy/files/scipy/0.9.0b1/"&gt;scipy-0.9.0b1-win32-superpack-python2.7.exe&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://pypi.python.org/pypi/setuptools"&gt;setuptools-0.6c11.win32-py2.7.exe&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;b&gt;Installation of NLTK&lt;/b&gt;&lt;br /&gt;&lt;h3&gt;&lt;span class="Apple-style-span" style="font-size: small; font-weight: normal;"&gt;The install was very simple just type:&lt;/span&gt;&lt;/h3&gt;&lt;h3&gt;&lt;span class="Apple-style-span" style="font-size: small; font-weight: normal;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;\Python27\lib\site-packages\easy_install.py nltk&lt;/span&gt;&lt;/span&gt;&lt;/h3&gt;&lt;br /&gt;&lt;b&gt;Other libraries installed with easy_install.py&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;CherryPy&lt;/li&gt;&lt;li&gt;ipython&lt;/li&gt;&lt;li&gt;PIL&lt;/li&gt;&lt;li&gt;pymongo&lt;/li&gt;&lt;li&gt;pyodbc&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;/ul&gt;&lt;br /&gt;&lt;b&gt;YAML Library&lt;/b&gt;&lt;br /&gt;On a Windows Vista computer with no MS C++ compiler were I tested this NLTK install I also had to do a manual install of YAML from:&lt;br /&gt;&lt;a href="http://pyyaml.org/wiki/PyYAML"&gt;http://pyyaml.org/wiki/PyYAML&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Libraries from unofficial binary distributions&lt;/b&gt;&lt;br /&gt;There are a few packages that have build problems, but can be loaded from Christoph Gohlke's site with&amp;nbsp;Unofficial Windows Binaries for Python Extension Packages:&amp;nbsp;&lt;a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/"&gt;http://www.lfd.uci.edu/~gohlke/pythonlibs/&lt;/a&gt;&amp;nbsp;I downloaded and installed:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;matplotlib-1.0.0.win32-py2.7.exe&lt;/li&gt;&lt;li&gt;opencv-python-2.2.0.win32-py2.7.exe&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;h3&gt;Stability&lt;/h3&gt;The installation was simple. Everything installed cleanly. I ran some bigger scripts and they ran fine. Development and debugging also worked fine. Out of 134 NLTK related unit tests only one failed under Python 2.7&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Problems with SciPy algorithms&lt;/h3&gt;The failing unit test was maximum entropy training using the &lt;a href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.classify.maxent.MaxentClassifier-class.html"&gt;LBFGSB&lt;/a&gt; optimization algorithm. These were my settings:&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;nltk.MaxentClassifier.train(train, algorithm='LBFGSB', gaussian_prior_sigma=1, trace=2)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;First the&amp;nbsp;maximum entropy training&amp;nbsp;would not run because it was calling the method rmatvec() in scipy/sparse/base.py.&amp;nbsp;This method has been deprecated for a while and has been taken out of the SciPy 0.9.&amp;nbsp;I found this method in SciPy 0.8 and added it back. My unit test ran, but instead of finishing in a couple of seconds it took around 10 minutes eating up 1.5GB before it crashed. After this I gave up on LBFGSB.&lt;br /&gt;&lt;br /&gt;&lt;div style="margin: 0px;"&gt;If you do not want to use LBFGSB, &lt;a href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.classify.megam-module.html"&gt;megam&lt;/a&gt; is another efficient optimization algorithm. However it is implemented in OCaml and I did not want to install OCaml on a Windows computer.&lt;br /&gt;&lt;br /&gt;This problem&amp;nbsp;occurred&amp;nbsp;for both SciPy 0.9 Beta 1 and RC1.&lt;/div&gt;&lt;br /&gt;&lt;h3&gt;Python 2.6 and 2.7 interpreters active in PyDev&lt;/h3&gt;Another problem was that having both Python 2.6 and 2.7 interpreters active in PyDev made it less stable. When I started scripts from PyDev sometime they timed out before starting. PyLint would also show errors in code that was correct. I deleted Python 2.6 interpreter under PyDev Preferences, and PyDev worked fine with just Python 2.7.&lt;br /&gt;&lt;br /&gt;I also added a version check the one failing unit test, since it caused problems for my machine.&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;if (2, 7) &amp;lt; sys.version_info: return&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Multiple versions of Python on Windows&lt;/h3&gt;If you install Python 2.7 and realize that some code is only running under Python 2.6 or that you have to rollback. Here are a few simple suggestions:&lt;br /&gt;&lt;br /&gt;I did a Google search for:&lt;br /&gt;python multiple versions windows&lt;br /&gt;This will show many ways to deal with this problem. One way is calling a little Python script that change the Windows register settings.&lt;br /&gt;&lt;br /&gt;Multiple versions of Python have not been a big problem for me. So I favor a very simple approach. The main issue is file extension binding. What program gets called when you double click a py file or type script.py on the command line.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Changing file extension binding for rollback to Python 2.6&lt;/h3&gt;Under Windows XP You can change file extensions in Windows Explorer:&lt;br /&gt;Under Tools &amp;gt; Folder Option &amp;gt; File Types&lt;br /&gt;Select the PY Extension and&amp;nbsp;press&amp;nbsp;Advanced then press Change&lt;br /&gt;Select open&amp;nbsp;press&amp;nbsp;Edit&lt;br /&gt;The value is:&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;"C:\Python27\python.exe" "%1" %*&lt;/span&gt;&lt;br /&gt;You can change this to use a different interpreter:&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;"C:\Python26\python.exe" "%1" %*&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Or even simpler when I want to run the older Python interpreter I just type:&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;\Python26\python.exe script.py&lt;/span&gt;&lt;br /&gt;Instead of typing&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;script.py&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Is Python 2.7 and SciPy 0.9.0 Beta 1 stable enough for NLTK use?&lt;/h3&gt;The installation of all the needed software was fast and unproblematic. I would certainly not use it in a production environment.&amp;nbsp;If you are doing a lot of numerical algorithms you should probably hold off.&amp;nbsp;If you are impatient and you do not need to do new training it is worth trying it, you can always roll back.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-2578426564817637783?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/2578426564817637783/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=2578426564817637783' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/2578426564817637783'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/2578426564817637783'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2010/12/nltk-under-python-27-and-scipy-090-beta.html' title='NLTK under Python 2.7 and SciPy 0.9.0'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-1072506514946234146</id><published>2010-11-12T21:50:00.005-05:00</published><updated>2010-11-12T22:50:08.498-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MongoDB'/><category scheme='http://www.blogger.com/atom/ns#' term='PyDev'/><category scheme='http://www.blogger.com/atom/ns#' term='large scale'/><category scheme='http://www.blogger.com/atom/ns#' term='project management'/><category scheme='http://www.blogger.com/atom/ns#' term='CherryPy'/><category scheme='http://www.blogger.com/atom/ns#' term='Python'/><title type='text'>Growing Python projects from small to large scale</title><content type='html'>You need significantly different principles for developing small, medium and large scale software system.&lt;br /&gt;&lt;br /&gt;When my project started to become big I searched the Internet for some guidelines  or best practices  for how to scale Python, but did not  find much. Here are a few of my  observations on what technique to use for what project sizes.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;General principle&lt;/h3&gt;&lt;br /&gt;For a small system you can spend most of your time solving the  problem, but the bigger the system gets the more time you spend on  project plans, coordination and documentation. The complexity and cost  does not scale linearly with the size of the project but maybe scales  with the square of the size. This holds for different styles of project management  both waterfall and agile.&lt;br /&gt;&lt;br /&gt;A central problem is minimizing dependencies  and avoiding tight coupling. John Lakos has  written an excellent book on software scaling called: &lt;a href="http://www.amazon.com/Large-Scale-Software-Design-John-Lakos/dp/0201633620"&gt; Large-Scale C++  Software Design&lt;/a&gt; here is a &lt;a href="http://blog.codeimproved.net/2009/03/the-large-scale-c-software-design-rules-in-practice/"&gt;summary&lt;/a&gt;. It is a very  scientific and stringent approach, which is specific for C++. He developed a metric for how much dependencies you  have in your system. His technique are not a good fit for smaller  projects, you could finish several scripts before you could even  implement his methodology.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Small scripts&lt;/b&gt;&lt;br /&gt;Keep it simple. Focus on the core functionality. Minimize the time you spend on setting up the project.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Medium applications&lt;/b&gt;&lt;br /&gt;Spending some time organizing things, will save you time in the long run.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Large applications&lt;/b&gt;&lt;br /&gt;Here you need a lot of structure; otherwise the project will not be stable.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Development environment&lt;/h3&gt;&lt;br /&gt;&lt;b&gt;Small scripts&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;I use PyWin Windows IDE. &lt;br /&gt;&lt;ul&gt;&lt;li&gt;It is lightweight&lt;/li&gt;&lt;li&gt;No need for Java or Eclipse&lt;/li&gt;&lt;li&gt;Syntax highlighting&lt;/li&gt;&lt;li&gt;Code completion at run time and some at write time&lt;/li&gt;&lt;li&gt;Allow primitive debugging&lt;/li&gt;&lt;li&gt;You do not need to set up a project to use it.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;b&gt;Medium applications&lt;/b&gt;&lt;br /&gt;I have used both PyWin and PyDev.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Large applications&lt;/b&gt;&lt;br /&gt;I would strongly recommend &lt;a href="http://pydev.org/"&gt;PyDev Eclipse plugin&lt;/a&gt;. It is a modern IDE and runs pylint continuously and has good code completion while writing code. It will find maybe half the error a compiler would find. This improves the stability a lot and was the most important change that I made from my old coding style.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Organization of code&lt;/h3&gt;&lt;br /&gt;&lt;b&gt;Small scripts&lt;/b&gt;&lt;br /&gt;Use one module / file with all the code in. This can have several  classes. The advantage is that deployment becomes trivial: you just email the script to the  user. This works for modules up to around 3000 lines of code.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Medium applications&lt;/b&gt;&lt;br /&gt;Use one directory with all modules in. This gives you fewer issues with PYTHONPATH.&lt;br /&gt;&lt;br /&gt;Make a convention for naming field names, database name and parameter name. Put all these names in a module that only contains string constants, and use these in your code instead of raw string.&lt;br /&gt;&lt;br /&gt;Use a separate repository for the project. I package  the Python and other self written executable together in a repository, even when I have another source control system for the compiled sources.&lt;br /&gt;&lt;br /&gt;This works up till around 40 Python modules, then it become hard to find anything. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Large applications&lt;/b&gt;&lt;br /&gt;Read and follow the &lt;a href="http://www.python.org/doc/essays/styleguide.html"&gt;Python style guide&lt;/a&gt;. Before I followed a Java   style guide since Java is big on coding convention, but the Python   style is actually pretty different. A noticeable difference is a Java file  contains a main class with a title case name and the file has the same name.  In python modules should have short lowercase name while the classes  still should have title case names.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Organizing &lt;/b&gt;&lt;b&gt;packages as an a&lt;/b&gt;&lt;b&gt;-cyclical graph&lt;/b&gt;&lt;br /&gt;Refactor the modules into packages. The packages should be organized as an a-cyclical graph. So at the lowest level you would have an util package that is not allowed to reference anything else. You can have other specialized packages that can access the util package. Over that I have the main source directory with code that is central and general. Over that I have a loader package that can access all the other packages.&lt;br /&gt;&lt;br /&gt;One  problem when you have different directories is that you need the PYTHONPATH include all the code. A good way to do this is to try to add the  parent directory to the system path before you import any of the  modules.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Documentation&lt;/h3&gt;&lt;br /&gt;&lt;b&gt;Small scripts&lt;/b&gt;&lt;br /&gt;Usually I have:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Python docstring in the program.&amp;nbsp;&lt;/li&gt;&lt;li&gt;Print a usage message&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;b&gt;Medium applications&lt;/b&gt; &lt;br /&gt;Have a directory for documentation. To keep it simple I prefer to use simple HTML. I find that &lt;a href="http://www.seamonkey-project.org/"&gt;Mozilla SeaMonkey&lt;/a&gt; is simple to use and generates clean HTML you can do a diff on. Often I have:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;User documentation page&amp;nbsp;&lt;/li&gt;&lt;li&gt;Programmer documentation page&lt;/li&gt;&lt;li&gt;Release notes&lt;/li&gt;&lt;li&gt;Example data&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;b&gt;Large applications&lt;/b&gt;&lt;br /&gt;At this point using automatically generated documentation and some sort of wiki format for writing documentation is a good idea.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Communication&lt;/h3&gt;&lt;br /&gt;Input and output account for a sizable part of your code. I prefer to use the most lightweight method I can get away with.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Small scripts and medium applications&lt;/b&gt;&lt;br /&gt;Communication is done with flat files, csv files and database.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Large applications&lt;/b&gt;&lt;br /&gt;Communication is done with flat files, csv files, database, MongoDB  and CherryPy.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.mongodb.org/"&gt;MongoDB&lt;/a&gt; have dramatically simplified my work, before different types of structured data demanded their own database with several tables. Now I just load the data into a MongoDB collection. MongoDB make very different structured documents look very uniform and trivial to load from  Python. After that I can use the same script on very different data.&lt;br /&gt;&lt;br /&gt;When you have a script and find out that you need to have other programs call it. It is very simple to create XML, JSON or text based RESTful web service using &lt;a href="http://www.cherrypy.org/"&gt;CherryPy&lt;/a&gt;. You just add a 1 line annotation to a method and it is now a web service. You barely have to make any changes to your program. CherryPy feels very Pythonic. This will give you very cheap way to connect to a GUI and a web site written in other languages.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Unit tests&lt;/h3&gt;&lt;br /&gt;&lt;b&gt;Small scripts&lt;/b&gt;&lt;br /&gt;Unit tests give you a small advantage. I still write unit tests unless there is an emergency, and then I usually regret it.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Large applications&lt;/b&gt;&lt;br /&gt;The bigger the system the more important it is that the individual pieces works. Large systems are not maintainable if you do not have unit tests.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Source control system&lt;/h3&gt;&lt;br /&gt;I put any code that I use for production in a source control system. I usually use Subversion or GIT.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://subversion.tigris.org/"&gt;Subversion&lt;/a&gt;  is good for centralized development, and it is nice that each check in  has a sequential revision number so that you can see revision number 123 and next 124.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://git-scm.com/"&gt;GIT&lt;/a&gt; is better for distributed development; it is easy to create a local repository for a project.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Small scripts&lt;/b&gt;&lt;br /&gt;One repository for each type of script.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Medium and large applications&lt;/b&gt;&lt;br /&gt;One repository for each project.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Use of standard libraries&lt;/h3&gt;&lt;br /&gt;&lt;b&gt;Small scripts&lt;/b&gt;&lt;br /&gt;Use the simplest approach that gets the work done. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Large applications&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;When my application grew I realized that I recreated functionality from the standard libraries; for instance from these libraries:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://docs.python.org/dev/library/argparse.html"&gt;argparse&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://docs.python.org/library/csv.html"&gt;csv&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://docs.python.org/library/itertools.html"&gt;itertools&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;I refactored my program to use the standard library and found that it were much better than what I had written. For bigger application using standard libraries makes your code less buggy and more maintainable. So spend some time to find what has already been written.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;How well does Python scale compared to compiled languages&lt;/h3&gt;&lt;br /&gt;There are mixed opinions on this topic. Scripts are generally small and large systems are generally written in compiled languages. The extra checks and rigidity you get from a compiled language is more important the bigger you applications get. If you are writing a financial application and have very low tolerance for errors this could be significant.&lt;br /&gt;&lt;br /&gt;I am using Python for natural language processing: classification,   named entity recognition, sentiment analysis and information extraction.  I have to write many complex custom scripts fast.&lt;br /&gt;&lt;br /&gt;Based on my earlier experience with writing smaller Python scripts I was concerned about writing a bigger application. I found a good setup with PyDev, unit test and source control. It gives me much of the stability I am used to in a compiled language, while I can still can do rapid development.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-1072506514946234146?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/1072506514946234146/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=1072506514946234146' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/1072506514946234146'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/1072506514946234146'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2010/11/growing-python-projects-from-small-to.html' title='Growing Python projects from small to large scale'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-7092636845598357792</id><published>2010-10-29T22:51:00.001-04:00</published><updated>2010-11-14T06:15:04.860-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='MongoDB'/><category scheme='http://www.blogger.com/atom/ns#' term='Cython'/><category scheme='http://www.blogger.com/atom/ns#' term='NLTK'/><category scheme='http://www.blogger.com/atom/ns#' term='natural language processing'/><category scheme='http://www.blogger.com/atom/ns#' term='Go'/><category scheme='http://www.blogger.com/atom/ns#' term='C#'/><category scheme='http://www.blogger.com/atom/ns#' term='Golang'/><category scheme='http://www.blogger.com/atom/ns#' term='Python'/><category scheme='http://www.blogger.com/atom/ns#' term='NLP'/><category scheme='http://www.blogger.com/atom/ns#' term='Clojure'/><title type='text'>Natural language processing in Clojure, Go and Cython</title><content type='html'>I work in natural language processing, programming in C# 3.5 and Python. My work includes classification, named entity recognition, sentiment analysis and information extraction. Both C# and Python are great languages, but I do have some unmet needs. I investigated if there are any new languages that would help. I only looked at minimal language that would be simple to learn. The 3 top contenders were: Clojure, Go and Cython. Both Clojure, Go have innovative approaches to non locking  concurrency. This is my first impression of working with these languages.&lt;br /&gt;&lt;br /&gt;For contrast let me start by listing the features of my current languages.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;C# 3.5&lt;/h3&gt;C# is an advanced object orientated / functional hybrid language and programming platform:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;It is fast&lt;/li&gt;&lt;li&gt;Great development environment&lt;/li&gt;&lt;li&gt;You can do almost any tasks in it&lt;/li&gt;&lt;li&gt;Great database support with LINQ to SQL&lt;/li&gt;&lt;li&gt;Advanced web development with ASP.net&lt;/li&gt;&lt;li&gt;Advanced GUI toolkit with WPF&lt;/li&gt;&lt;li&gt;Good concurrency with threading library&lt;/li&gt;&lt;li&gt;Good MongoDB library&lt;/li&gt;&lt;/ul&gt;&lt;b&gt;Issues&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Works best on Windows&lt;/li&gt;&lt;li&gt;Not well suited for rapid development&lt;/li&gt;&lt;/ul&gt;While many features of C# are not directly related to NLP they are very convenient. C# has some NLP libraries: SharpNLP is a port of OpenNLP from Java. Lucene has also been ported. The ports are behind the Java implementation, but still give a good foundation.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Python&lt;/h3&gt;Python is an elegant scripting language, with a strong focus on simplicity.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;NLTK is a great NLP library&lt;/li&gt;&lt;li&gt;Lot of open source math and science libraries&lt;/li&gt;&lt;li&gt;PyDev is a good development environment &lt;/li&gt;&lt;li&gt;Good MongoDB library&lt;/li&gt;&lt;li&gt;Great for rapid development &lt;/li&gt;&lt;/ul&gt;&lt;b&gt;Issues&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;It is interpreted and not very fast&lt;/li&gt;&lt;li&gt;Problems with GIL based threading model&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;ul&gt;&lt;/ul&gt;&lt;h3&gt;C# vs. Python and unmet needs&lt;/h3&gt;I was not sure what language I would prefer to work with. I suspected that C# would win out with all it advanced features. Due to demand for fast turnaround, I ended up doing more work in Python, and have been very happy with that choice. I have a lot of scripts that can be piped together to create new applications, with the help of the very fast and flexible MongoDB.&lt;br /&gt;&lt;br /&gt;I do have some concerns about Python moving forward:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Will it scale if I get really large amount of text&lt;/li&gt;&lt;li&gt;Will speed improve on multi core processors&lt;/li&gt;&lt;li&gt;Will it work with cloud computing&lt;/li&gt;&lt;li&gt;Part of speech tagging is slow&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Java&lt;/h3&gt;Java is a modern object oriented language. Like C# it is a programming platform:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Has most NLP libraries: OpenNLP, Mahout, Lucene, WEKA&lt;/li&gt;&lt;li&gt;It is fast&lt;/li&gt;&lt;li&gt;Great development environment: Eclipse and NetBeans&lt;/li&gt;&lt;li&gt;You can do almost any tasks in it&lt;/li&gt;&lt;li&gt;Great database support with JDBC and Hibernate&lt;/li&gt;&lt;li&gt;Many web development frameworks &lt;/li&gt;&lt;li&gt;Good GUI toolkit: Swing and JavaFX&lt;/li&gt;&lt;li&gt;Good concurrency with threading library&lt;/li&gt;&lt;/ul&gt;&lt;b&gt;Issues&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Functional style programming is clumsy&lt;/li&gt;&lt;li&gt;Working with MongoDB is clumsy&lt;/li&gt;&lt;li&gt;Java code is verbose &lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;I would not hesitate using Java for NLP, but my company is not a Java shop.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Clojure&lt;/h3&gt;&lt;a href="http://clojure.org/"&gt;Clojure&lt;/a&gt; was released in 2007. It is a right sized LISP. Not very big like Common LISP or very small like Scheme. &lt;br /&gt;&lt;ul&gt;&lt;li&gt;Gives easy access to Java libraries: OpenNLP, Mahout, Lucene, WEKA, OpinionFinder&lt;/li&gt;&lt;li&gt;Innovative non locking concurrency primitives&lt;/li&gt;&lt;li&gt;Good IDEs in Eclipse and NetBeans&lt;/li&gt;&lt;li&gt;Easy to work with&lt;/li&gt;&lt;li&gt;Code and data is unified&lt;/li&gt;&lt;li&gt;Interactive REPL&lt;/li&gt;&lt;li&gt;LISP is the classic artificial intelligence language&lt;/li&gt;&lt;li&gt;If you need speed you can write Java code&lt;/li&gt;&lt;li&gt;Good MongoDB library&lt;/li&gt;&lt;/ul&gt;&lt;b&gt;&amp;nbsp;Issues&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The IDE is not working as well as IDEs for Java or C# &lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;ul&gt;&lt;/ul&gt;Clojure is minimal in the sense that it is build on around 10 primitive programming constructs. The rest of the language is constructed with macros written in Clojure.&lt;br /&gt;&lt;br /&gt;Once I got Clojure installed it was easy to work with and program in. Most of the good features about Python also applies to Clojure: it is minimal and has batteries included. Still I think that Python is a simpler language than Clojure.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Use case&lt;/b&gt;&lt;br /&gt;Clojure is a good way to script the extensive Java libraries, for rapid development. It has more natural interaction with MongoDB than Java.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Clojure OpenNLP&lt;/h3&gt;The &lt;a href="http://github.com/dakrone/clojure-opennlp"&gt;clojure-opennlp&lt;/a&gt; project is a thin Clojure wrapper around OpenNLP. It came with all the corpora used as training data for OpenNLP nicely packaged and it works well. You can script OpenNLP approximately as terse as NLTK, from an interactively repl.&lt;br /&gt;&lt;br /&gt;I tried it in both Eclipse and NetBeans. They seem somewhat equal in number of features. I had a little better luck with the Eclipse version.&lt;br /&gt;&lt;br /&gt;clojure-opennlp is using a Maven built system, but has a nontraditional directory layout, this caused problems for both Eclipse and NetBeans, they both took some configuration.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Eclipse Counterclockwise&lt;/b&gt;&lt;br /&gt;The &lt;a href="http://www.assembla.com/wiki/show/clojure/Getting_Started_with_Eclipse_and_Counterclockwise"&gt;Counterclockwise instruction&lt;/a&gt; for labrepl mainly worked for installing clojure-opennlp.&lt;br /&gt;When you were done you had to go in add the example directory the source directories under properties. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;NetBeans Enclojure&lt;/b&gt;&lt;br /&gt;I imported the project. I had to move the Clojure file from example directory to a different position to get it to work.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Maven plugins for Clojure&lt;/b&gt;&lt;br /&gt;The standard Maven directory layout has several advantages, e.g. if you want to mix Java and Clojure code. I created my own Maven pom configuration file up, based on examples of other Clojure Maven projects. They used Clojure plugins for Maven, I could not get this to work. Eventually I ripped these plugins out and was left with very pain POM file that worked.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Go / Golang&lt;/h3&gt;&lt;a href="http://golang.org/"&gt;Go&lt;/a&gt; was announced November 2009. It is created by Google to deal with multicore and networked machines. It feels like a mixture of Python and C. It is a very clean and minimal language.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;It is fast &lt;/li&gt;&lt;li&gt;Good standard library&lt;/li&gt;&lt;li&gt;Excellent support for concurrency&lt;/li&gt;&lt;li&gt;It is trivial to write your own load balancer&lt;/li&gt;&lt;/ul&gt;&lt;b&gt;Issues&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The Eclipse IDE is in an early stage&lt;/li&gt;&lt;li&gt;Debugger is not working&lt;/li&gt;&lt;li&gt;Windows port is not done and has just been released&lt;/li&gt;&lt;/ul&gt;It was hard to find the right &lt;a href="http://code.google.com/p/gomingw/downloads/list"&gt;Go Windows port&lt;/a&gt;, there are several Go windows port projects with no code.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Use cases&lt;/b&gt;&lt;br /&gt;I currently have a problem when downloading a lot HTML pages and parsing them to a tree structure. This does not have the best support in C#. I found a library that  translates HTML to XHTML and then I can use LINQ to process it. The library is not documented, not very fast and fails for some HTML files.&lt;br /&gt;&lt;br /&gt;Go comes with a HTML library that parses HTML 5, it is simple to write a program with some threads that download and other that parse the files into a DOM tree structure.&lt;br /&gt;I would use Golang for loading large amounts of text in a cloud computing environment.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Cython&lt;/h3&gt;&lt;a href="http://www.cython.org/"&gt;Cython&lt;/a&gt; was released in July 2007. It is a static compiler to write Python extension modules in a mixture of Python and C.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Process for using Cython&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Start by writing normal Python code&lt;/li&gt;&lt;li&gt;Find modules that are too slow&lt;/li&gt;&lt;li&gt;Add static types&lt;/li&gt;&lt;li&gt;Compile it with Cython using the setup tool&lt;/li&gt;&lt;li&gt;This produces compiled modules that can be used with normal Python &lt;/li&gt;&lt;/ul&gt;&lt;b&gt;Issues&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;It is still more complex that normal Python code&lt;/li&gt;&lt;li&gt;You need to know C to use it&lt;/li&gt;&lt;/ul&gt;I was surprised how simple it was to get it working both under  Windows and Linux. I did not have to mess with make files or configure  the compiles. Cython integrated well with NumPy and SciPy. This expands the programming tasks you can do with Python substantially.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Use cases&lt;/b&gt;&lt;br /&gt;Speed up slow POS tagging.&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;/ul&gt;&lt;h3&gt;My previous language experience&lt;/h3&gt;Over the years I have experimented with a long list of non mainstream languages: functional, declarative, parallel, array, dynamic and hybrid languages. Many of these were frustrating experiences. I would read about a new language and get very excited. However this would often be the chain of events:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Download language &lt;/li&gt;&lt;li&gt;Installed Cygwin&lt;/li&gt;&lt;li&gt;Find out how the language's build system works&lt;/li&gt;&lt;li&gt;Try to find a version of the GCC compiler that will compile it&lt;/li&gt;&lt;li&gt;Get the right version of Emacs installed&lt;/li&gt;&lt;li&gt;Try to get the debugger working under Emacs &lt;/li&gt;&lt;li&gt;Start programming from scratch since the libraries were sparse&lt;/li&gt;&lt;li&gt;Burn out&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;You  only have so much mental capacity, and if you do not use a language you forget  it. Only Scala made it into my toolbox. &lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Do Clojure, Go or Cython belong in your programmer's toolbox&lt;/h3&gt;Clojure, Go and Cython are all simple languages. They are easy to install, easy learn, they all have big standard libraries so you can be productive in them right away. This is my first impression:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Clojure&lt;/b&gt; is a good way to script the extensive Java libraries, for  rapid application development and for AI work.&lt;/li&gt;&lt;li&gt;&lt;b&gt;Go&lt;/b&gt; is a great language but it is still rough around the edges. There are not any big NLP libraries written for Go yet. I would not try to use it for my main NLP tasks.&lt;/li&gt;&lt;li&gt;&lt;b&gt;Cython&lt;/b&gt; was useful right away for my NLP work. It makes it possible to do fast numerical programming in Python without too much trouble.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-7092636845598357792?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/7092636845598357792/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=7092636845598357792' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/7092636845598357792'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/7092636845598357792'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2010/10/natural-language-processing-in-clojure.html' title='Natural language processing in Clojure, Go and Cython'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-6120721954256205779</id><published>2010-06-23T21:20:00.122-04:00</published><updated>2010-12-05T23:24:52.243-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='NLTK'/><category scheme='http://www.blogger.com/atom/ns#' term='Orange'/><category scheme='http://www.blogger.com/atom/ns#' term='mlpy'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><category scheme='http://www.blogger.com/atom/ns#' term='comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='ffnet'/><category scheme='http://www.blogger.com/atom/ns#' term='WEKA'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistica'/><category scheme='http://www.blogger.com/atom/ns#' term='Rapidminer'/><category scheme='http://www.blogger.com/atom/ns#' term='review'/><category scheme='http://www.blogger.com/atom/ns#' term='Python'/><title type='text'>Orange, R, RapidMiner, Statistica and WEKA</title><content type='html'>&lt;h4&gt;Review of open source and cheap software packages for Data Mining&lt;/h4&gt;This blog posting is comparing the following tools, after working with them for 2 months and using them for solving a real data mining problem:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Orange&lt;br /&gt;&lt;/li&gt;&lt;li&gt;R&lt;/li&gt;&lt;li&gt;RapidMiner&lt;/li&gt;&lt;li&gt;Statistica 8 with Data Miner module&lt;/li&gt;&lt;li&gt;WEKA&lt;/li&gt;&lt;/ul&gt;Statistica is commercial, all the other are open source. There is also a brief mention of the following Python libraries: mlpy, ffnet, NLTK.&lt;br /&gt;&lt;h4&gt;Summary of first impression&lt;br /&gt;&lt;/h4&gt;This is a follow up on my previous post &lt;a href="http://samibadawi.blogspot.com/2010/04/r-rapidminer-statistica-ssas-or-weka.html"&gt;R,  RapidMiner, Statistica, SSAS or WEKA&lt;/a&gt; describing my impression  of the following software packages after using them for a couple of days each:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;R&lt;/li&gt;&lt;li&gt;RapidMiner&lt;/li&gt;&lt;li&gt;SciPy&lt;br /&gt;&lt;/li&gt;&lt;li&gt;SQL Server  Analysis  Services, Business Intelligence Development Studio&lt;/li&gt;&lt;li&gt;SQL  Server  Analysis Services, Table Analysis Tool for Excel &lt;/li&gt;&lt;li&gt;Statistica  8  with Data Miner module&lt;br /&gt;&lt;/li&gt;&lt;li&gt;WEKA&lt;/li&gt;&lt;/ul&gt;Let me summarize what I found:&lt;br /&gt;&lt;br /&gt;SciPy did not have  what I needed. However I found a few other good Python-based solutions: Orange,  mlpy, ffnet and NLTK.&lt;br /&gt;&lt;br /&gt;The SSAS-based solutions held promise due to their close integration with Microsoft products, but I found them to be too closely tied to data warehouses so I postponed exploring them.&lt;br /&gt;&lt;br /&gt;Statistica and RapidMiner had a lot of functionality and were polished, but the many features were overwhelming.&lt;br /&gt;&lt;br /&gt;R was harder to get started with and WEKA was less polished, so I did not spend too much time on them.&lt;br /&gt;&lt;h3&gt;Comparison matrix&lt;/h3&gt;In order to compress my current findings I am summarizing it in this matrix. This metric is only based on limited work with the different software packages and is not very accurate. The categories are:&lt;br /&gt;Documentation; GUI and graphics; how polished the package is; ease of learning; controlling package from a script or program; how many machine learning algorithms that are available:&lt;br /&gt;&lt;table border="1"&gt;&lt;tbody&gt;&lt;tr&gt; &lt;th&gt;&lt;br /&gt;&lt;/th&gt;&lt;th&gt;Doc&lt;/th&gt;&lt;th&gt;GUI&lt;/th&gt;&lt;th&gt;Polished&lt;/th&gt;&lt;th&gt;Ease&lt;/th&gt;&lt;th&gt;Scripting&lt;/th&gt;&lt;th&gt;Algorithms&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td&gt;Orange&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;2&lt;/td&gt; &lt;/tr&gt;&lt;tr&gt; &lt;td&gt;Python libs&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;2&lt;/td&gt; &lt;/tr&gt;&lt;tr&gt; &lt;td&gt;R&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;2&lt;/td&gt; &lt;/tr&gt;&lt;tr&gt; &lt;td&gt;RapidMiner&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;3&lt;/td&gt; &lt;/tr&gt;&lt;tr&gt; &lt;td&gt;Statistica&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td&gt;WEKA&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;3&lt;/td&gt; &lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;h3&gt;Criteria for software package comparison&lt;br /&gt;&lt;/h3&gt;The comparison is based on a real data  mining task  that is relatively simple:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Supervised  learning for categorization.&lt;/li&gt;&lt;li&gt;Over 200 attributes mainly numeric  but 2 categorical / text.&lt;/li&gt;&lt;li&gt;One of the categorical attributes is the most important predictor.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Data is clean, so no need to clean  outliers and missing data.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Accuracy  is a good metric.&lt;/li&gt;&lt;li&gt;GUI with good graphic to explore the data is a plus.&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;General observations&lt;br /&gt;&lt;/h3&gt;The most popular data mining packages in the industry are &lt;a href="http://www.sas.com/"&gt;SAS&lt;/a&gt; and &lt;a href="http://www.spss.com/"&gt;SPSS&lt;/a&gt;,  but they are quite expensive. Orange, R, RapidMiner, Statistica and WEKA all can be used for doing real data mining work. While some of them are unpolished.&lt;br /&gt;&lt;br /&gt;There was a similar learning curve for most of the programs. Most programs took me a few days to get working, between the documentation and experimenting.&lt;br /&gt;&lt;br /&gt;I had to reformulate my original problem. Neural network models did not work well on my categorical / text attributes. Statistica produced an accuracy of 90%, while RapidMiner produced an accuracy of 82%.&lt;br /&gt;I replaced the 2 categorical attributes with a numeric attribute and accuracy of the best model increased to around 97%, and was much more uniform between the different tools.&lt;br /&gt;&lt;h3&gt;Orange&lt;/h3&gt;&lt;a href="http://www.ailab.si/orange/"&gt;Orange&lt;/a&gt; is an open source data mining package build on Python, NumPy, wrapped C, C++ and Qt.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Works both as a script and with an ETL work flow GUI.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Shortest script for doing training, cross validation, algorithms comparison  and prediction.&lt;/li&gt;&lt;li&gt;I found Orange the easiest tool to learn.&lt;/li&gt;&lt;li&gt;Cross platform GUI.&lt;/li&gt;&lt;/ul&gt;&lt;h5&gt;Issues:&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;Not super polished.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The install is big since you need to install QT.&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;Python libs: ffnet, NumPy, mlpy, NLTK&lt;br /&gt;&lt;/h3&gt;A few Python libs deserve to be mentioned here: ffnet, NumPy, mlpy and NLTK.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;If you do not care about the graphic exploration, you can set up an &lt;a href="http://pypi.python.org/pypi/ffnet/0.6"&gt;ffnet&lt;/a&gt; neural network in few lines of code.&lt;/li&gt;&lt;li&gt;There are several machine learning algorithms in &lt;a href="https://mlpy.fbk.eu/"&gt;mlpy&lt;/a&gt;.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The machine learning is &lt;a href="http://www.nltk.org/"&gt;NLTK&lt;/a&gt; is very elegant if you have a text mining or NLP problem.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The libraries  are self contained.&lt;/li&gt;&lt;/ul&gt;&lt;h5&gt;Issues:&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;Limited list of machine learning algorithms.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Machine learning is not handled uniformly between the different libraries.&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;R&lt;/h3&gt;&lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; is an open source statistical  and data mining package and programming language.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Very extensive statistical library.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;It is a powerful elegant array  language in the tradition of APL, Mathematica and  MATLAB,  but also  LISP/Scheme.&lt;/li&gt;&lt;li&gt;I was able to make a working machine learning program in just 40 lines of code.&lt;/li&gt;&lt;/ul&gt;&lt;h5&gt;Issues:&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;Less specialized towards data mining.&lt;/li&gt;&lt;li&gt;There is a steep learning curve, unless you are familiar with array languages.&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;R vs. Orange written in Python&lt;br /&gt;&lt;/h3&gt;Python and R have a lot in common: they are both elegant, minimal,  interpreted  languages with good numeric libraries. Still they have a different feel. So I was interested in seeing how  they compared.&lt;br /&gt;&lt;h5&gt;Orange / Python advantages&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;R is quite different from  common programming languages.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Python is easier for most programmers to learn.&lt;/li&gt;&lt;li&gt;Python has better debugger.&lt;/li&gt;&lt;li&gt;Scripting data mining categorization problems is simpler in Orange.&lt;/li&gt;&lt;li&gt;Orange also has an ELT work flow GUI.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h5&gt;R advantages&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;R is even more minimal than Python.&lt;/li&gt;&lt;li&gt;Numerical programming is better integrated in R, in Python where you have to use external packages NumPy and SciPy.&lt;/li&gt;&lt;li&gt;R has better graphics.&lt;/li&gt;&lt;li&gt;R is more transparent since the Orange are wrapped C++ classes.&lt;/li&gt;&lt;li&gt;Easier to combine with other statistical calculations.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;I made small script to solve my data mining problem in both Orange and R. This was my impression:&lt;br /&gt;&lt;br /&gt;If all you want to do is to solve a categorization problem I found Orange to be simpler. You have to become very familiar with how Orange read the spreadsheet, the different attribute types, notably the Meta attribute.&lt;br /&gt;&lt;br /&gt;Import and export of data from spreadsheet is easier in R, spreadsheet are stored in a data frames that the different machine learning algorithms are operating on. Programming in R really is very different, you are working on a higher abstraction level, but you do lose control over the details.&lt;br /&gt;&lt;h3&gt;RapidMiner&lt;/h3&gt;&lt;a href="http://rapid-i.com/content/view/181/190/"&gt;RapidMiner&lt;/a&gt; is an  open source statistical and data mining package written in Java.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Solid and complete package.&lt;/li&gt;&lt;li&gt;It easily reads and writes Excel files and different databases.&lt;/li&gt;&lt;li&gt;You  program by piping components together in a graphic ETL work flows.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;  If you set up an illegal work flows RapidMiner suggest Quick Fixes to  make it legal.&lt;/li&gt;&lt;/ul&gt;&lt;h5&gt;Issues:&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;I only got it to works under Windows, but others have gotten it to work in other environments, see comment below.&lt;/li&gt;&lt;li&gt;There are a lot of different ETL modules; it took a while to understand how to use them.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;First I had a hard time making a comparison between different models.  Eventually I found a way: You chose a cross validation and select  different models one by one. When you run the model the will all be  stored on the result page and you can do comparison there.&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;Statistica 8&lt;/h3&gt;&lt;a href="http://www.statsoft.com/"&gt;Statistica&lt;/a&gt; is a commercial   statistics and data mining software package for Windows.&lt;br /&gt;There is a 90 day trial   for Statistica 8 with data miner module in the textbook:&lt;br /&gt;&lt;a href="http://www.elsevier.com/wps/find/bookdescription.cws_home/717661/description#description"&gt;Handbook   of Statistical Analysis and Data Mining Applications&lt;/a&gt;. There is  also a &lt;a href="http://www.statsoft.com/support/free-statistica-9-trial/"&gt;free 30  day trial&lt;/a&gt;.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Generally very polished and good at everything, but it is also the only non open source program.&lt;/li&gt;&lt;li&gt;High accuracy even when I gave it bad input.&lt;/li&gt;&lt;li&gt;You can script everything in Statistica in VB.&lt;/li&gt;&lt;li&gt;Cheap compared to SPSS and SAS.&lt;/li&gt;&lt;/ul&gt;&lt;h5&gt;Issues:&lt;br /&gt;&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;So many options that it was hard to navigate the program.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The most important video about &lt;a href="http://www.youtube.com/watch?v=woMHzX4O5nw"&gt;Data Miner Recipes&lt;/a&gt;  is the very last out of 36.&lt;/li&gt;&lt;li&gt;Cost of Statistica is not available on their  website.&lt;/li&gt;&lt;li&gt;It is cheap in a corporate setting, but not for private  use.&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;WEKA&lt;/h3&gt;&lt;a href="http://www.cs.waikato.ac.nz/ml/weka/"&gt;WEKA&lt;/a&gt; is an open  source statistical and data mining library written in Java.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A lot of machine learning algorithms. &lt;/li&gt;&lt;li&gt;Easy to learn and use.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Good GUI. &lt;/li&gt;&lt;li&gt;Platform independent.&lt;/li&gt;&lt;/ul&gt;&lt;h5&gt;Issues:&lt;br /&gt;&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;Worse connectivity to Excel spreadsheet and non Java based  databases.&lt;/li&gt;&lt;li&gt;CSV reader not as robust as in RapidMiner.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Not as  polished.&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;RapidMiner vs. WEKA&lt;/h3&gt;The most similar data mining packages are RapidMiner and WEKA. There  have many similarities:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Written in in Java.&lt;/li&gt;&lt;li&gt;Free /  open source software with GPL license.&lt;/li&gt;&lt;li&gt;RapidMiner includes many  learning algorithms from WEKA.&lt;/li&gt;&lt;/ul&gt;My first thought what that RapidMiner has everything that WEKA has, plus a lot of other functionality and is more polished. Therefore I did not spend too much time on WEKA. For the sake of completeness I took a second look at WEKA and I have to say that it was a lot easier to get WEKA to work. Sometimes less is more. Depending on what is more important functionality or ease of use.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Conclusion&lt;/h3&gt;There are several good and very different solutions. Let me finish by listing the strongest aspect of each tool:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Orange&lt;/span&gt; has elegant and concise  scripting and can also be run in an ETL GUI mode.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;R&lt;/span&gt; has elegant and concise scripting integrated with a vast statistical library.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;RapidMiner&lt;/span&gt; has a lot of functionality, is polished and has good connectivity.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Statistica&lt;/span&gt; is the most polished product, and generally performed well in all categories. It gave good result when I gave it bad input.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;WEKA&lt;/span&gt; is the easiest GUI to learn and use.&lt;br /&gt;&lt;br /&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-6120721954256205779?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/6120721954256205779/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=6120721954256205779' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/6120721954256205779'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/6120721954256205779'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2010/06/orange-r-rapidminer-statistica-and-weka.html' title='Orange, R, RapidMiner, Statistica and WEKA'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-1419166022712154266</id><published>2010-04-29T20:58:00.082-04:00</published><updated>2010-12-05T23:28:46.297-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='Business Intelligence'/><category scheme='http://www.blogger.com/atom/ns#' term='Table Analysis Tool'/><category scheme='http://www.blogger.com/atom/ns#' term='vs'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><category scheme='http://www.blogger.com/atom/ns#' term='SQL Server'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><category scheme='http://www.blogger.com/atom/ns#' term='comparison'/><category scheme='http://www.blogger.com/atom/ns#' term='WEKA'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistica'/><category scheme='http://www.blogger.com/atom/ns#' term='SSAS'/><category scheme='http://www.blogger.com/atom/ns#' term='Rapidminer'/><category scheme='http://www.blogger.com/atom/ns#' term='review'/><title type='text'>R, RapidMiner, Statistica, SSAS or WEKA</title><content type='html'>&lt;h4&gt;Choosing cheap software packages to get started with Data Mining&lt;br /&gt;&lt;/h4&gt;You have a data mining problem and you want to try to solve it with a data mining software package. The most popular packages in the industry are &lt;a href="http://www.sas.com/"&gt;SAS&lt;/a&gt; and &lt;a href="http://www.spss.com/"&gt;SPSS&lt;/a&gt;, but they are quite expensive, so you might have a hard time convincing your boss to purchase them before you already have produced impressive results.&lt;br /&gt;&lt;br /&gt;When I needed data mining or machine learning algorithms in the past, I would program  it from scratch and integrate it in my Java or C# code. But recently I needed a more interactive graphics environment to help with what is called Data Understanding phase in the &lt;a href="http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining"&gt;CRISP-DM&lt;/a&gt;. I also wanted a way to compare the predictive accuracy of a broad array of algorithms, so I tried out several packages:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;R&lt;/li&gt;&lt;li&gt;RapidMiner&lt;/li&gt;&lt;li&gt;SciPy&lt;br /&gt;&lt;/li&gt;&lt;li&gt;SQL Server Analysis Services, Business Intelligence Development Studio&lt;/li&gt;&lt;li&gt;SQL Server Analysis Services, Table Analysis Tool for Excel &lt;/li&gt;&lt;li&gt;Statistica 8 with Data Miner module&lt;br /&gt;&lt;/li&gt;&lt;li&gt;WEKA&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Disclaimer for review&lt;br /&gt;&lt;/h4&gt;Here is a review of my first impression of these packages. First impression is not the best indicator for what going to work for you in the long run. I am sure that I have missed many features.  Still I hope this can save you some time finding a solution that will work for your problem.&lt;br /&gt;&lt;h3&gt;R&lt;/h3&gt;&lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; is an open source statistical and data mining package and programming language.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Very extensive statistical library.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Very concise for solving statistical problems.&lt;/li&gt;&lt;li&gt;It is a powerful elegant array language in the tradition of APL, Mathematica and  MATLAB,  but also LISP/Scheme.&lt;/li&gt;&lt;li&gt;In a few lines you can set up an R program that does data mining and machine learning.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;You have full control.&lt;/li&gt;&lt;li&gt;It is easier to integrate this into a work flow with your other programs. You just spawn an R program and pass input in and read output from a pipe.&lt;/li&gt;&lt;li&gt;Good plotting functionality.&lt;/li&gt;&lt;/ul&gt;&lt;h5&gt;Issues:&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;Less interactive GUI.&lt;/li&gt;&lt;li&gt;Less specialized towards data mining.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Language is pretty different from current mainstream languages like C,  C#, C++, Java, PHP and VB.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;There is a learning curve, unless you are familiar with array languages.&lt;/li&gt;&lt;li&gt;R was created in 1990.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Link: Screencast showing how a trained R user can generate a &lt;a href="http://www.youtube.com/watch?v=3t8LiXlBL40"&gt;PMML neural network model in 60 seconds&lt;/a&gt;.&lt;br /&gt;&lt;h3&gt;RapidMiner&lt;br /&gt;&lt;/h3&gt;&lt;a href="http://rapid-i.com/content/view/181/190/"&gt;RapidMiner&lt;/a&gt; is an open source statistical and data mining package written in Java.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Lot of data mining algorithms.&lt;/li&gt;&lt;li&gt;Feels polished.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Good graphics.&lt;/li&gt;&lt;li&gt;It easily reads and writes Excel files and different databases.&lt;/li&gt;&lt;li&gt;You program by piping components together in a graphic ETL workflows.&lt;br /&gt;&lt;/li&gt;&lt;li&gt; If you set up an illegal workflows RapidMiner suggest Quick Fixes to make it legal.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Good &lt;a href="http://rapid-i.com/content/view/189/198/lang,en/"&gt; video &lt;/a&gt;&lt;a href="http://rapid-i.com/content/view/189/198/lang,en/"&gt;tutorials&lt;/a&gt; / European dance parties. *:o)&lt;/li&gt;&lt;/ul&gt;&lt;h5&gt;Issues:&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;I only got it to works under Windows, but others have gotten it to work in other environments.&lt;/li&gt;&lt;li&gt;Harder to compare different algorithms than WEKA.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;SciPy&lt;/h3&gt;&lt;a href="http://www.scipy.org/"&gt;SciPy&lt;/a&gt; is an open source Python wrapper around numerical libraries.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Good for mathematics.&lt;/li&gt;&lt;li&gt;Python is a simple, elegant and mature language.&lt;/li&gt;&lt;/ul&gt;&lt;h5&gt;Issues:&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;Data mining part is too immature.&lt;/li&gt;&lt;li&gt;Too much duct tape.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;SQL Server Business Intelligence Development Studio&lt;br /&gt;&lt;/h3&gt;&lt;a href="http://msdn.microsoft.com/en-us/library/ms175609%28SQL.90%29.aspx"&gt;Microsoft SQL Server Analysis Services&lt;/a&gt; comes with data mining service.&lt;br /&gt;If you have access to SQL Server 2005 or later with SSAS installed, you can use some of the data mining algorithms for free. If you want to scale it can become quite expensive.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;If your are working with the Microsoft stack this integrate well.&lt;/li&gt;&lt;li&gt;Good data mining functionality.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Organized well.&lt;/li&gt;&lt;li&gt;Comes with some graphics.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h5&gt;Issues:&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;The machine learning is closely tied to data warehouses and cubes. This  makes the learning curve steeper and deployment harder.&lt;/li&gt;&lt;li&gt;Documentation about using the BIDS GUI was hard to find. I looked in several books and several videos.&lt;/li&gt;&lt;li&gt;I need to do my data mining from within a web server or a command line program. For this you need to access the models using: Analysis Management Objects (AMO). Documentation for this was also hard to find.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;You need good cooperation from your DBA, unless you have your own instance of SQL Server.&lt;/li&gt;&lt;li&gt;If you want to evaluate the performance of your predictive model, cross-validation is available &lt;a href="http://technet.microsoft.com/en-us/library/bb895226.aspx"&gt;only in SQL Server 2008 Enterprise&lt;/a&gt;.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Link: Good &lt;a href="http://www.microsoft.com/events/series/technetsqlserver2008.aspx?tab=Webcasts&amp;amp;seriesid=111&amp;amp;webcastid=4812"&gt;screencast&lt;/a&gt;  about data mining with SSAS.&lt;br /&gt;&lt;h3&gt;SQL Server Analysis Services, Table Analysis Tool Excel&lt;br /&gt;&lt;/h3&gt;&lt;a href="http://msdn.microsoft.com/en-us/library/dd299412.aspx"&gt;Microsoft Excel data mining plug-in&lt;/a&gt; is dependent on SQL Server 2008 and Excel 2007.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;This takes less interaction with the database and DBA than the Development Studio.&lt;/li&gt;&lt;li&gt;A lot of users have their data in Excel.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;There is an Analysis ribbon / menu that is  very simple to use. Even for users with very limited understanding of data mining.&lt;/li&gt;&lt;li&gt;The Machine Learning ribbon has more control over internals of the algorithms.&lt;/li&gt;&lt;li&gt;You can run with huge amount of data since the number crunching is done on the server.&lt;/li&gt;&lt;/ul&gt;&lt;h5&gt;Issues:&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;This also needs a connection to a SQL Server 2008 with Analysis Services running. Despite the data mining algorithms being relatively simple.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;You need a special database inside Analysis Services that you have write permissions to.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Link: Excel Table Analysis Tool &lt;a href="http://msdn.microsoft.com/en-us/library/dd299412.aspx"&gt;video&lt;/a&gt;&lt;br /&gt;&lt;h3&gt;Statistica 8&lt;/h3&gt;&lt;a href="http://www.statsoft.com/"&gt;Statistica&lt;/a&gt; is a commercial  statistics and data mining software package.&lt;br /&gt;There is a 90 day trial  for Statistica 8 with data miner module in the textbook:&lt;br /&gt;&lt;a href="http://www.elsevier.com/wps/find/bookdescription.cws_home/717661/description#description"&gt;Handbook  of Statistical Analysis and Data Mining Applications&lt;/a&gt;. There is also a &lt;a href="http://www.statsoft.com/support/free-statistica-9-trial/"&gt;free 30 day trial&lt;/a&gt;.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Statistica  is cheaper than SAS and SPSS.&lt;/li&gt;&lt;li&gt;Six hours of instructional &lt;a href="http://www.statsoft.com/support/download/video-tutorials/"&gt;videos&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.youtube.com/watch?v=woMHzX4O5nw"&gt;Data Miner Recipes&lt;/a&gt; wizard is the easiest tool for a beginner.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Lot of data mining algorithms.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;GUI  with a lot of functionality.&lt;/li&gt;&lt;li&gt;You program using menus and wizards.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Good graphics.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Easy to find and clean up outliers  and missing data attributes.&lt;/li&gt;&lt;/ul&gt;&lt;h5&gt;Issues:&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;Overwhelming number of menu items.&lt;/li&gt;&lt;li&gt;The most important video about &lt;a href="http://www.youtube.com/watch?v=woMHzX4O5nw"&gt;Data Miner Recipes&lt;/a&gt; is the very last.&lt;/li&gt;&lt;li&gt;Cost of Statistica is not available on their website.&lt;/li&gt;&lt;li&gt;It is cheap in a corporate setting, but not for private use.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;WEKA&lt;/h3&gt;&lt;a href="http://www.cs.waikato.ac.nz/ml/weka/"&gt;WEKA&lt;/a&gt; is an open source statistical and data mining library written in Java.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Many machine learning packages.&lt;/li&gt;&lt;li&gt;Good graphics.&lt;/li&gt;&lt;li&gt;Specialized for data mining.&lt;/li&gt;&lt;li&gt;Easy to work with.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Written in pure Java so it is multi platform.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Good for &lt;a href="http://sentimentmining.net/weka/"&gt;text mining&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;You can train different learning algorithms at the same time and compare their result.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h5&gt;RapidMiner vs WEKA:&lt;/h5&gt;The most similar data mining packages are RapidMiner and WEKA. There have many similarities:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Written in in Java.&lt;/li&gt;&lt;li&gt;Free / open source software with GPL license.&lt;/li&gt;&lt;li&gt;RapidMiner includes many learning algorithms from WEKA.&lt;/li&gt;&lt;/ul&gt;Therefore the issues with WEKA is really how it compares to RapidMiner.&lt;br /&gt;&lt;h5&gt;Issues compared to RapidMiner:&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;Worse connectivity to Excel spreadsheet and non Java based databases.&lt;/li&gt;&lt;li&gt;CSV reader not as robust.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Not as polished.&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;Criteria for software package comparison&lt;br /&gt;&lt;/h3&gt;My current data mining needs are relatively simple. I do not need the most sophisticated software packages. This is what I need now:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Supervised learning for categorization.&lt;/li&gt;&lt;li&gt;Over 200 features mainly numeric but 2 categorical.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Data is clean so no need to clean outliers and missing data.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Not important to avoiding mistakes.&lt;/li&gt;&lt;li&gt;Equal cost for type 1 and type 2 errors.&lt;/li&gt;&lt;li&gt;Accuracy is a good metric.&lt;/li&gt;&lt;li&gt;Easy to export model to production environment.&lt;/li&gt;&lt;li&gt;Good GUI with good graphic to explore the data.&lt;/li&gt;&lt;li&gt;Easy to compare a few different models e.g. boosted trees, naive bayes, neural network, random forest and vector support machine.&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;Summary&lt;br /&gt;&lt;/h3&gt;I did not have time to test all the tools enough for a real review. I was only trying to determine what data mining software packages to try first.&lt;br /&gt;&lt;h5&gt;Try first list&lt;/h5&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Statistica:&lt;/span&gt; Most polished, easiest to get started with. Good graphics and documentation.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;RapidMiner:&lt;/span&gt; Polished.  Simplest and most uniform GUI. Good graphics. Open source.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;WEKA:&lt;/span&gt; A little unpolished. Good functionality for comparing different data mining algorithms.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;SSAS Table Analysis Tool, Data Mining ribbon:&lt;/span&gt; Showed promise, but I did not get it to do what I need.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;SSAS BIDS:&lt;/span&gt; Close tie to cube and data warehouse. Hard to find documentation about AMO programming. Could possibly give best integration with C# and VB.NET.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;SSAS Table Analysis Tool, Analysis ribbon:&lt;/span&gt; Simple to use but does not have the functionality I need.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;R:&lt;/span&gt; Not specialized towards data mining. Elegant but different programming paradigm.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;SciPy:&lt;/span&gt; Data mining library too immature.&lt;/li&gt;&lt;/ol&gt;Both RapidMiner and Statistica 8 do what I need now. So far I have found it easier to find functions using Statistica's menus and wizards, than  RapidMiner's ETL workflows, but RapidMiner is open source. Still I would not be surprised if ended up using one or more than one package.&lt;br /&gt;&lt;h3&gt;Preliminary quality comparison of Statistica and RapidMiner&lt;/h3&gt;I ran my predictive modeling task in both Statistica and RapidMiner. In the first match the model that preformed best in Statistica was neural network, with an error rate of approximately 10%.&lt;br /&gt;&lt;br /&gt;I ran the neural network in RapidMiner the error rate was approximately 18%. I was surprised about the big difference. The reason is probably that one of my most important attributes is categorical with many values, and neural network does not work well with that. Statistica might have preformed better due to more hidden layers.&lt;br /&gt;&lt;br /&gt;Second time I ran my predictive model, Statistica was having some numeric overflow for neural network and there were missing prediction values. This also surprised me I would expect that there could be problems with the training of the neural network, but not the calculation of and input on a trained model.&lt;br /&gt;&lt;br /&gt;These problems can easily be the result of me being unfamiliarity with the software packages, but this was my first impression.&lt;br /&gt;&lt;br /&gt;Link to my follow up post that is based on solving an actual data mining problem in &lt;a href="http://samibadawi.blogspot.com/2010/06/orange-r-rapidminer-statistica-and-weka.html"&gt;Orange, R, RapidMiner, Statistica and WEKA&lt;/a&gt; after working with them for 2 months.&lt;br /&gt;&lt;br /&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-1419166022712154266?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/1419166022712154266/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=1419166022712154266' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/1419166022712154266'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/1419166022712154266'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2010/04/r-rapidminer-statistica-ssas-or-weka.html' title='R, RapidMiner, Statistica, SSAS or WEKA'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-7861979148324362855</id><published>2010-04-03T23:08:00.040-04:00</published><updated>2010-04-05T08:33:09.843-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='AI'/><category scheme='http://www.blogger.com/atom/ns#' term='Business Intelligence'/><category scheme='http://www.blogger.com/atom/ns#' term='Artificial Intelligence'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><category scheme='http://www.blogger.com/atom/ns#' term='Predictive Analytics'/><title type='text'>Data Mining rediscovers Artificial Intelligence</title><content type='html'>&lt;a href="http://en.wikipedia.org/wiki/Artificial_intelligence"&gt;Artificial intelligence&lt;/a&gt; started in the 1950s with very high expectations. AI did not deliver on the expectations and fell into decades long discredit. I am seeing signs that Data Mining and Business Intelligence are bringing AI into mainstream computing. This blog posting is a personal account of my long struggle to work in artificial intelligence during different trends in computer science.&lt;br /&gt;&lt;br /&gt;In the 1980s I was studying mathematics and physics, which I really enjoyed. I was concerned about my job prospects, there are not many math or science jobs outside of academia. Artificial intelligence seemed equally interesting but more practical, and I thought that it could provide me with a living wage. Little did I know that artificial intelligence was about to become an  unmentionable phrase that you should not put on your resume if you wanted  a paying job.&lt;br /&gt;&lt;h3&gt;Highlights of the history of artificial intelligence&lt;br /&gt;&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;In 1956 AI was founded. &lt;/li&gt;&lt;li&gt;In 1957 Frank Rosenblatt invented &lt;a href="http://en.wikipedia.org/wiki/Perceptron"&gt;Perceptron&lt;/a&gt;, the first generation of neural networks. It was based on the way the human brain works, and provided simple solutions to some simple problems.&lt;/li&gt;&lt;li&gt;In 1958 John McCarthy invented &lt;a href="http://en.wikipedia.org/wiki/Lisp_%28programming_language%29"&gt;LISP&lt;/a&gt;, the classic AI language. Mainstream  programming languages have borrowed heavily from LISP and are only now catching up with LISP.&lt;/li&gt;&lt;li&gt;In the 1960s AI got lots of defense funding. Especially military translation software translating  from  Russian to English.&lt;/li&gt;&lt;/ul&gt; AI theory made quick advances and a lot was  developed early on. AI techniques worked well on small  problems. It was expected that AI could learn, using machine learning,   and this soon would lead to human like intelligence.&lt;br /&gt;&lt;br /&gt;This did not work out as planned. The machine translation did not work well enough  to be usable. The defense funding dried up. The approaches that had  worked well for small problems did not scale to bigger domains. Artificial intelligence fell out of favor in the 1970s.&lt;br /&gt;&lt;h3&gt;AI advances in the 1980s&lt;br /&gt;&lt;/h3&gt;When I started studying AI, it was in the middle of a renaissance and I was optimistic about recent advances:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The discovery of new types of neural networks, after Perceptron networks had been discredited in an article by Marvin Minsky&lt;/li&gt;&lt;li&gt;Commercial expert system were thriving&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The Japanese &lt;a href="http://en.wikipedia.org/wiki/Fifth_generation_computer"&gt;Fifth  Generation Computer Systems&lt;/a&gt; project, written in the new elegant  declarative &lt;a href="http://en.wikipedia.org/wiki/Prolog"&gt;Prolog&lt;/a&gt;   language  had many people in the West worried &lt;/li&gt;&lt;li&gt;Advances in probability theory Bayesian Networks / Causal Network&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;In order to combat this brittleness of intelligence Doug Lenat started a large scale AI project &lt;a href="http://cyc.com/"&gt;CYC&lt;/a&gt; in 1984. His idea was that there is no free lunch, and in order to build an intelligent system, you have to use many different types of fine tuned logical inference; and you have to hand encode it with a lot of common sense knowledge. Cycorp spent hundreds of man years building their huge ontology. Their hope was that CYC would be able to start learning on its own, after training it for some years.&lt;br /&gt;&lt;h3&gt;AI in the 1990s&lt;br /&gt;&lt;/h3&gt;I did not loose my patience but other people did, and AI went from the technology of the future to yesterday's news. It had become a loser that you did not want to be associated with.&lt;br /&gt;&lt;br /&gt;During the Internet bubble when venture capital founding was abundant, I was briefly involved with an AI Internet start up company. The company did not take off; its main business was emailing discount coupons out to AOL costumers. This left me disillusioned, thinking that I just have to put on a happy face when I worked on the next web application or trading system.&lt;br /&gt;&lt;h3&gt;AI usage today&lt;br /&gt;&lt;/h3&gt; Even though AI stopped being cool, regular people are using its use it in more and more places:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Spam filter&lt;/li&gt;&lt;li&gt;Search engines use natural language processing&lt;/li&gt;&lt;li&gt;Biometric, face and fingerprint detection&lt;br /&gt;&lt;/li&gt;&lt;li&gt;OCR, check reading in ATM&lt;/li&gt;&lt;li&gt;Image processing in coffee machine detecting misaligned cups&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Fraud detection&lt;/li&gt;&lt;li&gt;Movie and book recommendations&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Machine translation&lt;/li&gt;&lt;li&gt;Speech understanding and generation in phone menu system&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;Euphemistic words for AI techniques&lt;br /&gt;&lt;/h3&gt; The rule seem to be that you can use AI techniques as long as you call it something else, e.g.:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Business Intelligence&lt;/li&gt;&lt;li&gt;Collective Intelligence &lt;/li&gt;&lt;li&gt;Data Mining &lt;/li&gt;&lt;li&gt;Information Retrieval &lt;/li&gt;&lt;li&gt;Machine Learning&lt;/li&gt;&lt;li&gt;Natural Language Processing &lt;/li&gt;&lt;li&gt;Predictive Analytics&lt;/li&gt;&lt;li&gt;Pattern Matching&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;AI is entering mainstream computing now&lt;br /&gt;&lt;/h3&gt;Recently I have seen signs that AI techniques are moving into mainstream computing:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;I went to a presentation for &lt;a href="http://www.spss.com/software/modeling/modeler-pro/"&gt;SPSS statistical modeling software&lt;/a&gt;, and was shocked how many people now are using data  mining and machine learning techniques. I was sitting next to people working in a prison, adoption agency, marketing, disease prevention NGO.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;I started working on a data warehouse using SQL Server Analytic Services, and found that SSAS has a suite of machine learning tools.&lt;/li&gt;&lt;li&gt;Functional and declarative techniques are spreading to mainstream programming languages.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;Business Intelligence compared to AI&lt;br /&gt;&lt;/h3&gt;Business Intelligence is about aggregating a company's data into an understandable format and analyzing it to provide better business decisions. BI is currently the most popular field using artificial intelligence techniques. Here are a few words about how it differs from AI:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;    BI is driven by vendors instead of academia&lt;br /&gt;&lt;/li&gt;&lt;li&gt;    BI is centered around expensive software packages with a lot of marketing&lt;/li&gt;&lt;li&gt;The scope is limited, e.g. find good prospective customers for your products&lt;br /&gt;&lt;/li&gt;&lt;li&gt;  Everything is living in  databases or data warehouses&lt;/li&gt;&lt;li&gt;BI is data driven&lt;/li&gt;&lt;li&gt;Reporting is a very important component of BI&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;Getting a job in AI&lt;br /&gt;&lt;/h3&gt;I recently made a big effort to steer my career towards AI. I started an open  source computer vision project, &lt;a href="http://www.shapelogic.org/"&gt;ShapeLogic&lt;/a&gt; and put AI back on my resume. A head hunter contacted me and asked if I had any experience in  Predictive Analytics. It took me 15 minutest to convince her that  Predictive Analytics and AI was close enough that she could forward my resume. I got the job, my first real AI and NLP job.&lt;br /&gt;&lt;br /&gt;The  work I am doing is not dramatically different from normal software development work. I  spend less time on machine  learning than on getting AJAX to work with C# ASP.NET for the web GUI; or upgrade the database ORM from ADO.NET strongly typed datasets to LINQ to SQL. However, it was very gratifying to see my program started to  perform a task that had been very time consuming for the company's medical staff.&lt;br /&gt;&lt;h3&gt;Is AI regaining respect?&lt;br /&gt;&lt;/h3&gt;No, not now. There are lots of job postings for BI and data mining but barely any   for artificial intelligence. AI is still not a popular word, except in video games where AI means something different. When I worked as a games developer what was called AI was just checking if your character was close to an enemy and then the enemy would start shooting in your character's direction.&lt;br /&gt;&lt;br /&gt;After 25 long years of waiting I am very happy to see AI techniques has finally become a commodity, and I enjoy working with it even if I have to disguise this work by whatever the buzzword of the day is.&lt;br /&gt;&lt;br /&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-7861979148324362855?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/7861979148324362855/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=7861979148324362855' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/7861979148324362855'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/7861979148324362855'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2010/04/data-mining-rediscovers-artificial.html' title='Data Mining rediscovers Artificial Intelligence'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-1516230596802250365</id><published>2010-03-17T22:23:00.023-04:00</published><updated>2011-04-21T12:19:10.213-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='open source'/><category scheme='http://www.blogger.com/atom/ns#' term='VB.NET'/><category scheme='http://www.blogger.com/atom/ns#' term='NLTK'/><category scheme='http://www.blogger.com/atom/ns#' term='C#'/><category scheme='http://www.blogger.com/atom/ns#' term='IronPython'/><category scheme='http://www.blogger.com/atom/ns#' term='DLR'/><category scheme='http://www.blogger.com/atom/ns#' term='SharpNLP'/><category scheme='http://www.blogger.com/atom/ns#' term='Python'/><category scheme='http://www.blogger.com/atom/ns#' term='NLP'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>SharpNLP vs NLTK called from C# review</title><content type='html'>C# and VB.net have fewer open source NLP libraries than languages like C++, Java, LISP and Perl. My last blog post: &lt;a href="http://samibadawi.blogspot.com/2010/03/open-source-nlp-in-c-35-using-nltk.html"&gt;Open Source NLP in C# 3.5 using NLTK&lt;/a&gt; is about calling &lt;a href="http://www.nltk.org/"&gt;NLTK&lt;/a&gt;, which is written in Python, from IronPython embedded under C# or VB.net.&lt;br /&gt;&lt;br /&gt;An alternative is to use &lt;a href="http://www.codeplex.com/sharpnlp"&gt;SharpNLP&lt;/a&gt;, which is the leading open source NLP project written in C# 2.0. SharpNLP is not as big as other &lt;a href="http://nlp.stanford.edu/links/statnlp.html"&gt;Open Source NLP&lt;/a&gt; projects. This blog posting is a short comparison of SharpNLP and NLTK embedded in C#.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Documentation&lt;br /&gt;&lt;/h3&gt;NLTK has excellent documentation, including an introductory &lt;a href="http://www.nltk.org/book"&gt;online book&lt;/a&gt; on NLP and Python programming.&lt;br /&gt;&lt;br /&gt;For SharpNLP the source code is the documentation. There is also a short &lt;a href="http://www.codeproject.com/KB/recipes/englishparsing.aspx"&gt;introductory article&lt;/a&gt; by SharpNLP's author Richard J. Northedge.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Ease of learning&lt;br /&gt;&lt;/h3&gt;NLTK is very easy to work with under Python, but integrating it as embedded IronPython under C# took me a few days. It is still a lot simpler to get Python and C# to work together than Python and C++.&lt;br /&gt;&lt;br /&gt;SharpNLP's lack of documentation makes it harder to use; but it is very simple to install.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Ease of use&lt;br /&gt;&lt;/h3&gt;NLTK it is great to work with in the Python interpreter.&lt;br /&gt;&lt;br /&gt;SharpNLP simplifies life by not having to deal with the embedding of IronPython under C# and the mismatching between the 2 languages.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Machine learning and statistical models&lt;br /&gt;&lt;/h3&gt;NLTK comes with a variety of machine learning and statistical models: decision trees, naive Bayesian, and maximum entropy. They are very easy to train and validate, but do not preform well for large data sets.&lt;br /&gt;&lt;br /&gt;SharpNLP is focused on maximum entropy modeling.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Tokenizer quality&lt;br /&gt;&lt;/h3&gt;NLTK has a very simple RegEx based tokenizer that works well in most cases.&lt;br /&gt;&lt;br /&gt;SharpNLP has a more advanced maximum entropy based tokenizer that can split "don't" into "do | n't". On the other hand it sometimes makes errors and splits a normal word into 2 words.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Development community&lt;br /&gt;&lt;/h3&gt;NLTK has an active development community, with an active mailing list.&lt;br /&gt;&lt;br /&gt;SharpNLP was last release was in December 2006. It is a port of the Java based OpenNLP, and can read models from OpenNLP. SharpNLP has a low volume mailing list.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Code quality&lt;br /&gt;&lt;/h3&gt;NLTK lets you write programs that read from web pages, clean HTML out of text and do machine learning in a few lines of code.&lt;br /&gt;&lt;br /&gt;SharpNLP is written in C# 2.0 using generics. It is a port from OpenNLP and maintains a Java flavor, but it is still very readable and pleasant to work with.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;License&lt;br /&gt;&lt;/h3&gt;NLTK's license is Apache License, Version 2.0, which should fit most people's need.&lt;br /&gt;&lt;br /&gt;SharpNLP's license is LGPL 2.1. This is a versatile license, but maybe a little harder to work with when the project is not active.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Applications&lt;br /&gt;&lt;/h3&gt;NLTK comes with a theorem prover for reasoning about semantic content of text.&lt;br /&gt;&lt;br /&gt;SharpNLP comes with an name, organization, time, date and percentage finder.&lt;br /&gt;It is very simple to add an advanced GUI, using WPF or WinForms.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Conclusion&lt;br /&gt;&lt;/h3&gt;Both packages comes with a lot of functionality. They both have weaknesses, but they are definitely usable. I have both SharpNLP and embedded NLTK in my NLP toolbox.&lt;br /&gt;&lt;br /&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-1516230596802250365?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/1516230596802250365/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=1516230596802250365' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/1516230596802250365'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/1516230596802250365'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2010/03/sharpnlp-vs-nltk-called-from-c-review.html' title='SharpNLP vs NLTK called from C# review'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-14061637374794049</id><published>2010-03-11T07:02:00.022-05:00</published><updated>2011-04-22T18:47:00.756-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='VB.NET'/><category scheme='http://www.blogger.com/atom/ns#' term='C#'/><category scheme='http://www.blogger.com/atom/ns#' term='IronPython'/><category scheme='http://www.blogger.com/atom/ns#' term='DLR'/><category scheme='http://www.blogger.com/atom/ns#' term='Python'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Open Source NLP in C# 3.5 using NLTK</title><content type='html'>I am working on natural language processing algorithms in a C# 3.5 environment. I did not find any &lt;a href="http://nlp.stanford.edu/links/statnlp.html"&gt;open source NLP&lt;/a&gt; packages for C# or VB.NET.&lt;br /&gt;&lt;a href="http://www.nltk.org/"&gt;NLTK&lt;/a&gt; is a great open source NLP package written in Python. It comes with an online &lt;a href="http://www.nltk.org/book"&gt;book&lt;/a&gt;. I decided to try to embed IronPython under C# and run NLTK from there. Here are a few thoughts about the experience.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Problems with embedding IronPython and NLTK&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;Some libraries that NLTK uses are not installed in IronPython, e.g. zlib and numpy, you can mainly patch this up&lt;br /&gt;&lt;/li&gt;&lt;li&gt;You need a good understanding of how embedded IronPython works&lt;/li&gt;&lt;li&gt;The connection between Python and C# is not seamless&lt;/li&gt;&lt;li&gt;Sending data between Python and C#  takes work&lt;/li&gt;&lt;li&gt;NLTK is pretty slow at starting up&lt;/li&gt;&lt;li&gt;Doing large scale machine learning in NLTK is slow&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;C# and IronPython&lt;br /&gt;&lt;/h4&gt;IronPython is a very good implementation of Python, but in C# 3.5 there is still a mismatch between C# and Python; this becomes an issue when you are dealing with a library as big as NLTK.&lt;br /&gt;The integration between IronPython and C# is going to improve with C# 4.0. How much remains to be seen.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;To embed or not to embed&lt;/h3&gt;When is embedding IronPython and NLTK inside C# a good idea?&lt;br /&gt;&lt;h4&gt;Separate processes for NLTK under CPython and C#&lt;br /&gt;&lt;/h4&gt;If your C# tasks and your NLP tasks are not interacting too much, it might be simpler to have a C# program call a NLP CPython program as an external process. E.g. you want to analyze the content of a Word document. You would open the Word document in C# create a Python process pipe the text into it and read the result back in JSON or XML and display it in ASP, WPF or WinForms.&lt;br /&gt;&lt;h4&gt;Small NLP tasks&lt;br /&gt;&lt;/h4&gt;There is a learning curve for both NLTK and embedded IronPython, that slows down you down when you start work.&lt;br /&gt;&lt;h4&gt;Medium sized NLP projects&lt;/h4&gt;The setup cost is not an issue so embedding IronPython and NLTK could work very well here.&lt;br /&gt;&lt;h4&gt;Big NLP projects&lt;br /&gt;&lt;/h4&gt;The setup cost is not an issue, but at some point the mismatch between Python and C#, will start to outweigh the advantages you get.&lt;br /&gt;&lt;h4&gt;Prototyping in NLTK&lt;br /&gt;&lt;/h4&gt;Start writing your application in NLTK either under CPython or IronPython. This should improve development time substantially. You might find that your prototype is good enough and you do not need to port it to C#; or you will have a working program that you can port to C#.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;References&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;Post about running &lt;a href="http://ironpython.codeplex.com/WorkItem/View.aspx?WorkItemId=24357"&gt;NLTK from IronPython&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Chapter 15 of &lt;a href="http://www.ironpythoninaction.com/"&gt;IronPython in Action&lt;/a&gt; is about embedding IronPython in C# or VB.NET&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.ironpythoninaction.com/download.html"&gt;Source code examples&lt;/a&gt; from IronPython in Action&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Here is a short &lt;a href="http://www.voidspace.org.uk/ironpython/hosting_api.shtml"&gt;intro to embedding IronPython&lt;/a&gt; by Michael Foord&lt;/li&gt;&lt;li&gt;I tried loading &lt;a href="http://jdhardy.blogspot.com/2008/12/solving-zlib-problem-ironpythonzlib.html"&gt;Jeff Hardy's&lt;/a&gt; &lt;a href="http://bitbucket.org/jdhardy/ironpythonzlib/"&gt;IronPython.Zlib.dll&lt;/a&gt; using Assembly.LoadFile, that did not work but I could add it with clr.AddReference from the embedded Python code&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-14061637374794049?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/14061637374794049/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=14061637374794049' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/14061637374794049'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/14061637374794049'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2010/03/open-source-nlp-in-c-35-using-nltk.html' title='Open Source NLP in C# 3.5 using NLTK'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-4065720047424471322</id><published>2008-11-17T15:23:00.004-05:00</published><updated>2008-11-20T12:54:43.871-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Boost'/><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='image processing'/><category scheme='http://www.blogger.com/atom/ns#' term='computer vision'/><category scheme='http://www.blogger.com/atom/ns#' term='Eclipse'/><category scheme='http://www.blogger.com/atom/ns#' term='STL'/><category scheme='http://www.blogger.com/atom/ns#' term='functional programming'/><category scheme='http://www.blogger.com/atom/ns#' term='VXL'/><category scheme='http://www.blogger.com/atom/ns#' term='GIL'/><category scheme='http://www.blogger.com/atom/ns#' term='C++'/><category scheme='http://www.blogger.com/atom/ns#' term='FLTK'/><category scheme='http://www.blogger.com/atom/ns#' term='ImageJ'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenCV'/><category scheme='http://www.blogger.com/atom/ns#' term='unit test'/><title type='text'>Computer vision C++ libraries review</title><content type='html'>I am trying to create an easy to use, minimalistic C++ cross-platform computer vision system, with a non-restrictive license.  My biggest challenge was to chose the best libraries and to get them to work together; this took some investigation and experimenting.  This posting is a brief description of my findings.&lt;br /&gt;&lt;br /&gt;This is what ShapeLogic C++ currently looks like:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_bxfuAiCUZwY/SSDaAPqpedI/AAAAAAAAABo/CXPRAr743rI/s1600-h/shapelogic-cpp-windows.jpg"&gt;&lt;img style="cursor: pointer; width: 320px; height: 228px;" src="http://4.bp.blogspot.com/_bxfuAiCUZwY/SSDaAPqpedI/AAAAAAAAABo/CXPRAr743rI/s320/shapelogic-cpp-windows.jpg" alt="" id="BLOGGER_PHOTO_ID_5269451261763746258" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Windows&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_bxfuAiCUZwY/SR9GlpJzL0I/AAAAAAAAABg/It1nng_CqsU/s1600-h/shapelogic-cpp-linux.jpg"&gt;&lt;img style="cursor: pointer; width: 320px; height: 208px;" src="http://3.bp.blogspot.com/_bxfuAiCUZwY/SR9GlpJzL0I/AAAAAAAAABg/It1nng_CqsU/s320/shapelogic-cpp-linux.jpg" alt="" id="BLOGGER_PHOTO_ID_5269007701562830658" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Linux&lt;br /&gt;&lt;br /&gt;In order to construct ShapeLogic C++, I had to make choices within the following categories:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Computer vision and image processing libraries&lt;/li&gt;&lt;li&gt;GUI libraries&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Unit test systems&lt;/li&gt;&lt;li&gt;Build systems&lt;/li&gt;&lt;li&gt;Compilers under Windows&lt;br /&gt;&lt;/li&gt;&lt;li&gt;C++ IDEs under UNIX&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;a href="http://rsb.info.nih.gov/ij/"&gt;ImageJ,&lt;/a&gt; the Java open source image processing tool, is the inspiration for the first part of my work: it is very simple to learn, use and program in. This is a follow up to my last posting: &lt;a href="http://samibadawi.blogspot.com/2008/09/computer-vision-c-vs-java-review.html"&gt;Computer Vision C++ vs Java&lt;/a&gt;. The result of my work is released as an open source project &lt;a href="http://www.shapelogic.org/cpp.html"&gt;ShapeLogic C++,&lt;/a&gt; under the MIT license.&lt;br /&gt;&lt;h3&gt;Computer vision and image processing libraries&lt;br /&gt;&lt;/h3&gt;The candidates I considered were:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;GIL, Generic Image Library&lt;br /&gt;&lt;/li&gt;&lt;li&gt;OpenCV&lt;/li&gt;&lt;li&gt;VXL&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;GIL, Generic Image Library&lt;br /&gt;&lt;/h4&gt; GIL, &lt;a href="http://opensource.adobe.com/wiki/display/gil/Generic+Image+Library"&gt;Generic Image Library&lt;/a&gt; by Adobe.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Very non intrusive, only based on header files&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Puts a wrapper around most image format&lt;/li&gt;&lt;li&gt;You can write a algorithm once and it will work for most image types&lt;/li&gt;&lt;li&gt;Part of Boost since 1.35&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Does not come with a lot of image processing algorithms&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;OpenCV&lt;br /&gt;&lt;/h4&gt; &lt;a href="http://sourceforge.net/projects/opencvlibrary/"&gt; OpenCV&lt;/a&gt;, Open Computer Vision by Intel.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt; Very simple&lt;/li&gt;&lt;li&gt;Works with both C and C++&lt;/li&gt;&lt;li&gt;Very broad range of algorithms&lt;/li&gt;&lt;li&gt;Complex algorithms: face detection, convexity defects&lt;/li&gt;&lt;li&gt;Very popular&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;You have to use OpenCV's IplImage&lt;/li&gt;&lt;li&gt;IplImage byte order is BGR instead of the normal RGB&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt; &lt;h4&gt;VXL, Vision X Library&lt;br /&gt;&lt;/h4&gt;&lt;a href="http://vxl.sourceforge.net/"&gt;VXL&lt;/a&gt; a combination of 2 big older vision libraries TargetJR and IUE&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Well tested technology&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Simpler build process using &lt;a href="http://www.cmake.org/"&gt;cmake&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Uses modern programming techniques: classes, template and &lt;a href="http://www.sgi.com/tech/stl/"&gt;STL&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;It has a lot of functionality&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Simple to get started with&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;It is not using normal STL, but in order to work on different compiler it had to make its own version with different names.&lt;/li&gt;&lt;li&gt;Class structure is somewhat complex.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Choice computer vision library for ShapeLogic C++&lt;/span&gt;&lt;br /&gt;OpenCV for existing image processing and vision algorithms. GIL for writing new algorithms.&lt;br /&gt;&lt;h3&gt;Cross platform GUI&lt;/h3&gt;The candidates I considered were:&lt;ul&gt;&lt;li&gt;GIMP plugin&lt;/li&gt;&lt;li&gt;GTK+, GIMP toolkit&lt;/li&gt;&lt;li&gt;FLTK&lt;/li&gt;&lt;li&gt;HighGui from OpenCV&lt;/li&gt;&lt;li&gt;PhotoShop plugin&lt;/li&gt;&lt;li&gt;wxWidget&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Run ShapeLogic as a GIMP plugin&lt;/h4&gt;&lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.gimp.org/"&gt;GIMP&lt;/a&gt; is the main cross platform OSS image editing programs.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;It is in wide use.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Has a lot of powerful features including scripting functionality in Scheme and Python.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;From a user perspective this would be an excellent choice.&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;GIMP is GPL, but you could have a wrapper around plugins in order to access them as GIMP plugins.&lt;/li&gt;&lt;li&gt;The plugin works with tiles, which gives good performance, but does not fit well with either the way OpenCV or GIL are processing images.&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;GTK+, GIMP Toolkit&lt;br /&gt;&lt;/h4&gt;&lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.gtk.org/"&gt;GTK+&lt;/a&gt; is a great looking and very powerful framework that works on: Windows, Linux, Mac, a.o.&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;It is written in C and has a homegrown object system, which is not type safe.&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;GTKMM C++ wrapper around GTK+&lt;br /&gt;&lt;/h4&gt; &lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.gtkmm.org/"&gt;CTKMM&lt;/a&gt; is a great looking and very powerful framework, that works on: Windows, Linux, Mac, a.o.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;It feels natural to program in for a C++ programmer.&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The class hierarchy is somewhat deep since it is built on top of GTK.&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;FLTK, Fast light toolkit&lt;br /&gt;&lt;/h4&gt;   &lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.fltk.org/"&gt;FLTK&lt;/a&gt; is very lightweight.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Very clean C++, you actually have a main().&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Native C++ build on top of X11 or Windows.&lt;/li&gt;&lt;li&gt;Fluid, a simple GUI builder&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Not as many widgets.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Dated look.&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;HighGui from OpenCV&lt;br /&gt;&lt;/h4&gt; &lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Very lightweight.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;There is some functionality for displaying images, video and an event handler for mouse events.&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Does not come with a menu system.&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Run ShapeLogic as a PhotoShop plugin&lt;/h4&gt;&lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.adobe.com/products/photoshop/index.html"&gt;PhotoShop&lt;/a&gt; is the main image editing programs.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;It is in wide use, has a lot of powerful features including macros.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;From a user perspective this would be an excellent choice.&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The PhotoShop SDK is not freely available, you have to apply to get it.&lt;/li&gt;&lt;li&gt;The plugin does not fit well with either the way OpenCV or GIL are processing images.&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;wxWidgets&lt;br /&gt;&lt;/h4&gt;   &lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.wxwidgets.org/" class="externalLink"&gt;wxWidgets&lt;/a&gt; is a full featured GUI toolkit, built on top of native toolkits: Win32, Mac OS X, GTK+, X11, Motif, WinCE and more.&lt;/li&gt;&lt;li&gt;Looks good and modern.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Big community.&lt;/li&gt;&lt;li&gt;Several GUI builders.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The programming style is close to Windows MFC programming.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;There are many layers.&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Choice&lt;/span&gt;&lt;br /&gt;This was a hard choice and I went back and forth between FLTK and wxWidgets, but went with FLTK. All the GUI code is separate from the image processing code, so if I wanted to change from FLTK to another toolkit later it should not be too dramatic.&lt;br /&gt;&lt;h3&gt;C++ unit test frameworks&lt;br /&gt;&lt;/h3&gt;There are a lot of different choices and no clear leader. Some of the candidates were:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Boost.test&lt;/li&gt;&lt;li&gt;CppUnit&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Google C++ Testing Framework&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Boost.test&lt;br /&gt;&lt;/h4&gt;    &lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.boost.org/doc/libs/1_37_0/libs/test/doc/html/index.html"&gt;Boost.test&lt;/a&gt; is part of the Boost library.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Powerful with a lot of options.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;You have to manually set up test suites.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;It is somewhat heavy.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The documentation is extensive but not easy to read.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;CppUnit&lt;br /&gt;&lt;/h4&gt;     &lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://sourceforge.net/projects/cppunit"&gt;CppUnit&lt;/a&gt; is following a standard unit testing convention XUnit.&lt;/li&gt;&lt;li&gt;Integration with Eclipse CDT.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;You have to manually set up test suites.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;It is an extra library to install.&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Google C++ Testing Frameworks&lt;br /&gt;&lt;/h4&gt;     &lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://code.google.com/p/googletest/"&gt;Google test&lt;/a&gt; is following a standard unit testing convention XUnit.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Strong focus on simplicity.&lt;/li&gt;&lt;li&gt;Documentation is short and easy to read.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;It is an extra library to install.&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Choice of C++ unit test framework for ShapeLogic C++&lt;br /&gt;&lt;/h4&gt;     I spent quite a bit of time reading the Boost Test documentation, finally I tried Google C++ Testing Framework and got it working very fast.&lt;br /&gt;&lt;p&gt; &lt;/p&gt; &lt;p&gt; &lt;/p&gt; &lt;h3&gt;Build system&lt;br /&gt;&lt;/h3&gt;The candidates I considered were:&lt;ul&gt;&lt;li&gt;Boost build&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Make&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Boost build&lt;br /&gt;&lt;/h4&gt;     &lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.boost.org/doc/tools/build/doc/html/index.html"&gt;Boost build&lt;/a&gt; is part of Boost.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Clean design, made as a Make replacement.&lt;/li&gt;&lt;li&gt;Works on most platforms and with most compilers.&lt;/li&gt;&lt;li&gt;The scripts are pretty short.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt; &lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;There is a learning curve.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Make&lt;br /&gt;&lt;/h4&gt;     &lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.gnu.org/software/make/"&gt;Make&lt;/a&gt; is the standard for build on C++.&lt;/li&gt;&lt;li&gt;Widely used.&lt;/li&gt;&lt;li&gt;Works with Eclipse, MSVC, NetBeans.&lt;/li&gt;&lt;li&gt;Short scripts.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;It has gotten messy over time.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Shell script dependency.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;There is too much magic for my taste.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Choice of build system for ShapeLogic C++&lt;br /&gt;&lt;/h4&gt;     I chose to go with Boost Build because it has a cleaner design, but Make looks very competitive when looking over the pros and cons.&lt;br /&gt;&lt;h3&gt;Compilers under Windows&lt;br /&gt;&lt;/h3&gt;In order to compile Boost you need a pretty modern and standard compliant compiler. The candidates that I looked at are:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.cygwin.com/"&gt;&lt;/a&gt;Cygwin GCC&lt;br /&gt;&lt;/li&gt;&lt;li&gt;MinGW GCC&lt;/li&gt;&lt;li&gt;MSVC Microsoft Visual C++&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt; &lt;h4&gt;Cygwin GCC&lt;br /&gt;&lt;/h4&gt;      &lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.cygwin.com/"&gt;Cygwin&lt;/a&gt; GCC is close to GCC under UNIX&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;  &lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Uses emulation of UNIX system call.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;You can only use it to build GPL compatible application.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt; &lt;h4&gt;MinGW GCC&lt;br /&gt;&lt;/h4&gt;      &lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.mingw.org/"&gt;MingGW&lt;/a&gt; integrates well with Eclipse CDT.&lt;/li&gt;&lt;li&gt;Works more natively with Windows.&lt;/li&gt;&lt;li&gt;Most libraries build fine with MinGW.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;It was supposed to be able to build FLTK, but I tried a few times and could not get it to work.&lt;/li&gt;&lt;li&gt;In order to run Make files you also have to install MSYS, which is a minimal shell.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;MSVC, Microsoft Visual C++&lt;br /&gt;&lt;/h4&gt;       &lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://msdn.microsoft.com/en-us/visualc/default.aspx"&gt;MSVC&lt;/a&gt; is a high quality compiler.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Most used compiler under Windows.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;There is a free &lt;a href="http://www.microsoft.com/express/vc/"&gt;Express&lt;/a&gt; version.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt; &lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;There seems to be some restrictions of the Express version that I did not quite understand.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;    &lt;h4&gt;Choice of compiler under Windows for ShapeLogic C++&lt;br /&gt;&lt;/h4&gt;      MSVC.&lt;br /&gt;&lt;h3&gt;C++ IDEs under UNIX&lt;br /&gt;&lt;/h3&gt;  The candidates I considered were:&lt;ul&gt;&lt;li&gt;Eclipse 3.4&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Emacs / &lt;a href="http://www.xemacs.org/"&gt;Xemacs&lt;/a&gt;&lt;/li&gt;&lt;li&gt;NetBeans 6.1&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;  &lt;h4&gt;Eclipse 3.4&lt;br /&gt;&lt;/h4&gt;       &lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.eclipse.org/cdt/"&gt;Eclipse 3.4 CDT&lt;/a&gt; has a good debugger.&lt;/li&gt;&lt;li&gt;Easy to jump from classes to files defining the classes.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;   &lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Not nearly as good as Eclipse for Java.&lt;/li&gt;&lt;li&gt;Unstable under Linux AMD64.&lt;/li&gt;&lt;/ul&gt;  &lt;h4&gt;Emacs&lt;br /&gt;&lt;/h4&gt;       &lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.gnu.org/software/emacs/"&gt;Emacs&lt;/a&gt;  is powerful tools that runs in a terminal.&lt;/li&gt;&lt;li&gt;Takes up less resources.&lt;/li&gt;&lt;li&gt;Not dependent on Java.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The Java bases IDE have more features.&lt;/li&gt;&lt;li&gt;Demands more knowledge to use.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt; &lt;h4&gt;NetBeans 6.1&lt;br /&gt;&lt;/h4&gt;        &lt;span style="font-weight: bold;"&gt;Pros&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.netbeans.org/features/cpp/"&gt;NetBeans 6.1&lt;/a&gt; seem a little lighter than Eclipse.&lt;/li&gt;&lt;/ul&gt;  &lt;span style="font-weight: bold;"&gt;Cons&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;It is made to work with Make files, and ShapeLogic C++ is using Boost Build / Bjam.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;     &lt;h4&gt;Choice of IDE under UNIX for ShapeLogic C++&lt;br /&gt;&lt;/h4&gt;       Eclipse.&lt;br /&gt;&lt;h3&gt;Summary of libraries and tools used&lt;br /&gt;&lt;/h3&gt;   &lt;ul&gt;&lt;li&gt;&lt;a href="http://www.boost.org/" class="externalLink"&gt;Boost&lt;/a&gt;  the C++ library&lt;/li&gt;&lt;li&gt;&lt;a href="http://opensource.adobe.com/wiki/display/gil/Generic+Image+Library" class="externalLink"&gt; Generic Image Library&lt;/a&gt;  for writing new image processing code&lt;/li&gt;&lt;li&gt;&lt;a href="http://opencvlibrary.sourceforge.net/" class="externalLink"&gt; OpenCV&lt;/a&gt;  for existing computer vision algorithms&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.fltk.org/" class="externalLink"&gt; FLTK, Fast Light Toolkit&lt;/a&gt;  lightweight cross platform GUI&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.google.com/p/googletest/" class="externalLink"&gt; Google C++ Testing Framework&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://www.boost.org/doc/tools/build/index.html" class="externalLink"&gt; Boost.build v2&lt;/a&gt;  for command line based build system&lt;/li&gt;&lt;/ul&gt; &lt;h3&gt;Status of ShapeLogic C++&lt;br /&gt;&lt;/h3&gt; &lt;a href="http://code.google.com/p/shapelogic-cpp/downloads/list"&gt;ShapeLogic C++ 0.4&lt;/a&gt; is the first alpha release. It can do some useful work, but it still mainly an example application.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Currently has&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Comes with some image processing operation&lt;/li&gt;&lt;li&gt;Comes with 3 brushes&lt;/li&gt;&lt;li&gt;It is pretty simple to program an image processing algorithm&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;    &lt;span style="font-weight: bold;"&gt;Missing&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Drawing is currently slow and there is only one pen size&lt;/li&gt;&lt;li&gt;None of the ShapeLogic Java algorithms have been ported yet&lt;/li&gt;&lt;li&gt;Documentation is poor&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;Hardest problems&lt;br /&gt;&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;  Learning how FLTK works&lt;/li&gt;&lt;li&gt;Building a cross platform C++ build script covering several libraries&lt;/li&gt;&lt;/ul&gt;None of these problems will effect ShapeLogic users.&lt;br /&gt;&lt;h3&gt;Porting computer vision code from Java to C++&lt;br /&gt;&lt;/h3&gt;Before I started porting ShapeLogic from Java, I thought that C++ was moving towards becoming a legacy language.  What I have learned from this work is that C++ has advanced substantially since 2002, when I last used it professionally. C++ still seems competitive, at least in computer vision, and according to my old video game colleagues also in games, where I though that C# might have taken a lead by now. Both C++ and Java have substantial advantages.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;C++&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;OpenCV has a lot of vision algorithms, e.g. face recognition&lt;/li&gt;&lt;li&gt;C++ is faster than Java&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Better for video processing&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Programs are shorter&lt;/li&gt;&lt;li&gt;Generic programming working on primitive types&lt;br /&gt;&lt;/li&gt;&lt;li&gt;You can make build script that build under both Windows and UNIX&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Java / ImageJ&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;ImageJ has more open source algorithms for medical image processing&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Better support for medical image files formats under ImageJ&lt;/li&gt;&lt;li&gt;IDEs are better under Java&lt;/li&gt;&lt;li&gt;Build process is simpler than C++&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Simpler language&lt;/li&gt;&lt;li&gt;Better support for parallel processing&lt;/li&gt;&lt;li&gt;A lot easier to dynamically load plugins&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;  The next step is to port my framework for declarative programming -- it is based on lazy streams -- and port the &lt;a href="http://www.shapelogic.org/particle.html"&gt;Color Particle Analyzer&lt;/a&gt;. C++ / Boost have good support for functional programming techniques: &lt;a href="http://www.boost.org/doc/libs/1_37_0/boost/bind.hpp"&gt;Boost.Bind&lt;/a&gt; and &lt;a href="http://www.boost.org/doc/libs/1_37_0/doc/html/lambda.html"&gt;Boost.Lambda&lt;/a&gt;, and the &lt;a href="http://spirit.sourceforge.net/dl_docs/phoenix-2/libs/spirit/phoenix/doc/html/index.html"&gt;Phoenix&lt;/a&gt; library has just been accepted into Boost.  When complete, I will do another posting about how it went.&lt;br /&gt;&lt;br /&gt;-Sami Badawi&lt;br /&gt;http://www.shapelogic.org&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-4065720047424471322?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/4065720047424471322/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=4065720047424471322' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/4065720047424471322'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/4065720047424471322'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2008/11/computer-vision-c-libraries-review.html' title='Computer vision C++ libraries review'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_bxfuAiCUZwY/SSDaAPqpedI/AAAAAAAAABo/CXPRAr743rI/s72-c/shapelogic-cpp-windows.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-2229200947310071334</id><published>2008-09-02T06:43:00.031-04:00</published><updated>2008-11-18T23:27:14.794-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Boost'/><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='computer vision'/><category scheme='http://www.blogger.com/atom/ns#' term='Eclipse'/><category scheme='http://www.blogger.com/atom/ns#' term='STL'/><category scheme='http://www.blogger.com/atom/ns#' term='functional programming'/><category scheme='http://www.blogger.com/atom/ns#' term='computer art'/><category scheme='http://www.blogger.com/atom/ns#' term='Processing'/><category scheme='http://www.blogger.com/atom/ns#' term='VXL'/><category scheme='http://www.blogger.com/atom/ns#' term='GIL'/><category scheme='http://www.blogger.com/atom/ns#' term='C++0x standard'/><category scheme='http://www.blogger.com/atom/ns#' term='declarative programming'/><category scheme='http://www.blogger.com/atom/ns#' term='C++'/><category scheme='http://www.blogger.com/atom/ns#' term='FLTK'/><category scheme='http://www.blogger.com/atom/ns#' term='Python'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenCV'/><category scheme='http://www.blogger.com/atom/ns#' term='Openframeworks'/><category scheme='http://www.blogger.com/atom/ns#' term='TR1'/><title type='text'>Computer Vision C++ vs Java review</title><content type='html'>In 2007 I created an open source computer vision project, &lt;a href="http://www.shapelogic.org/"&gt;ShapeLogic&lt;/a&gt;, built in Java to work with &lt;a href="http://rsb.info.nih.gov/ij/"&gt;ImageJ&lt;/a&gt;.  This setup has been very easy to work with and very productive.  Bjarne Stroustrup the creator of C++  gave an interview about the &lt;a href="http://www.devx.com/SpecialReports/Article/38813"&gt;new &lt;/a&gt;&lt;a href="http://www.devx.com/SpecialReports/Article/38813"&gt;features in the &lt;/a&gt;&lt;a href="http://www.devx.com/SpecialReports/Article/38813"&gt;C++&lt;/a&gt;&lt;a href="http://www.devx.com/SpecialReports/Article/38813"&gt;0x standard and TR1&lt;/a&gt;.  C++ now has a lot of innovating programming constructs e.g. template meta  programming, lambda functions, concepts and traits.  When I found out that "axiom" is going to be a keyword in C++ my inner mathematician demanded that I take a second look at C++ in connection with computer vision.&lt;span style="text-decoration: underline;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;This post is a review of my personal past experience with computer vision in C++ and Java.  I did my masters thesis in computer vision in the early 90ies, but I ended up working in other fields: video games, Internet and finance, which only left a little time to do vision in my free time.  While both C++ and Java were good choices for professional vision programmers, several of the approaches I chose caused me to run out of steam.  I also tried to do computer vision with functional, declarative and hybrid languages e.g. Oz, Scheme and Scala but will not cover that here.&lt;br /&gt;&lt;h3&gt;Borland C++ early 90ies&lt;/h3&gt;C++ did not have STL or any other standard library so I used Borland's OWL library for images and for the application.  I used C++ templates, classes with multiple inheritance, RTTI just to set up basic container functionality.  There were a few books that has some free C or C++ source code for image processing and vision, but they did not spawn a user community.  I did not really get to do anything interesting.&lt;br /&gt;&lt;h3&gt;JAI, Java Advanced Imaging late 90ies&lt;br /&gt;&lt;/h3&gt;I was very excited when Java came around, this was the language to cure all programming ailments. Now they had added a library that could be used for vision and a lot of big companies were sponsoring JAI.  It turned out to be a very complex framework with a deep class hierarchy, I spent a lot of time reading the manual trying to find out how to get access to image pixels.  I gave up using it and the framework never gained much popularity.&lt;br /&gt;&lt;h3&gt;VXL, C++, STL, Boost, Python, GCC, Linux around 2000&lt;br /&gt;&lt;/h3&gt; Open source software, OSS had started to become prominent. There were 2 OSS libraries:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;OpenCV (Open Computer Vision), wich was still in alpha.&lt;/li&gt;&lt;li&gt;VXL (Vision X Library) wich was a merge of 2 big non OSS libs TargetJR and IUE.&lt;/li&gt;&lt;/ul&gt;VXL finally got into beta and I tried to combine it with Python for more high level processing.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tools needed for build and GUI&lt;/span&gt; &lt;ul&gt;&lt;li&gt;&lt;a href="http://vxl.sourceforge.net/"&gt;VXL&lt;/a&gt; does builds using &lt;a href="http://www.cmake.org/HTML/index.html"&gt;CMake&lt;/a&gt; to create Make files&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.boost.org/"&gt;Boost&lt;/a&gt; uses BJam to do builds&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt; bindings using &lt;a href="http://www.boost.org/doc/libs/1_36_0/libs/python/pyste/index.html"&gt;Pyste&lt;/a&gt; from Boost&lt;/li&gt;&lt;li&gt;VXL used &lt;a href="http://www.fltk.org/"&gt;FLTK&lt;/a&gt; and OpenGL as a GUI&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Problems encountered&lt;/span&gt;&lt;ul&gt;&lt;li&gt;It was hard to get the different build systems, CMake, Bjam and Make, to work together&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://gcc.gnu.org/"&gt;GCC&lt;/a&gt; 3.1 and 3.2 core dumped when compiling certain Boost classes&lt;/li&gt;&lt;li&gt;Python bindings worked for simple C++ classes, but not for the nested template classes in VXL&lt;br /&gt;&lt;/li&gt;&lt;li&gt;It was hard to debug the template programs&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Emacs was not really as easy to use as Visual Studio&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Bad drivers for OpenGL on Linux&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt; I actually got some examples set up, but spent more time fighting with the tool stack than doing vision work.&lt;br /&gt;&lt;h3&gt;ImageJ in Java around 2004&lt;br /&gt;&lt;/h3&gt;  A colleague showed me a visualization tool he had worked on and said that he did it in around 1 month.  I barely believed him, but tried the underlying framework, ImageJ.  To my big surprise I was up and running and doing real work in a few hours.  ImageJ just got things right.  It was built using pure Java by one man, Wayne Rasband.  It is very easy to work with and very modular, so a lot of people have made plugins and there is a vibrant development community.  When I started working on ShapeLogic that was the best choice.&lt;br /&gt;&lt;h3&gt;OpenCV, GIL Generic Image Library, Boost and Eclipse in C++ 2008&lt;br /&gt;&lt;/h3&gt;In the light of advance in the C++ language and tools, I have decided to try it again.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;C++ image libraries choices&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://opencvlibrary.sourceforge.net/"&gt;OpenCV, Open Computer Vision&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.openframeworks.cc/"&gt;Openframeworks&lt;/a&gt; built on top of OpenCV&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://opensource.adobe.com/wiki/display/gil/Generic+Image+Library"&gt;GIL Generic Image Library&lt;/a&gt;&lt;/li&gt;&lt;li&gt;VXL&lt;/li&gt;&lt;/ul&gt;I chose to start with OpenCV made by Intel and GIL made by Adobe but a part of Boost since 1.35.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;C++ IDE tried&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Eclipse 3.4&lt;br /&gt;&lt;/li&gt;&lt;li&gt;NetBeans 6.1&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Eclipse worked better for me, it has its own build system so you do not have to mess with Make files.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;C++ cross platform GUIs&lt;/span&gt; &lt;ul&gt;&lt;li&gt;&lt;a href="http://www.fltk.org/"&gt;FLTK Fast Light Toolkit&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.wxwidgets.org/"&gt;wxWidgets&lt;/a&gt;&lt;/li&gt;&lt;li&gt;HighGui from OpenCV &lt;/li&gt;&lt;/ul&gt;Not sure which one will be best for my purpose.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;First attempt&lt;/span&gt;&lt;br /&gt;I tried Boost, OpenCV and GIL and got them up and running under both Linux and Windows in a few hours.  &lt;a href="http://www.eclipse.org/cdt/"&gt;Eclipse CDT C++ IDE&lt;/a&gt; works great.&lt;br /&gt;&lt;h3&gt;Porting ShapeLogic algorithms to C++ version&lt;br /&gt;&lt;/h3&gt; My plan is to port some algorithms from ShapeLogic from Java to C++.   ShapeLogic is a toolkit for declarative programming, specialized for vision.  In principle you should be able to make a list of rules for categorizing say the shape of a particle in a particle analyzer.  You put them in a database or a flat file and the same rules should work for C++ and Java version of ShapeLogic.  In practice this might not work out.&lt;br /&gt;&lt;h3&gt;Advantages of C++ and Java&lt;br /&gt;&lt;/h3&gt; This is a loose first assessment.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Constructs used in ShapeLogic that are missing or less convenient in C++&lt;/span&gt; &lt;ul&gt;&lt;li&gt; Uniform cross platform GUI&lt;/li&gt;&lt;li&gt; Dynamic cross platform libraries&lt;/li&gt;&lt;li&gt;HashTable&lt;/li&gt;&lt;li&gt; Reflection&lt;/li&gt;&lt;li&gt; Garbage collection&lt;/li&gt;&lt;li&gt;&lt;a href="http://antlr.org/"&gt;Antlr&lt;/a&gt; for parsing logic language&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Advantages of C++ over Java in vision&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Substantially higher speed&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Better handling of video&lt;/li&gt;&lt;li&gt;Used more frequently for computer vision programming&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Good tracking and face recognition algorithms in OpenCV&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;For me, Java has been very good for doing medical image processing algorithms.  I have heard conflicting evidence about whether it is feasible for doing computer vision on video using Java.  Video handling in Java has been bad up to now, this is supposed to be fixed with the new &lt;a href="http://www.javafx.com/"&gt;JavaFX&lt;/a&gt;.  &lt;a href="http://www.youtube.com/watch?v=4HUPX9sE-Cw&amp;amp;feature=related"&gt;Shadow Monsters&lt;/a&gt; is a computer vision based art piece taking video footage of silhouette of the viewer and adding monsters to them, I saw it on display at &lt;a href="http://moma.org/"&gt;Museum of Modern Art&lt;/a&gt;. It was programmed using &lt;a href="http://processing.org/"&gt;Processing&lt;/a&gt;, which is a Java based image processing tool for artists.   I discussed the issue with a computer programmer / artist who said that he had tried to do a motion algorithm in Processing and had to port it to C++ based Openframeworks since Java was too slow.   After being discouraged by my prior attempts to do vision in C++, I am very happy to see the dramatic developments in C++ and see if it is suitable for a simple port of ShapeLogic algorithms. The result of this C++ port will be covered in 2 postings:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://samibadawi.blogspot.com/2008/11/computer-vision-c-libraries-review.html"&gt;Computer vision C++ libraries review&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Declarative framework  and &lt;a href="http://www.shapelogic.org/particle.html"&gt;Particle Analyzer&lt;/a&gt; Java to C++ port&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;-Sami Badawi&lt;br /&gt;http://www.shapelogic.org&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-2229200947310071334?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/2229200947310071334/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=2229200947310071334' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/2229200947310071334'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/2229200947310071334'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2008/09/computer-vision-c-vs-java-review.html' title='Computer Vision C++ vs Java review'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-8282007904140447638</id><published>2008-07-07T02:43:00.006-04:00</published><updated>2008-07-07T04:25:53.313-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='image processing'/><category scheme='http://www.blogger.com/atom/ns#' term='particle analyzer'/><category scheme='http://www.blogger.com/atom/ns#' term='computer vision'/><category scheme='http://www.blogger.com/atom/ns#' term='ShapeLogic'/><category scheme='http://www.blogger.com/atom/ns#' term='ImageJ'/><category scheme='http://www.blogger.com/atom/ns#' term='cell'/><category scheme='http://www.blogger.com/atom/ns#' term='medical'/><title type='text'>ShapeLogic 1.2 with color particle analyzer released</title><content type='html'>&lt;p&gt;Here are the release notes for &lt;a href="http://www.shapelogic.org/"&gt;ShapeLogic&lt;/a&gt; 1.2&lt;br /&gt;&lt;/p&gt;&lt;h3&gt;Changes&lt;br /&gt;&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.shapelogic.org/particle.html" rel="nofollow"&gt;Particle analyzer&lt;/a&gt; working directly on color and gray scale images without manual user intervention &lt;/li&gt;&lt;li&gt;Both particle counter and particle analyzer now take parameters and print reports about each particle's color, area, standard deviation to result table &lt;/li&gt;&lt;li&gt;Color replacer replaces one color within a tolerance with another color. Parameter input dialog with preview check box &lt;/li&gt;&lt;li&gt;Organize plugins and macros under ShapeLogic&lt;a href="http://code.google.com/p/shapelogic/w/edit/ShapeLogic"&gt;?&lt;/a&gt; and ShapeLogicOld&lt;a href="http://code.google.com/p/shapelogic/w/edit/ShapeLogicOld"&gt;?&lt;/a&gt; menus, until 1.1 they where all placed under shapelogic menu &lt;/li&gt;&lt;li&gt;ShapeLogic still has beta development status&lt;/li&gt;&lt;/ul&gt; The particle analyzer in ShapeLogic v 1.2 has gone through limited&lt;br /&gt;testing and seems to work well. There is still a bug in the edge&lt;br /&gt;tracer.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Using particle analyzer as a ImageJ plugin&lt;br /&gt;&lt;/h3&gt;&lt;p&gt;The particle analyzer was tested on the particle sample images from ImageJ embryos.jpg&lt;/p&gt; &lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_bxfuAiCUZwY/SCQ60Kn8XHI/AAAAAAAAABI/MAAG1JjQPcU/s1600-h/embryos.jpg"&gt;&lt;img style="cursor: pointer;" src="http://bp2.blogger.com/_bxfuAiCUZwY/SCQ60Kn8XHI/AAAAAAAAABI/MAAG1JjQPcU/s320/embryos.jpg" alt="embryos.jpg" id="BLOGGER_PHOTO_ID_5198344537771891826" border="0" /&gt;&lt;/a&gt;&lt;p&gt;To run it from &lt;a href="http://rsbweb.nih.gov/ij/"&gt;ImageJ&lt;/a&gt; select "Color Particle Analyzer" in the ShapeLogic&lt;a href="http://code.google.com/p/shapelogic/w/edit/ShapeLogic"&gt;?&lt;/a&gt; menu:&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;img src="http://shapelogic.googlecode.com/svn/wiki/images/particles/shapelogicmenu.png" /&gt; &lt;/p&gt;&lt;p&gt;First a particle count dialog is displayed: &lt;/p&gt;&lt;p&gt;&lt;img src="http://shapelogic.googlecode.com/svn/wiki/images/particles/particleCountDialog.png" /&gt; &lt;/p&gt;&lt;p&gt;Here is the result of running the non-customized particle analyzer on it. This is written to a result table that can be exported to Excel: &lt;/p&gt;&lt;p&gt;&lt;img src="http://shapelogic.googlecode.com/svn/wiki/images/particles/particleAnalyzerResult.png" /&gt; &lt;/p&gt;&lt;p&gt;The categories for the particles are only examples, it is easy to setup different rules for categorizing particles. In ShapeLogic&lt;a href="http://code.google.com/p/shapelogic/w/edit/ShapeLogic"&gt;?&lt;/a&gt; 1.3 there will be custom rules to recognize specific cells. &lt;/p&gt;&lt;p&gt;ShapeLogic&lt;a href="http://code.google.com/p/shapelogic/w/edit/ShapeLogic"&gt;?&lt;/a&gt; 1.2 also contains the second version of a color particle counter. It also prints a smaller report of the particle's properties. &lt;/p&gt;&lt;p&gt;&lt;img src="http://shapelogic.googlecode.com/svn/wiki/images/particles/particleCounterResult.png" /&gt; &lt;/p&gt;&lt;h3&gt;Plans for next release&lt;br /&gt;&lt;/h3&gt; The next release, ShapeLogic v 1.3, will be a more mature particle analyzer which will come with custom rules to recognize specific cells.&lt;br /&gt;&lt;h3&gt;Seeking particle images&lt;br /&gt;&lt;/h3&gt;In order to create these rules, I am looking for images of particles on a relatively uniform background. Please let me know if you have sample images that I could work from, preferably standard images like the embryo sample image that comes with ImageJ.&lt;br /&gt;&lt;h3&gt;Possible future plans for particle analyzer&lt;br /&gt;&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;Create rule for recognizing cells using neural networks or machine learning techniques&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Be able to handle a background that is not uniform, and cell organelles&lt;/li&gt;&lt;li&gt;Incorporate reasoning under uncertainty using the lazy stream library&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Find overlapping particles and distinguish them as separate &lt;/li&gt;&lt;/ul&gt;&lt;a href="http://code.google.com/p/shapelogic/downloads/list"&gt;Download ShapeLogic 1.2&lt;/a&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;-Sami Badawi&lt;br /&gt;http://www.shapelogic.org&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-8282007904140447638?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/8282007904140447638/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=8282007904140447638' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/8282007904140447638'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/8282007904140447638'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2008/07/shapelogic-12-with-color-particle.html' title='ShapeLogic 1.2 with color particle analyzer released'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp2.blogger.com/_bxfuAiCUZwY/SCQ60Kn8XHI/AAAAAAAAABI/MAAG1JjQPcU/s72-c/embryos.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-3601627100952497394</id><published>2008-05-09T07:05:00.009-04:00</published><updated>2008-05-09T10:52:02.694-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='image processing'/><category scheme='http://www.blogger.com/atom/ns#' term='computer vision'/><category scheme='http://www.blogger.com/atom/ns#' term='ShapeLogic'/><category scheme='http://www.blogger.com/atom/ns#' term='particle counter'/><category scheme='http://www.blogger.com/atom/ns#' term='ImageJ'/><category scheme='http://www.blogger.com/atom/ns#' term='NetBeans'/><category scheme='http://www.blogger.com/atom/ns#' term='medical'/><title type='text'>ShapeLogic 1.1 with particle counter released</title><content type='html'>&lt;p&gt;Here are the release notes for &lt;a href="http://www.shapelogic.org/"&gt;ShapeLogic&lt;/a&gt; 1.1&lt;br /&gt;&lt;/p&gt;&lt;h3&gt;Changes&lt;br /&gt;&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.shapelogic.org/particle.html" rel="nofollow"&gt;Particle counter&lt;/a&gt; working directly on color and gray scale images without manual user intervention&lt;/li&gt;&lt;li&gt;Particle counter finds average color, standard deviation, area and location for each particle&lt;/li&gt;&lt;li&gt;Framework to build more advanced particle counters and particle analyzers&lt;/li&gt;&lt;li&gt;Color clustering using K-mean algorithm&lt;/li&gt;&lt;li&gt;Background color finder&lt;/li&gt;&lt;li&gt;Extend all the image processing algorithms in ShapeLogic to work in both ImageJ and in plain Java&lt;/li&gt;&lt;li&gt;Better support for &lt;a href="http://www.shapelogic.org/setup.html#NetBeans"&gt;NetBeans&lt;/a&gt; &lt;/li&gt;&lt;li&gt;ShapeLogic still has beta development status&lt;/li&gt;&lt;li&gt;29000 lines of Java code&lt;/li&gt;&lt;li&gt;440 unit test that all works on local machine&lt;/li&gt;&lt;/ul&gt;The particle counter in ShapeLogic v 1.1 has gone through limited testing, and seems to &lt;a href="http://www.shapelogic.org/particle.html#Test"&gt;work well&lt;/a&gt;  though &lt;a href="http://www.shapelogic.org/particle.html#parameters"&gt;tweaking &lt;/a&gt;&lt;a href="http://www.shapelogic.org/particle.html#parameters"&gt;the parameters&lt;/a&gt;  is still a bit clumsy. Users looking for a mature particle counter should probably wait for ShapeLogic v 1.2.&lt;br /&gt;&lt;h3&gt;Test on sample images from ImageJ&lt;br /&gt;&lt;/h3&gt;&lt;p&gt;The particle counter was tested on the particle images from ImageJ:&lt;/p&gt; &lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_bxfuAiCUZwY/SCQ60Kn8XHI/AAAAAAAAABI/MAAG1JjQPcU/s1600-h/embryos.jpg"&gt;&lt;img style="cursor: pointer;" src="http://3.bp.blogspot.com/_bxfuAiCUZwY/SCQ60Kn8XHI/AAAAAAAAABI/MAAG1JjQPcU/s320/embryos.jpg" alt="embryos.jpg" id="BLOGGER_PHOTO_ID_5198344537771891826" border="0" /&gt;&lt;/a&gt;&lt;p&gt;embryos.jpg. The un-tweaked particle counter in ShapeLogic 1.1 found&lt;br /&gt;&lt;/p&gt;&lt;p&gt;particle count = 9&lt;br /&gt;&lt;/p&gt;&lt;p&gt;embryos.jpg contains 6 particles and a few shadows.&lt;/p&gt; &lt;p&gt;After changing the parameter setting it found&lt;br /&gt;&lt;/p&gt;&lt;p&gt;particle count = 5&lt;br /&gt;&lt;/p&gt;&lt;p&gt;which is the correct value since ShapeLogic 1.1 cannot split overlapping particles.&lt;/p&gt;&lt;h3&gt;Direction&lt;br /&gt;&lt;/h3&gt;&lt;p&gt;This is the first use of ShapeLogic in medical image processing. The next few releases should also be focused on the particle counter, making it more robust and automatic. This is the plan for ShapeLogic 1.2:&lt;br /&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;More testing and tweaking of particle counter&lt;/li&gt;&lt;li&gt;Make it easier to set parameters for particle counter in a macro or a configuration file&lt;/li&gt;&lt;li&gt;Print report about each particle's color, area, standard deviation to a file&lt;/li&gt;&lt;li&gt;Different implementations of particle counters&lt;/li&gt;&lt;li&gt;Vectorize particles by tracing the edge&lt;/li&gt;&lt;li&gt;Filter particles based geometric properties of the edge using the same techniques as letter matcher&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;a href="http://code.google.com/p/shapelogic/downloads/list"&gt;Download ShapeLogic 1.1&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;-Sami Badawi&lt;br /&gt;http://www.shapelogic.org &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-3601627100952497394?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/3601627100952497394/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=3601627100952497394' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/3601627100952497394'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/3601627100952497394'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2008/05/shapelogic-11-with-particle-counter.html' title='ShapeLogic 1.1 with particle counter released'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_bxfuAiCUZwY/SCQ60Kn8XHI/AAAAAAAAABI/MAAG1JjQPcU/s72-c/embryos.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-4454711251766383003</id><published>2008-03-07T08:56:00.006-05:00</published><updated>2008-03-07T16:49:26.744-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='functional programming'/><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='closures'/><category scheme='http://www.blogger.com/atom/ns#' term='jruby'/><category scheme='http://www.blogger.com/atom/ns#' term='computer vision'/><category scheme='http://www.blogger.com/atom/ns#' term='lazy'/><category scheme='http://www.blogger.com/atom/ns#' term='declarative programming'/><category scheme='http://www.blogger.com/atom/ns#' term='stream'/><category scheme='http://www.blogger.com/atom/ns#' term='ShapeLogic'/><category scheme='http://www.blogger.com/atom/ns#' term='groovy'/><category scheme='http://www.blogger.com/atom/ns#' term='scripting'/><title type='text'>ShapeLogic 1.0 with stream based rules released</title><content type='html'>Here are the release notes for &lt;a href="http://www.shapelogic.org/"&gt;ShapeLogic&lt;/a&gt; 1.0&lt;br /&gt;&lt;h3&gt;Changes&lt;br /&gt;&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;Rule for image processing have been migrated, previously they were implemented as goal driven tasks with sub tasks. Version 1.0 uses lazy streams which are simpler and more powerful.&lt;/li&gt;&lt;li&gt;Letter match example now matches all polygons instead of just the first found.&lt;/li&gt;&lt;li&gt;When running ShapeLogic as ImageJ plugin, it is now easy for users to define rules for matching in external Java files.&lt;/li&gt;&lt;li&gt;New number matcher to demonstrate how to define rules for matching in an external Java file in &lt;a href="http://code.google.com/p/shapelogic/source/browse/trunk/src/main/java/DigitStreamVectorizer_.java"&gt;130 lines of code&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;Enabled use of &lt;a href="http://java.sun.com/javase/6/docs/technotes/guides/scripting/programmer_guide/index.html"&gt; Java 6 Scripting&lt;/a&gt; for rule database, which gives the user access to the 25 scripting languages that are supported&lt;/li&gt;&lt;li&gt;ShapeLogic now has beta development status&lt;/li&gt;&lt;li&gt;Many unit tests added for Lazy Stream library&lt;/li&gt;&lt;li&gt;Fixed bugs in Lazy Stream library&lt;/li&gt;&lt;li&gt;Fixed bugs in vectorizer&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;Lazy streams features&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;Lazy streams can be named and defined based on other lazy streams&lt;br /&gt;&lt;/li&gt;&lt;li&gt;They work similarly to UNIX pipes or calculation Legos&lt;/li&gt;&lt;li&gt;They serve as your query construct, you can directly query them&lt;/li&gt;&lt;li&gt;A stream can be wrapped around an Iterator&lt;/li&gt;&lt;li&gt;A stream can be created from an input stream and a Calculation&lt;br /&gt;&lt;/li&gt;&lt;li&gt;They can be instantiated lazily, so you can load calculation networks independently&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;The lazy stream worked very well with the letter match. It started out as a functional construct, but with the named streams they got a more declarative feel to them.&lt;br /&gt;&lt;br /&gt;The Calculation in the stream is using the same signature as &lt;a href="http://gafter.blogspot.com/2006/08/closures-for-java.html"&gt;Neal Grafter's Java 7 closure prototype&lt;/a&gt;. This might make integration with Java 7 easier.&lt;br /&gt;&lt;h3&gt;User defined rules for number match&lt;/h3&gt;The most significant change is that it has become a lot simpler for users to define rules.  Here is  the start of &lt;a href="http://code.google.com/p/shapelogic/source/browse/trunk/src/main/java/DigitStreamVectorizer_.java"&gt;DigitStreamVectorizer_ the number matchre in 130 lines of code&lt;/a&gt;. Notice that there is barely any setup code. It is almost only rules, defined in a very simple format.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;public class DigitStreamVectorizer_ extends StreamVectorizer_ {&lt;br /&gt;&lt;br /&gt;@Override&lt;br /&gt;public void matchSetup(ImageProcessor ip) {&lt;br /&gt; loadDigitStream();&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;public static void loadDigitStream() {&lt;br /&gt; LoadPolygonStreams.loadStreamsRequiredForLetterMatch();&lt;br /&gt; makeDigitStream();&lt;br /&gt; String[] digits = {"0", "1", "2", "3", "4", "5", "6", "7", "8", "9"};&lt;br /&gt;LoadLetterStreams.makeLetterXOrStream(digits);&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;public static void makeDigitStream() {&lt;br /&gt;&lt;br /&gt; rule("0", HOLE_COUNT, "==", 1.);&lt;br /&gt; rule("0", T_JUNCTION_POINT_COUNT, "==", 0.);&lt;br /&gt; rule("0", END_POINT_COUNT, "==", 0.);&lt;br /&gt; rule("0", MULTI_LINE_COUNT, "==", 1.);&lt;br /&gt; rule("0", CURVE_ARCH_COUNT, "&gt;", 0.);&lt;br /&gt; rule("0", HARD_CORNER_COUNT, "==", 0.);&lt;br /&gt; rule("0", SOFT_POINT_COUNT, "&gt;", 0.);&lt;br /&gt;&lt;br /&gt; rule("1", HOLE_COUNT, "==", 0.);&lt;br /&gt; rule("1", T_JUNCTION_LEFT_POINT_COUNT, "==", 0.);&lt;br /&gt; rule("1", T_JUNCTION_RIGHT_POINT_COUNT, "==", 0.);&lt;br /&gt; rule("1", END_POINT_BOTTOM_POINT_COUNT, "==", 1.);&lt;br /&gt; rule("1", HORIZONTAL_LINE_COUNT, "==", 0.);&lt;br /&gt; rule("1", VERTICAL_LINE_COUNT, "==", 1.);&lt;br /&gt; rule("1", END_POINT_COUNT, "==", 2.);&lt;br /&gt; rule("1", MULTI_LINE_COUNT, "==", 0.);&lt;br /&gt; rule("1", SOFT_POINT_COUNT, "==", 0.);&lt;br /&gt; rule("1", ASPECT_RATIO, "&lt;", 0.4);  &lt;/pre&gt;&lt;br /&gt;&lt;h3&gt;ShapeLogic's 3 different approaches to declarative programming&lt;br /&gt;&lt;/h3&gt;Here is a chronological listing of ShapeLogic's 3 different approaches declarative logic:&lt;ol&gt;&lt;li&gt;Declarative goal driven logic engine. From ShapeLogic 0.2.&lt;/li&gt;&lt;li&gt;Logic filter language. From ShapeLogic 0.8 The syntax and development of the logic language is better described in &lt;a href="http://www.shapelogic.org/logic-language.html"&gt; Logic language&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;Lazy streams. From ShapeLogic 0.9.&lt;/li&gt;&lt;/ol&gt;&lt;h3&gt;Result of different approaches so far&lt;/h3&gt;&lt;p&gt;For the letter matching example, the lazy stream approach has been both simpler and more powerful than the goal driven logic engine.&lt;/p&gt;&lt;p&gt;The Artificial Intelligence choice tree is built into the logical structure of goal driven logic engine, so this approach might work well when reasoning under uncertainty.&lt;/p&gt;&lt;p&gt;The logic filter language is used with both lazy streams and goal driven approach.&lt;/p&gt;&lt;p&gt;ShapeLogic is a toolkit, all 3 approaches are available with unit tests. &lt;/p&gt;&lt;p&gt;For now development is focused on the lazy stream approach.&lt;/p&gt; &lt;h3&gt;JSR 223 scripting surprise&lt;br /&gt;&lt;/h3&gt; I had put a lot of energy into making it easy to create a stream from a Scripting snippet in either Groovy, JRuby or JavaScript. This was clearly superior to text snippet rules defined using &lt;a href="http://commons.apache.org/jexl"&gt;Apache Commons JEXL&lt;/a&gt; under the Declarative goal driven approach, but using the new Stream it turned out not to be necessary. The rules could easily be defined in plain Java. I still think that good integration with streams and Scripting from ShapeLogic might turn out to be useful.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://code.google.com/p/shapelogic/downloads/list"&gt;Download ShapeLogic 1.0&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;-Sami Badawi&lt;br /&gt;http://www.shapelogic.org&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-4454711251766383003?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/4454711251766383003/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=4454711251766383003' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/4454711251766383003'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/4454711251766383003'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2008/03/shapelogic-10-with-stream-based-rules.html' title='ShapeLogic 1.0 with stream based rules released'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-8025798691001932169</id><published>2008-01-29T06:50:00.000-05:00</published><updated>2008-01-29T10:17:08.575-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='computer vision'/><category scheme='http://www.blogger.com/atom/ns#' term='lazy'/><category scheme='http://www.blogger.com/atom/ns#' term='scripting'/><category scheme='http://www.blogger.com/atom/ns#' term='functional programming'/><category scheme='http://www.blogger.com/atom/ns#' term='jruby'/><category scheme='http://www.blogger.com/atom/ns#' term='declarative programming'/><category scheme='http://www.blogger.com/atom/ns#' term='mathematics'/><category scheme='http://www.blogger.com/atom/ns#' term='stream'/><category scheme='http://www.blogger.com/atom/ns#' term='ShapeLogic'/><category scheme='http://www.blogger.com/atom/ns#' term='groovy'/><category scheme='http://www.blogger.com/atom/ns#' term='Project Euler'/><category scheme='http://www.blogger.com/atom/ns#' term='scala'/><title type='text'>Lazy streams in mathematics and vision</title><content type='html'>&lt;a href="http://www.shapelogic.org/"&gt;ShapeLogic&lt;/a&gt; 0.9 contains new functional and declarative constructs&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Lazy streams&lt;/li&gt;&lt;li&gt;Simplified and expanded lazy calculations&lt;/li&gt;&lt;/ol&gt;The functional and declarative construct of ShapeLogic can be used independently of the image processing code. It only requires a 200 KB jar file to use this in other application, and there is no dependency of third party jars.&lt;br /&gt;&lt;br /&gt;ShapeLogic 0.9 letter recognition example is still using a different system for declarative programming. More on that later in this post.&lt;br /&gt;&lt;br /&gt;To test lazy streams in pure form I tried them on the 10 first mathematical problems in &lt;a href="http://projecteuler.net/"&gt;Project Euler&lt;/a&gt;. I think that ShapeLogic streams provided for simple solutions to the mathematical problems.&lt;br /&gt;However they are more verbose that say solutions in the Scala language:&lt;br /&gt;&lt;a href="http://scala-blogs.org/2007/12/project-euler-fun-in-scala.html"&gt;http://scala-blogs.org/2007/12/project-euler-fun-in-scala.html&lt;/a&gt;&lt;br /&gt;&lt;h3&gt;Solutions to a few of the Project Euler mathematical problems&lt;/h3&gt;Project Euler is a list of 178, mathematical problems, that can be solved by computers.&lt;br /&gt;&lt;h4&gt;1 Add all the natural numbers below 1000 that are multiples of 3 or 5&lt;/h4&gt;&lt;pre&gt;NaturalNumberStream naturalNumberStream = new NaturalNumberStream(1,999);&lt;br /&gt;ListFilterStream&amp;lt;Integer&amp;gt; filter = new BaseListFilterStream&amp;lt;Integer&gt;(naturalNumberStream) {&lt;br /&gt;public boolean evaluate(Integer object) {return object % 3 == 0 || object % 5 == 0;}&lt;br /&gt;};&lt;br /&gt;SumAccumulator accumulator = new SumAccumulator(filter);&lt;br /&gt;System.out.println(accumulator.getValue());&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;h4&gt;2 Find the sum of all the even-valued terms in the Fibonacci sequence which do not exceed one million&lt;/h4&gt;&lt;pre&gt;BaseListStream1&amp;lt;Object,Integer&amp;gt;fibonacci = new BaseListStream1&amp;lt;Object,Integer&amp;gt;(){&lt;br /&gt;{ _list.add(1); _list.add(2);}&lt;br /&gt;public Integer invoke(Object input, int index) {return get(index-2) + get(index-1);}&lt;br /&gt;};&lt;br /&gt;ListFilterStream&amp;lt;Integer&amp;gt; filter = new BaseListFilterStream&amp;lt;Integer&amp;gt;(fibonacci) {&lt;br /&gt;public boolean evaluate(Integer object) { return object % 2 == 0; }&lt;br /&gt;};&lt;br /&gt;SumAccumulator accumulator = new SumAccumulator(filter) {&lt;br /&gt;{_inputElement = 0;}&lt;br /&gt;public boolean hasNext(){ return _inputElement &amp;lt;= theNumber; }&lt;br /&gt;};&lt;br /&gt;System.out.println(accumulator.getValue());&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;h4&gt;More solutions to Project Euler&lt;/h4&gt;&lt;br /&gt;Here are the next 8 solutions:&lt;br /&gt;&lt;a href="http://www.shapelogic.org/project-euler.html"&gt;10 first solutions to Project Euler in ShapeLogic&lt;/a&gt;&lt;br /&gt;&lt;a href="http://shapelogic.googlecode.com/svn/trunk/src/test/java/org/shapelogic/euler/ProjectEuler1Test.java"&gt;10 first solutions to Project Euler as Java unit tests&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;&lt;h3&gt;Comparison between streams and old declarative constructs&lt;/h3&gt;In ShapeLogic 0.8 a cornerstone in declarative programming was a goal driven system of tasks with sub tasks. The letter recognition example reads in a rule database and translated it into goals/tasks with sub tasks. Somewhat like the programming language Prolog.&lt;br /&gt;In case of uncertainty the different choices would live in an artificial intelligence choice tree that is tied to the tree of tasks and sub tasks.&lt;br /&gt;&lt;h3&gt;Integrating old declarative constructs with lazy streams&lt;/h3&gt;Based on what is needed in the current letter match example. It seems like the stream based approach is simpler and able to handle all the problem that the goal driven approach could.&lt;br /&gt;I expect the stream approach to supersede the goal approach.&lt;br /&gt;The rule database will be read and translated to streams. So the rule for the letter A would be translated into a filter that could filter a stream of polygons into a stream of polygons that represent the letter A.&lt;br /&gt;&lt;br /&gt;Also currently the rules now are using &lt;a href="http://commons.apache.org/jexl"&gt;JEXL&lt;/a&gt; to translate a text rule into something executable.&lt;br /&gt;Example:&lt;br /&gt;Rule for letter A: polygon.holeCount == 1.&lt;br /&gt;In ShapeLogic 1.0, you should be able to use Java 6 Scripting, JSR 223, to be able to define these rules, in one of the 25 supported scripting languages. In ShapeLogic 0.9: &lt;a href="http://groovy.codehaus.org/"&gt;Groovy,&lt;/a&gt; &lt;a href="http://jruby.codehaus.org/"&gt;JRuby&lt;/a&gt;, and JavaScript was tested with streams, but not with the rule databases.&lt;br /&gt;&lt;h3&gt;Using streams for concurrent programming&lt;br /&gt;&lt;/h3&gt;Streams can also support parallel or concurrent programming, which is important with the CPU intensive operations in image processing and computer vision. Especially with the advent of cheap multi processor machines.&lt;br /&gt;&lt;h4&gt;Example: Find polygons in a stack of images&lt;/h4&gt;You define a lazy data stream for this and set a stream property&lt;br /&gt;randomAccess = true&lt;br /&gt;This indicates that individual elements can be calculated independently. The factory creating the stream could create a parallel version of the stream and assign each operation its own thread. Note that the result would be a stream of polygons for each image.&lt;br /&gt;&lt;br /&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-8025798691001932169?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/8025798691001932169/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=8025798691001932169' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/8025798691001932169'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/8025798691001932169'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2008/01/lazy-streams-in-mathematics-and-vision.html' title='Lazy streams in mathematics and vision'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-1671608474062410223</id><published>2008-01-23T01:02:00.000-05:00</published><updated>2008-01-23T08:18:44.929-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='javascript'/><category scheme='http://www.blogger.com/atom/ns#' term='functional programming'/><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='jruby'/><category scheme='http://www.blogger.com/atom/ns#' term='image processing'/><category scheme='http://www.blogger.com/atom/ns#' term='computer vision'/><category scheme='http://www.blogger.com/atom/ns#' term='declarative programming'/><category scheme='http://www.blogger.com/atom/ns#' term='stream'/><category scheme='http://www.blogger.com/atom/ns#' term='ShapeLogic'/><category scheme='http://www.blogger.com/atom/ns#' term='ImageJ'/><category scheme='http://www.blogger.com/atom/ns#' term='groovy'/><category scheme='http://www.blogger.com/atom/ns#' term='scripting'/><title type='text'>ShapeLogic 0.9 with lazy stream library released</title><content type='html'>Here are the release notes, I will soon describe the changes in more details.&lt;br /&gt;&lt;br /&gt;This is the first release where &lt;a href="http://www.shapelogic.org/"&gt;ShapeLogic&lt;/a&gt; is moving beyond current parameters as a plugin library for &lt;a href="http://rsb.info.nih.gov/ij"&gt;ImageJ&lt;/a&gt;, currently only used in a letter recognition example. The improved system will be for declarative programming where the user can define rules in either a database or flat file. The focus will still be on image processing and computer vision, but the system will be more broadly applicable. There has been no new work on image processing or letter recognition in this release. ShapeLogic 1.0 will combine these new changes with the current image processing code.&lt;h3&gt;Changes&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;Introduce new functional, declarative and query constructs to Java&lt;/li&gt;&lt;li&gt;Implement lazy streams like Haskell, &lt;a href="http://www.scala-lang.org/"&gt; Scala&lt;/a&gt; or Scheme&lt;/li&gt;&lt;li&gt;These functional constructs are very lightweight and you only need one 200KB jar file to use it in other applications&lt;/li&gt;&lt;li&gt;Test streams by solving the first 10 &lt;a href="http://www.shapelogic.org/project-euler.html"&gt;mathematical problems&lt;/a&gt; from &lt;a href="http://projecteuler.net/"&gt;Project Euler&lt;/a&gt; &lt;/li&gt;&lt;li&gt;Enabled &lt;a href="http://java.sun.com/javase/6/docs/technotes/guides/scripting/programmer_guide/index.html"&gt;Java 6 Scripting&lt;/a&gt; for evaluating expressions.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Tested with &lt;a href="http://groovy.codehaus.org/"&gt;Groovy&lt;/a&gt;, &lt;a href="http://jruby.codehaus.org/"&gt;JRuby,&lt;/a&gt; &lt;a href="http://www.mozilla.org/rhino"&gt;JavaScript&lt;/a&gt;, but should work with other supported Scripting languages, currently that are 25 of these. This makes it possible for users to add rule, formulas and queries in real time using text format. They can interact with a running Java application, which can be useful in science, finance or web applications.&lt;/li&gt;&lt;/ul&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-1671608474062410223?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/1671608474062410223/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=1671608474062410223' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/1671608474062410223'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/1671608474062410223'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2008/01/shapelogic-09-with-lazy-stream-library.html' title='ShapeLogic 0.9 with lazy stream library released'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-266863923588136427</id><published>2008-01-12T20:40:00.000-05:00</published><updated>2008-01-13T09:38:49.212-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='javascript'/><category scheme='http://www.blogger.com/atom/ns#' term='functional programming'/><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='closures'/><category scheme='http://www.blogger.com/atom/ns#' term='jruby'/><category scheme='http://www.blogger.com/atom/ns#' term='declarative programming'/><category scheme='http://www.blogger.com/atom/ns#' term='groovy'/><title type='text'>Lazy streams in Java, Groovy, JavaScript, JRuby</title><content type='html'>This is a follow up to my last blog: &lt;a href="http://samibadawi.blogspot.com/2008/01/functional-constructs-java-7-groovy.html"&gt;Functional constructs Java 7, Groovy, Commons&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I have finished coding the first part of the lazy streams for &lt;a href="http://www.shapelogic.org/"&gt;ShapeLogic&lt;/a&gt;. They are going to be key for the declarative programming query interface to ShapeLogic. As a proof of concept I wanted to see how easy it would be to implement a lazy stream of &lt;a href="http://en.wikipedia.org/wiki/Fibonacci_number"&gt;Fibonacci numbers&lt;/a&gt;. So far I have tested my framework with definitions written in:&lt;ol&gt;&lt;li&gt;Groovy &lt;/li&gt;&lt;li&gt;JavaScript &lt;/li&gt;&lt;li&gt;JRuby&lt;/li&gt;&lt;li&gt;Java &lt;/li&gt;&lt;/ol&gt;It is essential that lazy streams are are easy to define so they can be put in a flat file or a database. I was very satisfied with the concise definitions of the Fibonacci streams in the different languages:&lt;br /&gt;&lt;h3&gt;Lazy Fibonacci stream in Groovy&lt;/h3&gt;&lt;span style="font-family:courier new;"&gt;new FunctionStream("fibo","def fibo_FUNCTION_ = { fibo.get(it-2) + fibo.get(it-1) };",1,1);&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;fibo&lt;/b&gt; is the name the lazy stream has in the context / name space&lt;br /&gt;&lt;b&gt;fibo_FUNCTION_&lt;/b&gt; it is a naming convention that the function that is use to create the stream has this name&lt;br /&gt;&lt;b&gt;1,1&lt;/b&gt; is the first part of the lazy stream&lt;br /&gt;&lt;h3&gt;Lazy Fibonacci stream in JRuby&lt;/h3&gt;&lt;span style="font-family:courier new;"&gt;new FunctionStream("fibo","jruby",null,"def fibo_FUNCTION_(it) return $fibo.get(it-2) + $fibo.get(it-1) end",1,1)&lt;/span&gt;&lt;br /&gt;&lt;h3&gt;Lazy Fibonacci stream in JavaScript&lt;/h3&gt;&lt;span style="font-family:courier new;"&gt;new FunctionStream("fibo","javascript",null,"function fibo_FUNCTION_(it) { return parseInt(fibo.get(it-2) ) + parseInt(fibo.get(it-1))};",1,1);&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Scripting using JSR 223&lt;/h3&gt;I added a formula language for users to an enterprise Java system a few years back using Antlr and BeanShell. JSR 223 is dramatically easier to work with, but there are still some problems.&lt;br /&gt;&lt;br /&gt;The only scripting language that works out of the box with Java 6 is JavaScript. To get others to work you have to download jsr223-engines.zip from:&lt;br /&gt;&lt;a href="https://scripting.dev.java.net/servlets/ProjectDocumentList"&gt;https://scripting.dev.java.net/servlets/ProjectDocumentList&lt;/a&gt;&lt;br /&gt;For each language you want to use there is a engine jar file.&lt;br /&gt;E.g. jruby-engine.jar.&lt;br /&gt;They have to be on your path. So does the jar file implementing the language.&lt;br /&gt;&lt;h3&gt;JSR 223 and Maven 2 problems&lt;br /&gt;&lt;/h3&gt;JSR 223 does not work well with Maven 2. The engine jar file does not reside in the Maven repository so you have to separately install them into you local Maven repository. Here is the command to install JRuby:&lt;br /&gt;~/bin/maven-2.0.8/bin/mvn install:install-file -Dfile=jruby-engine.jar -DgroupId=org.jruby -DartifactId=jruby-engine -Dversion=1.0.1 -Dpackaging=jar&lt;br /&gt;&lt;h3&gt;Scripting languages currently under JSR 223&lt;/h3&gt;beanshell, browserjs, ejs, freemarker, groovy, jacl, jaskell, java, javascript, jawk, jelly, jep, jexl, jruby, jst, judo, juel, jython, ognl, pnuts, scheme, sleep, velocity, xpath, xslt.&lt;br /&gt;These languages should in theory work with ShapeLogic, without any additional code.&lt;br /&gt;&lt;h3&gt;Other implementations of lazy streams&lt;/h3&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;BaseStream&lt;/span&gt;: That is the abstract base class with most of the lazy stream functionality&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;IteratorStream:&lt;/span&gt; Generates elements using Java Iterator&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;TransformerStream:&lt;/span&gt; Generates elements using Java interface&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;FunctionStream:&lt;/span&gt; Generates elements using JRS 223 as described above&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;h3&gt;Next step for functional constructs for ShapeLogic&lt;/h3&gt;&lt;ol&gt;&lt;li&gt;Filter that is easy to define in external text using scripting&lt;/li&gt;&lt;li&gt;Transformer transforming one lazy stream to the next&lt;/li&gt;&lt;li&gt;Query interface from where a result can be retrieved&lt;/li&gt;&lt;/ol&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-266863923588136427?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/266863923588136427/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=266863923588136427' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/266863923588136427'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/266863923588136427'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2008/01/lazy-streams-in-java-groovy-javascript.html' title='Lazy streams in Java, Groovy, JavaScript, JRuby'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-3614667019713188830</id><published>2008-01-09T07:17:00.000-05:00</published><updated>2008-01-09T11:37:48.481-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='functional programming'/><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='closures'/><category scheme='http://www.blogger.com/atom/ns#' term='haskell'/><category scheme='http://www.blogger.com/atom/ns#' term='groovy'/><category scheme='http://www.blogger.com/atom/ns#' term='scala'/><title type='text'>Functional constructs Java 7, Groovy, Commons</title><content type='html'>&lt;h3&gt;What functional constructs to use for ShapeLogic&lt;/h3&gt;I have started to code the lazy streams for &lt;a href="http://www.shapelogic.org"&gt;ShapeLogic&lt;/a&gt;. They are going to be key for the query interface to ShapeLogic. I need some functional constructs to work with these. The top sources candidates are:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Apache Commons&lt;/li&gt;&lt;li&gt;Groovy&lt;/li&gt;&lt;li&gt;Java 7&lt;/li&gt;&lt;li&gt;Hand coding&lt;/li&gt;&lt;/ol&gt;None of these are a perfect fit. This post list some of the advantages and disadvantages of these.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Apache Commons functional constructs&lt;/h3&gt;I am currently using &lt;a href="http://commons.apache.org/jexl"&gt;Apache Commons JEXL&lt;/a&gt; for user expressions.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Apache Commons do not use templates&lt;/li&gt;&lt;li&gt;Apache Commons Functor are still in the sandbox and not actively developed&lt;/li&gt;&lt;li&gt;The code is not uniform&lt;/li&gt;&lt;li&gt;You cannot make user define functions in Apache Commons JEXL&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;h3&gt;Groovy functional constructs&lt;/h3&gt;I need to use a scripting language, to define user functions.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Groovy contains all of Java's constructs&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Groovy comes with good functional constructs&lt;br /&gt;&lt;/li&gt;&lt;li&gt;I like the Groovy syntax for using these&lt;br /&gt;&lt;/li&gt;&lt;li&gt;I cannot use these expressions directly since Groovy is not a lazy language&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;h3&gt;Java 7 functional constructs&lt;/h3&gt;&lt;a href="http://tech.puredanger.com/java7"&gt;Java 7&lt;/a&gt; comes with good functional constructs.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;There are a lot of interest for functional constructs in Java 7&lt;br /&gt;&lt;/li&gt;&lt;li&gt;I would rather use Java 7 than compete with it&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Java 7 still seem to be pretty far away&lt;br /&gt;&lt;/li&gt;&lt;li&gt;It is not sure that closures will make it into Java 7&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;h3&gt;Thoughts on immutable constructs&lt;/h3&gt;I will probably make the convention that a lazy stream like the Fibonacci numbers are immutable, but not enforce this by making a LISP list.&lt;br /&gt;&lt;h3&gt;Scala envy&lt;/h3&gt;Now I suffer from &lt;a href="http://www.scala-lang.org/"&gt;Scala language&lt;/a&gt; envy. These Scala language features would come in handy now:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Lazy streams&lt;br /&gt;&lt;/li&gt;&lt;li&gt;List comprehension, with same syntax for lists, iterators and streams&lt;br /&gt;&lt;/li&gt;&lt;li&gt;First class function&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Closures&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Lazy calculation&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Uniform access to functions and lists&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;I could of cause try to include the scala.jar and directly use the Scala constructs in ShapeLogic, but there lay the road to madness.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Hand coding functional constructs&lt;/h3&gt;This might not be too much work, but I would rather not create yet another implementation of functional constructs. &lt;a href="http://commons.apache.org/sandbox/functor"&gt;Apache Commons Functor&lt;/a&gt; never took off, despite some good press. That is probably an indication that Java was not that well suited for elegant functional constructs when this library was made, and maybe still isn't.&lt;br /&gt;&lt;h3&gt;Current work plan&lt;/h3&gt;&lt;ol&gt;&lt;li&gt;I will start by implementing a lazy Fibonacci stream&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Then I will move the polygon finder to a lazy stream, before it could only find one polygon&lt;/li&gt;&lt;/ol&gt;I will try to get ShapeLogic 0.9 out soon, even if it is a very limited version.&lt;br /&gt;&lt;br /&gt;If you know of any libraries that would fit my need and not be too heavyweight, please let me know.&lt;br /&gt;&lt;h3&gt;Two Fibonacci implementations&lt;/h3&gt;The &lt;a href="http://www.haskell.org/"&gt;Haskell language&lt;/a&gt; has an incredibly elegant lazy stream implementation:&lt;br /&gt;fibs :: [Int]&lt;br /&gt;fibs = 1 : 1 : [ a + b | (a, b) &lt;- zip fibs (tail fibs)] &lt;br /&gt;&lt;br /&gt;&lt;a href="http://legacy.drools.codehaus.org/Fibonacci+Example"&gt;Drools Fibonacci implemented&lt;/a&gt;. &lt;a href="http://labs.jboss.com/drools"&gt;This&lt;/a&gt; is not a good fit for my computer vision project.&lt;br /&gt;&lt;br /&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-3614667019713188830?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/3614667019713188830/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=3614667019713188830' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/3614667019713188830'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/3614667019713188830'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2008/01/functional-constructs-java-7-groovy.html' title='Functional constructs Java 7, Groovy, Commons'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-6536618862179961263</id><published>2008-01-05T05:22:00.000-05:00</published><updated>2008-01-07T10:23:54.227-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='Maven'/><category scheme='http://www.blogger.com/atom/ns#' term='ShapeLogic'/><title type='text'>Good resources about generating a Site and Documentation in Maven 2</title><content type='html'>The &lt;a href="http://www.shapelogic.org"&gt;ShapeLogic project site&lt;/a&gt; is generated with Maven site generation tool.&lt;br /&gt;&lt;br /&gt;I am trying to customize the ShapeLogic site a little more. Unfortunately the documentation for using Maven site generation tool is very spotty, but Eric Redmond has made 2 great resources about this:&lt;br /&gt;&lt;br /&gt;His online Maven 2 book:&lt;br /&gt;&lt;a href="http://propellors.net/maven/book/site-generation.html"&gt;http://propellors.net/maven/book/site-generation.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;His blog entry:&lt;br /&gt;&lt;a href="http://www.coderoshi.com/2007/02/generating-site-and-documentation-in.html"&gt;過労死 Death by Overcoding: Generating a Site and Documentation in Maven&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Now the whole ShapeLogic site is generated cleanly with this command: &lt;br /&gt;mvn clean site&lt;br /&gt;Before you manually had to copy the directories: images and css&lt;br /&gt;&lt;br /&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-6536618862179961263?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/6536618862179961263/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=6536618862179961263' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/6536618862179961263'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/6536618862179961263'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2008/01/death-by-overcoding-generating-site-and.html' title='Good resources about generating a Site and Documentation in Maven 2'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-1131519058147930868</id><published>2008-01-02T15:21:00.000-05:00</published><updated>2008-01-02T23:04:33.768-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='computer vision'/><category scheme='http://www.blogger.com/atom/ns#' term='declarative programming'/><category scheme='http://www.blogger.com/atom/ns#' term='logic'/><category scheme='http://www.blogger.com/atom/ns#' term='scripting'/><title type='text'>ShapeLogic for general declarative programming</title><content type='html'>&lt;a href="http://www.shapelogic.org"&gt;ShapeLogic&lt;/a&gt; 0.9 is going to be the first release where users define rule sets read from either databases or flat files.  I have decided to broaden the scope of the project.&lt;br /&gt;&lt;br /&gt;First I hoped that I would be able to make minor adjustment to ShapeLogic's letter match example so that it would read the rules from an external source and gradually expand. But since backward compatibility is not a problem yet, I came to the conclusion that it was better to try to construct a solid foundation for declarative programming. While it will be primarily geared toward computer vision problem solving, the system will have other, broader, applications as well.&lt;br /&gt;&lt;h3&gt;Do we need another system for declarative programming?&lt;/h3&gt;&lt;br /&gt;SQL rules supreme in the field of simple data, but once you go outside this field there are many different directions that you can take. &lt;br /&gt;&lt;br /&gt;The declarative logic programming system that I had highest expectations for was &lt;a href="http://www.cyc.com"&gt;CYC&lt;/a&gt;. It is a large, hand coded knowledge base trying to capture common sense, using many different inference techniques. Strangely, it never got a big following even after they released part of the engine as &lt;a href="http://www.opencyc.org"&gt;open source&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I hope to find a sweet spot where a simple system will be able to do some real work and be  simple to learn. My work is geared toward a domain where there are a limited number of objects with somewhat constant features.&lt;br /&gt;&lt;h3&gt;A few of the changes in ShapeLogic 0.9&lt;/h3&gt;&lt;br /&gt;&lt;h4&gt;Lazy data streams&lt;/h4&gt;&lt;br /&gt;Implement lazy data streams like &lt;a href="http://www.scala-lang.org"&gt;Scala&lt;/a&gt; or Scheme.&lt;br /&gt;&lt;h4&gt;Decouple annotation&lt;/h4&gt;&lt;br /&gt;Make the annotation of shapes more loosely coupled to classes in ShapeLogic. I was not super happy with the first solution that I came up with of how to annotate point, lines and polygon.&lt;br /&gt;If ShapeLogic should be used for more general problem this is completely unacceptable.&lt;br /&gt;&lt;h4&gt;Better interpreter&lt;/h4&gt;&lt;br /&gt;I order to save user defined rule in a database or a flat file, I need to be able to compile or interpret this code. This code in the rules should mainly be fairly straightforward.&lt;br /&gt;&lt;br /&gt;The top contenders are:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Java 6 Scripting interface, that should give access to different scripting languages for the JDK&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Java 6 Compiler interface, the problem is that Java is pretty verbose&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Groovy intepreter&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Do the parsing in ShapeLogic and write a little interpreter&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Stick with &lt;a href="http://commons.apache.org/jexl"&gt;JEXL&lt;/a&gt; for now, and make the move in a later release&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-1131519058147930868?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/1131519058147930868/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=1131519058147930868' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/1131519058147930868'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/1131519058147930868'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2008/01/shapelogic-for-general-declarative.html' title='ShapeLogic for general declarative programming'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-367672670398276965</id><published>2007-12-26T13:12:00.000-05:00</published><updated>2007-12-26T14:34:40.058-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='computer vision'/><category scheme='http://www.blogger.com/atom/ns#' term='declarative programming'/><category scheme='http://www.blogger.com/atom/ns#' term='ShapeLogic'/><category scheme='http://www.blogger.com/atom/ns#' term='scripting'/><title type='text'>Language and tutorials for the external rule base</title><content type='html'>&lt;h3&gt;Language&lt;/h3&gt;&lt;br /&gt;I have spent a couple of weeks thinking about how to organize the external rule database for &lt;a href="http://www.shapelogic.org"&gt;ShapeLogic&lt;/a&gt;, and have done more reading about &lt;a href="http://java.sun.com/javase/6/docs/technotes/guides/scripting/programmer_guide/index.html"&gt;Java 6 Scripting&lt;/a&gt;. That should be adequate for my need, but I am reluctant to add more dependencies to ShapeLogic than absolutely necessary.&lt;br /&gt;&lt;br /&gt;I am now debating what scripting language would be best for Java 6 Scripting:&lt;br /&gt;&lt;a href="http://groovy.codehaus.org"&gt;Groovy&lt;/a&gt;: Comes out as maybe the strongest contender, but I tried Groovy 2 times before only to find out it was not ready for prime time yet, but maybe with version 1.5 it is finally there.&lt;br /&gt;&lt;a href="http://www.beanshell.org"&gt;BeanShell 2&lt;/a&gt;: It has been in beta for over 2 years and does not seem to be in active development.&lt;br /&gt;&lt;a href="http://www.jython.org"&gt;Jyton&lt;/a&gt;: I have been a big fan of Python for almost 10 years now, but the white space indention does not work so well with code stored in a database or flat file.&lt;br /&gt;&lt;a href="http://www.mozilla.org/rhino"&gt;JavaScript/Rhino&lt;/a&gt;: I like it and people know it, but it would be better if it was a language that was using native Java types.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Tutorial&lt;/h3&gt;&lt;br /&gt;After seeing a 20 minute screen cast for &lt;a href="http://www.rubyonrails.org"&gt;Ruby on Rail&lt;/a&gt; by David Heinemeier Hansson, my new test for if a programming library or language is worth spending time on is if it has a 20 minute screen cast, where they can do something non-trivial. I do not adhere to this rigorously.&lt;br /&gt;Given that I have a thicker Danish accent than David Hansson, I have been looking for other options.&lt;br /&gt;One of my friends Joe Orr, see &lt;a href="http://www.3dtree.com/wp"&gt;Joe's blog 3DTree Notebook&lt;/a&gt;, has created a very interesting alternative to screen casts called &lt;a href="http://www.screenbooks.net"&gt;Screenbook Maker&lt;/a&gt;, it is a program that takes screen shots of a demonstration and adds text to it, to turn it into a tutorial, which is searchable.&lt;br /&gt;Joe has promised me to help make a Screenbook tutorial when the external rule database is released.&lt;br /&gt;&lt;br /&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-367672670398276965?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/367672670398276965/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=367672670398276965' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/367672670398276965'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/367672670398276965'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2007/12/language-and-tutorials-for-external.html' title='Language and tutorials for the external rule base'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-9159549438555677652</id><published>2007-12-19T18:48:00.000-05:00</published><updated>2007-12-19T23:38:46.975-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='computer vision'/><category scheme='http://www.blogger.com/atom/ns#' term='declarative programming'/><category scheme='http://www.blogger.com/atom/ns#' term='ShapeLogic'/><category scheme='http://www.blogger.com/atom/ns#' term='scripting'/><title type='text'>Declarative programming using Java 6 Scripting</title><content type='html'>I am working on moving the declarative programming in &lt;a href="http://www.shapelogic.org/"&gt;ShapeLogic&lt;/a&gt; into an external rule database now.&lt;br /&gt;&lt;br /&gt;Currently the rules in ShapeLogic are parsed from strings using &lt;a href="http://commons.apache.org/jexl"&gt;Apache Commons JEXL&lt;/a&gt; library.&lt;br /&gt;So the letter A would have a rule saying:&lt;br /&gt;polygon.holeCount == 1&lt;br /&gt;&lt;br /&gt;This is not trivial since a variable say polygon.holeCount could have different values in different contexts.&lt;br /&gt;E.g. if there was a choice of 2 different thresholds levels, then in one part of the choice tree we could have&lt;br /&gt;polygon.holeCount == 1 and in another we could have&lt;br /&gt;polygon.holeCount == 2.&lt;br /&gt;&lt;br /&gt;I am considering changing from JEXL to using the &lt;a href="http://java.sun.com/javase/6/docs/technotes/guides/scripting/programmer_guide/index.html"&gt;Java 6 Scripting instead&lt;/a&gt;.&lt;br /&gt;JEXL has not been released for over 1 year, and it is a little awkward to handle static fields and functions.&lt;br /&gt;It might also be better to let the user chose what scripting language they want to use.&lt;br /&gt;Currently there languages should be available for scripting: &lt;a href="http://www.mozilla.org/rhino/scriptjava.html"&gt;JavaScript&lt;/a&gt;, &lt;a href="http://www.beanshell.org/"&gt;BeanShell&lt;/a&gt;, &lt;a href="http://www.jython.org"&gt;Jython&lt;/a&gt;, &lt;a href="http://groovy.codehaus.org"&gt;Groovy&lt;/a&gt; and &lt;a href="http://jruby.codehaus.org"&gt;JRuby&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;One issue is that I cannot just use variable binding in a global scripting context.&lt;br /&gt;In my example from above if the variable polygon.holeCount does not exist in the top context, I will have to make sure that it is taken from the right context. This was relatively easy in JEXL since a context here mainly is just a map you store your key values pairs in, I am not sure if this is a problem when you are dealing with a whole dynamic scripting language. I am also a little concerned about performance.&lt;br /&gt;&lt;br /&gt;I might make a release of ShapeLogic 0.9 where you just can select another rule database stored in a flat file or a database, but using the current system, in order not to drag the next release out too long. This should allow the users to define rules for matching a separate alphabet, say the Greek.&lt;br /&gt;But it is far from what I want ShapeLogic to be able to do.&lt;br /&gt;&lt;br /&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-9159549438555677652?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/9159549438555677652/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=9159549438555677652' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/9159549438555677652'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/9159549438555677652'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2007/12/declarative-programming-using-java-6.html' title='Declarative programming using Java 6 Scripting'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-3401602685398523138</id><published>2007-12-18T22:11:00.000-05:00</published><updated>2007-12-18T22:43:46.491-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='computer vision'/><category scheme='http://www.blogger.com/atom/ns#' term='declarative programming'/><category scheme='http://www.blogger.com/atom/ns#' term='ShapeLogic'/><category scheme='http://www.blogger.com/atom/ns#' term='logic'/><title type='text'>Declarative and Object Oriented programming</title><content type='html'>One of the main objectives for &lt;a href="http://www.shapelogic.org"&gt;ShapeLogic&lt;/a&gt; is to make a good hybrid of Declarative programming and Object Oriented programming. This is not specific to computer vision, but is a general problem. This is a daunting task, and many people have tried and the state of the art still leaves a lot to be desired.&lt;br /&gt;&lt;br /&gt;I have started to work on the first release of ShapeLogic with an external rule based engine, I have not worked through all the problems yet. I think that this is a case of evolutionary programming where you have to try out and approach, not knowing if it will lead to anything useful. Hopefully I will have ShapeLogic 0.9 ready pretty soon.&lt;br /&gt;&lt;br /&gt;I think that the key is to keep it simple and keep the syntax easy to work with. Let me just give my 2 cents on a few project that combined Declarative programming and Object Oriented programming.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Approaches that impressed me&lt;/h3&gt;&lt;br /&gt;&lt;a href="http://www.prova.ws"&gt;Prova&lt;/a&gt; is a Java Prolog hybrid. I was very impressed by the simplicity and how well it managed to integrate queries with normal database access. Unfortunately I do not think that Prolog is applicable to the approach to computer vision, that I am pursuing in ShapeLogic now.&lt;br /&gt;&lt;br /&gt;List Comprehension in &lt;a href="http://www.python.org/doc/2.3.5/tut/node7.html"&gt;Python&lt;/a&gt; and &lt;a href="http://www.haskell.org/"&gt;Haskell&lt;/a&gt;. It is somewhat limited, but it is very convenient to work with.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://msdn2.microsoft.com/en-us/netframework/aa904594.aspx"&gt;Microsoft LINQ&lt;/a&gt;, I think that it is great that you can use the same simple syntax to query databases, XML and collections.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.hibernate.org/"&gt;Hibernate&lt;/a&gt; and ORM tools: While I do not think that the dust has settled yet as to how feasible they are for production system with large databases. I think they are very promising. This was the reason that I included Hibernate in &lt;a href="http://www.shapelogic.org"&gt;ShapeLogic&lt;/a&gt;, despite it not being used much yet.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Promising approaches that I found hard to work with&lt;/h3&gt;&lt;br /&gt;&lt;a href="http://labs.jboss.com/drools"&gt;Drools&lt;/a&gt;: An open source RETE engine for the Java JVM. It comes with a lot of cool features, but I thought that the example program setting a rule up to calculate Fibonacci numbers was too complicated.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.w3.org/TR/owl-features/"&gt;OWL&lt;/a&gt;: Works with XML / RDF. It is a standard. It comes with good open source tools, but it just seems too heavy weight for my purpose.&lt;br /&gt;&lt;br /&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-3401602685398523138?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/3401602685398523138/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=3401602685398523138' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/3401602685398523138'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/3401602685398523138'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2007/12/declarative-and-object-oriented.html' title='Declarative and Object Oriented programming'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-1801416466457776635</id><published>2007-12-15T17:40:00.000-05:00</published><updated>2007-12-18T17:06:09.956-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='declarative programming'/><title type='text'>Registered blog with technorati</title><content type='html'>Blogging is still new to me, but I registered my blog with Technorati.&lt;br /&gt;I also adjusted my terminology from Robotic Vision to Computer Vision.&lt;br /&gt;In 1990 when I told people that I was writing my thesis on Computer Vision they would always say: &lt;br /&gt;Oh you mean virtual reality. &lt;br /&gt;That was big in the early 90ies, so I started saying Robotic Vision instead. I now realized that this term is not used much, and not applicable as a tag.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-1801416466457776635?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/1801416466457776635/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=1801416466457776635' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/1801416466457776635'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/1801416466457776635'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2007/12/registered-blog-with-technorati.html' title='Registered blog with technorati'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-9048882834770826220</id><published>2007-12-14T11:36:00.001-05:00</published><updated>2007-12-18T17:06:09.957-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='vision'/><category scheme='http://www.blogger.com/atom/ns#' term='image processing'/><category scheme='http://www.blogger.com/atom/ns#' term='computer vision'/><category scheme='http://www.blogger.com/atom/ns#' term='declarative programming'/><category scheme='http://www.blogger.com/atom/ns#' term='ShapeLogic'/><category scheme='http://www.blogger.com/atom/ns#' term='logic'/><title type='text'>Day of Judgment was postponed</title><content type='html'>Yesterday I sent out an announcement about &lt;a href="http://www.shapelogic.org"&gt;ShapeLogic&lt;/a&gt; to the &lt;a href="http://rsb.info.nih.gov/ij/list.html"&gt;ImageJ mailing list&lt;/a&gt;, not sure if it would generate any interest at all. I was very happy to see that there were 21 downloads on the first day, so at least there was some interest.&lt;br /&gt;&lt;br /&gt;I was considering to start working on a medical image analysis problem next. It was presented to me by a medical scientist doing Alzheimer's research. It was a very interesting problem, but it would take me at least a few months to finish it.&lt;br /&gt;&lt;br /&gt;Since there are some interest in ShapeLogic I am now thinking that maybe it is better to first focus on ShapeLogic primary focus, to be a toolkit for declarative logic in machine vision and image processing.&lt;br /&gt;&lt;br /&gt;When a user is trying the letter matching example the rules are stored in a Java class. There is a unit test that is writing these rules to a database first and then gets them from the database before doing the letter match, but this is not available in the user interface yet.&lt;br /&gt;It would be better if the users could define a new set of rules themselves in a database or a flat file. E.g. the user could have a file that does matching for a different alphabet or other symbols.&lt;br /&gt;&lt;br /&gt;I will probably also clean the letter match example up a little more. E.g.: The skeletonizer will create little Y junctions at the bottom point of a V, which will trick some rules.&lt;br /&gt;&lt;br /&gt;I hope that this will get ShapeLogic to beta quality in a couple of releases, maybe around ShapeLogic v 1.0. This is probably a statement that I will live to regret, when I announce that ShapeLogic v 2.0 finally have reached beta quality in half a year.&lt;br /&gt;&lt;br /&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-9048882834770826220?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/9048882834770826220/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=9048882834770826220' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/9048882834770826220'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/9048882834770826220'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2007/12/day-of-judgment-was-postponed.html' title='Day of Judgment was postponed'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-5078056768758843905</id><published>2007-12-13T14:57:00.000-05:00</published><updated>2007-12-18T17:06:09.958-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='vision'/><category scheme='http://www.blogger.com/atom/ns#' term='image processing'/><category scheme='http://www.blogger.com/atom/ns#' term='computer vision'/><category scheme='http://www.blogger.com/atom/ns#' term='declarative programming'/><category scheme='http://www.blogger.com/atom/ns#' term='ShapeLogic'/><category scheme='http://www.blogger.com/atom/ns#' term='logic'/><title type='text'>Day of Judgment or day of indifference?</title><content type='html'>&lt;div class="Ih2E3d"&gt;The time of hiding in shame is over. ShapeLogic is finally presentable enough, so I sent out an announcement about ShapeLogic to the &lt;a href="http://rsb.info.nih.gov/ij/list.html"&gt;ImageJ mailing list&lt;/a&gt; this morning.&lt;br /&gt;&lt;br /&gt;I have not had any contact with the image processing or vision community. So I have no idea if the list readers will think it is horrible or promising or not think anything at all.&lt;br /&gt;&lt;br /&gt;I have still not made my mind up about what to work on next: There is an idea for a medical image processing example that is interesting, but also quite involved.&lt;br /&gt;&lt;br /&gt;The letter match could also use a little more cleanup:&lt;br /&gt;&lt;/div&gt;Sometimes the skeletonizer will create little Y junctions at the bottom point of a V, which will trick simple rules.&lt;br /&gt;&lt;br /&gt;I would like to get ShapeLogic to beta quality as soon as possible, I hope that this will happen in the next couple of releases.&lt;br /&gt;&lt;span style="color: rgb(136, 136, 136);"&gt;&lt;br /&gt;-Sami Badawi&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-5078056768758843905?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/5078056768758843905/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=5078056768758843905' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/5078056768758843905'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/5078056768758843905'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2007/12/day-of-judgment-or-day-of-indifference.html' title='Day of Judgment or day of indifference?'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7506593179569894775.post-8026594935361451577</id><published>2007-12-09T21:35:00.000-05:00</published><updated>2007-12-18T17:06:09.959-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='vision'/><category scheme='http://www.blogger.com/atom/ns#' term='image processing'/><category scheme='http://www.blogger.com/atom/ns#' term='computer vision'/><category scheme='http://www.blogger.com/atom/ns#' term='declarative programming'/><category scheme='http://www.blogger.com/atom/ns#' term='ShapeLogic'/><category scheme='http://www.blogger.com/atom/ns#' term='logic'/><title type='text'>100 downloads of ShapeLogic</title><content type='html'>I am Sami Badawi, robotic vision has been my biggest passion since 1989, but starting in August 2007 I created a Java software library for image processing and robotic vision called ShapeLogic: &lt;a href="http://www.shapelogic.org/"&gt;http://www.shapelogic.org&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I have not advertised ShapeLogic since it was pre alpha quality, however, two weeks ago I released ShapeLogic v 0.7, which was the first alpha quality release.  Last week I got the website up and running, then to my big surprise somebody found it, and I got over 100 downloads of ShapeLogic 0.7 last week.  I am not sure how many were downloads by spiders and robots, but this was enough to create this blog and a mailing list: &lt;a href="http://groups.google.com/group/shapelogic"&gt;http://groups.google.com/group/shapelogic&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I released ShapeLogic v 0.8 two days ago, which has a better syntax for the logical expressions, which were pretty convoluted in 0.7 and also completely undocumented.&lt;br /&gt;&lt;br /&gt;I am planning to improve the documentation and clean up the web site next.&lt;br /&gt;After that I would like to work on applying ShapeLogic to a medical image processing problem.&lt;br /&gt;&lt;br /&gt;This is my first blog so I am in unknown territory.&lt;br /&gt;&lt;br /&gt;Thanks for your interest,&lt;br /&gt;-Sami Badawi&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7506593179569894775-8026594935361451577?l=blog.samibadawi.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.samibadawi.com/feeds/8026594935361451577/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7506593179569894775&amp;postID=8026594935361451577' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/8026594935361451577'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7506593179569894775/posts/default/8026594935361451577'/><link rel='alternate' type='text/html' href='http://blog.samibadawi.com/2007/12/100-downloads-of-shapelogic.html' title='100 downloads of ShapeLogic'/><author><name>Sami Badawi</name><uri>http://www.blogger.com/profile/12508131380437723177</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://1.bp.blogspot.com/-Yc0XSAU9nbM/ThxJRBvG9NI/AAAAAAAAAGo/ihoG6J8t8i0/s220/SamiBadawiBlog2.JPG'/></author><thr:total>0</thr:total></entry></feed>
