I have been studying the Mahout v0.4 API. This API is a machine learning API and may be used on top of Hadoop. It is built in Java. In particular, I have been digging into the clustering and collaborative filtering code of Mahout. In this blog, I will not say much, since most of what I would be saying were put into Word and PowerPoint files already. I will merely give you the link to download the document and slides.
The document and slide deck are licensed under the Creative Commons Attribution 3.0 Unported License.
I think anybody interested in machine learning and/or Hadoop’s Map/Reduce (M/R) programming paradigm may find the files helpful. The document is a bit more detailed than the slides, and it also uses a working toy example to let the reader observe the inputs and output of the Map/Reduce classes/tasks/jobs.
I did not make too many subjective statements in the document and slides. But in this blog post, let me say that I find the API fascinating. First, if you just want to know how to implement your own M/R Jobs, study this API since they use a lot data structures and “tricks” in intermediary steps to achieve the machine learning goals. Second, for some algorithms in this API (i.e. clustering algorithms), you can run them in standalone mode or on a Hadoop cluster (you do not necessarily need a Hadoop cluster or Cygwin + Hadoop to use this API). Third, they have a very good (progressive and responsive) team working on this API, and you can study the code for real-world best-practices. For example, in the recommender algorithm (collaborative filtering), it is interesting to note the required input. The required input is just a plain old text file where each line holds the user’s ID, item’s ID, and preference weight (the line is comma-delimited). The input data format consequentially means each user’s association with an item is its own record. Usually, the way I have worked with data is that a record would be equivalent to a user. You may find that “billions” of records in Mahout may not equate with your own familiarity or understanding of a record (e.g. as in a database).
At any rate, I hope the document and slide files help you to understand Mahout’s API a little bit better. Enjoy. Zai jian.