Introduction Implementing the org.apache.hadoop.io.RawComparator interface will definitely help speed up your Map/Reduce (MR) Jobs. As you may recall, a MR Job is composed of receiving and sending key-value pairs. The process looks like the following. (K1,V1) –> Map –> (K2,V2) (K2,List[V2]) –> Reduce –> (K3,V3) The key-value pairs (K2,V2) are called the intermediary key-value pairs. […]
Introduction Last time, I talked about controlling logging on Amazon Web Service’s (AWS) Elastic MapReduce (EMR). However, that approach works only when you provision an EMR cluster of 1 node and need to get the log files from that 1 node. In this blog, I will talk about how to control logging for an EMR […]
Intro I have been using the Elastic MapReduce (EMR) product. EMR is one of many products and services available from Amazon Web Services (AWS). EMR is AWS’s product to dynamically provision a Hadoop cluster. One problem I ran into was how to control logging. Hadoop uses Apache Commons Logging. Both Hadoop and AWS seem to […]
Introduction Sometimes, we would like to sort the values coming into the Reducer of a Hadoop Map/Reduce (MR) Job. You can indirectly sort the values by using a combination of implementations. They are as follows. Use a composite key. Extend org.apache.hadoop.mapreduce.Partitioner. Extend org.apache.hadoop.io.WritableComparator. Other tutorials that explains this approach on sorting values going into a […]
Introduction I am reading a book by (Lin and Dyer 2010). This book is very informative about designing efficient algorithms under the Map/Reduce (M/R) programming paradigm. Of particular interest is the “in-mapper combining” design pattern that I came across while reading this book. As if engineers and data miners did not have to change their […]
I am going to continue on the Map/Reduce Text Mining Toolkit (MRTMT) API in this blog post. I have worked on it a little bit more, and now I will be releasing v0.3. The added improvements include allowing the user to specify the local and global weights used to build the vector space model (VSM).
I recently released MRTMT v0.1. In that article, I stated there were still a lot of work to be done. Since, I have simplified the process of building a vector space model (VSM) in a newer version, MRTMT v0.2. The full source code may be downloaded here. Again, this API is released under the Apache […]