Running a cluster of virtual machines with Hadoop (HDFS + YARN) v2.4.1 and Spark v1.0.1 using Vagrant

I was able to set up a cluster of virtual machines (VMs) running Hadoop v2.4.1 with HDFS and YARN as well as Spark v1.0.1. The cluster setup was possible using Vagrant v1.5.1 and VirtualBox v4.3.10. The project is open source using Apache v2.0 License and is available at GitHub. I created this project for multiple reasons. to learn about creating […]

Implementing RawComparator will speed up your Hadoop Map/Reduce (MR) Jobs

Introduction Implementing the org.apache.hadoop.io.RawComparator interface will definitely help speed up your Map/Reduce (MR) Jobs. As you may recall, a MR Job is composed of receiving and sending key-value pairs. The process looks like the following. (K1,V1) –> Map –> (K2,V2) (K2,List[V2]) –> Reduce –> (K3,V3) The key-value pairs (K2,V2) are called the intermediary key-value pairs. […]

Secondary sorting aka sorting values in Hadoop’s Map/Reduce programming paradigm

Introduction Sometimes, we would like to sort the values coming into the Reducer of a Hadoop Map/Reduce (MR) Job. You can indirectly sort the values by using a combination of implementations. They are as follows. Use a composite key. Extend org.apache.hadoop.mapreduce.Partitioner. Extend org.apache.hadoop.io.WritableComparator. Other tutorials that explains this approach on sorting values going into a […]