Running a cluster of virtual machines with Hadoop (HDFS + YARN) v2.4.1 and Spark v1.0.1 using Vagrant

Posted on August 3, 2014 by Jee Vang, Ph.D.

I was able to set up a cluster of virtual machines (VMs) running Hadoop v2.4.1 with HDFS and YARN as well as Spark v1.0.1. The cluster setup was possible using Vagrant v1.5.1 and VirtualBox v4.3.10. The project is open source using Apache v2.0 License and is available at GitHub. I created this project for multiple reasons.

to learn about creating a cluster of VMs with Vagrant,
to have a sandbox with Hadoop (HDFS + YARN),
to have a sandbox with Spark, and
to AVOID having to download the sandbox VMs from Cloudera and Hortonworks (it takes forever to download these VMs; I, myself, was ever only able to finish downloading the CDH sandbox once, and yet, after 24+ hours, the file was corrupted; morever, these VMs are are standalone VMs and do not emulate the cluster environment).

To use this project you will need to install Vagrant and VirtualBox. You will also need to install a git client to clone the project from GitHub. After you have the above installed, then you can change into the cloned directory and simply type in

vagrant up

Continue reading →

Installing a single-node Hadoop (v2.3.0) instance with YARN using Vagrant

Posted on March 5, 2014 by Jee Vang, Ph.D.

In this blog, I will show how to install a single-node Hadoop (v2.3.0) instance with YARN using Vagrant. You might think this is a crazy idea, given that HortonWorks and Cloudera offers free sandboxes with Hadoop. However, it’s not so crazy if you think about wanting to learn about how to actually do it yourself (DIY). There’s a lot that one can learn with a DIY approach (such as dependencies and minimal requirements). Also, I find these sandboxes quite confusing (where’s Hadoop actually installed; you might find files are all over the place) and resembles bloatware (Spark, Hue, Impala, etc…). Furthermore, I found the installation documentation on Hadoop unclear, and I just had to figure out for myself what’s involved. To follow along in this blog, you will need to download the following software.

VirtualBox v4.3.6
Vagrant v1.4.3

The first thing you need to do is install VirtualBox. The second thing you need to do is install Vagrant. Next, on the command-line, add the required Vagrant box.

vagrant box add centos65 https://github.com/2creatives/vagrant-centos/releases/download/v6.5.1/centos65-x86_64-20131205.box

Then, using your favorite Git client, check out the Vagrant project from GitHub at https://github.com/vangj/vagrant-hadoop-2.3.0.git. After you checkout the Vagrant project, go into this directory and simply type in the following.

vagrant up

Depending on your connection, it will take a while for the virtual machine (VM) to get created. The primary reason for the installation time is that after the VM is created, we have to download and install OpenJDK and Hadoop. The download of OpenJDK happens through using yum, while the download of Hadoop happens through the use of curl. The secondary reason is that I couldn’t store the Hadoop archive on GitHub (GitHub does not allow files larger than 50 MB), so, the workaround is to have Vagrant execute a script to download Hadoop.

After the VM finishes being created, you can SSH into the VM by typing the following.

vagrant ssh

When you are done with the VM, you can destroy it by using the following command.

vagrant destroy

But, before you destroy the VM, you may verify that Hadoop was successfully installed by pointing your browsers to the following URLs.

Note that the URLs are pointing to localhost and NOT the VM. The reason why this is possible is because Vagrant can setup port forwarding from your desktop to the VM. This feature is another reason why Vagrant is an awesome product.

You should also try the hdfs shell command.

hdfs dfs -ls /

Well, that is it for this blog. I hope and expect that we all can easily and at-will now setup our own sandboxes of Hadoop with YARN using Vagrant and VirtualBox. Now, we can move onto real fun things like building applications to run on YARN.

As always, cheers!

Implementing RawComparator will speed up your Hadoop Map/Reduce (MR) Jobs

Posted on March 30, 2012 by Jee Vang, Ph.D.

Introduction

Implementing the org.apache.hadoop.io.RawComparator interface will definitely help speed up your Map/Reduce (MR) Jobs. As you may recall, a MR Job is composed of receiving and sending key-value pairs. The process looks like the following.

(K1,V1) –> Map –> (K2,V2)
(K2,List[V2]) –> Reduce –> (K3,V3)

The key-value pairs (K2,V2) are called the intermediary key-value pairs. They are passed from the mapper to the reducer. Before these intermediary key-value pairs reach the reducer, a shuffle and sort step is performed. The shuffle is the assignment of the intermediary keys (K2) to reducers and the sort is the sorting of these keys. In this blog, by implementing the RawComparator to compare the intermediary keys, this extra effort will greatly improve sorting. Sorting is improved because the RawComparator will compare the keys by byte. If we did not use RawComparator, the intermediary keys would have to be completely deserialized to perform a comparison.
Continue reading →

Controlling logging on Amazon Web Service’s (AWS) Elastic MapReduce (EMR) on a multi-node cluster

Posted on March 27, 2012 by Jee Vang, Ph.D.

Introduction

Last time, I talked about controlling logging on Amazon Web Service’s (AWS) Elastic MapReduce (EMR). However, that approach works only when you provision an EMR cluster of 1 node and need to get the log files from that 1 node. In this blog, I will talk about how to control logging for an EMR cluster of more than 1 node. In fact, you should probably read the original article before reading this one because the approach is identical except for that the way you pull/acquire your custom log files is different.
Continue reading →

An approach to controlling logging on Amazon Web Services’ (AWS) Elastic MapReduce (EMR)

Posted on March 24, 2012 by Jee Vang, Ph.D.

Intro

I have been using the Elastic MapReduce (EMR) product. EMR is one of many products and services available from Amazon Web Services (AWS). EMR is AWS’s product to dynamically provision a Hadoop cluster. One problem I ran into was how to control logging. Hadoop uses Apache Commons Logging. Both Hadoop and AWS seem to encourage using Log4j as the actual logging implementation. This blog assumes you are familiar with all these components.

Apache Commons Logging
Log4j
Hadoop v0.20.x
AWS EMR
AWS Simple Storage Service (S3)

One problem I ran into was how to control logging using Apache Commons Logging and Log4j. What I wanted to do was to simply be able to specify logging properties for Log4j. I asked for help on the AWS EMR discussion forum, but for several days (and thus far), no one responded (where did all the AWS/EMR evangelists go?). From digging around the AWS site and other discussion threads, I did get some scattered “hints” of how to go about controlling logging. Still, when I searched for a more detailed explanation (on the general internet) putting all the parts together, I did not see anything helpful or instructive. But, with some perseverance, I did pull through and found a way, reported in this blog, to control logging on an AWS EMR Hadoop cluster.

As you may already know, there is a conf directory inside the root of the Hadoop installation directory (i.e. /home/hadoop/conf). Inside this directory, is a log4j.properties file (i.e. /home/hadoop/conf/log4j.properties). This file is precisely where you need to make modifications to control logging.
Continue reading →

Secondary sorting aka sorting values in Hadoop’s Map/Reduce programming paradigm

Posted on March 20, 2012 by Jee Vang, Ph.D.

Introduction

Sometimes, we would like to sort the values coming into the Reducer of a Hadoop Map/Reduce (MR) Job. You can indirectly sort the values by using a combination of implementations. They are as follows.

Use a composite key.
Extend org.apache.hadoop.mapreduce.Partitioner.
Extend org.apache.hadoop.io.WritableComparator.

Other tutorials that explains this approach on sorting values going into a Reducer are explained in the links below. In this blog, I summarize what I have learned from the links below and also provide a self-contained example. The main difference between this blog and the links below is that I will show how to do this using the new M/R API (i.e. org.apache.hadoop.mapreduce.*).

Continue reading →

The “in-mapper combining” design pattern for Map/Reduce programming in Java

Posted on March 7, 2012 by Jee Vang, Ph.D.

Introduction

I am reading a book by (Lin and Dyer 2010). This book is very informative about designing efficient algorithms under the Map/Reduce (M/R) programming paradigm. Of particular interest is the “in-mapper combining” design pattern that I came across while reading this book. As if engineers and data miners did not have to change their way of thinking enough while adapting to the M/R programming paradigm, our change in thinking and development must also be sensitive to the particular M/R framework as well. The in-mapper combining design pattern is meant to address some issues with M/R programming, and in particular, M/R programming under the Hadoop platform. In this blog I will discuss this in-mapper combining design patterns and show some examples. This design pattern seems to me an excellent technical screening problem—if you are so (un)fortunate. 🙂 Hereafter, I will refer to the in-mapper combining design pattern with the acronym IMCDP.
Continue reading →

Map/Reduce Text Mining Toolkit (MRTMT) version 0.3 Released

Posted on March 7, 2012 by Jee Vang, Ph.D.

I am going to continue on the Map/Reduce Text Mining Toolkit (MRTMT) API in this blog post. I have worked on it a little bit more, and now I will be releasing v0.3. The added improvements include allowing the user to specify the local and global weights used to build the vector space model (VSM).
Continue reading →

Map/Reduce Text Mining Toolkit (MRTMT) version 0.2 Released

Posted on March 6, 2012 by Jee Vang, Ph.D.

I recently released MRTMT v0.1. In that article, I stated there were still a lot of work to be done. Since, I have simplified the process of building a vector space model (VSM) in a newer version, MRTMT v0.2. The full source code may be downloaded here. Again, this API is released under the Apache 2.0 license.

Continue reading →

A Simple Toolkit to Create a Vector Space Model using Map/Reduce

Posted on March 5, 2012 by Jee Vang, Ph.D.

In this blog I will discuss a simple toolkit you may use to create a vector space model (VSM) (Salton 75). The toolkit is called, Map/Reduce Text Mining Toolkit (MRTMT), however, for now, its accomplishments does not entirely cover the scope of text mining and just merely creating a VSM from text documents.

The purpose of MRTMT is to create a VSM from a very large corpus of documents using Hadoop’s M/R programming paradigm. For a smaller corpus of document, please visit Word Vector Tool.

Continue reading →

Jee’s Blog

Hard to ask questions and hard to find answers

Category Archives: Hadoop

Running a cluster of virtual machines with Hadoop (HDFS + YARN) v2.4.1 and Spark v1.0.1 using Vagrant

Installing a single-node Hadoop (v2.3.0) instance with YARN using Vagrant

Implementing RawComparator will speed up your Hadoop Map/Reduce (MR) Jobs

Introduction

Controlling logging on Amazon Web Service’s (AWS) Elastic MapReduce (EMR) on a multi-node cluster

Introduction

An approach to controlling logging on Amazon Web Services’ (AWS) Elastic MapReduce (EMR)

Intro

Secondary sorting aka sorting values in Hadoop’s Map/Reduce programming paradigm

Introduction

The “in-mapper combining” design pattern for Map/Reduce programming in Java

Introduction

Map/Reduce Text Mining Toolkit (MRTMT) version 0.3 Released

Map/Reduce Text Mining Toolkit (MRTMT) version 0.2 Released

A Simple Toolkit to Create a Vector Space Model using Map/Reduce

Share this:

Introduction

Share this:

Introduction

Share this:

Intro

Share this:

Introduction

Share this:

Introduction

Share this:

Share this:

Share this:

Share this: