Running a cluster of virtual machines with Hadoop (HDFS + YARN) v2.4.1 and Spark v1.0.1 using Vagrant
I was able to set up a cluster of virtual machines (VMs) running Hadoop v2.4.1 with HDFS and YARN as well as Spark v1.0.1. The cluster setup was possible using Vagrant v1.5.1 and VirtualBox v4.3.10. The project is open source using Apache v2.0 License and is available at GitHub. I created this project for multiple reasons.
- to learn about creating a cluster of VMs with Vagrant,
- to have a sandbox with Hadoop (HDFS + YARN),
- to have a sandbox with Spark, and
- to AVOID having to download the sandbox VMs from Cloudera and Hortonworks (it takes forever to download these VMs; I, myself, was ever only able to finish downloading the CDH sandbox once, and yet, after 24+ hours, the file was corrupted; morever, these VMs are are standalone VMs and do not emulate the cluster environment).
To use this project you will need to install Vagrant and VirtualBox. You will also need to install a git client to clone the project from GitHub. After you have the above installed, then you can change into the cloned directory and simply type in
Filed under: bash, Big Data, Cloud Computing, Hadoop, Ruby, Vagrant, Virtual Machines | Leave a Comment
The Bayesian Dirichlet (BD) scoring function is defined as follows.
Let’s see how we may quickly use these APIs to compute the score of a Bayesian Belief Network (BBN). In [Cooper92], a set of data with three variables (X1, X2, X3) was given as follows.
There was also 3 Bayesian network structures (BS) to represent the relationships of the variables as well. Those 3 BS were reported as follows.
- BS1: X1 → X2 → X3
- BS2: X2 ← X1 → X3
- BS3: X1 ← X2 ← X3
In Java, we can use the API to quickly estimate the scores of BS1, BS2, and BS3 as follows.
double bs1 = (new BayesianDirchletBuilder()) .addKutato(5, 5) //X1 .addKutato(1, 4) //X2 .addKutato(4, 1) .addKutato(0, 5) //X3 .addKutato(4, 1) .build() .get(); double bs2 = (new BayesianDirchletBuilder()) .addKutato(5, 5) //X1 .addKutato(1, 4) //X2 .addKutato(4, 1) .addKutato(2, 3) //X3 .addKutato(4, 1) .build() .get(); double bs3 = (new BayesianDirchletBuilder()) .addKutato(1, 4) //X1 .addKutato(4, 1) .addKutato(0, 4) //X2 .addKutato(5, 1) .addKutato(6, 4) //X3 .build() .get();
var bs1 = (new BayesianDirichletBuilder()) .addKutato([5,5]) .addKutato([1,4]) .addKutato([4,1]) .addKutato([0,5]) .addKutato([4,1]) .build() .get(); var bs2 = (new BayesianDirichletBuilder()) .addKutato([5,5]) .addKutato([1,4]) .addKutato([4,1]) .addKutato([2,3]) .addKutato([4,1]) .build() .get(); var bs3 = (new BayesianDirichletBuilder()) .addKutato([1,4]) .addKutato([4,1]) .addKutato([0,4]) .addKutato([5,1]) .addKutato([6,4]) .build() .get();
Notice how in both APIs, you only add the counts? Easy.
As always, enjoy and cheers! Sib ntsib dua nawb mog!
- [Cooper92] G.F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347 (1992).
In this blog, I will show how to install a single-node Hadoop (v2.3.0) instance with YARN using Vagrant. You might think this is a crazy idea, given that HortonWorks and Cloudera offers free sandboxes with Hadoop. However, it’s not so crazy if you think about wanting to learn about how to actually do it yourself (DIY). There’s a lot that one can learn with a DIY approach (such as dependencies and minimal requirements). Also, I find these sandboxes quite confusing (where’s Hadoop actually installed; you might find files are all over the place) and resembles bloatware (Spark, Hue, Impala, etc…). Furthermore, I found the installation documentation on Hadoop unclear, and I just had to figure out for myself what’s involved. To follow along in this blog, you will need to download the following software.
- VirtualBox v4.3.6
- Vagrant v1.4.3
The first thing you need to do is install VirtualBox. The second thing you need to do is install Vagrant. Next, on the command-line, add the required Vagrant box.
vagrant box add centos65 https://github.com/2creatives/vagrant-centos/releases/download/v6.5.1/centos65-x86_64-20131205.box
Then, using your favorite Git client, check out the Vagrant project from GitHub at https://github.com/vangj/vagrant-hadoop-2.3.0.git. After you checkout the Vagrant project, go into this directory and simply type in the following.
Depending on your connection, it will take a while for the virtual machine (VM) to get created. The primary reason for the installation time is that after the VM is created, we have to download and install OpenJDK and Hadoop. The download of OpenJDK happens through using yum, while the download of Hadoop happens through the use of curl. The secondary reason is that I couldn’t store the Hadoop archive on GitHub (GitHub does not allow files larger than 50 MB), so, the workaround is to have Vagrant execute a script to download Hadoop.
After the VM finishes being created, you can SSH into the VM by typing the following.
When you are done with the VM, you can destroy it by using the following command.
But, before you destroy the VM, you may verify that Hadoop was successfully installed by pointing your browsers to the following URLs.
Note that the URLs are pointing to localhost and NOT the VM. The reason why this is possible is because Vagrant can setup port forwarding from your desktop to the VM. This feature is another reason why Vagrant is an awesome product.
You should also try the hdfs shell command.
hdfs dfs -ls /
Well, that is it for this blog. I hope and expect that we all can easily and at-will now setup our own sandboxes of Hadoop with YARN using Vagrant and VirtualBox. Now, we can move onto real fun things like building applications to run on YARN.
As always, cheers!
Filed under: Hadoop, Java | Leave a Comment
Recently, the Data Science DC Meetup group held a competition for their members to visualize their RSVP data. I did not have too much time, but I took a stab at trying to visualize the data. In one approach, I simply clustered the Meetup event titles into 4 groups using the k-means algorithm, and from each group/cluster, I created word clouds. In another approach, I built a n-by-n co-occurrence matrix, where n is the number of members and each matrix cell value was the number of times the i-th and j-th members went to the same Meetup event. From this matrix, I built a maximum weight spanning tree (MWST) where each vertex corresponded to a member and each edge was weighted by the co-occurrence value. I then visualized this MWST using the Yifu Han layout algorithm. The original data and visualizations may be downloaded below.
As usual, cheers! Sib ntsib dua mog!
Filed under: Uncategorized | Leave a Comment
Tags: data mining, visualization