Last time, I talked about controlling logging on Amazon Web Service’s (AWS) Elastic MapReduce (EMR). However, that approach works only when you provision an EMR cluster of 1 node and need to get the log files from that 1 node. In this blog, I will talk about how to control logging for an EMR cluster of more than 1 node. In fact, you should probably read the original article before reading this one because the approach is identical except for that the way you pull/acquire your custom log files is different.
Multiple log files
Unless I have missed out on some crucial details, it seems when you provision N number of instances (nodes) (N > 1) with AWS’s EMR, 1 of these instance will be both the NameNode and JobTracker. It will not be a part of the Task and Data nodes (it will not have the TaskTracker and DataNode daemons running). So, you will always have N-1 nodes for Map/Reduce (MR) computations. If you have provisioned N instances in an AWS EMR cluster, and you followed the steps of the previous blog post, you will find out that when you SSH into the master node, your custom logs may be empty. As just explained, since the master node performs no MR computations, these logs will be empty. So, what you really need to do is to SSH into the slave nodes and grab the log files from there. You have to SSH into every single slave node and do this step.
SSH into task nodes
This blog assumes you have already successfully signed up for AWS EMR and generated a SSH key pair (this is the file ending with pem extension, i.e. mrhadoop.pem). Before you may use the SSH key pair to SSH into your task nodes, you need to authorize access to your instances. The master node already has the TCP rule to allow SSH access, however, the slave nodes do not. The slave nodes belong to a Security Group called, ElasticMapReduce-slave. If you follow the instructions on the link above, you end up setting a TCP rule for SSH access to all your slave nodes.
After you have authorized SSH access to your slave nodes, you may use SSH+Cygwin or Putty (or any other SSH client) to access your slave instances. For example, from the Cygwin command line, I type in the following.
ssh -i /cygdrive/c/Ruby187/elastic-mapreduce-cli/mrhadoop.pem firstname.lastname@example.org
The generic command line expression is as follows.
ssh -i [path/to/ssh/key/pair] [username]@[slave-public-dns-name]
After you SSH into your slave nodes, you may then use the hadoop dfs -copyFromLocal command to copy your log files to your S3 bucket (as before in the other blog).
Summary and conclusion
An approach to controlling logging via Apache Commons Logging and Log4j on AWS’s EMR was previously reported. In this blog, I expand on controlling logging on EMR clusters with more than one node. The gist is that your custom log files will be written to only on the compute (task/slave) nodes. Before you may SSH into them, you have to allow access to your slave instances. After you allow SSH access, you may then use the hadoop dfs -copyFromLocal command to copy your log files over to S3.