An approach to controlling logging on Amazon Web Services’ (AWS) Elastic MapReduce (EMR)

Intro

I have been using the Elastic MapReduce (EMR) product. EMR is one of many products and services available from Amazon Web Services (AWS). EMR is AWS’s product to dynamically provision a Hadoop cluster. One problem I ran into was how to control logging. Hadoop uses Apache Commons Logging. Both Hadoop and AWS seem to encourage using Log4j as the actual logging implementation. This blog assumes you are familiar with all these components.

One problem I ran into was how to control logging using Apache Commons Logging and Log4j. What I wanted to do was to simply be able to specify logging properties for Log4j. I asked for help on the AWS EMR discussion forum, but for several days (and thus far), no one responded (where did all the AWS/EMR evangelists go?). From digging around the AWS site and other discussion threads, I did get some scattered “hints” of how to go about controlling logging. Still, when I searched for a more detailed explanation (on the general internet) putting all the parts together, I did not see anything helpful or instructive. But, with some perseverance, I did pull through and found a way, reported in this blog, to control logging on an AWS EMR Hadoop cluster.

As you may already know, there is a conf directory inside the root of the Hadoop installation directory (i.e. /home/hadoop/conf). Inside this directory, is a log4j.properties file (i.e. /home/hadoop/conf/log4j.properties). This file is precisely where you need to make modifications to control logging.

Bootstrap Action

When you provision a Hadoop cluster with EMR, you can specify bootstrap actions (you can specify up to 16 bootstrap actions per job flow). If you want to modify log4j.properties, you need to specify a bootstrap action to overwrite the current log4j.properties with your own. Below is an example of how to create a Hadoop cluster with EMR (or, a Job Flow) with a bootstrap action to overwrite the default log4j.properties file with your own. You will need Ruby and Amazon’s EMR Ruby Client.

ruby elastic-mapreduce --create --name j-small --alive --enable-debugging --log-uri s3n://mrhadoop-uri/log/ --bootstrap-action "s3://mrhadoop-uri/copy-log4j.sh"

What you really need to pay attention to is the command line option part: –bootstrap-action “s3://mrhadoop-uri/copy-log4j.sh”. The –boostrap-action is a flag (an option). The “s3://mrhadoop-uri/copy-log4j.sh” is the script you want to run. In this case, I want to run the copy-log4j.sh script that is located in my S3 bucket, mrhadoop-uri. Below is an example of what copy-log4j.sh might look like for you.

#!/bin/bash
hadoop dfs -copyToLocal s3n://mrhadoop-uri/log4j.properties /home/hadoop/log4j.properties
cp -f /home/hadoop/log4j.properties /home/hadoop/conf/log4j.properties
mkdir /home/hadoop/logs

As you can see in this script, the first thing I do is copy log4j.properties from my S3 location to the local file system using the “hadoop dfs -copyToLocal” command. The next thing I do is execute a force copy of the new log4j.properties file over the current one. The last thing I do is to create a /home/hadoop/logs directory, since my log4j.properties file specifies that this directory is the location to which my log files will be written.

Below is a snippet of my log4j.properties file. As you can see, I have 3 rolling file appenders; one each for my mappers, reducers, and jobs.

log4j.logger.net.mrtmt.mapper=DEBUG,R2
log4j.logger.net.mrtmt.reducer=DEBUG,R3
log4j.logger.net.mrtmt.job=DEBUG,R5

log4j.appender.R2=org.apache.log4j.RollingFileAppender
log4j.appender.R2.File=/home/hadoop/logs/map.log
log4j.appender.R2.MaxFileSize=10MB
log4j.appender.R2.MaxBackupIndex=10
log4j.appender.R2.layout=org.apache.log4j.PatternLayout
log4j.appender.R2.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
log4j.appender.R2.Threshold=ALL

log4j.appender.R3=org.apache.log4j.RollingFileAppender
log4j.appender.R3.File=/home/hadoop/logs/reduce.log
log4j.appender.R3.MaxFileSize=10MB
log4j.appender.R3.MaxBackupIndex=10
log4j.appender.R3.layout=org.apache.log4j.PatternLayout
log4j.appender.R3.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
log4j.appender.R3.Threshold=ALL

log4j.appender.R5=org.apache.log4j.RollingFileAppender
log4j.appender.R5.File=/home/hadoop/logs/job.log
log4j.appender.R5.MaxFileSize=10MB
log4j.appender.R5.MaxBackupIndex=10
log4j.appender.R5.layout=org.apache.log4j.PatternLayout
log4j.appender.R5.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
log4j.appender.R5.Threshold=ALL

At this point, you can go ahead and submit your Map/Reduce (MR) Job. An example is shown below.

ruby elastic-mapreduce --jobflow <job-id> --json <json-step-file>

Getting the log files

At this point, your EMR Hadoop cluster should be set up to log per your specification. The natural question is, how do I get these log files back? You may SSH into your EMR Hadoop cluster and use Hadoop’s copy utility command to get your log files. To SSH into your EMR Hadoop instance, do the following.

ruby elastic-mapreduce --ssh --jobflow <job-id>

Now you are in a command line shell. Then go ahead and transfer the log files to your S3 bucket while you are in your command line shell.

hadoop dfs -copyFromLocal /home/hadoop/logs/map.log s3n://mrhadoop-uri/map.log

Future works

Future work (and this work should be contributed by Amazon itself, IMO), should be for the availability of an Log4j appender to log to S3 directly. In this way, all you really need to do is specify the boostrap action to copy the log4j.properties file over. In fact, here’s a link, http://aws.amazon.com/code/4030, to a project that does something like what I am talking about. If you click on “download”, it’s a dead link. If you click on the location of the source code, http://code.google.com/p/nebulent-log4j-appenders/, and browse around the SVN repository, there is no code to be found. So, until we have someone create an Appender that is able to write to S3, what I presented here is a way to control logging and acquire the logging output files. It’s multi-step, and because of that, error-prone, but it should suffice until something better comes along.

Warning

You probably do NOT want to completely overwrite the original log4j.properties file. What you may actually want to do is modify it (i.e. add additional Appenders or specify log levels for different classes/packages). What I actually did was

  1. create the AWS EMR Hadoop instance,
  2. SSH into the Hadoop instance,
  3. use Hadoop’s copy utility to copy the original /home/hadoop/conf/log4j.properties file to my S3 bucket, and
  4. download log4j.properties from my S3 bucket to my computer, modify it, and upload it back to S3.

Just to save you time and the trouble, here is the default /home/hadoop/conf/log4j.properties file.

# Define some default values that can be overridden by system properties
hadoop.root.logger=INFO,console
hadoop.log.dir=.
hadoop.log.file=hadoop.log

#
# Job Summary Appender 
#
# Use following logger to send summary to separate file defined by 
# hadoop.mapreduce.jobsummary.log.file rolled daily:
# hadoop.mapreduce.jobsummary.logger=INFO,JSA
# 
hadoop.mapreduce.jobsummary.logger=${hadoop.root.logger}
hadoop.mapreduce.jobsummary.log.file=hadoop-mapreduce.jobsummary.log
#AWS SDK Logging
log4j.logger.com.amazonaws=WARN

# Define the root logger to the system property "hadoop.root.logger".
log4j.rootLogger=${hadoop.root.logger}, EventCounter

# Logging Threshold
log4j.threshhold=ALL

#
# Daily Rolling File Appender
#

log4j.appender.DRFA=org.apache.log4j.DailyRollingFileAppender
log4j.appender.DRFA.File=${hadoop.log.dir}/${hadoop.log.file}

# Rollver at every hour
log4j.appender.DRFA.DatePattern=.yyyy-MM-dd-HH

# 30-day backup
#log4j.appender.DRFA.MaxBackupIndex=30
log4j.appender.DRFA.layout=org.apache.log4j.PatternLayout

# Pattern format: Date LogLevel LoggerName LogMessage
log4j.appender.DRFA.layout.ConversionPattern=%d{ISO8601} %p %c (%t): %m%n
# Debugging Pattern format
#log4j.appender.DRFA.layout.ConversionPattern=%d{ISO8601} %-5p %c{2} (%F:%M(%L)) - %m%n


#
# console
# Add "console" to rootlogger above if you want to use this 
#

log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n

#
# TaskLog Appender
#

#Default values
hadoop.tasklog.taskid=null
hadoop.tasklog.iscleanup=false
hadoop.tasklog.noKeepSplits=4
hadoop.tasklog.totalLogFileSize=100
hadoop.tasklog.purgeLogSplits=true
hadoop.tasklog.logsRetainHours=12

log4j.appender.TLA=org.apache.hadoop.mapred.TaskLogAppender
log4j.appender.TLA.taskId=${hadoop.tasklog.taskid}
log4j.appender.TLA.isCleanup=${hadoop.tasklog.iscleanup}
log4j.appender.TLA.totalLogFileSize=${hadoop.tasklog.totalLogFileSize}

log4j.appender.TLA.layout=org.apache.log4j.PatternLayout
log4j.appender.TLA.layout.ConversionPattern=%d{ISO8601} %p %c (%t): %m%n

#
#Security audit appender
#
hadoop.security.log.file=SecurityAuth.audit
log4j.appender.DRFAS=org.apache.log4j.DailyRollingFileAppender 
log4j.appender.DRFAS.File=${hadoop.log.dir}/${hadoop.security.log.file}

log4j.appender.DRFAS.layout=org.apache.log4j.PatternLayout
log4j.appender.DRFAS.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n
#new logger
log4j.logger.SecurityLogger=OFF,console
log4j.logger.SecurityLogger.additivity=false

#
# Rolling File Appender
#

#log4j.appender.RFA=org.apache.log4j.RollingFileAppender
#log4j.appender.RFA.File=${hadoop.log.dir}/${hadoop.log.file}

# Logfile size and and 30-day backups
#log4j.appender.RFA.MaxFileSize=1MB
#log4j.appender.RFA.MaxBackupIndex=30

#log4j.appender.RFA.layout=org.apache.log4j.PatternLayout
#log4j.appender.RFA.layout.ConversionPattern=%d{ISO8601} %-5p %c{2} - %m%n
#log4j.appender.RFA.layout.ConversionPattern=%d{ISO8601} %-5p %c{2} (%F:%M(%L)) - %m%n

#
# FSNamesystem Audit logging
# All audit events are logged at INFO level
#
log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=WARN

# Custom Logging levels

hadoop.metrics.log.level=INFO
#log4j.logger.org.apache.hadoop.mapred.JobTracker=DEBUG
#log4j.logger.org.apache.hadoop.mapred.TaskTracker=DEBUG
#log4j.logger.org.apache.hadoop.fs.FSNamesystem=DEBUG
log4j.logger.org.apache.hadoop.metrics2=${hadoop.metrics.log.level}

# Jets3t library
log4j.logger.org.jets3t.service.impl.rest.httpclient.RestS3Service=ERROR

#
# Null Appender
# Trap security logger on the hadoop client side
#
log4j.appender.NullAppender=org.apache.log4j.varia.NullAppender

#
# Event Counter Appender
# Sends counts of logging messages at different severity levels to Hadoop Metrics.
#
log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter

#
# Job Summary Appender
#
log4j.appender.JSA=org.apache.log4j.DailyRollingFileAppender
log4j.appender.JSA.File=${hadoop.log.dir}/${hadoop.mapreduce.jobsummary.log.file}
log4j.appender.JSA.layout=org.apache.log4j.PatternLayout
log4j.appender.JSA.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
log4j.appender.JSA.DatePattern=.yyyy-MM-dd-HH
log4j.logger.org.apache.hadoop.mapred.JobInProgress$JobSummary=${hadoop.mapreduce.jobsummary.logger}
log4j.additivity.org.apache.hadoop.mapred.JobInProgress$JobSummary=false

Summary and conclusion

This blog was about controlling the logging output of Log4j on an AWS EMR Hadoop cluster using a boostrap action. Future work should be to write a Log4j appender to log directly to S3. IMO, this work should be done by Amazon (they should do it because it would help all their users of AWS EMR, and also, since they charge for in/out data transfer of S3, writing such a S3 Appender is in the interest of their profits). In fact, an attempt of some sort has been made, but there is no code to show for the effort. As always, happy coding and I hope you find this blog helpful.

Sib ntsib dua nawb mog!

Leave a comment