Hadoop 2.x release involves many changes to Hadoop and MapReduce. The centralized JobTracker service is replaced with a ResourceManager that manages the resources in the cluster and an ApplicationManager that manages the application lifecycle. These architectural changes enable hadoop to scale to much larger clusters. A new release also has minor changes to scripts,directories and environment variables necessary to get started. This is a getting started tutorial for 2.x. The intended audience is someone who is completely new to hadoop and needs a jumpstart or someone who has played a little bit with a previous version and wants to start using 2.x. The emphasis on getting hadoop running and not necessarily explaining concepts which is covered in many other blogs.
In this tutorial we will
(1) Setup a hadoop in a single node environment
(2) Create and move files to HDFS
(3) Write and execute a simple MapReduce Application
Step 1: Download Hadoop and install
Download the current 2.x.x release from http://hadoop.apache.org/releases.html.
I downloaded hadoop-2.0.1-alpha.tar.gz.
Untar the file to a directory say ~/hadoop-2.0.1-alpha.
Step 2: Set the following environment variables
hadoop-2.0.1-alpha$ export HADOOP_HOME=~/hadoop-2.0.1-alpha
hadoop-2.0.1-alpha$ export HADOOP_MAPRED_HOME=~/hadoop-2.0.1-alpha
hadoop-2.0.1-alpha$ export HADOOP_COMMON_HOME=~/hadoop-2.0.1-alpha
hadoop-2.0.1-alpha$ export HADOOP_HDFS_HOME=~/hadoop-2.0.1-alpha
hadoop-2.0.1-alpha$ export YARN_HOME=~/hadoop-2.0.1-alpha
hadoop-2.0.1-alpha$ export HADOOP_CONF_DIR=~/hadoop-2.0.1-alpha/etc/hadoop
If these environment variables are not setup correctly, Step 4 might fail.
Step 3: Update the configuration files
hdfs-site.xml
Add the following configuration to etc/hadoop/hdfs-site.xml. If you do not set dfs.namenode.name.dir and dfs.datanote.data.dir explicitly, hadoop will default to a temp directory that the OS may clean up on restart and you will lose data. This is a common omission for newbees.
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/Users/joe/hadoop-hdfs201/data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/Users/joe/hadoop-hdfs201/data/hdfs/datanode</value>
</property>
</configuration>
core-site.xml
Add the following to etc/hadoop/core-site.xml.
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
yarn-site.xml
Add the following to etc/hadoop/yarn-site.xml.
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services </name>
<value>mapreduce.shuffle </value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class </name>
<value>org.apache.hadoop.mapred.ShuffleHandler </value>
</property>
</configuration>
mapred-site.xml
Add the following to etc/hadoop/mapred-site.xml.
<?xml version="1.0"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Step 4: Start the processes.
Change to the directory where hadoop is installed.
cd ~/hadoop-2.0.1-alpha
If you are running hadoop for the first time, the following command will format HDFS. Do not run this everytime as it formats and thus deletes any existing data
hadoop-2.0.1-alpha$ bin/hadoop namenode -format
Start the namenode.
hadoop-2.0.1-alpha$ sbin/hadoop-daemon.sh start namenode
Start the datanode.
hadoop-2.0.1-alpha$ sbin/hadoop-daemon.sh start datanode
In hadoop 2.x , there is no jobtracker. Instead there is a resourcemanager and a nodemanager.
Start the resourcemanager.
hadoop-2.0.1-alpha$ sbin/yarn-daemon.sh start resourcemanager
Start the nodemanager.
hadoop-2.0.1-alpha$ sbin/yarn-daemon.sh start nodemanager
Start the history server.
hadoop-2.0.1-alpha$ sbin/mr-jobhistory-daemon.sh start historyserver
Type jps. It lists the java processes running. Check that all the processes are started
hadoop-2.0.1-alpha$ jps
1380 DataNode
1558 Jps
1433 ResourceManager
1536 JobHistoryServer
1335 NameNode
1849 NodeManager
Once I ran into a problem where the mapreduce job was being accepted but never executed. In looking at the logs, I found that the NodeManager had not started. The jps command is a good check to ensure all necessary processes are started.
Step 5: Get familiar with HDFS
The HDFS commands are documented in the older releases of hadoop
http://hadoop.apache.org/docs/r1.0.3/file_system_shell.html
hadoop-2.0.1-alpha$ bin/hadoop fs -ls
will list the home directory. If you are user joe. HDFS creates a /user/joe directory for you. Any files or directories you create will be created here.
hadoop-2.0.1-alpha$ bin/hadoop fs -mkdir /user/joe/input
creates a directory input
In the local filesystem create a file app.log with the data
user01|1|2|3|4|5
user02|1|2|3|4|5
user03|1|2|3|4|5
user01|1|2|3|4|5
user02|1|2|3|4|5
user01|1|2|3|4|5
user03|1|2|3|4|5
user01|1|2|3|4|5
user04|1|2|3|4|5
user01|1|2|3|4|5
let us pretend this is a log file from a web application where for each request we have logged userid and some additional data. We will later use this .as input for a MapReduce program.
You can move it to hdfs using the command
hadoop-2.0.1-alpha$ bin/hadoop fs -moveFromLocal ~/projects/app.log /user/manoj/input/
To print the file just moved to hdfs
hadoop-2.0.1-alpha$ bin/hadoop fs -cat /user/manoj/input/app.log
Step 6: Create a MapReduce program
The MapReduce programming model is explained in the blog What is MapReduce ?. Let us write a simple mapreduce program that uses that app.log we created in step5 as input and outputs the number of times a user visited the site. UserCountMap reads a line and outputs (username,1). UserCountReducer takes as input (username, list of 1s) and outputs (username, sum).
public class UserCount {
public static class UserCountMap extends Mapper
public void map(LongWritable key, Text Value, Context context)
throws IOException, InterruptedException {
String line = Value.toString() ;
String tokens[] = line.split("\\|") ;
if (tokens.length > 0) {
context.write(new Text(tokens[0]),new IntWritable(1)) ;
}
}
}
public static class UserCountReducer extends Reducer
public void reduce(Text key, Iterable values, Context context)
throws IOException, InterruptedException {
int count = 0 ;
for (IntWritable value : values) {
count = count + value.get() ;
}
context.write(key, new IntWritable(count)) ;
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(UserCount.class) ;
FileInputFormat.addInputPath(job, new Path(args[0])) ;
FileOutputFormat.setOutputPath(job, new Path(args[1])) ;
job.setMapperClass(UserCountMap.class) ;
job.setReducerClass(UserCountReducer.class) ;
job.setOutputKeyClass(Text.class) ;
job.setOutputValueClass(IntWritable.class) ;
System.exit(job.waitForCompletion(true) ? 0 : 1) ;
}
}
Compile the program and package into a jar called say usercount.jar.
Step 7: Run the madreduce program
hadoop-2.0.1-alpha$ bin/hadoop jar ~/projects/usercount.jar com.mj.UserCount /user/joe/input /user/joe/output
you should see output some of which is shown below.
12/09/12 17:41:28 INFO mapreduce.Job: The url to track the job: http://joe.local:8088/proxy/application_1347494786422_0003/
12/09/12 17:41:28 INFO mapreduce.Job: Running job: job_1347494786422_0003
12/09/12 17:41:37 INFO mapreduce.Job: Job job_1347494786422_0003 running in uber mode : false
12/09/12 17:41:37 INFO mapreduce.Job: map 0% reduce 0%
12/09/12 17:41:43 INFO mapreduce.Job: map 100% reduce 0%
12/09/12 17:41:45 INFO mapreduce.Job: map 100% reduce 100%
12/09/12 17:41:45 INFO mapreduce.Job: Job job_1347494786422_0003 completed successfully
You can see the status of the job at http:/localhost:8088
hadoop-2.0.1-alpha$ bin/hadoop fs -ls /user/joe/output
212/09/12 17:45:29 WARN util.KerberosName: Kerberos krb5 configuration not found, setting default realm to empty
Found 2 items
-rw-r--r-- 1 manoj supergroup 0 2012-09-12 17:41 /user/joe/output/_SUCCESS
-rw-r--r-- 1 manoj supergroup 39 2012-09-12 17:41 /user/joe/output/part-r-00000
The file part-r-00000 will have the output which is
user01 5
user02 2
user03 2
user04 1
Hoping these steps help jumpstart you with hadoop and get you going on your way to write more complex Map Reduce jobs to analyze your big data.
Update 2/28/2014: For how to setup hadoop 2.x Cluster , see Hadoop 2.x YARN cluster setup tutorial
In this tutorial we will
(1) Setup a hadoop in a single node environment
(2) Create and move files to HDFS
(3) Write and execute a simple MapReduce Application
Step 1: Download Hadoop and install
Download the current 2.x.x release from http://hadoop.apache.org/releases.html.
I downloaded hadoop-2.0.1-alpha.tar.gz.
Untar the file to a directory say ~/hadoop-2.0.1-alpha.
Step 2: Set the following environment variables
hadoop-2.0.1-alpha$ export HADOOP_HOME=~/hadoop-2.0.1-alpha
hadoop-2.0.1-alpha$ export HADOOP_MAPRED_HOME=~/hadoop-2.0.1-alpha
hadoop-2.0.1-alpha$ export HADOOP_COMMON_HOME=~/hadoop-2.0.1-alpha
hadoop-2.0.1-alpha$ export HADOOP_HDFS_HOME=~/hadoop-2.0.1-alpha
hadoop-2.0.1-alpha$ export YARN_HOME=~/hadoop-2.0.1-alpha
hadoop-2.0.1-alpha$ export HADOOP_CONF_DIR=~/hadoop-2.0.1-alpha/etc/hadoop
If these environment variables are not setup correctly, Step 4 might fail.
Step 3: Update the configuration files
hdfs-site.xml
Add the following configuration to etc/hadoop/hdfs-site.xml. If you do not set dfs.namenode.name.dir and dfs.datanote.data.dir explicitly, hadoop will default to a temp directory that the OS may clean up on restart and you will lose data. This is a common omission for newbees.
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/Users/joe/hadoop-hdfs201/data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/Users/joe/hadoop-hdfs201/data/hdfs/datanode</value>
</property>
</configuration>
core-site.xml
Add the following to etc/hadoop/core-site.xml.
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
yarn-site.xml
Add the following to etc/hadoop/yarn-site.xml.
mapred-site.xml
Add the following to etc/hadoop/mapred-site.xml.
<?xml version="1.0"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Step 4: Start the processes.
Change to the directory where hadoop is installed.
cd ~/hadoop-2.0.1-alpha
If you are running hadoop for the first time, the following command will format HDFS. Do not run this everytime as it formats and thus deletes any existing data
hadoop-2.0.1-alpha$ bin/hadoop namenode -format
Start the namenode.
hadoop-2.0.1-alpha$ sbin/hadoop-daemon.sh start namenode
Start the datanode.
hadoop-2.0.1-alpha$ sbin/hadoop-daemon.sh start datanode
In hadoop 2.x , there is no jobtracker. Instead there is a resourcemanager and a nodemanager.
Start the resourcemanager.
hadoop-2.0.1-alpha$ sbin/yarn-daemon.sh start resourcemanager
Start the nodemanager.
hadoop-2.0.1-alpha$ sbin/yarn-daemon.sh start nodemanager
Start the history server.
hadoop-2.0.1-alpha$ sbin/mr-jobhistory-daemon.sh start historyserver
Type jps. It lists the java processes running. Check that all the processes are started
hadoop-2.0.1-alpha$ jps
1380 DataNode
1558 Jps
1433 ResourceManager
1536 JobHistoryServer
1335 NameNode
1849 NodeManager
Once I ran into a problem where the mapreduce job was being accepted but never executed. In looking at the logs, I found that the NodeManager had not started. The jps command is a good check to ensure all necessary processes are started.
Step 5: Get familiar with HDFS
The HDFS commands are documented in the older releases of hadoop
http://hadoop.apache.org/docs/r1.0.3/file_system_shell.html
hadoop-2.0.1-alpha$ bin/hadoop fs -ls
will list the home directory. If you are user joe. HDFS creates a /user/joe directory for you. Any files or directories you create will be created here.
hadoop-2.0.1-alpha$ bin/hadoop fs -mkdir /user/joe/input
creates a directory input
In the local filesystem create a file app.log with the data
user01|1|2|3|4|5
user02|1|2|3|4|5
user03|1|2|3|4|5
user01|1|2|3|4|5
user02|1|2|3|4|5
user01|1|2|3|4|5
user03|1|2|3|4|5
user01|1|2|3|4|5
user04|1|2|3|4|5
user01|1|2|3|4|5
let us pretend this is a log file from a web application where for each request we have logged userid and some additional data. We will later use this .as input for a MapReduce program.
You can move it to hdfs using the command
hadoop-2.0.1-alpha$ bin/hadoop fs -moveFromLocal ~/projects/app.log /user/manoj/input/
To print the file just moved to hdfs
hadoop-2.0.1-alpha$ bin/hadoop fs -cat /user/manoj/input/app.log
Step 6: Create a MapReduce program
The MapReduce programming model is explained in the blog What is MapReduce ?. Let us write a simple mapreduce program that uses that app.log we created in step5 as input and outputs the number of times a user visited the site. UserCountMap reads a line and outputs (username,1)
public class UserCount {
public static class UserCountMap extends Mapper
public void map(LongWritable key, Text Value, Context context)
throws IOException, InterruptedException {
String line = Value.toString() ;
String tokens[] = line.split("\\|") ;
if (tokens.length > 0) {
context.write(new Text(tokens[0]),new IntWritable(1)) ;
}
}
}
public static class UserCountReducer extends Reducer
public void reduce(Text key, Iterable
throws IOException, InterruptedException {
int count = 0 ;
for (IntWritable value : values) {
count = count + value.get() ;
}
context.write(key, new IntWritable(count)) ;
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(UserCount.class) ;
FileInputFormat.addInputPath(job, new Path(args[0])) ;
FileOutputFormat.setOutputPath(job, new Path(args[1])) ;
job.setMapperClass(UserCountMap.class) ;
job.setReducerClass(UserCountReducer.class) ;
job.setOutputKeyClass(Text.class) ;
job.setOutputValueClass(IntWritable.class) ;
System.exit(job.waitForCompletion(true) ? 0 : 1) ;
}
}
Compile the program and package into a jar called say usercount.jar.
Step 7: Run the madreduce program
hadoop-2.0.1-alpha$ bin/hadoop jar ~/projects/usercount.jar com.mj.UserCount /user/joe/input /user/joe/output
you should see output some of which is shown below.
12/09/12 17:41:28 INFO mapreduce.Job: The url to track the job: http://joe.local:8088/proxy/application_1347494786422_0003/
12/09/12 17:41:28 INFO mapreduce.Job: Running job: job_1347494786422_0003
12/09/12 17:41:37 INFO mapreduce.Job: Job job_1347494786422_0003 running in uber mode : false
12/09/12 17:41:37 INFO mapreduce.Job: map 0% reduce 0%
12/09/12 17:41:43 INFO mapreduce.Job: map 100% reduce 0%
12/09/12 17:41:45 INFO mapreduce.Job: map 100% reduce 100%
12/09/12 17:41:45 INFO mapreduce.Job: Job job_1347494786422_0003 completed successfully
You can see the status of the job at http:/localhost:8088
hadoop-2.0.1-alpha$ bin/hadoop fs -ls /user/joe/output
212/09/12 17:45:29 WARN util.KerberosName: Kerberos krb5 configuration not found, setting default realm to empty
Found 2 items
-rw-r--r-- 1 manoj supergroup 0 2012-09-12 17:41 /user/joe/output/_SUCCESS
-rw-r--r-- 1 manoj supergroup 39 2012-09-12 17:41 /user/joe/output/part-r-00000
The file part-r-00000 will have the output which is
user01 5
user02 2
user03 2
user04 1
Hoping these steps help jumpstart you with hadoop and get you going on your way to write more complex Map Reduce jobs to analyze your big data.
Update 2/28/2014: For how to setup hadoop 2.x Cluster , see Hadoop 2.x YARN cluster setup tutorial
Thanks, this tutorial helped me so much, especially the code.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDelete