Friday, February 28, 2014

Hadoop 2.x cluster setup tutorial

Recently I had the opportunity to setup a multi-node hadoop cluster. The apache documentation is a little thin and I had to spend several hours trouble shooting issues and googling for help before I got it right.  The purpose of this tutorial is to save time for those setting up a hadoop cluster for the first time. If you are new to hadoop, you may read my tutorial on single node setup at Hadoop 2.x tutorial. If you have never setup hadoop before, it is a good idea to to do a single node setup the first time and then try the multi node setup.

In this tutorial we will

(1) set up a multi node hadoop cluster with 4 servers
(2) To test, copy files to hdfs and cat the files.
(3) To test, run a simple map reduce job that we developed in the single node tutorial

Step 1: Download a stable hadoop release and untar it.

Download a stable 2.x.x release from http://hadoop.apache.org/releases.html#Download.
I downloaded hadoop-2.2.0.tar.gz.
Untar the file to a directory say ~/hadoop-2.2.0.

Step 2: Decide on the topology

For this tutorial , we shall setup a 4 node cluster.

Host : master
Host : slave1
Host : slave2
Host : slave3

On the master we will run the namenode, resourcemanager, datanode, nodemanager and historyserver. On the slaves , we run the datanode and nodemanager.

To make it more real world, we will bring up and test the cluster first with just master, slave1 and slave2. Typically you add more capacity as needed. So we will add slave3 after the fact.

Step 3: Ensure proper network connectivity

I am not going to cover networking details here. But it goes without saying, the master should be able to reach the slaves using their hostnames and the slaves should be able to reach the master. So you may have to add the hostname to ip address mapping in /etc/hosts.

Several startup scripts use ssh to connect and start processes on hosts. So ssh must be setup for password less login on all hosts.

Step 4: Set these environment variables
export HADOOP_HOME=path_to_hadoop_install_dir
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
To hadoop_install_dir/libexec/hadoop-config.sh, Add
export JAVA_HOME=path_to_jdk
Step 5: Update the config files
These files are located at hadoop_install_dir/etc/hadoop

 core-site.xml
 This applies to all servers.
<configuration>
<property>
    <name>fs.default.name</name>
        <value>hdfs://master:9000</value>
    </property>
</configuration>


hdfs-site.xml
This applies to all servers.
<configuration>
<property>
    <name>dfs.replication </name>
    <value>2 </value>

</property >
<property>
   <name>dfs.permissions </name>
   <value>false </value>

</property>
<property>
  <name>dfs.namenode.name.dir </name>
  <value>file:/mnt/hadoop/data/namenode </value>

</property>
<property>
  <name>dfs.datanode.data.dir </name>
  <value>file:/mnt/hadoop/data/datanode </value>

</property>
</configuration>

dfs.namenode.name.dir points to location where namenode stores metadata. dfs.datanode.data.dir points to location where datanode stores data. It is important to put these directories on disks that have lots of free space ( terra bytes). Default block sizes are large and if there is not enough space you will encounter errors that do not clearly point out the space issue.

yarn-site.xml
This applies to all servers.
<configuration>
<property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>

</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>
<property>

  <name>yarn.resourcemanager.resource-tracker.address</name>
  <value>master:8025</value>

</property>
<property>
  <name>yarn.resourcemanager.scheduler.address</name>
  <value>master:8030</value>

</property>
<property>
  <name>yarn.resourcemanager.address</name>
  <value>master:8040</value>

</property>
</configuration>

The last 3 properties tell the nodemanager how to connect to the resourcemanager.

Only in the yarn-site.xml of the master, add
<property>
  <name>yarn.nodemanager.localizer.address</name>
  <value>master:8060
</property>
Without this, nodemanager will not start on the master.

mapred-site.xml
<configuration>
<property>

  <name>mapreduce.framework.name</name>
  <value>yarn</value>

</property>
</configuration>


slaves
This is necessary on the master only and is used by the scripts to start datanodes and nodemanagers on the servers listed.

master
slave1
slave2

Step 5: Start the processes

Change to the directory where hadoop is installed.
cd hadoop_install_dir

If you are running hadoop for the first time, the following command will format HDFS. Do not run this everytime as it formats and thus deletes any existing data

hadoop_install_dir$ bin/hadoop namenode -format

Start the namenode.
hadoop_install_dir$ sbin/hadoop-daemon.sh start namenode

Start the datanode.
hadoop_install_dir$ bin/hadoop-daemons.sh start datanode

Note that the script name is -daemons. This starts datanode on all the servers listed in the slaves file. If you use the -daemon script, it will only start the datanode on the server on which you ran the script.

In hadoop 2.x , there is no jobtracker. Instead there is a resourcemanager and a nodemanager.
Start the resourcemanager.
hadoop_install_dir$ sbin/yarn-daemon.sh start resourcemanager

Start the nodemanager.
hadoop_install_dir$ sbin/yarn-daemons.sh start nodemanager
As mentioned in the case of  datanode, the -daemons script will start the nodemanager on all servers listed in the slave file , where as -daemon script with start it only the server on which the script is executed.

Start the history server.
hadoop_install_dir$ sbin/mr-jobhistory-daemon.sh start historyserver

On the master, type jps. It lists the java processes running. Check that all the processes are started

hadoop_install_dir$ jps

1380 DataNode
1558 Jps
1433 ResourceManager
1536 JobHistoryServer
1335 NameNode
1849 NodeManager

Do the same on each of the slaves

hadoop_install_dir$ jps

1380 DataNode
1558 Jps
1849 NodeManager


The jps command is a good check to ensure all necessary processes are started.

You can use the following urls to see the state of the cluster.

For HDFS
http://master:50070
For YARN/Mapreduce
http://master:8088

Step 6: Test HDFS

The HDFS commands are documented at
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

hadoop_install_dir$ bin/hdfs dfs -ls /

will list the root directory

hadoop_install_dir$ bin/hdfs dfs -mkdir /input
creates a directory input

In the local filesystem create a file app.log with the data

user01|1|2|3|4|5
user02|1|2|3|4|5
user03|1|2|3|4|5
user01|1|2|3|4|5
user02|1|2|3|4|5
user01|1|2|3|4|5
user03|1|2|3|4|5
user01|1|2|3|4|5
user04|1|2|3|4|5
user01|1|2|3|4|5

Let us pretend this is a log file from a web application where for each request we have logged userid and some additional data. We will later use this as input for a MapReduce program.
You can move it to hdfs using the command

hadoop_install_dir$ bin/hdfs dfs -moveFromLocal ~/projects/app.log /input/

To print the file just moved to hdfs
hadoop_install_dir$ bin/hdfs dfs -cat /input/app.log

Step 7: Run a map reduce program

Writing and running MR program is no different from what we did in the single node tutorial. See Step 6 and Step 7 of the single node tutorial at Hadoop 2.x tutorial.

Step 8: Add additional capacity

Your hadoop cluster is working well. But you are running out of space and your MR jobs are backed up because not enough mappers or reducers are available. Let us increase capacity by adding an additional server slave3.

On slave3, do
Step 1 -- untar that hadoop binaries
Step 3 -- ensure network connectivity
Step 4 -- set the environment variables
Step 5 -- setup the config files

Start the data node
hadoop_install_dir$ bin/hadoop-daemon.sh start datanode

Start the nodemanager
hadoop_install_dir$ sbin/yarn-daemon.sh start nodemanager

Use jps to check processes started. Use the web urls to check that the node got added to the cluster. 

Hoping these steps help jumpstart your hadoop cluster setup and save you time.