Wednesday, July 20, 2011

Hadoop Installation and Single Node Setup


In this document describes how to set up and configure a single-node Hadoop installation on Hadoop Distributed File System (HDFS).

Prerequisites:


1) Java 1.6.x must be installed
2) ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.

To test it, try :

'java -version'
'ssh localhost'

If ssh it works without asking a passphrase, good. If not follow this post to generate you ssh key and allow localhost access.

If ssh is not installed: 'sudo apt-get install ssh' for linux systems

Download a stable release from Apache Hadoop Mirrors. Unpack the downloaded Hadoop distribution.

Try the following command:


$ bin/hadoop

It should give you the usages of hadoop script

Standalone Operation


By default, Hadoop is configured to run in a non-distributed mode, as a single Java process.

Lets try running standalone example -

It copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.

$ mkdir input 

$ cp conf/*.xml input 
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+' 

$ cat output/*

Pseudo-Distributed Operation (Using HDFS)


Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
Configuration

Use the following:



conf/core-site.xml:

<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://localhost:9000</value>
     </property>
</configuration>

conf/hdfs-site.xml:
<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
</configuration>

conf/mapred-site.xml:
<configuration>
     <property>
         <name>mapred.job.tracker</name>
         <value>localhost:9001</value>
     </property>
</configuration>   
By default, the HDFS is created under /tmp/hadoop-yourusername/ - But you can change it.

Add these two more properties to conf/hdfs-site.xml:
  
    <property>
        <name>dfs.data.dir</name>
        <value>/Users/ashish/hadoop_home/hdfs/data</value>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>/Users/ashish/hadoop_home/hdfs/name</value>
    </property>
Also make sure your JAVA_HOME is set in conf/hadoop-env.sh

export JAVA_HOME=/Library/Java/Home

Format a new distributed-filesystem:


$ bin/hadoop namenode -format

Start the hadoop daemons:
$ bin/start-all.sh

Running the example


Note that hadoop assumes all the files in HDFS for processing. So to run the same example, copy all the files from local filesystem to hdfs

$ bin/hadoop fs -put conf input
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

Examine the output files:

Similarly, the output directory that hadoop created would be on HDFS. Copy the output files from the distributed filesystem to the local filesystem and examine them:

$ bin/hadoop fs -get output output
$ cat output/*

OR you can directly view them on HDFS as -

$ bin/hadoop fs -cat output/*

When you're done, stop the daemons with:

$ bin/stop-all.sh