Wednesday, December 24, 2008

Running Hadoop with HBase on CentOS Linux (Single-Node Cluster)


1.Purpose of this tutorial:


In this tutorial, I will describe the required steps for setting up a single-node Hadoop/Hbase cluster using the Hadoop Distributed File System (HDFS) on CentOS Linux.

Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduce. HDFS is a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storeage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop.

HBase's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware.

2.Prerequisites

  • CentOS Linux.
  • Hadoop 0.18.2


  • [root@localhost bin]# wget 'http://mirror.mirimar.net/apache/hadoop/core/hadoop-0.18.2/hadoop-0.18.2.tar.gz'
    --02:49:53-- http://mirror.mirimar.net/apache/hadoop/core/hadoop-0.18.2/hadoop-0.18.2.tar.gz
    Resolving mirror.mirimar.net... 194.90.150.47
    Connecting to mirror.mirimar.net|194.90.150.47|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 16836495 (16M) [application/x-gzip]
    Saving to: `hadoop-0.18.2.tar.gz'

    100%[=========================================================================================================================================>] 16,836,495 471K/s in 36s

  • HBase 0.18.1


  • [root@localhost bin]# wget 'http://mirror.mirimar.net/apache/hadoop/hbase/hbase-0.18.1/hbase-0.18.1.tar.gz'
    --02:51:42-- http://mirror.mirimar.net/apache/hadoop/hbase/hbase-0.18.1/hbase-0.18.1.tar.gz
    Resolving mirror.mirimar.net... 194.90.150.47
    Connecting to mirror.mirimar.net|194.90.150.47|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 16295734 (16M) [application/x-gzip]
    Saving to: `hbase-0.18.1.tar.gz'

    100%[=========================================================================================================================================>] 16,295,734 496K/s in 34s

  • JDK1.5.14 from Sun
  • SSH
  • Edit environment sessings


  • [root@34 ~]# vi ~/.bash_profile
    // JDK installation directory
    export JAVA_HOME=/usr/local/bin/jdk1.5.0_14
    PATH=$JAVA_HOME/bin:$PATH:$HOME/bin
    export HADOOP_HOME=/usr/local/bin/hadoop
    export HBASE_HOME=/usr/local/bin/hbase
    //optional
    export PS1="[\u@\h \w]"

Please don't forget to add all environment variables to ~/.bash_profile, otherwise all exports will be deleted after you disconnect your SSH session.

2.1 Java

For now we'll use HBase 0.18.1 which was compiled with JDK1.5 and Hadoop 0.18.2 which supports jdk1.5.x, today Hadoop 0.19.0 is available, but it requires jdk1.6, HBase suppose to support jdk1.6 only in version 0.19

Install JDK1.5.14,


[root@localhost /usr/local/bin]wget 'http://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_Developer-Site/en_US/-/USD/VerifyItem-Start/jdk-1_5_0_14-linux-i586.bin?BundledLineItemUUID=PspIBe.oQkIAAAEeWNw8f4HX&OrderID=na1IBe.ouqIAAAEePNw8f4HX&ProductID=YOzACUFBuXAAAAEYlak5AXuQ&FileName=/jdk-1_5_0_14-linux-i586.bin' -O jdk-1_5_0_14-linux-i586.bin
--03:25:00-- http://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_Developer-Site/en_US/-/USD/VerifyItem-Start/jdk-1_5_0_14-linux-i586.bin?BundledLineItemUUID=PspIBe.oQkIAAAEeWNw8f4HX&OrderID=na1IBe.ouqIAAAEePNw8f4HX&ProductID=YOzACUFBuXAAAAEYlak5AXuQ&FileName=/jdk-1_5_0_14-linux-i586.bin
Resolving cds.sun.com... 72.5.239.134
Connecting to cds.sun.com|72.5.239.134|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://cds-esd.sun.com/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin?AuthParam=1230539222_7e7c419133c9fb57c076e8e08293fd8c&TicketId=B%2Fw5lxuBTFhPQB1LOFJTnQTr&GroupName=CDS&FilePath=/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin&File=jdk-1_5_0_14-linux-i586.bin [following]
--03:25:01-- http://cds-esd.sun.com/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin?AuthParam=1230539222_7e7c419133c9fb57c076e8e08293fd8c&TicketId=B%2Fw5lxuBTFhPQB1LOFJTnQTr&GroupName=CDS&FilePath=/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin&File=jdk-1_5_0_14-linux-i586.bin
Resolving cds-esd.sun.com... 98.27.88.9, 98.27.88.39
Connecting to cds-esd.sun.com|98.27.88.9|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49649265 (47M) [application/x-sdlc]
Saving to: `jdk-1_5_0_14-linux-i586.bin'

100%[=========================================================================================================================================>] 49,649,265 5.36M/s in 9.0s

03:25:11 (5.25 MB/s) - `jdk-1_5_0_14-linux-i586.bin' saved [49649265/49649265]

define JAVA_HOME and add $JAVA_HOME/bin to $PATH

2.2 SSH public key authentication

Hadoop requires SSH public access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it, so

[root@rt ~/.ssh] ssh-keygen -t rsa
//This will create two files in your ~/.ssh directory
//id_rsa: your private key
//id_rsa.pub: is your public key.
[root@rt ~/.ssh] cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[root@rt ~/.ssh] ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is 55:d7:91:86:ea:86:8f:51:89:9f:68:b0:75:88:52:72.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
[root@rt ~]

As you see the final test is to see if you able to make ssh public authentication connection to the localhost

If the SSH connection fails, these general tips might help:

  • Enable debugging with ssh -vvv localhost and investigate the error in detail.
  • Check the SSH server configuration in /etc/ssh/sshd_config, in particular the options PubkeyAuthentication (which should be set to yes) and AllowUsers (if this option is active, add the hadoop user to it). If you made any changes to the SSH server configuration file, you can force a configuration reload with sudo /etc/init.d/ssh reload.

2.3 Disabling IPv6


To disable IPv6 on CentOS Linux, open /etc/modprobe.d/blacklist in the editor of your choice and add the following lines to the end of the file:
 # disable IPv6
blacklist ipv6

You have to reboot your machine in order to make the changes take effect.

2.4 Edit open files limit.

Edit file /etc/security/limits.conf, add the following lines:

root - nofile 100000

root - locks 100000

Run ulimit -n 1000000 in shell.

3.Hadoop

3.1 Installation

- Unpack hadoop archive to /usr/local/bin (could any directory)

- Move unpacked directory to /usr/local/bin/hadoop: mv hadoop.18.2 hadoop

- Set HADOOP_HOME: export HADOOP_HOME=/usr/local/bin/hadoop

3.2 Configuration

Set up JAVA_HOME in $HADOOP_HOME/conf/hadoop-env.sh to point to your java location:

//The java implementation to use. Required.
export JAVA_HOME=/usr/local/bin/jdk1.5.0_14


Set up $HADOOP_HOME/conf/hadoop-site.sh


Any site-specific configuration of Hadoop is configured in $HADOOP_HOME/conf/hadoop-site.xml. Here we will configure the directory where Hadoop will store its data files, the ports it listens to, etc. Our setup will use Hadoop's Distributed File System, HDFS, even though our little "cluster" only contains our single local machine.

You can leave the settings below as is with the exception of the hadoop.tmp.dir variable which you have to change to the directory of your choice, for example:

/usr/local/hadoop-datastore/hadoop-${user.name}.

Hadoop will expand ${user.name} to the system user which is running Hadoop, so in our case this will be hadoop and thus the final path will be /usr/local/hadoop-datastore/hadoop-hadoop.







hadoop.tmp.dir
/usr/local/bin/hadoop/datastore/hadoop-${user.name}
A base for other temporary directories.



fs.default.name
hdfs://master:54310
The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.




mapred.job.tracker
master:54311
The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.



mapred.reduce.tasks
4
The default number of reduce tasks per job. Typically set
to a prime close to the number of available hosts. Ignored when
mapred.job.tracker is "local".



mapred.tasktracker.reduce.tasks.maximum
4
The maximum number of reduce tasks that will be run
simultaneously by a task tracker.



dfs.replication
2
Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.





2.2 Formatting the name node

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your "cluster" (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem, this will cause all your data to be erased.

To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command

[root@cc hadoop]# bin/hadoop namenode -format

The output suppose to be like this:



[root@cc hadoop]# bin/hadoop namenode -format
08/12/24 10:56:34 INFO dfs.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = cc.d.de.static.xlhost.com/206.222.13.204
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.18.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 709042; compiled by 'ndaley' on Thu Oct 30 01:07:18 UTC 2008
************************************************************/
Re-format filesystem in /usr/local/bin/hadoop/datastore/hadoop-root/dfs/name ? (Y or N) Y
08/12/24 10:57:40 INFO fs.FSNamesystem: fsOwner=root,root,bin,daemon,sys,adm,disk,wheel
08/12/24 10:57:40 INFO fs.FSNamesystem: supergroup=supergroup
08/12/24 10:57:40 INFO fs.FSNamesystem: isPermissionEnabled=true
08/12/24 10:57:40 INFO dfs.Storage: Image file of size 78 saved in 0 seconds.
08/12/24 10:57:40 INFO dfs.Storage: Storage directory /usr/local/bin/hadoop/datastore/hadoop-root/dfs/name has been successfully formatted.
08/12/24 10:57:40 INFO dfs.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at cc.d.de.static.xlhost.com/206.222.13.204
************************************************************/


2.3 Starting/Stopping your single-node cluster

Run the command:



[root@cc hadoop]# $HADOOP_HOME/bin/start-all.sh

You suppose to see the following output:

starting namenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-namenode-cc.com.out
localhost: starting datanode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-datanode-cc.out
localhost: starting secondarynamenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-secondarynamenode-cc.out
starting jobtracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-jobtracker-cc.com.out
localhost: starting tasktracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-tasktracker-cc.com.out

Run example map-reduce job that comes with hadoop installation:

[root@38 /usr/local/bin/hadoop]bin/hadoop dfs -copyFromLocal LICENSE.txt testWordCount
[root@38 /usr/local/bin/hadoop]bin/hadoop dfs -ls
Found 1 items
-rw-r--r-- 1 root supergroup 11358 2008-12-25 04:54 /user/root/testWordCount
[root@38 /usr/local/bin/hadoop]bin/hadoop jar hadoop-0.18.2-examples.jar wordcount testWordCount testWordCount-output
08/12/25 04:55:47 INFO mapred.FileInputFormat: Total input paths to process : 1
08/12/25 04:55:47 INFO mapred.FileInputFormat: Total input paths to process : 1
08/12/25 04:55:48 INFO mapred.JobClient: Running job: job_200812250447_0001
08/12/25 04:55:49 INFO mapred.JobClient: map 0% reduce 0%
08/12/25 04:55:51 INFO mapred.JobClient: map 100% reduce 0%
08/12/25 04:55:56 INFO mapred.JobClient: Job complete: job_200812250447_0001
08/12/25 04:55:56 INFO mapred.JobClient: Counters: 16
08/12/25 04:55:56 INFO mapred.JobClient: Job Counters
08/12/25 04:55:56 INFO mapred.JobClient: Data-local map tasks=2
08/12/25 04:55:56 INFO mapred.JobClient: Launched reduce tasks=1
08/12/25 04:55:56 INFO mapred.JobClient: Launched map tasks=2
08/12/25 04:55:56 INFO mapred.JobClient: Map-Reduce Framework
08/12/25 04:55:56 INFO mapred.JobClient: Map output records=1581
08/12/25 04:55:56 INFO mapred.JobClient: Reduce input records=593
08/12/25 04:55:56 INFO mapred.JobClient: Map output bytes=16546
08/12/25 04:55:56 INFO mapred.JobClient: Map input records=202
08/12/25 04:55:56 INFO mapred.JobClient: Combine output records=1292
08/12/25 04:55:56 INFO mapred.JobClient: Map input bytes=11358
08/12/25 04:55:56 INFO mapred.JobClient: Combine input records=2280
08/12/25 04:55:56 INFO mapred.JobClient: Reduce input groups=593
08/12/25 04:55:56 INFO mapred.JobClient: Reduce output records=593
08/12/25 04:55:56 INFO mapred.JobClient: File Systems
08/12/25 04:55:56 INFO mapred.JobClient: HDFS bytes written=6117
08/12/25 04:55:56 INFO mapred.JobClient: Local bytes written=18568
08/12/25 04:55:56 INFO mapred.JobClient: HDFS bytes read=13872
08/12/25 04:55:56 INFO mapred.JobClient: Local bytes read=8542
[root@38 /usr/local/bin/hadoop]bin/hadoop dfs -ls testWordCount-output
Found 2 items
drwxr-xr-x - root supergroup 0 2008-12-25 04:55 /user/root/testWordCount-output/_logs
-rw-r--r-- 1 root supergroup 6117 2008-12-25 04:55 /user/root/testWordCount-output/part-00000
[root@38 /usr/local/bin/hadoop]bin/hadoop dfs -cat testWordCount-output/part-00000
// suppose to see something like this
...
tracking 1
trade 1
trademark, 1
trademarks, 1
transfer 1
transformation 1
translation 1
...

To stop Hadoop cluster run the following:


[root@37 /usr/local/bin/hadoop]bin/stop-all.sh
no jobtracker to stop
localhost: no tasktracker to stop
no namenode to stop
localhost: no datanode to stop
localhost: no secondarynamenode to stop

2.4 Hadoop monitoring and debugging

Please see hadoop tips of how to debug Map-Reduce programs. Worth to mention that hadoop logs are providing the most information
from $HADOOP_HOME/logs or there are links from hadoop web interfaces.

Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:

The job tracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file

The task tracker web UI shows you running and non-running tasks.The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser.
If everything work fine you suppose to see the following output after running jps utility:


[root@cc hadoop]# jps
3060 SecondaryNameNode
3136 JobTracker
2814 NameNode
3270 TaskTracker
3458 Jps

[root@38 /usr/local/bin/hadoop]netstat -plten | grep java
tcp 0 0 :::50020 :::* LISTEN 0 1248347094 5131/java
tcp 0 0 ::ffff:127.0.0.1:54310 :::* LISTEN 0 1248346675 5032/java
tcp 0 0 ::ffff:127.0.0.1:54311 :::* LISTEN 0 1248347128 5351/java
tcp 0 0 :::50090 :::* LISTEN 0 1248347119 5272/java
tcp 0 0 :::50060 :::* LISTEN 0 1248347327 5467/java
tcp 0 0 :::50030 :::* LISTEN 0 1248347293 5351/java
tcp 0 0 :::48498 :::* LISTEN 0 1248347011 5272/java
tcp 0 0 :::50070 :::* LISTEN 0 1248346888 5032/java
tcp 0 0 :::53210 :::* LISTEN 0 1248347103 5351/java
tcp 0 0 :::50010 :::* LISTEN 0 1248347000 5131/java
tcp 0 0 :::50075 :::* LISTEN 0 1248347020 5131/java
tcp 0 0 :::40315 :::* LISTEN 0 1248346669 5032/java
tcp 0 0 ::ffff:127.0.0.1:47198 :::* LISTEN 0 1248347345 5467/java
tcp 0 0 :::56575 :::* LISTEN 0 1248346834 5131/java


3.HBase

3.1 Configuration

- Unpack HBase archive to /usr/local/bin
- Move hbase.18.1 to hbase
- Define HBASE_HOME point to /usr/local/bin/hbase( don't forget to edit ~/.bash_profile )
- Define JAVA_HOME in $HBASE_HOME/conf/hbase-env.sh

export HADOOP_CLASSPATH=/usr/local/bin/hbase/conf:/usr/local/bin/hbase/hbase-0.18.1.jar:/usr/local/bin/hbase/hbase-0.18.1-test.jar

- Edit $HADOOP_HOME/conf/hadoop-env.sh, add this(for instance):

export JAVA_HOME=/usr/local/bin/jdk1.5.0_14/


3.2 Pseudo-Distributed Operation

A pseudo-distributed operation is simply a distributed operation run on a single host. Once you have confirmed your DFS setup, configuring HBase for use on one host requires modification of ${HBASE_HOME}/conf/hbase-site.xml, which needs to be pointed at the running Hadoop DFS instance. Use hbase-site.xml to override the properties defined in ${HBASE_HOME}/conf/hbase-default.xml (hbase-default.xml itself should never be modified). At a minimum the hbase.rootdir property should be redefined in hbase-site.xml to point HBase at the Hadoop filesystem to use. For example, adding the property below to your hbase-site.xml says that HBase should use the /hbase directory in the HDFS whose namenode is at port 54310 on your local machine:





hbase.rootdir
hdfs://localhost:54310/hbase
The directory shared by region servers.



3.3 Example API Usage

Once you have a running HBase, you probably want a way to hook your application up to it.If your application is in Java, then you should use the Java API. The following example takes as input excel formatted file and name of already existed table in HTable, process records and writes them to HBase. You could look at client example here



package org.examples;

import java.io.IOException;
import java.util.Iterator;
import java.util.regex.Pattern;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.io.BatchUpdate;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapred.TableReduce;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.WritableUtils;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
* Sample uploader Map-Reduce example class. Takes excel format file as input and write output to HBase
*
*/

public class SampleUploader extends MapReduceBase implements Mapper, Tool
{
static enum Counters { MAP_LINES,REDUCE_LINES }
private static final String NAME = "SampleUploader";
private Configuration conf;
static final String OUTPUT_COLUMN = "value:";
static final String OUTPUT_KEY = "key:";
long numRecords;
private Text idText = new Text();
private Text recordText = new Text();
private String inputFile;

/** A WritableComparator optimized for Text keys. */
public static class Comparator extends WritableComparator
{
public Comparator()
{
super(Text.class);
}

public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2)
{
int n1 = WritableUtils.decodeVIntSize(b1[s1]);
int n2 = WritableUtils.decodeVIntSize(b2[s2]);
return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2);
}
}


public JobConf createSubmittableJob(String[] args)
{
JobConf c = new JobConf(getConf(), SampleUploader.class);
c.setJobName(NAME);
c.setInputFormat(TextInputFormat.class);
FileInputFormat.setInputPaths(c, new Path(args[0]));
//c.setInputPaths(new Path(args[0]));
c.setMapperClass(this.getClass());
c.setMapOutputKeyClass(Text.class);
c.setMapOutputValueClass(Text.class);
c.setReducerClass(TableUploader.class);
TableReduce.initJob(args[1], TableUploader.class, c);
return c;
}
public void configure(JobConf job)
{
inputFile = job.get("map.input.file");
}
public void map(LongWritable k, Text v,OutputCollector output, Reporter r) throws IOException
{

String lineWithoutURLs = Pattern.compile("\"[^\"]* "").matcher(v.toString()).replaceAll("");
String userID = lineWithoutURLs.substring(22, 54);

r.incrCounter(Counters.MAP_LINES, 1);
if ((++numRecords % 10000) == 0)
{
System.out.println("Finished mapping of " + numRecords + " records " + "from the input file: " + inputFile);
}
idText.set(userID);
recordText.set(lineWithoutURLs);
output.collect( idText,recordText );
}

public static class TableUploader extends TableReduce
{

@Override
public void reduce( Text k, Iterator v,
OutputCollector output,
Reporter r) throws IOException
{

BatchUpdate outval = new BatchUpdate(k.toString());
while (v.hasNext())
{
String value = v.next().toString();
String dateStamp = value.substring(0, 20).replaceAll("[-:,]", "");
outval.put( OUTPUT_COLUMN + dateStamp, value.getBytes());
r.incrCounter(Counters.REDUCE_LINES, 1);
}
output.collect( new ImmutableBytesWritable( k.getBytes()), outval);
}
}



static int printUsage()
{
System.out.println(NAME + " ");
return -1;
}

public int run( String[] args) throws Exception
{
// Make sure there are exactly 2 parameters left.
if (args.length != 2) {
System.out.println("ERROR: Wrong number of parameters: " +
args.length + " instead of 2.");
return printUsage();
}
JobClient.runJob(createSubmittableJob(args));
return 0;
}

public Configuration getConf() {
return this.conf;
}

public void setConf(final Configuration c) {
this.conf = c;
}

public static void main(String[] args) throws Exception
{
int errCode = ToolRunner.run(new Configuration(), new SampleUploader(),
args);
System.exit(errCode);

}



}


3.4 Running HBase and using HBase shell

Start HBase with the following command:

${HBASE_HOME}/bin/start-hbase.sh
If HBase is started succesfully you suppose to see a following output after running jps:

[root@37 /usr/local/bin/hadoop]jps
10379 DataNode
18303 Jps
10637 TaskTracker
10536 JobTracker
10286 NameNode
17512 HMaster

Check if HBase is running with web interface: http://localhost:60030

Once HBase has started, run HBase Shell with

${HBASE_HOME}/bin/hbase shell

Create a sample table for our tests:

create 'table_keyInMemory', {NAME => 'key',IN_MEMORY => true, VERSIONS => 1,BLOCKCACHE => true},{NAME => 'value',VERSIONS => 1}

To stop HBase, exit the HBase shell and enter:

${HBASE_HOME}/bin/stop-hbase.sh

3.5 Finally: Rinning Map-Reduce jobs:

First copy an input file from local filesystem:

[root@37 /usr/local/bin/hadoop]bin/hadoop dfs -copyFromLocal /localFile fileNameInHadoopDFS
[root@37 /usr/local/bin/hadoop] bin/hadoop dfs -ls
Found 1 items -rw-r--r-- 1 root supergroup 27325 2008-12-25 03:26 /user/root fileNameInHadoopDFS
// Run Map-Reduce script
[root@37 /usr/local/bin/hadoop]bin/hadoop jar Test.jar org.exelate.Uploader 100 sampleTable
08/12/25 03:55:40 INFO mapred.FileInputFormat: Total input paths to process : 1
08/12/25 03:55:40 INFO mapred.FileInputFormat: Total input paths to process : 1
08/12/25 03:55:40 INFO mapred.JobClient: Running job: job_200812250352_0002
08/12/25 03:55:41 INFO mapred.JobClient: map 0% reduce 0%
08/12/25 03:55:44 INFO mapred.JobClient: map 100% reduce 0%
08/12/25 03:55:50 INFO mapred.JobClient: Job complete: job_200812250352_0002
08/12/25 03:55:50 INFO mapred.JobClient: Counters: 17
08/12/25 03:55:50 INFO mapred.JobClient: Job Counters
08/12/25 03:55:50 INFO mapred.JobClient: Data-local map tasks=2
08/12/25 03:55:50 INFO mapred.JobClient: Launched reduce tasks=1
08/12/25 03:55:50 INFO mapred.JobClient: Launched map tasks=2
08/12/25 03:55:50 INFO mapred.JobClient: Map-Reduce Framework
08/12/25 03:55:50 INFO mapred.JobClient: Map output records=100
08/12/25 03:55:50 INFO mapred.JobClient: Reduce input records=100
08/12/25 03:55:50 INFO mapred.JobClient: Map output bytes=16516
08/12/25 03:55:50 INFO mapred.JobClient: Map input records=100
08/12/25 03:55:50 INFO mapred.JobClient: Combine output records=0
08/12/25 03:55:50 INFO mapred.JobClient: Map input bytes=27325
08/12/25 03:55:50 INFO mapred.JobClient: Combine input records=0
08/12/25 03:55:50 INFO mapred.JobClient: Reduce input groups=44
08/12/25 03:55:50 INFO mapred.JobClient: Reduce output records=44
08/12/25 03:55:50 INFO mapred.JobClient: File Systems
08/12/25 03:55:50 INFO mapred.JobClient: Local bytes written=33928
08/12/25 03:55:50 INFO mapred.JobClient: HDFS bytes read=30048
08/12/25 03:55:50 INFO mapred.JobClient: Local bytes read=16921
08/12/25 03:55:50 INFO mapred.JobClient: org.exelate.SampleUploader$Counters
08/12/25 03:55:50 INFO mapred.JobClient: MAP_LINES=100
08/12/25 03:55:50 INFO mapred.JobClient: REDUCE_LINES=100

HDFS Command Reference

There are many more commands in bin/hadoop dfs than were demonstrated here, although these basic operations will get you started. Running bin/hadoop dfs with no additional arguments will list all commands which can be run with the FsShell system. Furthermore, bin/hadoop dfs -help commandName will display a short usage summary for the operation in question, if you are stuck.

A table of all operations is reproduced below. The following conventions are used for parameters:

CommandOperation
-ls path Lists the contents of the directory specified by path, showing the names, permissions, owner, size and modification date for each entry.
-lsr path Behaves like -ls, but recursively displays entries in all subdirectories of path.
-du path Shows disk usage, in bytes, for all files which match path; filenames are reported with the full HDFS protocol prefix.
-dus path Like -du, but prints a summary of disk usage of all files/directories in the path.
-mv src dest Moves the file or directory indicated by src to dest, within HDFS.
-cp src dest Copies the file or directory identified by src to dest, within HDFS.
-rm path Removes the file or empty directory identified by path.
-rmr path Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or subdirectories of path).
-put localSrc dest Copies the file or directory from the local file system identified by localSrc to dest within the DFS.
-copyFromLocal localSrc dest Identical to -put
-moveFromLocal localSrc dest Copies the file or directory from the local file system identified by localSrc to dest within HDFS, then deletes the local copy on success.
-get [-crc] src localDest Copies the file or directory in HDFS identified by src to the local file system path identified by localDest.
-getmerge src localDest [addnl] Retrieves all files that match the path src in HDFS, and copies them to a single, merged file in the local file system identified by localDest.
-cat filename Displays the contents of filename on stdout.
-copyToLocal [-crc] src localDest Identical to -get
-moveToLocal [-crc] src localDest Works like -get, but deletes the HDFS copy on success.
-mkdir path Creates a directory named path in HDFS. Creates any parent directories in path that are missing (e.g., like mkdir -p in Linux).
-setrep [-R] [-w] rep path Sets the target replication factor for files identified by path to rep. (The actual replication factor will move toward the target over time)
-touchz path Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path, unless the file is already size 0.
-test -[ezd] path Returns 1 if path exists; has zero length; or is a directory, or 0 otherwise.
-stat [format] path Prints information about path. format is a string which accepts file size in blocks (%b), filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).
-tail [-f] file Shows the lats 1KB of file on stdout.
-chmod [-R] mode,mode,... path... Changes the file permissions associated with one or more objects identified by path.... Performs changes recursively with -R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}. Assumes a if no scope is specified and does not apply a umask.
-chown [-R] [owner][:[group]] path... Sets the owning user and/or group for files or directories identified by path.... Sets owner recursively if -R is specified.
-chgrp [-R] group path... Sets the owning group for files or directories identified by path.... Sets group recursively if -R is specified.
-help cmd Returns usage information for one of the commands listed above. You must omit the leading '-' character in cmd

3 comments:

sheela rajesh said...

The way of you expressing your ideas is really good.you gave more useful ideas for us and please update more ideas for the learners.
Hadoop Training in Chennai
Big data training in chennai
Big Data Training in Anna Nagar
JAVA Training in Chennai
Python Training in Chennai
Android Training in Chennai
Hadoop training in chennai
Big data training in chennai
big data training in velachery

technews said...

A persuading conversation is worth remark. I do accept that you ought to distribute more on this topic, it probably won't be an untouchable issue however typically individuals don't discuss such issues. To the following! Much obliged!! update news

Unknown said...

bookmarked!!, I like your site!
best interiors