The crontab (cron derives fromchronos, Greek for time; tab stands fortable) command, found in Unix and Unix-like operating systems, is used to schedule commands to be executed periodically. To see what crontabs are currently running on your system, you can open a terminal and run:
crontab -l
To edit the list ofcronjobsyou can run:
crontab -e
This will open a the default editor ( vi or pico) to manipulate the crontab settings. If you save and exit the editor, all your cronjobs are saved into crontab. Cronjobs are written in the following format:
* * * * * /bin/execute/this/script.sh
Scheduling explained
As you can see there are 5 stars. The stars represent different date parts in the following order:
minute (from 0 to 59)
hour (from 0 to 23)
day of month (from 1 to 31)
month (from 1 to 12)
day of week (from 0 to 6) (0=Sunday)
Execute every minute
If you leave the star, or asterisk, it meansevery. Maybe that's a bit unclear. Let's use the the previous example again:
* * * * * /bin/execute/this/script.sh
They are all still asterisks! So this means execute/bin/execute/this/script.sh:
everyminute
ofeveryhour
ofeveryday of the month
ofeverymonth
andeveryday in the week.
In short: This script is being executed every minute. Without exception.
Execute every Friday 1AM
So if we want to schedule the script to run at 1AM every Friday, we would need the following cronjob:
0 1 * * 5 /bin/execute/this/script.sh
Get it? The script is now being executed when the system clock hits:
minute: 0
of hour: 1
of day of month: * (every day of month)
of month: * (every month)
and weekday: 5 (=Friday)
Execute 10 past after every hour on the 1st of every month
Here's another one, just for practicing
10 * 1 * * /bin/execute/this/script.sh
Fair enough, it takes some getting used to, but it offers great flexibility.
To do that you need to add public key from master to slave ~/.ssh/authorized_keys and via verse, so eventually tou should be able to do the following:
[root@37 /usr/local/bin/hbase]ssh master The authenticity of host 'master (10.1.0.55)' can't be established. RSA key fingerprint is 09:e2:73:ac:6f:42:d1:da:13:20:76:10:36:29:c4:62. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'master,10.1.0.55' (RSA) to the list of known hosts. Last login: Thu Dec 25 07:19:16 2008 from slave [root@37 ~]exit logout Connection to master closed. [root@37 /usr/local/bin/hbase]ssh slave The authenticity of host 'slave (10.1.0.56)' can't be established. RSA key fingerprint is 5c:4f:81:19:ad:f3:78:02:ce:64:f1:67:10:ca:c5:b8. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'slave,10.1.0.56' (RSA) to the list of known hosts. Last login: Thu Dec 25 07:16:47 2008 from 38.d.de.static.xlhost.com [root@38 ~]exit logout Connection to slave closed.
2.Cluster Overview
The master node will also act as a slave because we only have two machines available in our cluster but still want to spread data storage and processing to multiple machines.
The master node will run the "master" daemons for each layer: namenode for the HDFS storage layer, and jobtracker for the MapReduce processing layer. Both machines will run the "slave" daemons: datanode for the HDFS layer, and tasktracker for MapReduce processing layer. Basically, the "master" daemons are responsible for coordination and management of the "slave" daemons while the latter will do the actual data storage and data processing work.
3. Hadoop core Configuration
3.1 conf/masters (master only)
The conf/masters file defines the master nodes of our multi-node cluster.
On master, update $HADOOP_HOME/conf/masters that it looks like this:
master
3.2 conf/slaves (master only)
This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run. We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data.
On master, update $HADOOP_HOME/conf/slaves that it looks like this:
master slave
If you have additional slave nodes, just add them to the conf/slaves file, one per line (do this on all machines in the cluster).
master slave slave01 slave02 slave03
3.3 conf/hadoop-site.xml (all machines)
Following a sample configuration for all hosts, explanation of all parameters link
hadoop.tmp.dir /usr/local/bin/hadoop/datastore/hadoop-${user.name} A base for other temporary directories.
fs.default.name hdfs://master:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.
mapred.job.tracker master:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
dfs.replication 1 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
mapred.reduce.tasks 8 The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local".
mapred.tasktracker.reduce.tasks.maximum 8 The maximum number of reduce tasks that will be run simultaneously by a task tracker.
mapred.child.java.opts -Xmx500m Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc
The configuration variable mapred.child.ulimit can be used to control the maximum virtual memory of the child processes.
Very important: now format each node data storage before proceeding.
3.4 Start/Stop Hadoop cluster
HDFS daemons: start/stop namenode command bin/start-dfs.sh or bin/stop-dfs.sh MapReduce daemons: run the command $HADOOP_HOME/bin/start-mapred.sh on the machine you want the jobtracker to run on. This will bring up the MapReduce cluster with the jobtracker running on the machine you ran the previous command on, and tasktrackers on the machines listed in the conf/slaves file.
[root@37 /usr/local/bin/hadoop]bin/start-dfs.sh starting namenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-namenode-37.c3.33.static.xlhost.com.out master: starting datanode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-datanode-37.c3.33.static.xlhost.com.out slave: starting datanode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-datanode-38.c3.33.static.xlhost.com.out master: starting secondarynamenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-secondarynamenode-37.c3.33.static.xlhost.com.out [root@37 /usr/local/bin/hadoop]bin/start-mapred.sh starting jobtracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-jobtracker-37.c3.33.static.xlhost.com.out master: starting tasktracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-tasktracker-37.c3.33.static.xlhost.com.out slave: starting tasktracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-tasktracker-38.c3.33.static.xlhost.com.out [root@37 /usr/local/bin/hadoop]jps] -bash: jps]: command not found [root@37 /usr/local/bin/hadoop]jps 28638 DataNode 29052 Jps 28527 NameNode 28764 SecondaryNameNode 28847 JobTracker 28962 TaskTracker [root@37 /usr/local/bin/hadoop]bin/hadoop dfs -copyFromLocal LICENSE.txt testWordCount [root@37 /usr/local/bin/hadoop]bin/hadoop dfs -ls Found 1 items -rw-r--r-- 1 root supergroup 11358 2008-12-25 08:20 /user/root/testWordCount [root@37 /usr/local/bin/hadoop]bin/hadoop jar hadoop-0.18.2-examples.jar wordcount testWordCount testWordCount-out 08/12/25 08:21:46 INFO mapred.FileInputFormat: Total input paths to process : 1 08/12/25 08:21:46 INFO mapred.FileInputFormat: Total input paths to process : 1 08/12/25 08:21:47 INFO mapred.JobClient: Running job: job_200812250818_0001 08/12/25 08:21:48 INFO mapred.JobClient: map 0% reduce 0% 08/12/25 08:21:53 INFO mapred.JobClient: map 50% reduce 0% 08/12/25 08:21:55 INFO mapred.JobClient: map 100% reduce 0% 08/12/25 08:22:01 INFO mapred.JobClient: map 100% reduce 12% 08/12/25 08:22:02 INFO mapred.JobClient: map 100% reduce 25% 08/12/25 08:22:05 INFO mapred.JobClient: map 100% reduce 41% 08/12/25 08:22:09 INFO mapred.JobClient: map 100% reduce 52% 08/12/25 08:22:14 INFO mapred.JobClient: map 100% reduce 64% 08/12/25 08:22:19 INFO mapred.JobClient: map 100% reduce 66% 08/12/25 08:22:24 INFO mapred.JobClient: map 100% reduce 77% 08/12/25 08:22:29 INFO mapred.JobClient: map 100% reduce 79% 08/12/25 08:25:30 INFO mapred.JobClient: map 100% reduce 89% 08/12/25 08:26:12 INFO mapred.JobClient: Job complete: job_200812250818_0001 08/12/25 08:26:12 INFO mapred.JobClient: Counters: 17 08/12/25 08:26:12 INFO mapred.JobClient: Job Counters 08/12/25 08:26:12 INFO mapred.JobClient: Data-local map tasks=1 08/12/25 08:26:12 INFO mapred.JobClient: Launched reduce tasks=9 08/12/25 08:26:12 INFO mapred.JobClient: Launched map tasks=2 08/12/25 08:26:12 INFO mapred.JobClient: Rack-local map tasks=1 08/12/25 08:26:12 INFO mapred.JobClient: Map-Reduce Framework 08/12/25 08:26:12 INFO mapred.JobClient: Map output records=1581 08/12/25 08:26:12 INFO mapred.JobClient: Reduce input records=593 08/12/25 08:26:12 INFO mapred.JobClient: Map output bytes=16546 08/12/25 08:26:12 INFO mapred.JobClient: Map input records=202 08/12/25 08:26:12 INFO mapred.JobClient: Combine output records=1292 08/12/25 08:26:12 INFO mapred.JobClient: Map input bytes=11358 08/12/25 08:26:12 INFO mapred.JobClient: Combine input records=2280 08/12/25 08:26:12 INFO mapred.JobClient: Reduce input groups=593 08/12/25 08:26:12 INFO mapred.JobClient: Reduce output records=593 08/12/25 08:26:12 INFO mapred.JobClient: File Systems 08/12/25 08:26:12 INFO mapred.JobClient: HDFS bytes written=6117 08/12/25 08:26:12 INFO mapred.JobClient: Local bytes written=19010 08/12/25 08:26:12 INFO mapred.JobClient: HDFS bytes read=13872 08/12/25 08:26:12 INFO mapred.JobClient: Local bytes read=8620
4. HBase configuration
4.1 $HBASE_HOME/hbase-site.sh
hbase.rootdir hdfs://master:54310/hbase The directory shared by region servers.
hbase.master master:60000 The host and port that the HBase master runs at.
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduce. HDFS is a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.
HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storeage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop.
HBase's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware.
[root@localhost bin]# wget 'http://mirror.mirimar.net/apache/hadoop/core/hadoop-0.18.2/hadoop-0.18.2.tar.gz' --02:49:53-- http://mirror.mirimar.net/apache/hadoop/core/hadoop-0.18.2/hadoop-0.18.2.tar.gz Resolving mirror.mirimar.net... 194.90.150.47 Connecting to mirror.mirimar.net|194.90.150.47|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 16836495 (16M) [application/x-gzip] Saving to: `hadoop-0.18.2.tar.gz'
100%[=========================================================================================================================================>] 16,836,495 471K/s in 36s
HBase 0.18.1
[root@localhost bin]# wget 'http://mirror.mirimar.net/apache/hadoop/hbase/hbase-0.18.1/hbase-0.18.1.tar.gz' --02:51:42-- http://mirror.mirimar.net/apache/hadoop/hbase/hbase-0.18.1/hbase-0.18.1.tar.gz Resolving mirror.mirimar.net... 194.90.150.47 Connecting to mirror.mirimar.net|194.90.150.47|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 16295734 (16M) [application/x-gzip] Saving to: `hbase-0.18.1.tar.gz'
100%[=========================================================================================================================================>] 16,295,734 496K/s in 34s
Please don't forget to add all environment variables to ~/.bash_profile, otherwise all exports will be deleted after you disconnect your SSH session.
2.1 Java
For now we'll use HBase 0.18.1 which was compiled with JDK1.5 and Hadoop 0.18.2 which supports jdk1.5.x, today Hadoop 0.19.0 is available, but it requires jdk1.6, HBase suppose to support jdk1.6 only in version 0.19
Install JDK1.5.14,
[root@localhost /usr/local/bin]wget 'http://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_Developer-Site/en_US/-/USD/VerifyItem-Start/jdk-1_5_0_14-linux-i586.bin?BundledLineItemUUID=PspIBe.oQkIAAAEeWNw8f4HX&OrderID=na1IBe.ouqIAAAEePNw8f4HX&ProductID=YOzACUFBuXAAAAEYlak5AXuQ&FileName=/jdk-1_5_0_14-linux-i586.bin' -O jdk-1_5_0_14-linux-i586.bin --03:25:00-- http://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_Developer-Site/en_US/-/USD/VerifyItem-Start/jdk-1_5_0_14-linux-i586.bin?BundledLineItemUUID=PspIBe.oQkIAAAEeWNw8f4HX&OrderID=na1IBe.ouqIAAAEePNw8f4HX&ProductID=YOzACUFBuXAAAAEYlak5AXuQ&FileName=/jdk-1_5_0_14-linux-i586.bin Resolving cds.sun.com... 72.5.239.134 Connecting to cds.sun.com|72.5.239.134|:80... connected. HTTP request sent, awaiting response... 302 Found Location: http://cds-esd.sun.com/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin?AuthParam=1230539222_7e7c419133c9fb57c076e8e08293fd8c&TicketId=B%2Fw5lxuBTFhPQB1LOFJTnQTr&GroupName=CDS&FilePath=/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin&File=jdk-1_5_0_14-linux-i586.bin [following] --03:25:01-- http://cds-esd.sun.com/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin?AuthParam=1230539222_7e7c419133c9fb57c076e8e08293fd8c&TicketId=B%2Fw5lxuBTFhPQB1LOFJTnQTr&GroupName=CDS&FilePath=/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin&File=jdk-1_5_0_14-linux-i586.bin Resolving cds-esd.sun.com... 98.27.88.9, 98.27.88.39 Connecting to cds-esd.sun.com|98.27.88.9|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 49649265 (47M) [application/x-sdlc] Saving to: `jdk-1_5_0_14-linux-i586.bin'
100%[=========================================================================================================================================>] 49,649,265 5.36M/s in 9.0s
Hadoop requires SSH public access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it, so
[root@rt ~/.ssh] ssh-keygen -t rsa //This will create two files in your ~/.ssh directory //id_rsa: yourprivate key //id_rsa.pub: is your public key. [root@rt ~/.ssh] cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys [root@rt ~/.ssh] ssh localhost The authenticity of host 'localhost (127.0.0.1)' can't be established. RSA key fingerprint is 55:d7:91:86:ea:86:8f:51:89:9f:68:b0:75:88:52:72. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (RSA) to the list of known hosts. [root@rt ~]
As you see the final test is to see if you able to make ssh public authentication connection to the localhost
If the SSH connection fails, these general tips might help:
Enable debugging with ssh -vvv localhost and investigate the error in detail.
Check the SSH server configuration in /etc/ssh/sshd_config, in particular the options PubkeyAuthentication(which should be set to yes) and AllowUsers(if this option is active, add the hadoop user to it). If you made any changes to the SSH server configuration file, you can force a configuration reload with sudo /etc/init.d/ssh reload.
2.3 Disabling IPv6
To disable IPv6 on CentOS Linux, open /etc/modprobe.d/blacklist in the editor of your choice and add the following lines to the end of the file:
# disable IPv6 blacklist ipv6
You have to reboot your machine in order to make the changes take effect.
2.4 Edit open files limit.
Edit file /etc/security/limits.conf, add the following lines:
root - nofile 100000
root - locks 100000
Run ulimit -n 1000000 in shell.
3.Hadoop
3.1 Installation
- Unpack hadoop archive to /usr/local/bin (could any directory)
- Move unpacked directory to /usr/local/bin/hadoop: mv hadoop.18.2 hadoop
- Set HADOOP_HOME: export HADOOP_HOME=/usr/local/bin/hadoop
3.2 Configuration Set up JAVA_HOME in $HADOOP_HOME/conf/hadoop-env.sh to point to your java location:
//The java implementation to use. Required. export JAVA_HOME=/usr/local/bin/jdk1.5.0_14
Set up $HADOOP_HOME/conf/hadoop-site.sh
Any site-specific configuration of Hadoop is configured in $HADOOP_HOME/conf/hadoop-site.xml. Here we will configure the directory where Hadoop will store its data files, the ports it listens to, etc. Our setup will use Hadoop's Distributed File System, HDFS, even though our little "cluster" only contains our single local machine.
You can leave the settings below as is with the exception of the hadoop.tmp.dir variable which you have to change to the directory of your choice, for example:
/usr/local/hadoop-datastore/hadoop-${user.name}.
Hadoop will expand ${user.name} to the system user which is running Hadoop, so in our case this will be hadoop and thus the final path will be /usr/local/hadoop-datastore/hadoop-hadoop.
hadoop.tmp.dir /usr/local/bin/hadoop/datastore/hadoop-${user.name} A base for other temporary directories.
fs.default.name hdfs://master:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.
mapred.job.tracker master:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
mapred.reduce.tasks 4 The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local".
mapred.tasktracker.reduce.tasks.maximum 4 The maximum number of reduce tasks that will be run simultaneously by a task tracker.
dfs.replication 2 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
2.2 Formatting the name node
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your "cluster" (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem, this will cause all your data to be erased.
To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command
[root@cc hadoop]# bin/hadoop namenode -format
The output suppose to be like this:
[root@cc hadoop]# bin/hadoop namenode -format 08/12/24 10:56:34 INFO dfs.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = cc.d.de.static.xlhost.com/206.222.13.204 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.18.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 709042; compiled by 'ndaley' on Thu Oct 30 01:07:18 UTC 2008 ************************************************************/ Re-format filesystem in /usr/local/bin/hadoop/datastore/hadoop-root/dfs/name ? (Y or N) Y 08/12/24 10:57:40 INFO fs.FSNamesystem: fsOwner=root,root,bin,daemon,sys,adm,disk,wheel 08/12/24 10:57:40 INFO fs.FSNamesystem: supergroup=supergroup 08/12/24 10:57:40 INFO fs.FSNamesystem: isPermissionEnabled=true 08/12/24 10:57:40 INFO dfs.Storage: Image file of size 78 saved in 0 seconds. 08/12/24 10:57:40 INFO dfs.Storage: Storage directory /usr/local/bin/hadoop/datastore/hadoop-root/dfs/name has been successfully formatted. 08/12/24 10:57:40 INFO dfs.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at cc.d.de.static.xlhost.com/206.222.13.204 ************************************************************/
2.3 Starting/Stopping your single-node cluster
Run the command:
[root@cc hadoop]# $HADOOP_HOME/bin/start-all.sh
You suppose to see the following output:
starting namenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-namenode-cc.com.out localhost: starting datanode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-datanode-cc.out localhost: starting secondarynamenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-secondarynamenode-cc.out starting jobtracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-jobtracker-cc.com.out localhost: starting tasktracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-tasktracker-cc.com.out
Run example map-reduce job that comes with hadoop installation:
[root@38 /usr/local/bin/hadoop]bin/hadoop dfs -copyFromLocal LICENSE.txt testWordCount [root@38 /usr/local/bin/hadoop]bin/hadoop dfs -ls Found 1 items -rw-r--r-- 1 root supergroup 11358 2008-12-25 04:54 /user/root/testWordCount [root@38 /usr/local/bin/hadoop]bin/hadoop jar hadoop-0.18.2-examples.jar wordcount testWordCount testWordCount-output 08/12/25 04:55:47 INFO mapred.FileInputFormat: Total input paths to process : 1 08/12/25 04:55:47 INFO mapred.FileInputFormat: Total input paths to process : 1 08/12/25 04:55:48 INFO mapred.JobClient: Running job: job_200812250447_0001 08/12/25 04:55:49 INFO mapred.JobClient: map 0% reduce 0% 08/12/25 04:55:51 INFO mapred.JobClient: map 100% reduce 0% 08/12/25 04:55:56 INFO mapred.JobClient: Job complete: job_200812250447_0001 08/12/25 04:55:56 INFO mapred.JobClient: Counters: 16 08/12/25 04:55:56 INFO mapred.JobClient: Job Counters 08/12/25 04:55:56 INFO mapred.JobClient: Data-local map tasks=2 08/12/25 04:55:56 INFO mapred.JobClient: Launched reduce tasks=1 08/12/25 04:55:56 INFO mapred.JobClient: Launched map tasks=2 08/12/25 04:55:56 INFO mapred.JobClient: Map-Reduce Framework 08/12/25 04:55:56 INFO mapred.JobClient: Map output records=1581 08/12/25 04:55:56 INFO mapred.JobClient: Reduce input records=593 08/12/25 04:55:56 INFO mapred.JobClient: Map output bytes=16546 08/12/25 04:55:56 INFO mapred.JobClient: Map input records=202 08/12/25 04:55:56 INFO mapred.JobClient: Combine output records=1292 08/12/25 04:55:56 INFO mapred.JobClient: Map input bytes=11358 08/12/25 04:55:56 INFO mapred.JobClient: Combine input records=2280 08/12/25 04:55:56 INFO mapred.JobClient: Reduce input groups=593 08/12/25 04:55:56 INFO mapred.JobClient: Reduce output records=593 08/12/25 04:55:56 INFO mapred.JobClient: File Systems 08/12/25 04:55:56 INFO mapred.JobClient: HDFS bytes written=6117 08/12/25 04:55:56 INFO mapred.JobClient: Local bytes written=18568 08/12/25 04:55:56 INFO mapred.JobClient: HDFS bytes read=13872 08/12/25 04:55:56 INFO mapred.JobClient: Local bytes read=8542 [root@38 /usr/local/bin/hadoop]bin/hadoop dfs -ls testWordCount-output Found 2 items drwxr-xr-x - root supergroup 0 2008-12-25 04:55 /user/root/testWordCount-output/_logs -rw-r--r-- 1 root supergroup 6117 2008-12-25 04:55 /user/root/testWordCount-output/part-00000 [root@38 /usr/local/bin/hadoop]bin/hadoop dfs -cat testWordCount-output/part-00000 // suppose to see something like this ... tracking 1 trade 1 trademark, 1 trademarks, 1 transfer 1 transformation 1 translation 1 ...
To stop Hadoop cluster run the following:
[root@37 /usr/local/bin/hadoop]bin/stop-all.sh no jobtracker to stop localhost: no tasktracker to stop no namenode to stop localhost: no datanode to stop localhost: no secondarynamenode to stop
2.4 Hadoop monitoring and debugging
Please see hadoop tips of how to debug Map-Reduce programs. Worth to mention that hadoop logs are providing the most information from $HADOOP_HOME/logsor there are links from hadoop web interfaces.
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:
The job tracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file
The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser.
- Unpack HBase archive to /usr/local/bin - Move hbase.18.1 to hbase - Define HBASE_HOME point to /usr/local/bin/hbase( don't forget to edit ~/.bash_profile ) - Define JAVA_HOME in $HBASE_HOME/conf/hbase-env.sh
A pseudo-distributed operation is simply a distributed operation run on a single host. Once you have confirmed your DFS setup, configuring HBase for use on one host requires modification of ${HBASE_HOME}/conf/hbase-site.xml, which needs to be pointed at the running Hadoop DFS instance. Use hbase-site.xml to override the properties defined in ${HBASE_HOME}/conf/hbase-default.xml (hbase-default.xml itself should never be modified). At a minimum the hbase.rootdir property should be redefined in hbase-site.xmlto point HBase at the Hadoop filesystem to use. For example, adding the property below to your hbase-site.xml says that HBase should use the /hbase directory in the HDFS whose namenode is at port 54310 on your local machine:
hbase.rootdir hdfs://localhost:54310/hbase The directory shared by region servers.
Once you have a running HBase, you probably want a way to hook your application up to it.If your application is in Java, then you should use the Java API. The following example takes as input excel formatted file and name of already existed table in HTable, process records and writes them to HBase. You could look at client example here
/** * Sample uploader Map-Reduce example class. Takes excel format file as input and write output to HBase * */
public class SampleUploader extends MapReduceBase implements Mapper, Tool { static enum Counters { MAP_LINES,REDUCE_LINES } private static final String NAME = "SampleUploader"; private Configuration conf; static final String OUTPUT_COLUMN = "value:"; static final String OUTPUT_KEY = "key:"; long numRecords; private Text idText = new Text(); private Text recordText = new Text(); private String inputFile;
/** A WritableComparator optimized for Text keys. */ public static class Comparator extends WritableComparator { public Comparator() { super(Text.class); }
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { int n1 = WritableUtils.decodeVIntSize(b1[s1]); int n2 = WritableUtils.decodeVIntSize(b2[s2]); return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2); } }
public JobConf createSubmittableJob(String[] args) { JobConf c = new JobConf(getConf(), SampleUploader.class); c.setJobName(NAME); c.setInputFormat(TextInputFormat.class); FileInputFormat.setInputPaths(c, new Path(args[0])); //c.setInputPaths(new Path(args[0])); c.setMapperClass(this.getClass()); c.setMapOutputKeyClass(Text.class); c.setMapOutputValueClass(Text.class); c.setReducerClass(TableUploader.class); TableReduce.initJob(args[1], TableUploader.class, c); return c; } public void configure(JobConf job) { inputFile = job.get("map.input.file"); } public void map(LongWritable k, Text v,OutputCollector output, Reporter r) throws IOException {
public int run( String[] args) throws Exception { // Make sure there are exactly 2 parameters left. if (args.length != 2) { System.out.println("ERROR: Wrong number of parameters: " + args.length + " instead of 2."); return printUsage(); } JobClient.runJob(createSubmittableJob(args)); return 0; }
public Configuration getConf() { return this.conf; }
public void setConf(final Configuration c) { this.conf = c; }
public static void main(String[] args) throws Exception { int errCode = ToolRunner.run(new Configuration(), new SampleUploader(), args); System.exit(errCode);
}
}
3.4 Running HBase and using HBase shell
Start HBase with the following command:
${HBASE_HOME}/bin/start-hbase.sh
If HBase is started succesfully you suppose to see a following output after running jps:
Lists the contents of the directory specified by path, showing the names, permissions, owner, size and modification date for each entry.
-lsr path
Behaves like -ls, but recursively displays entries in all subdirectories of path.
-du path
Shows disk usage, in bytes, for all files which match path; filenames are reported with the full HDFS protocol prefix.
-dus path
Like -du, but prints a summary of disk usage of all files/directories in the path.
-mv srcdest
Moves the file or directory indicated by src to dest, within HDFS.
-cp srcdest
Copies the file or directory identified by src to dest, within HDFS.
-rm path
Removes the file or empty directory identified by path.
-rmr path
Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or subdirectories of path).
-put localSrcdest
Copies the file or directory from the local file system identified by localSrc to dest within the DFS.
-copyFromLocal localSrcdest
Identical to -put
-moveFromLocal localSrcdest
Copies the file or directory from the local file system identified by localSrc to dest within HDFS, then deletes the local copy on success.
-get [-crc] srclocalDest
Copies the file or directory in HDFS identified by src to the local file system path identified by localDest.
-getmerge srclocalDest [addnl]
Retrieves all files that match the path src in HDFS, and copies them to a single, merged file in the local file system identified by localDest.
-cat filename
Displays the contents of filename on stdout.
-copyToLocal [-crc] srclocalDest
Identical to -get
-moveToLocal [-crc] srclocalDest
Works like -get, but deletes the HDFS copy on success.
-mkdir path
Creates a directory named path in HDFS. Creates any parent directories in path that are missing (e.g., like mkdir -p in Linux).
-setrep [-R] [-w] reppath
Sets the target replication factor for files identified by path to rep. (The actual replication factor will move toward the target over time)
-touchz path
Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path, unless the file is already size 0.
-test -[ezd] path
Returns 1 if pathexists; has zero length; or is a directory, or 0 otherwise.
-stat [format] path
Prints information about path. format is a string which accepts file size in blocks (%b), filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).
-tail [-f] file
Shows the lats 1KB of file on stdout.
-chmod [-R] mode,mode,...path...
Changes the file permissions associated with one or more objects identified by path.... Performs changes recursively with -R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}. Assumes a if no scope is specified and does not apply a umask.
-chown [-R] [owner][:[group]] path...
Sets the owning user and/or group for files or directories identified by path.... Sets owner recursively if -R is specified.
-chgrp [-R] grouppath...
Sets the owning group for files or directories identified by path.... Sets group recursively if -R is specified.
-help cmd
Returns usage information for one of the commands listed above. You must omit the leading '-' character in cmd
First you need to find out your hostname, you can do this with
$ hostname localhost.localdomain $
Edit /etc/hosts
You need to edit /etc/hosts and add a line for your host name
$ cat /etc/hosts # Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 localhost.localdomain localhost $
My new server IP is 72.232.196.90, i need to assign it hostname server12.hosthat.com, to do this, i have edited /etc/hosts as follows.
# Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 localhost.localdomain localhost 72.232.196.90 server12.hosthat.com server12
vmstat reports information about processes, memory, paging, block IO, traps, and cpu activity.
The first report produced gives averages since the last reboot. Additional reports give information on a sampling period of length delay. The process and memory reports are instantaneous in either case.
[root@]# vmstat 1 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 0 0 6085936 91884 1707572 1333604 48 50 70 73 46 7 1 1 23 0 0 0 6085936 88464 1707608 1335120 0 0 1374 49 2750 1784 0 0 100 0 0 0 6085936 88048 1707660 1335952 0 0 753 913 2532 1883 0 0 100 0 0 0 6085936 87632 1707824 1337460 0 0 1282 908 2452 2054 0 0 100
First, create a public/private key pair on the client that you will use to connect to the server (you will need to do this from each client machine from which you connect):
$ ssh-keygen -t rsa
This will create two files in your ~/.ssh directory id_rsa: yourprivate key id_rsa.pub: is your public key.
If you don't want to still be asked for a password each time you connect, just press enter when asked for a password when creating the key pair. It is up to you to decide whether or not you should password encrypt your key when you create it. If you don't password encrypt your key, then anyone gaining access to your local machine will automatically have ssh access to the remote server. Also, root on the local machine has access to your keys although one assumes that if you can't trust root (or root is compromised) then you're in real trouble. Encrypting the key adds additional security at the expense of eliminating the need for entering a password for the ssh server only to be replaced with entering a password for the use of the key. Now set permissions on your private key:
$ chmod 700 ~/.ssh $ chmod 600 ~/.ssh/id_rsa
Copy the public key (id_rsa.pub) to the server and install it to the authorized_keys list:
Once you've checked you can successfully login to the server using your public/private key pair, you can disable password authentication completely by adding the following setting to your /etc/ssh/sshd_config
# Disable password authentication forcing use of keys PasswordAuthentication no
No long write-ups this week, just a short list of some great resources that I've found very inspirational and thought provoking. I've broken these resources up into two lists: Blogs and Presentations.
Blogs
The blogs listed below are ones that I subscribe to and are filled with some great posts about capacity planning, scalability problems and solutions, and distributed system information. Each blog is authored by exceptionally smart people and many of them have significant experience building production-level scalable systems.
The presentations listed below are from the SlideShare site and are primarily the slides used to accompany scalability talks from around the world. Many of them outline the problems that various companies have encountered during their non-linear growth phases and how they've solved them by scaling their systems.
Amazon EC2 setup - link Yahoo Hadoop tutorial - link Michael Noll blog - link Hadoop main page - link Google lectures - link HBase resources - link Distributed computing(IBM) - link HBase and BigTable - link