Sunday, December 28, 2008

Tasks management on Linux using crontab

Crontab

The crontab (cron derives fromchronos, Greek for time; tab stands fortable) command, found in Unix and Unix-like operating systems, is used to schedule commands to be executed periodically. To see what crontabs are currently running on your system, you can open a terminal and run:

crontab -l

To edit the list of cronjobs you can run:

crontab -e

This will open a the default editor ( vi or pico) to manipulate the crontab settings. If you save and exit the editor, all your cronjobs are saved into crontab. Cronjobs are written in the following format:

* * * * * /bin/execute/this/script.sh

Scheduling explained

As you can see there are 5 stars. The stars represent different date parts in the following order:

minute (from 0 to 59)
hour (from 0 to 23)
day of month (from 1 to 31)
month (from 1 to 12)
day of week (from 0 to 6) (0=Sunday)

Execute every minute

If you leave the star, or asterisk, it means every. Maybe that's a bit unclear. Let's use the the previous example again:

* * * * * /bin/execute/this/script.sh

They are all still asterisks! So this means execute /bin/execute/this/script.sh:

every minute
of every hour
of every day of the month
of every month
and every day in the week.

In short: This script is being executed every minute. Without exception.

Execute every Friday 1AM

So if we want to schedule the script to run at 1AM every Friday, we would need the following cronjob:

0 1 * * 5 /bin/execute/this/script.sh

Get it? The script is now being executed when the system clock hits:

minute: 0
of hour: 1
of day of month: * (every day of month)
of month: * (every month)
and weekday: 5 (=Friday)

Execute 10 past after every hour on the 1st of every month

Here's another one, just for practicing

10 * 1 * * /bin/execute/this/script.sh

Fair enough, it takes some getting used to, but it offers great flexibility.

If you want to run something every 10 minutes

0,10,20,30,40,50 * * * * /bin/execute/this/script.sh

Mailing the crontab output of just one cronjob

And change the cronjob like this:*/10 * * * * /bin/execute/this/script.sh 2>&1 | mail -s "Cronjob ouput" yourname@

Thursday, December 25, 2008

Running Hadoop with HBase On CentOS Linux (Multi-Node Cluster)

1.Prerequisites

1.1 Configure single nodes

Use my tutorial "Running Hadoop with HBase On CentOS Linux (Single-Node Cluster)"

1.2 SSH public access

Add mapping to /etc/hosts for each your nodes:

10.1.0.55 master
10.1.0.56 slave

To do that you need to add public key from master to slave ~/.ssh/authorized_keys and via verse, so eventually tou should be able to do the following:


[root@37 /usr/local/bin/hbase]ssh master
The authenticity of host 'master (10.1.0.55)' can't be established.
RSA key fingerprint is 09:e2:73:ac:6f:42:d1:da:13:20:76:10:36:29:c4:62.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'master,10.1.0.55' (RSA) to the list of known hosts.
Last login: Thu Dec 25 07:19:16 2008 from slave
[root@37 ~]exit
logout
Connection to master closed.
[root@37 /usr/local/bin/hbase]ssh slave
The authenticity of host 'slave (10.1.0.56)' can't be established.
RSA key fingerprint is 5c:4f:81:19:ad:f3:78:02:ce:64:f1:67:10:ca:c5:b8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'slave,10.1.0.56' (RSA) to the list of known hosts.
Last login: Thu Dec 25 07:16:47 2008 from 38.d.de.static.xlhost.com
[root@38 ~]exit
logout
Connection to slave closed.

2.Cluster Overview

The master node will also act as a slave because we only have two machines available in our cluster but still want to spread data storage and processing to multiple machines.

The master node will run the "master" daemons for each layer: namenode for the HDFS storage layer, and jobtracker for the MapReduce processing layer. Both machines will run the "slave" daemons: datanode for the HDFS layer, and tasktracker for MapReduce processing layer. Basically, the "master" daemons are responsible for coordination and management of the "slave" daemons while the latter will do the actual data storage and data processing work.

3. Hadoop core Configuration

3.1 conf/masters (master only)

The conf/masters file defines the master nodes of our multi-node cluster.

On master, update $HADOOP_HOME/conf/masters that it looks like this:

  
master

3.2 conf/slaves (master only)

This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run. We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data.

On master, update $HADOOP_HOME/conf/slaves that it looks like this:


master
slave

If you have additional slave nodes, just add them to the conf/slaves file, one per line (do this on all machines in the cluster).

 master
slave
slave01
slave02
slave03

3.3 conf/hadoop-site.xml (all machines)

Following a sample configuration for all hosts, explanation of all parameters link





hadoop.tmp.dir
/usr/local/bin/hadoop/datastore/hadoop-${user.name}
A base for other temporary directories.



fs.default.name
hdfs://master:54310
The name of the default file system.  A URI whose
scheme and authority determine the FileSystem implementation.  The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class.  The uri's authority is used to
determine the host, port, etc. for a filesystem.



mapred.job.tracker
master:54311
The host and port that the MapReduce job tracker runs
at.  If "local", then jobs are run in-process as a single map
and reduce task.




dfs.replication
1
Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.



mapred.reduce.tasks
8
The default number of reduce tasks per job.  Typically set
to a prime close to the number of available hosts.  Ignored when
mapred.job.tracker is "local".



mapred.tasktracker.reduce.tasks.maximum
8
The maximum number of reduce tasks that will be run
simultaneously by a task tracker.



mapred.child.java.opts
-Xmx500m
Java opts for the task tracker child processes.
The following symbol, if present, will be interpolated: @taskid@ is replaced
by current TaskID. Any other occurrences of '@' will go unchanged.
For example, to enable verbose gc logging to a file named for the taskid in
/tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:
-Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc

The configuration variable mapred.child.ulimit can be used to control the
maximum virtual memory of the child processes.

Very important: now format each node data storage before proceeding.

3.4 Start/Stop Hadoop cluster

HDFS daemons: start/stop namenode command bin/start-dfs.sh or bin/stop-dfs.sh
MapReduce daemons: run the command $HADOOP_HOME/bin/start-mapred.sh on the machine you want the jobtracker to run on. This will bring up the MapReduce cluster with the jobtracker running on the machine you ran the previous command on, and tasktrackers on the machines listed in the conf/slaves file.

[root@37 /usr/local/bin/hadoop]bin/start-dfs.sh
starting namenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-namenode-37.c3.33.static.xlhost.com.out
master: starting datanode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-datanode-37.c3.33.static.xlhost.com.out
slave: starting datanode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-datanode-38.c3.33.static.xlhost.com.out
master: starting secondarynamenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-secondarynamenode-37.c3.33.static.xlhost.com.out
[root@37 /usr/local/bin/hadoop]bin/start-mapred.sh
starting jobtracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-jobtracker-37.c3.33.static.xlhost.com.out
master: starting tasktracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-tasktracker-37.c3.33.static.xlhost.com.out
slave: starting tasktracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-tasktracker-38.c3.33.static.xlhost.com.out
[root@37 /usr/local/bin/hadoop]jps]
-bash: jps]: command not found
[root@37 /usr/local/bin/hadoop]jps
28638 DataNode
29052 Jps
28527 NameNode
28764 SecondaryNameNode
28847 JobTracker
28962 TaskTracker
[root@37 /usr/local/bin/hadoop]bin/hadoop dfs -copyFromLocal LICENSE.txt testWordCount
[root@37 /usr/local/bin/hadoop]bin/hadoop dfs -ls
Found 1 items
-rw-r--r--   1 root supergroup      11358 2008-12-25 08:20 /user/root/testWordCount
[root@37 /usr/local/bin/hadoop]bin/hadoop jar hadoop-0.18.2-examples.jar wordcount testWordCount testWordCount-out
08/12/25 08:21:46 INFO mapred.FileInputFormat: Total input paths to process : 1
08/12/25 08:21:46 INFO mapred.FileInputFormat: Total input paths to process : 1
08/12/25 08:21:47 INFO mapred.JobClient: Running job: job_200812250818_0001
08/12/25 08:21:48 INFO mapred.JobClient:  map 0% reduce 0%
08/12/25 08:21:53 INFO mapred.JobClient:  map 50% reduce 0%
08/12/25 08:21:55 INFO mapred.JobClient:  map 100% reduce 0%
08/12/25 08:22:01 INFO mapred.JobClient:  map 100% reduce 12%
08/12/25 08:22:02 INFO mapred.JobClient:  map 100% reduce 25%
08/12/25 08:22:05 INFO mapred.JobClient:  map 100% reduce 41%
08/12/25 08:22:09 INFO mapred.JobClient:  map 100% reduce 52%
08/12/25 08:22:14 INFO mapred.JobClient:  map 100% reduce 64%
08/12/25 08:22:19 INFO mapred.JobClient:  map 100% reduce 66%
08/12/25 08:22:24 INFO mapred.JobClient:  map 100% reduce 77%
08/12/25 08:22:29 INFO mapred.JobClient:  map 100% reduce 79%
08/12/25 08:25:30 INFO mapred.JobClient:  map 100% reduce 89%
08/12/25 08:26:12 INFO mapred.JobClient: Job complete: job_200812250818_0001
08/12/25 08:26:12 INFO mapred.JobClient: Counters: 17
08/12/25 08:26:12 INFO mapred.JobClient:   Job Counters
08/12/25 08:26:12 INFO mapred.JobClient:     Data-local map tasks=1
08/12/25 08:26:12 INFO mapred.JobClient:     Launched reduce tasks=9
08/12/25 08:26:12 INFO mapred.JobClient:     Launched map tasks=2
08/12/25 08:26:12 INFO mapred.JobClient:     Rack-local map tasks=1
08/12/25 08:26:12 INFO mapred.JobClient:   Map-Reduce Framework
08/12/25 08:26:12 INFO mapred.JobClient:     Map output records=1581
08/12/25 08:26:12 INFO mapred.JobClient:     Reduce input records=593
08/12/25 08:26:12 INFO mapred.JobClient:     Map output bytes=16546
08/12/25 08:26:12 INFO mapred.JobClient:     Map input records=202
08/12/25 08:26:12 INFO mapred.JobClient:     Combine output records=1292
08/12/25 08:26:12 INFO mapred.JobClient:     Map input bytes=11358
08/12/25 08:26:12 INFO mapred.JobClient:     Combine input records=2280
08/12/25 08:26:12 INFO mapred.JobClient:     Reduce input groups=593
08/12/25 08:26:12 INFO mapred.JobClient:     Reduce output records=593
08/12/25 08:26:12 INFO mapred.JobClient:   File Systems
08/12/25 08:26:12 INFO mapred.JobClient:     HDFS bytes written=6117
08/12/25 08:26:12 INFO mapred.JobClient:     Local bytes written=19010
08/12/25 08:26:12 INFO mapred.JobClient:     HDFS bytes read=13872
08/12/25 08:26:12 INFO mapred.JobClient:     Local bytes read=8620

4. HBase configuration

4.1 $HBASE_HOME/hbase-site.sh




hbase.rootdir
hdfs://master:54310/hbase
The directory shared by region servers.



hbase.master
master:60000
The host and port that the HBase master runs at.

4.2 $HBASE_HOME/conf/regionservers


master
slave

4.3 Start HBase on master

Use command $HBASE_HOME/bin/hbase-start.sh


[root@37 /usr/local/bin/hbase]bin/start-hbase.sh
starting master, logging to /usr/local/bin/hbase/bin/../logs/hbase-root-master-37.c3.33.static.xlhost.com.out
slave: starting regionserver, logging to /usr/local/bin/hbase/bin/../logs/hbase-root-regionserver-38.c3.33.static.xlhost.com.out
master: starting regionserver, logging to /usr/local/bin/hbase/bin/../logs/hbase-root-regionserver-37.c3.33.static.xlhost.com.out
[root@37 /usr/local/bin/hbase]jps
30362 SecondaryNameNode
30455 JobTracker
30570 TaskTracker
30231 DataNode
32003 Jps
31772 HMaster
30115 NameNode
31919 HRegionServer

On slave you suppose to see the following output:


[root@38 /usr/local/bin/hbase]jps
513 DataNode
6360 Jps
6214 HRegionServer
628 TaskTracker

Wednesday, December 24, 2008

Code highlighting in blogspot

Adding syntax highlight to blogger is very simple, first of all you need to download SyntaxHighlighter.

Now extract the contents of the package and upload the Scripts and Styles folder to any host or website which can be linked from your blog.

To make it work, you will need to edit your blog's template and add the following code after the -- end outer-wrapper -- tag:

To add some java code to your blog use:

The list of supported languages and their aliases can be found at the syntaxhighlighter wiki

Running Hadoop with HBase on CentOS Linux (Single-Node Cluster)

1.Purpose of this tutorial:

In this tutorial, I will describe the required steps for setting up a single-node Hadoop/Hbase cluster using the Hadoop Distributed File System (HDFS) on CentOS Linux.

Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduce. HDFS is a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storeage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop.

HBase's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware.

2.Prerequisites

CentOS Linux.
Hadoop 0.18.2


[root@localhost bin]# wget 'http://mirror.mirimar.net/apache/hadoop/core/hadoop-0.18.2/hadoop-0.18.2.tar.gz'
--02:49:53--  http://mirror.mirimar.net/apache/hadoop/core/hadoop-0.18.2/hadoop-0.18.2.tar.gz
Resolving mirror.mirimar.net... 194.90.150.47
Connecting to mirror.mirimar.net|194.90.150.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16836495 (16M) [application/x-gzip]
Saving to: `hadoop-0.18.2.tar.gz'

100%[=========================================================================================================================================>] 16,836,495   471K/s   in 36s

HBase 0.18.1


[root@localhost bin]# wget 'http://mirror.mirimar.net/apache/hadoop/hbase/hbase-0.18.1/hbase-0.18.1.tar.gz'
--02:51:42--  http://mirror.mirimar.net/apache/hadoop/hbase/hbase-0.18.1/hbase-0.18.1.tar.gz
Resolving mirror.mirimar.net... 194.90.150.47
Connecting to mirror.mirimar.net|194.90.150.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16295734 (16M) [application/x-gzip]
Saving to: `hbase-0.18.1.tar.gz'

100%[=========================================================================================================================================>] 16,295,734   496K/s   in 34s

JDK1.5.14 from Sun
SSH
Edit environment sessings


[root@34 ~]# vi ~/.bash_profile
// JDK installation directory
export JAVA_HOME=/usr/local/bin/jdk1.5.0_14
PATH=$JAVA_HOME/bin:$PATH:$HOME/bin
export HADOOP_HOME=/usr/local/bin/hadoop
export HBASE_HOME=/usr/local/bin/hbase
//optional
export PS1="[\u@\h \w]"

Please don't forget to add all environment variables to ~/.bash_profile, otherwise all exports will be deleted after you disconnect your SSH session.

2.1 Java

For now we'll use HBase 0.18.1 which was compiled with JDK1.5 and Hadoop 0.18.2 which supports jdk1.5.x, today Hadoop 0.19.0 is available, but it requires jdk1.6, HBase suppose to support jdk1.6 only in version 0.19

Install JDK1.5.14,


[root@localhost /usr/local/bin]wget 'http://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_Developer-Site/en_US/-/USD/VerifyItem-Start/jdk-1_5_0_14-linux-i586.bin?BundledLineItemUUID=PspIBe.oQkIAAAEeWNw8f4HX&OrderID=na1IBe.ouqIAAAEePNw8f4HX&ProductID=YOzACUFBuXAAAAEYlak5AXuQ&FileName=/jdk-1_5_0_14-linux-i586.bin' -O jdk-1_5_0_14-linux-i586.bin
--03:25:00--  http://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_Developer-Site/en_US/-/USD/VerifyItem-Start/jdk-1_5_0_14-linux-i586.bin?BundledLineItemUUID=PspIBe.oQkIAAAEeWNw8f4HX&OrderID=na1IBe.ouqIAAAEePNw8f4HX&ProductID=YOzACUFBuXAAAAEYlak5AXuQ&FileName=/jdk-1_5_0_14-linux-i586.bin
Resolving cds.sun.com... 72.5.239.134
Connecting to cds.sun.com|72.5.239.134|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://cds-esd.sun.com/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin?AuthParam=1230539222_7e7c419133c9fb57c076e8e08293fd8c&TicketId=B%2Fw5lxuBTFhPQB1LOFJTnQTr&GroupName=CDS&FilePath=/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin&File=jdk-1_5_0_14-linux-i586.bin [following]
--03:25:01--  http://cds-esd.sun.com/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin?AuthParam=1230539222_7e7c419133c9fb57c076e8e08293fd8c&TicketId=B%2Fw5lxuBTFhPQB1LOFJTnQTr&GroupName=CDS&FilePath=/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin&File=jdk-1_5_0_14-linux-i586.bin
Resolving cds-esd.sun.com... 98.27.88.9, 98.27.88.39
Connecting to cds-esd.sun.com|98.27.88.9|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49649265 (47M) [application/x-sdlc]
Saving to: `jdk-1_5_0_14-linux-i586.bin'

100%[=========================================================================================================================================>] 49,649,265  5.36M/s   in 9.0s

03:25:11 (5.25 MB/s) - `jdk-1_5_0_14-linux-i586.bin' saved [49649265/49649265]

define JAVA_HOME and add $JAVA_HOME/bin to $PATH

2.2 SSH public key authentication

Hadoop requires SSH public access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it, so

[root@rt ~/.ssh] ssh-keygen -t rsa
//This will create two files in your ~/.ssh directory 
//id_rsa: your private key
//id_rsa.pub: is your public key.
[root@rt ~/.ssh] cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[root@rt ~/.ssh] ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is 55:d7:91:86:ea:86:8f:51:89:9f:68:b0:75:88:52:72.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
[root@rt ~]

As you see the final test is to see if you able to make ssh public authentication connection to the localhost

If the SSH connection fails, these general tips might help:

Enable debugging with ssh -vvv localhost and investigate the error in detail.
Check the SSH server configuration in /etc/ssh/sshd_config, in particular the options PubkeyAuthentication (which should be set to yes) and AllowUsers (if this option is active, add the hadoop user to it). If you made any changes to the SSH server configuration file, you can force a configuration reload with sudo /etc/init.d/ssh reload.

2.3 Disabling IPv6

To disable IPv6 on CentOS Linux, open /etc/modprobe.d/blacklist in the editor of your choice and add the following lines to the end of the file:

 # disable IPv6
 blacklist ipv6

You have to reboot your machine in order to make the changes take effect.

2.4 Edit open files limit.

Edit file /etc/security/limits.conf, add the following lines:

root - nofile 100000

root - locks 100000

Run ulimit -n 1000000 in shell.

3.Hadoop

3.1 Installation

- Unpack hadoop archive to /usr/local/bin (could any directory)

- Move unpacked directory to /usr/local/bin/hadoop: mv hadoop.18.2 hadoop

- Set HADOOP_HOME: export HADOOP_HOME=/usr/local/bin/hadoop

3.2 Configuration

Set up JAVA_HOME in $HADOOP_HOME/conf/hadoop-env.sh to point to your java location:


//The java implementation to use.  Required.
export JAVA_HOME=/usr/local/bin/jdk1.5.0_14

Set up $HADOOP_HOME/conf/hadoop-site.sh

Any site-specific configuration of Hadoop is configured in $HADOOP_HOME/conf/hadoop-site.xml. Here we will configure the directory where Hadoop will store its data files, the ports it listens to, etc. Our setup will use Hadoop's Distributed File System, HDFS, even though our little "cluster" only contains our single local machine.

You can leave the settings below as is with the exception of the hadoop.tmp.dir variable which you have to change to the directory of your choice, for example:

/usr/local/hadoop-datastore/hadoop-${user.name}.

Hadoop will expand ${user.name} to the system user which is running Hadoop, so in our case this will be hadoop and thus the final path will be /usr/local/hadoop-datastore/hadoop-hadoop.






hadoop.tmp.dir
/usr/local/bin/hadoop/datastore/hadoop-${user.name}
A base for other temporary directories.



fs.default.name
hdfs://master:54310
The name of the default file system.  A URI whose
scheme and authority determine the FileSystem implementation.  The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class.  The uri's authority is used to
determine the host, port, etc. for a filesystem.



mapred.job.tracker
master:54311
The host and port that the MapReduce job tracker runs
at.  If "local", then jobs are run in-process as a single map
and reduce task.



mapred.reduce.tasks
4
The default number of reduce tasks per job.  Typically set
to a prime close to the number of available hosts.  Ignored when
mapred.job.tracker is "local".



mapred.tasktracker.reduce.tasks.maximum
4
The maximum number of reduce tasks that will be run
simultaneously by a task tracker.



dfs.replication
2
Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.

2.2 Formatting the name node

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your "cluster" (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem, this will cause all your data to be erased.

To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command

[root@cc hadoop]# bin/hadoop namenode -format

The output suppose to be like this:


[root@cc hadoop]# bin/hadoop namenode -format
08/12/24 10:56:34 INFO dfs.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = cc.d.de.static.xlhost.com/206.222.13.204
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.18.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 709042; compiled by 'ndaley' on Thu Oct 30 01:07:18 UTC 2008
************************************************************/
Re-format filesystem in /usr/local/bin/hadoop/datastore/hadoop-root/dfs/name ? (Y or N) Y
08/12/24 10:57:40 INFO fs.FSNamesystem: fsOwner=root,root,bin,daemon,sys,adm,disk,wheel
08/12/24 10:57:40 INFO fs.FSNamesystem: supergroup=supergroup
08/12/24 10:57:40 INFO fs.FSNamesystem: isPermissionEnabled=true
08/12/24 10:57:40 INFO dfs.Storage: Image file of size 78 saved in 0 seconds.
08/12/24 10:57:40 INFO dfs.Storage: Storage directory /usr/local/bin/hadoop/datastore/hadoop-root/dfs/name has been successfully formatted.
08/12/24 10:57:40 INFO dfs.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at cc.d.de.static.xlhost.com/206.222.13.204
************************************************************/

2.3 Starting/Stopping your single-node cluster

Run the command:


[root@cc hadoop]# $HADOOP_HOME/bin/start-all.sh

You suppose to see the following output:


starting namenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-namenode-cc.com.out
localhost: starting datanode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-datanode-cc.out
localhost: starting secondarynamenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-secondarynamenode-cc.out
starting jobtracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-jobtracker-cc.com.out
localhost: starting tasktracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-tasktracker-cc.com.out

Run example map-reduce job that comes with hadoop installation:


[root@38 /usr/local/bin/hadoop]bin/hadoop dfs -copyFromLocal LICENSE.txt testWordCount
[root@38 /usr/local/bin/hadoop]bin/hadoop dfs -ls
Found 1 items
-rw-r--r--   1 root supergroup      11358 2008-12-25 04:54 /user/root/testWordCount
[root@38 /usr/local/bin/hadoop]bin/hadoop jar hadoop-0.18.2-examples.jar wordcount testWordCount testWordCount-output
08/12/25 04:55:47 INFO mapred.FileInputFormat: Total input paths to process : 1
08/12/25 04:55:47 INFO mapred.FileInputFormat: Total input paths to process : 1
08/12/25 04:55:48 INFO mapred.JobClient: Running job: job_200812250447_0001
08/12/25 04:55:49 INFO mapred.JobClient:  map 0% reduce 0%
08/12/25 04:55:51 INFO mapred.JobClient:  map 100% reduce 0%
08/12/25 04:55:56 INFO mapred.JobClient: Job complete: job_200812250447_0001
08/12/25 04:55:56 INFO mapred.JobClient: Counters: 16
08/12/25 04:55:56 INFO mapred.JobClient:   Job Counters
08/12/25 04:55:56 INFO mapred.JobClient:     Data-local map tasks=2
08/12/25 04:55:56 INFO mapred.JobClient:     Launched reduce tasks=1
08/12/25 04:55:56 INFO mapred.JobClient:     Launched map tasks=2
08/12/25 04:55:56 INFO mapred.JobClient:   Map-Reduce Framework
08/12/25 04:55:56 INFO mapred.JobClient:     Map output records=1581
08/12/25 04:55:56 INFO mapred.JobClient:     Reduce input records=593
08/12/25 04:55:56 INFO mapred.JobClient:     Map output bytes=16546
08/12/25 04:55:56 INFO mapred.JobClient:     Map input records=202
08/12/25 04:55:56 INFO mapred.JobClient:     Combine output records=1292
08/12/25 04:55:56 INFO mapred.JobClient:     Map input bytes=11358
08/12/25 04:55:56 INFO mapred.JobClient:     Combine input records=2280
08/12/25 04:55:56 INFO mapred.JobClient:     Reduce input groups=593
08/12/25 04:55:56 INFO mapred.JobClient:     Reduce output records=593
08/12/25 04:55:56 INFO mapred.JobClient:   File Systems
08/12/25 04:55:56 INFO mapred.JobClient:     HDFS bytes written=6117
08/12/25 04:55:56 INFO mapred.JobClient:     Local bytes written=18568
08/12/25 04:55:56 INFO mapred.JobClient:     HDFS bytes read=13872
08/12/25 04:55:56 INFO mapred.JobClient:     Local bytes read=8542
[root@38 /usr/local/bin/hadoop]bin/hadoop dfs -ls testWordCount-output
Found 2 items
drwxr-xr-x   - root supergroup          0 2008-12-25 04:55 /user/root/testWordCount-output/_logs
-rw-r--r--   1 root supergroup       6117 2008-12-25 04:55 /user/root/testWordCount-output/part-00000
[root@38 /usr/local/bin/hadoop]bin/hadoop dfs -cat testWordCount-output/part-00000
// suppose to see something like this
...
tracking        1
trade   1
trademark,      1
trademarks,     1
transfer        1
transformation  1
translation     1
...

To stop Hadoop cluster run the following:


[root@37 /usr/local/bin/hadoop]bin/stop-all.sh
no jobtracker to stop
localhost: no tasktracker to stop
no namenode to stop
localhost: no datanode to stop
localhost: no secondarynamenode to stop

2.4 Hadoop monitoring and debugging

Please see hadoop tips of how to debug Map-Reduce programs. Worth to mention that hadoop logs are providing the most information
from $HADOOP_HOME/logs or there are links from hadoop web interfaces.

Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:

The job tracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file

http://localhost:50030/ - web UI for MapReduce job tracker(s)

The task tracker web UI shows you running and non-running tasks.

http://localhost:50060/ - web UI for task tracker(s)

The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser.

http://localhost:50070/ - web UI for HDFS name node(s)

If everything work fine you suppose to see the following output after running jps utility:


[root@cc hadoop]# jps
3060 SecondaryNameNode
3136 JobTracker
2814 NameNode
3270 TaskTracker
3458 Jps

[root@38 /usr/local/bin/hadoop]netstat -plten | grep java
tcp        0      0 :::50020                    :::*                        LISTEN      0          1248347094 5131/java
tcp        0      0 ::ffff:127.0.0.1:54310      :::*                        LISTEN      0          1248346675 5032/java
tcp        0      0 ::ffff:127.0.0.1:54311      :::*                        LISTEN      0          1248347128 5351/java
tcp        0      0 :::50090                    :::*                        LISTEN      0          1248347119 5272/java
tcp        0      0 :::50060                    :::*                        LISTEN      0          1248347327 5467/java
tcp        0      0 :::50030                    :::*                        LISTEN      0          1248347293 5351/java
tcp        0      0 :::48498                    :::*                        LISTEN      0          1248347011 5272/java
tcp        0      0 :::50070                    :::*                        LISTEN      0          1248346888 5032/java
tcp        0      0 :::53210                    :::*                        LISTEN      0          1248347103 5351/java
tcp        0      0 :::50010                    :::*                        LISTEN      0          1248347000 5131/java
tcp        0      0 :::50075                    :::*                        LISTEN      0          1248347020 5131/java
tcp        0      0 :::40315                    :::*                        LISTEN      0          1248346669 5032/java
tcp        0      0 ::ffff:127.0.0.1:47198      :::*                        LISTEN      0          1248347345 5467/java
tcp        0      0 :::56575                    :::*                        LISTEN      0          1248346834 5131/java

3.HBase

3.1 Configuration

- Unpack HBase archive to /usr/local/bin
- Move hbase.18.1 to hbase
- Define HBASE_HOME point to /usr/local/bin/hbase( don't forget to edit ~/.bash_profile )
- Define JAVA_HOME in $HBASE_HOME/conf/hbase-env.sh


export HADOOP_CLASSPATH=/usr/local/bin/hbase/conf:/usr/local/bin/hbase/hbase-0.18.1.jar:/usr/local/bin/hbase/hbase-0.18.1-test.jar

- Edit $HADOOP_HOME/conf/hadoop-env.sh, add this(for instance):


export JAVA_HOME=/usr/local/bin/jdk1.5.0_14/

3.2 Pseudo-Distributed Operation

A pseudo-distributed operation is simply a distributed operation run on a single host. Once you have confirmed your DFS setup, configuring HBase for use on one host requires modification of ${HBASE_HOME}/conf/hbase-site.xml, which needs to be pointed at the running Hadoop DFS instance. Use hbase-site.xml to override the properties defined in ${HBASE_HOME}/conf/hbase-default.xml (hbase-default.xml itself should never be modified). At a minimum the hbase.rootdir property should be redefined in hbase-site.xml to point HBase at the Hadoop filesystem to use. For example, adding the property below to your hbase-site.xml says that HBase should use the /hbase directory in the HDFS whose namenode is at port 54310 on your local machine:




hbase.rootdir
hdfs://localhost:54310/hbase
The directory shared by region servers.

3.3 Example API Usage

Once you have a running HBase, you probably want a way to hook your application up to it.If your application is in Java, then you should use the Java API. The following example takes as input excel formatted file and name of already existed table in HTable, process records and writes them to HBase. You could look at client example here


package org.examples;

import java.io.IOException;
import java.util.Iterator;
import java.util.regex.Pattern;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.io.BatchUpdate;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapred.TableReduce;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.WritableUtils;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
* Sample uploader Map-Reduce example class. Takes excel format file as input and write output to HBase
*
*/

public class SampleUploader extends MapReduceBase implements Mapper, Tool
{
static enum Counters { MAP_LINES,REDUCE_LINES }
private static final String NAME = "SampleUploader";
private Configuration conf;
static final String OUTPUT_COLUMN = "value:";
static final String OUTPUT_KEY = "key:";
long numRecords;
private Text idText = new Text();
private Text recordText = new Text();
private String inputFile;

/** A WritableComparator optimized for Text keys. */
public static class Comparator extends WritableComparator
{
public Comparator()
{
super(Text.class);
}

public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2)
{
int n1 = WritableUtils.decodeVIntSize(b1[s1]);
int n2 = WritableUtils.decodeVIntSize(b2[s2]);
return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2);
}
}


public JobConf createSubmittableJob(String[] args)
{
JobConf c = new JobConf(getConf(), SampleUploader.class);
c.setJobName(NAME);
c.setInputFormat(TextInputFormat.class);
FileInputFormat.setInputPaths(c, new Path(args[0]));
//c.setInputPaths(new Path(args[0]));
c.setMapperClass(this.getClass());
c.setMapOutputKeyClass(Text.class);
c.setMapOutputValueClass(Text.class);
c.setReducerClass(TableUploader.class);
TableReduce.initJob(args[1], TableUploader.class, c);
return c;
}
public void configure(JobConf job)
{
inputFile = job.get("map.input.file");
}
public void map(LongWritable k, Text v,OutputCollector output, Reporter r) throws IOException
{

String lineWithoutURLs = Pattern.compile("\"[^\"]* "").matcher(v.toString()).replaceAll("");
String userID = lineWithoutURLs.substring(22, 54);

r.incrCounter(Counters.MAP_LINES, 1);
if ((++numRecords % 10000) == 0)
{
System.out.println("Finished mapping of " + numRecords + " records " + "from the input file: " + inputFile);
}
idText.set(userID);
recordText.set(lineWithoutURLs);
output.collect( idText,recordText );
}

public static class TableUploader  extends TableReduce
{

@Override
public void reduce( Text k, Iterator v,
OutputCollector output,
Reporter r) throws IOException
{

BatchUpdate outval = new BatchUpdate(k.toString());
while (v.hasNext())
{
String value = v.next().toString();
String dateStamp = value.substring(0, 20).replaceAll("[-:,]", "");
outval.put( OUTPUT_COLUMN + dateStamp, value.getBytes());
r.incrCounter(Counters.REDUCE_LINES, 1);
}
output.collect( new ImmutableBytesWritable( k.getBytes()), outval);
}
}



static int printUsage()
{
System.out.println(NAME + "  ");
return -1;
}

public int run( String[] args) throws Exception
{
// Make sure there are exactly 2 parameters left.
if (args.length != 2) {
System.out.println("ERROR: Wrong number of parameters: " +
args.length + " instead of 2.");
return printUsage();
}
JobClient.runJob(createSubmittableJob(args));
return 0;
}

public Configuration getConf() {
return this.conf;
}

public void setConf(final Configuration c) {
this.conf = c;
}

public static void main(String[] args) throws Exception
{
int errCode = ToolRunner.run(new Configuration(), new SampleUploader(),
args);
System.exit(errCode);

}



}

3.4 Running HBase and using HBase shell

Start HBase with the following command:


${HBASE_HOME}/bin/start-hbase.sh

If HBase is started succesfully you suppose to see a following output after running jps:


[root@37 /usr/local/bin/hadoop]jps
10379 DataNode
18303 Jps
10637 TaskTracker
10536 JobTracker
10286 NameNode
17512 HMaster

Check if HBase is running with web interface: http://localhost:60030

Once HBase has started, run HBase Shell with

${HBASE_HOME}/bin/hbase shell

Create a sample table for our tests:


create 'table_keyInMemory', {NAME => 'key',IN_MEMORY => true, VERSIONS => 1,BLOCKCACHE => true},{NAME => 'value',VERSIONS => 1}

To stop HBase, exit the HBase shell and enter:


${HBASE_HOME}/bin/stop-hbase.sh

3.5 Finally: Rinning Map-Reduce jobs:

First copy an input file from local filesystem:


[root@37 /usr/local/bin/hadoop]bin/hadoop dfs -copyFromLocal /localFile fileNameInHadoopDFS
[root@37 /usr/local/bin/hadoop] bin/hadoop dfs -ls
Found 1 items -rw-r--r-- 1 root supergroup 27325 2008-12-25 03:26 /user/root fileNameInHadoopDFS
// Run Map-Reduce script
[root@37 /usr/local/bin/hadoop]bin/hadoop jar Test.jar org.exelate.Uploader 100 sampleTable
08/12/25 03:55:40 INFO mapred.FileInputFormat: Total input paths to process : 1
08/12/25 03:55:40 INFO mapred.FileInputFormat: Total input paths to process : 1
08/12/25 03:55:40 INFO mapred.JobClient: Running job: job_200812250352_0002
08/12/25 03:55:41 INFO mapred.JobClient:  map 0% reduce 0%
08/12/25 03:55:44 INFO mapred.JobClient:  map 100% reduce 0%
08/12/25 03:55:50 INFO mapred.JobClient: Job complete: job_200812250352_0002
08/12/25 03:55:50 INFO mapred.JobClient: Counters: 17
08/12/25 03:55:50 INFO mapred.JobClient:   Job Counters
08/12/25 03:55:50 INFO mapred.JobClient:     Data-local map tasks=2
08/12/25 03:55:50 INFO mapred.JobClient:     Launched reduce tasks=1
08/12/25 03:55:50 INFO mapred.JobClient:     Launched map tasks=2
08/12/25 03:55:50 INFO mapred.JobClient:   Map-Reduce Framework
08/12/25 03:55:50 INFO mapred.JobClient:     Map output records=100
08/12/25 03:55:50 INFO mapred.JobClient:     Reduce input records=100
08/12/25 03:55:50 INFO mapred.JobClient:     Map output bytes=16516
08/12/25 03:55:50 INFO mapred.JobClient:     Map input records=100
08/12/25 03:55:50 INFO mapred.JobClient:     Combine output records=0
08/12/25 03:55:50 INFO mapred.JobClient:     Map input bytes=27325
08/12/25 03:55:50 INFO mapred.JobClient:     Combine input records=0
08/12/25 03:55:50 INFO mapred.JobClient:     Reduce input groups=44
08/12/25 03:55:50 INFO mapred.JobClient:     Reduce output records=44
08/12/25 03:55:50 INFO mapred.JobClient:   File Systems
08/12/25 03:55:50 INFO mapred.JobClient:     Local bytes written=33928
08/12/25 03:55:50 INFO mapred.JobClient:     HDFS bytes read=30048
08/12/25 03:55:50 INFO mapred.JobClient:     Local bytes read=16921
08/12/25 03:55:50 INFO mapred.JobClient:   org.exelate.SampleUploader$Counters
08/12/25 03:55:50 INFO mapred.JobClient:     MAP_LINES=100
08/12/25 03:55:50 INFO mapred.JobClient:     REDUCE_LINES=100

HDFS Command Reference

There are many more commands in bin/hadoop dfs than were demonstrated here, although these basic operations will get you started. Running bin/hadoop dfs with no additional arguments will list all commands which can be run with the FsShell system. Furthermore, bin/hadoop dfs -help commandName will display a short usage summary for the operation in question, if you are stuck.

A table of all operations is reproduced below. The following conventions are used for parameters:

italics denote variables to be filled out by the user.

"path" means any file or directory name.

"path..." means one or more file or directory names.

"file" means any filename.
"src" and "dest" are path names in a directed operation.

"localSrc" and "localDest" are paths as above, but on the local file system. All other file and path names refer to objects inside HDFS.

Parameters in [brackets] are optional.

Command	Operation
-ls path	Lists the contents of the directory specified by path, showing the names, permissions, owner, size and modification date for each entry.
-lsr path	Behaves like `-ls`, but recursively displays entries in all subdirectories of path.
-du path	Shows disk usage, in bytes, for all files which match path; filenames are reported with the full HDFS protocol prefix.
-dus path	Like `-du`, but prints a summary of disk usage of all files/directories in the path.
-mv src dest	Moves the file or directory indicated by src to dest, within HDFS.
-cp src dest	Copies the file or directory identified by src to dest, within HDFS.
-rm path	Removes the file or empty directory identified by path.
-rmr path	Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or subdirectories of path).
-put localSrc dest	Copies the file or directory from the local file system identified by localSrc to dest within the DFS.
-copyFromLocal localSrc dest	Identical to `-put`
-moveFromLocal localSrc dest	Copies the file or directory from the local file system identified by localSrc to dest within HDFS, then deletes the local copy on success.
-get [-crc] src localDest	Copies the file or directory in HDFS identified by src to the local file system path identified by localDest.
-getmerge src localDest [addnl]	Retrieves all files that match the path src in HDFS, and copies them to a single, merged file in the local file system identified by localDest.
-cat filename	Displays the contents of filename on stdout.
-copyToLocal [-crc] src localDest	Identical to `-get`
-moveToLocal [-crc] src localDest	Works like `-get`, but deletes the HDFS copy on success.
-mkdir path	Creates a directory named path in HDFS. Creates any parent directories in path that are missing (e.g., like `mkdir -p` in Linux).
-setrep [-R] [-w] rep path	Sets the target replication factor for files identified by path to rep. (The actual replication factor will move toward the target over time)
-touchz path	Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path, unless the file is already size 0.
-test -[ezd] path	Returns 1 if path exists; has zero length; or is a directory, or 0 otherwise.
-stat [format] path	Prints information about path. format is a string which accepts file size in blocks (%b), filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).
-tail [-f] file	Shows the lats 1KB of file on stdout.
-chmod [-R] mode,mode,... path...	Changes the file permissions associated with one or more objects identified by path.... Performs changes recursively with `-R`. mode is a 3-digit octal mode, or `{augo}+/-{rwxX}`. Assumes `a` if no scope is specified and does not apply a umask.
-chown [-R] [owner][:[group]] path...	Sets the owning user and/or group for files or directories identified by path.... Sets owner recursively if `-R` is specified.
-chgrp [-R] group path...	Sets the owning group for files or directories identified by path.... Sets group recursively if `-R` is specified.
-help cmd	Returns usage information for one of the commands listed above. You must omit the leading '-' character in cmd

Tuesday, December 23, 2008

Change hostname in Linux

First you need to find out your hostname, you can do this with

$ hostname
localhost.localdomain
$

Edit /etc/hosts

You need to edit /etc/hosts and add a line for your host name

$ cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
$

My new server IP is 72.232.196.90, i need to assign it hostname server12.hosthat.com, to do this, i have edited /etc/hosts as follows.

# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
72.232.196.90           server12.hosthat.com server12

Edit /etc/sysconfig/network

First lets see what is in the file

$ cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=localhost.localdomain
$

To change servers hostname to server12.hosthat.com, change the file as follows.

$ cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=server12.hosthat.com
$

Now you need to reboot the server to change the hostname.

vmstat

Report virtual memory statistics

SYNOPSIS

vmstat [-n] [delay [ count]]
vmstat[-V]

vmstat reports information about processes, memory, paging, block IO, traps, and cpu activity.

The first report produced gives averages since the last reboot. Additional reports give information on a sampling period of length delay. The process and memory reports are instantaneous in either case.


[root@]# vmstat 1
   procs                      memory      swap          io     system      cpu
 r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id
 0  0  0 6085936  91884 1707572 1333604   48   50    70    73   46     7  1  1 23
 0  0  0 6085936  88464 1707608 1335120    0    0  1374    49 2750  1784  0  0 100
 0  0  0 6085936  88048 1707660 1335952    0    0   753   913 2532  1883  0  0 100
 0  0  0 6085936  87632 1707824 1337460    0    0  1282   908 2452  2054  0  0 100

Monday, December 15, 2008

Use Public/Private Keys for Authentication

First, create a public/private key pair on the client that you will use to connect to the server (you will need to do this from each client machine from which you connect):

$ ssh-keygen -t rsa

This will create two files in your ~/.ssh directory 
id_rsa: your private key
id_rsa.pub: is your public key.

If you don't want to still be asked for a password each time you connect,
just press enter when asked for a password when creating the key pair.
It is up to you to decide whether or not you should password encrypt
your key when you create it. If you don't password encrypt your key,
then anyone gaining access to your local machine will automatically
have ssh access to the remote server. Also, root on the local machine
has access to your keys although one assumes that if you can't trust
root (or root is compromised) then you're in real trouble. Encrypting
the key adds additional security at the expense of eliminating the need
for entering a password for the ssh server only to be replaced with
entering a password for the use of the key. Now set permissions on your private key:

$ chmod 700 ~/.ssh
$ chmod 600 ~/.ssh/id_rsa

Copy the public key (id_rsa.pub) to the server and install it to the authorized_keys list:

$ cat id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 600 ~/.ssh/authorized_keys

Once you've checked you can successfully login to the server using your public/private key pair,
you can disable password authentication completely by adding the following setting to your
/etc/ssh/sshd_config

# Disable password authentication forcing use of keys
PasswordAuthentication no

Thursday, December 11, 2008

Distributed Systems and Web Scalability Resources

No long write-ups this week, just a short list of some great resources that I've found very inspirational and thought provoking. I've broken these resources up into two lists: Blogs and Presentations.

Blogs

The blogs listed below are ones that I subscribe to and are filled with some great posts about capacity planning, scalability problems and solutions, and distributed system information. Each blog is authored by exceptionally smart people and many of them have significant experience building production-level scalable systems.

Nati Shalom's Blog: Discussions about middleware and distributed technologies
http://natishalom.typepad.com/nati_shaloms_blog/

All Things Distributed: Werner Vogels' weblog on building scalable and robust distributed systems.
http://www.allthingsdistributed.com/

High Scalability: Building bigger, faster, more reliable websites
http://highscalability.com/

ProductionScale: Information Technology, Scalability, Technology Operations, and Cloud Computing
http://www.productionscale.com/

iamcal.com
http://www.iamcal.com/ (the "talks" section is particularly interesting)

Kitchen Soap: Thoughts on capacity planning and web operations
http://www.kitchensoap.com/

MySQL Performance Blog: Everything about MySQL Performance
http://www.mysqlperformanceblog.com/

Presentations

The presentations listed below are from the SlideShare site and are primarily the slides used to accompany scalability talks from around the world. Many of them outline the problems that various companies have encountered during their non-linear growth phases and how they've solved them by scaling their systems.

Scalable Internet Architectures
http://www.slideshare.net/shiflett/scalable-internet-architectures

How to build the Web
http://www.slideshare.net/simon/how-to-build-the-web

Netlog: What we learned about scalability & high availability
http://www.slideshare.net/folke/netlog-what-we-learned-about-scalability-high-availability-430211

Database Sharding at Netlog
http://www.slideshare.net/oemebamo/database-sharding-at-netlog-presentation

MySQL 2007 Techn At Digg V3
http://www.slideshare.net/epee/mysql-2007-tech-at-digg-v3

Flickr and PHP
http://www.slideshare.net/coolpics/flickr-44054

Scalable Web Architectures: Common Patterns and Approaches
http://www.slideshare.net/techdude/scalable-web-architectures-common-patterns-and-approaches

How to scale your web app
http://www.slideshare.net/Georgio_1999/how-to-scale-your-web-app

Google Cluster Innards
http://www.slideshare.net/ultradvorka/google-cluster-innards

Sharding Architectures
http://www.slideshare.net/guest0e6d5e/sharding-architectures

Amazon EC2 setup - link
Yahoo Hadoop tutorial - link
Michael Noll blog - link
Hadoop main page - link
Google lectures - link
HBase resources - link
Distributed computing(IBM) - link
HBase and BigTable - link

Understanding HBase column-family performance options - link

Debugging and Tuning Map-Reduce Applications

by Arun C Murthy, Principal Engineer at Yahoo! and Member of Apache Hadoop PMC

http://www.vimeo.com/2085477

Sunday, November 30, 2008

Process File Descriptor Tuning on Linux

I've recently encountered on file handlers limit problem while running java program that holds a large hash of file descriptors. The following example describes how to raise the maximum number of file descriptors per process to 4096 on the RedHat/CentOS distibution of Linux:

Process File Descriptor Tuning

In addition to configuring system-wide global file-descriptor values, you must also consider per-process limits.

The following example describes how to raise the maximum number of file descriptors per process to 4096 on the RedHat?CentOS distibution of Linux:

Allow all users to modify their file descriptor limits from an initial value of 1024 up to the maximum permitted value of 4096 by changing /etc/security/limits.conf
```
   *       soft    nofile  1024
   *       hard    nofile  4096
```
In /etc/pam.d/login, add:
```
   session required /lib/security/pam_limits.so
```
Increase the system-wide file descriptor limit by adding the following line to the /etc/rc.d/rc.local startup script:
```
   echo -n "8192" > /proc/sys/fs/file-max
```
or, on 2.6 kernels:
```
   echo -n "8192" > $( mount | grep sysfs | cut -d" " -f 3 )/fs/file-max
```
Now restart the system or run these commands from a command line to apply these changes.

You will then need to tell the system to use the new limits:

ulimit -n unlimited (bash)

ulimit -n 65535 (bash)

unlimit descriptors (csh, tcsh).

Verify this has raised the limit by checking the output of:
```
ulimit -a (bash) or limit (csh, tcsh)
```

Wednesday, November 26, 2008

HANDY ONE-LINERS FOR AWK

FILE SPACING:

# double space a file
awk '1;{print ""}'
awk 'BEGIN{ORS="\n\n"};1'

# double space a file which already has blank lines in it. Output file
# should contain no more than one blank line between lines of text.
# NOTE: On Unix systems, DOS lines which have only CRLF (\r\n) are
# often treated as non-blank, and thus 'NF' alone will return TRUE.
awk 'NF{print $0 "\n"}'

# triple space a file
awk '1;{print "\n"}'

NUMBERING AND CALCULATIONS:

# precede each line by its line number FOR THAT FILE (left alignment).
# Using a tab (\t) instead of space will preserve margins.
awk '{print FNR "\t" $0}' files*

# precede each line by its line number FOR ALL FILES TOGETHER, with tab.
awk '{print NR "\t" $0}' files*

# number each line of a file (number on left, right-aligned)
# Double the percent signs if typing from the DOS command prompt.
awk '{printf("%5d : %s\n", NR,$0)}'

# number each line of file, but only print numbers if line is not blank
# Remember caveats about Unix treatment of \r (mentioned above)
awk 'NF{$0=++a " :" $0};{print}'
awk '{print (NF? ++a " :" :"") $0}'

# count lines (emulates "wc -l")
awk 'END{print NR}'

# print the sums of the fields of every line
awk '{s=0; for (i=1; i<=NF; i++) s=s+$i; print s}'

# add all fields in all lines and print the sum
awk '{for (i=1; i<=NF; i++) s=s+$i}; END{print s}'

# print every line after replacing each field with its absolute value
awk '{for (i=1; i<=NF; i++) if ($i < 0) $i = -$i; print }'
awk '{for (i=1; i<=NF; i++) $i = ($i < 0) ? -$i : $i; print }'

# print the total number of fields ("words") in all lines
awk '{ total = total + NF }; END {print total}' file

# print the total number of lines that contain "Beth"
awk '/Beth/{n++}; END {print n+0}' file

# print the largest first field and the line that contains it
# Intended for finding the longest string in field #1
awk '$1 > max {max=$1; maxline=$0}; END{ print max, maxline}'

# print the number of fields in each line, followed by the line
awk '{ print NF ":" $0 } '

# print the last field of each line
awk '{ print $NF }'

# print the last field of the last line
awk '{ field = $NF }; END{ print field }'

# print every line with more than 4 fields
awk 'NF > 4'

# print every line where the value of the last field is > 4
awk '$NF > 4'


TEXT CONVERSION AND SUBSTITUTION:

# IN UNIX ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format
awk '{sub(/\r$/,"");print}'   # assumes EACH line ends with Ctrl-M

# IN UNIX ENVIRONMENT: convert Unix newlines (LF) to DOS format
awk '{sub(/$/,"\r");print}

# IN DOS ENVIRONMENT: convert Unix newlines (LF) to DOS format
awk 1

# IN DOS ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format
# Cannot be done with DOS versions of awk, other than gawk:
gawk -v BINMODE="w" '1' infile >outfile

# Use "tr" instead.
tr -d \r outfile            # GNU tr version 1.22 or higher

# delete leading whitespace (spaces, tabs) from front of each line
# aligns all text flush left
awk '{sub(/^[ \t]+/, ""); print}'

# delete trailing whitespace (spaces, tabs) from end of each line
awk '{sub(/[ \t]+$/, "");print}'

# delete BOTH leading and trailing whitespace from each line
awk '{gsub(/^[ \t]+|[ \t]+$/,"");print}'
awk '{$1=$1;print}'           # also removes extra space between fields

# insert 5 blank spaces at beginning of each line (make page offset)
awk '{sub(/^/, "     ");print}'

# align all text flush right on a 79-column width
awk '{printf "%79s\n", $0}' file*

# center all text on a 79-character width
awk '{l=length();s=int((79-l)/2); printf "%"(s+l)"s\n",$0}' file*

# substitute (find and replace) "foo" with "bar" on each line
awk '{sub(/foo/,"bar");print}'           # replaces only 1st instance
gawk '{$0=gensub(/foo/,"bar",4);print}'  # replaces only 4th instance
awk '{gsub(/foo/,"bar");print}'          # replaces ALL instances in a line

# substitute "foo" with "bar" ONLY for lines which contain "baz"
awk '/baz/{gsub(/foo/, "bar")};{print}'

# substitute "foo" with "bar" EXCEPT for lines which contain "baz"
awk '!/baz/{gsub(/foo/, "bar")};{print}'

# change "scarlet" or "ruby" or "puce" to "red"
awk '{gsub(/scarlet|ruby|puce/, "red"); print}'

# reverse order of lines (emulates "tac")
awk '{a[i++]=$0} END {for (j=i-1; j>=0;) print a[j--] }' file*

# if a line ends with a backslash, append the next line to it
# (fails if there are multiple lines ending with backslash...)
awk '/\\$/ {sub(/\\$/,""); getline t; print $0 t; next}; 1' file*

# print and sort the login names of all users
awk -F ":" '{ print $1 | "sort" }' /etc/passwd

# print the first 2 fields, in opposite order, of every line
awk '{print $2, $1}' file

# switch the first 2 fields of every line
awk '{temp = $1; $1 = $2; $2 = temp}' file

# print every line, deleting the second field of that line
awk '{ $2 = ""; print }'

# print in reverse order the fields of every line
awk '{for (i=NF; i>0; i--) printf("%s ",i);printf ("\n")}' file

# remove duplicate, consecutive lines (emulates "uniq")
awk 'a !~ $0; {a=$0}'

# remove duplicate, nonconsecutive lines
awk '! a[$0]++'                     # most concise script
awk '!($0 in a) {a[$0];print}'      # most efficient script

# concatenate every 5 lines of input, using a comma separator
# between fields
awk 'ORS=%NR%5?",":"\n"' file



SELECTIVE PRINTING OF CERTAIN LINES:

# print first 10 lines of file (emulates behavior of "head")
awk 'NR < 11'

# print first line of file (emulates "head -1")
awk 'NR>1{exit};1'

 # print the last 2 lines of a file (emulates "tail -2")
awk '{y=x "\n" $0; x=$0};END{print y}'

# print the last line of a file (emulates "tail -1")
awk 'END{print}'

# print only lines which match regular expression (emulates "grep")
awk '/regex/'

# print only lines which do NOT match regex (emulates "grep -v")
awk '!/regex/'

# print the line immediately before a regex, but not the line
# containing the regex
awk '/regex/{print x};{x=$0}'
awk '/regex/{print (x=="" ? "match on line 1" : x)};{x=$0}'

# print the line immediately after a regex, but not the line
# containing the regex
awk '/regex/{getline;print}'

# grep for AAA and BBB and CCC (in any order)
awk '/AAA/; /BBB/; /CCC/'

# grep for AAA and BBB and CCC (in that order)
awk '/AAA.*BBB.*CCC/'

# print only lines of 65 characters or longer
awk 'length > 64'

# print only lines of less than 65 characters
awk 'length < 64'

# print section of file from regular expression to end of file
awk '/regex/,0'
awk '/regex/,EOF'

# print section of file based on line numbers (lines 8-12, inclusive)
awk 'NR==8,NR==12'

# print line number 52
awk 'NR==52'
awk 'NR==52 {print;exit}'          # more efficient on large files

# print section of file between two regular expressions (inclusive)
awk '/Iowa/,/Montana/'             # case sensitive


SELECTIVE DELETION OF CERTAIN LINES:

# delete ALL blank lines from a file (same as "grep '.' ")
awk NF
awk '/./'

Tuesday, November 25, 2008

Searching with shell utilities

1.3 Matching Text

A number of Unix text-processing utilities let you search for, and in some cases change, text patterns rather than fixed strings. These utilities include the editing programs ed, ex, vi, and sed, the awk programming language, and the commands grep and egrep. Text patterns (formally called regular expressions) contain normal characters mixed with special characters (called metacharacters).

1.3.1 Filenames Versus Patterns

Metacharacters used in pattern matching are different from metacharacters used for filename expansion. When you issue a command on the command line, special characters are seen first by the shell, then by the program; therefore, unquoted metacharacters are interpreted by the shell for filename expansion. For example, the command:

$ grep [A-Z]* chap[12]

could be transformed by the shell into:

$ grep Array.c Bug.c Comp.c chap1 chap2

and would then try to find the pattern Array.c in files Bug.c, Comp.c, chap1, and chap2. To bypass the shell and pass the special characters to grep, use quotes as follows:

$ grep "[A-Z]*" chap[12]

Double quotes suffice in most cases, but single quotes are the safest bet.

Note also that in pattern matching, ? matches zero or one instance of a regular expression; in filename expansion, ? matches a single character.

1.3.2 Metacharacters

Different metacharacters have different meanings, depending upon where they are used. In particular, regular expressions used for searching through text (matching) have one set of metacharacters, while the metacharacters used when processing replacement text have a different set. These sets also vary somewhat per program. This section covers the metacharacters used for searching and replacing, with descriptions of the variants in the different utilities.

1.3.2.1 Search patterns

The characters in the following table have special meaning only in search patterns:

Character	Pattern
`.`	Match any single character except newline. Can match newline in awk.
`*`	Match any number (or none) of the single character that immediately precedes it. The preceding character can also be a regular expression. For example, since `.` (dot) means any character, `.*` means "match any number of any character."
`^`	Match the following regular expression at the beginning of the line or string.
`$`	Match the preceding regular expression at the end of the line or string.
`\`	Turn off the special meaning of the following character.
`[ ]`	Match any one of the enclosed characters. A hyphen (`-`) indicates a range of consecutive characters. A circumflex (`^`) as the first character in the brackets reverses the sense: it matches any one character not in the list. A hyphen or close bracket (`]`) as the first character is treated as a member of the list. All other metacharacters are treated as members of the list (i.e., literally).
`{n`,`m}`	Match a range of occurrences of the single character that immediately precedes it. The preceding character can also be a metacharacter. `{n}` matches exactly n occurrences; `{n,}` matches at least n occurrences; and `{n,m}` matches any number of occurrences between n and m. n and m must be between 0 and 255, inclusive.
`\{n`,`m\}`	Just like `{n,m}`, but with backslashes in front of the braces.
``	Save the pattern enclosed between `$` and `$` into a special holding space. Up to nine patterns can be saved on a single line. The text matched by the subpatterns can be "replayed" in substitutions by the escape sequences `\1` to `\9`.
`\n`	Replay the nth sub-pattern enclosed in `$` and `$` into the pattern at this point. n is a number from 1 to 9, with 1 starting on the left.
`\< \>`	Match characters at beginning (`\<`) or end (`\>`) of a word.
`+`	Match one or more instances of preceding regular expression.
`?`	Match zero or one instances of preceding regular expression.
`\|`	Match the regular expression specified before or after.
`( )`	Apply a match to the enclosed group of regular expressions.

Many Unix systems allow the use of POSIX character classes within the square brackets that enclose a group of characters. These are typed enclosed in [: and :]. For example, [[:alnum:]] matches a single alphanumeric character.

Class	Characters matched
`alnum`	Alphanumeric characters
`alpha`	Alphabetic characters
`blank`	Space or TAB
`cntrl`	Control characters
`digit`	Decimal digits
`graph`	Nonspace characters
`lower`	Lowercase characters
`print`	Printable characters
`space`	Whitespace characters
`upper`	Uppercase characters
`xdigit`	Hexadecimal digits

1.3.2.2 Replacement patterns

The characters in the following table have special meaning only in replacement patterns:

Character	Pattern
`\`	Turn off the special meaning of the following character.
`\n`	Restore the text matched by the nth pattern previously saved by `$` and `$`. n is a number from 1 to 9, with 1 starting on the left.
`&`	Reuse the text matched by the search pattern as part of the replacement pattern.
`~`	Reuse the previous replacement pattern in the current replacement pattern. Must be the only character in the replacement pattern (ex and vi).
`%`	Reuse the previous replacement pattern in the current replacement pattern. Must be the only character in the replacement pattern (ed).
`\u`	Convert first character of replacement pattern to uppercase.
`\U`	Convert entire replacement pattern to uppercase.
`\l`	Convert first character of replacement pattern to lowercase.
`\L`	Convert entire replacement pattern to lowercase.
`\E`	Turn off previous `\U` or `\L`.
`\e`	Turn off previous `\u` or `\l`.

1.3.3 Metacharacters, Listed by Unix Program

Some metacharacters are valid for one program but not for another. Those that are available to a Unix program are marked by a bullet () in the following table. (This table is correct for SVR4 and Solaris and most commercial Unix systems, but it's always a good idea to verify your system's behavior.) Items marked with a "P" are specified by POSIX; double check your system's version. Full descriptions were provided in the previous section.

Symbol	ed	ex	vi	sed	awk	grep	egrep	Action
`.`								Match any character.
`*`								Match zero or more preceding.
`^`								Match beginning of line/string.
`$`								Match end of line/string.
`\`								Escape following character.
`[ ]`								Match one from a set.
``								Store pattern for later replay.^[1]
`\n`								Replay sub-pattern in match.
`{ }`					^P		^P	Match a range of instances.
`\{ \}`								Match a range of instances.
`\< \>`								Match word's beginning or end.
`+`								Match one or more preceding.
`?`								Match zero or one preceding.
`\|`								Separate choices to match.
`( )`								Group expressions to match.

^[1] Stored sub-patterns can be "replayed" during matching. See the examples in the next table.

Note that in ed, ex, vi, and sed, you specify both a search pattern (on the left) and a replacement pattern (on the right). The metacharacters listed in this table are meaningful only in a search pattern.

In ed, ex, vi, and sed, the following metacharacters are valid only in a replacement pattern:

Symbol	ex	vi	sed	ed	Action
`\`					Escape following character.
`\n`					Text matching pattern stored in ``.
`&`					Text matching search pattern.
`~`					Reuse previous replacement pattern.
`%`					Reuse previous replacement pattern.
`\u \U`					Change character(s) to uppercase.
`\l \L`					Change character(s) to lowercase.
`\E`					Turn off previous `\U` or `\L`.
`\e`					Turn off previous `\u` or `\l`.

1.3.4 Examples of Searching

When used with grep or egrep, regular expressions should be surrounded by quotes. (If the pattern contains a $, you must use single quotes; e.g., 'pattern'.) When used with ed, ex, sed, and awk, regular expressions are usually surrounded by / although (except for awk), any delimiter works. Here are some example patterns:

Pattern	What does it match?
`bag`	The string bag.
`^bag`	bag at the beginning of the line.
`bag$`	bag at the end of the line.
`^bag$`	bag as the only word on the line.
`[Bb]ag`	Bag or bag.
`b[aeiou]g`	Second letter is a vowel.
`b[^aeiou]g`	Second letter is a consonant (or uppercase or symbol).
`b.g`	Second letter is any character.
`^...$`	Any line containing exactly three characters.
`^\.`	Any line that begins with a dot.
`^\.[a-z][a-z]`	Same as previous, followed by two lowercase letters (e.g., troff requests).
`^\.[a-z]\{2\}`	Same as previous; ed, grep and sed only.
`^[^.]`	Any line that doesn't begin with a dot.
`bugs*`	bug, bugs, bugss, etc.
`"word"`	A word in quotes.
`"word"`	A word, with or without quotes.
`[A-Z][A-Z]*`	One or more uppercase letters.
`[A-Z]+`	Same as previous; egrep or awk only.
`[[:upper:]]+`	Same as previous; POSIX egrep or awk.
`[A-Z].*`	An uppercase letter, followed by zero or more characters.
`[A-Z]*`	Zero or more uppercase letters.
`[a-zA-Z]`	Any letter, either lower- or uppercase.
`[^0-9A-Za-z]`	Any symbol or space (not a letter or a number).
`[^[:alnum:]]`	Same, using POSIX character class.

egrep or awk pattern	What does it match?
`[567]`	One of the numbers 5, 6, or 7.
`five\|six\|seven`	One of the words five, six, or seven.
`80[2-4]?86`	8086, 80286, 80386, or 80486.
`80[2-4]?86\|Pentium`	8086, 80286, 80386, 80486, or Pentium.
`compan(y\|ies)`	company or companies.

ex or vi pattern	What does it match?
`\`	Words like theater, there, or the.
`the\>`	Words like breathe, seethe, or the.
`\`	The word the.

ed, sed, or grep pattern	What does it match?
`0\{5,\}`	Five or more zeros in a row.
`[0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}`	U.S. Social Security number (nnn-nn-nnnn).
`$why$.*\1`	A line with two occurrences of why.
`$[[:alpha:]_][[:alnum:]_.]*$ = \1;`	C/C++ simple assignment statements.

1.3.4.1 Examples of searching and replacing

The following examples show the metacharacters available to sed or ex. Note that ex commands begin with a colon. A space is marked by a ; a TAB is marked by a .

Command	Result
`s/.*/( & )/`	Redo the entire line, but add parentheses.
`s/.*/mv & &.old/`	Change a wordlist (one word per line) into mv commands.
`/^$/d`	Delete blank lines.
`:g/^$/d`	Same as previous, in ex editor.
`/^[]*$/d`	Delete blank lines, plus lines containing only spaces or s.
`:g/^[]*$/d`	Same as previous, in ex editor.
`s/*//g`	Turn one or more spaces into one space.
`:%s/*//g`	Same as previous, in ex editor.
`:s/[0-9]/Item &:/`	Turn a number into an item label (on the current line).
`:s`	Repeat the substitution on the first occurrence.
`:&`	Same as previous.
`:sg`	Same as previous, but for all occurrences on the line.
`:&g`	Same as previous.
`:%&g`	Repeat the substitution globally (i.e., on all lines).
`:.,$s/Fortran/\U&/g`	On current line to last line, change word to uppercase.
`:%s/.*/\L&/`	Lowercase entire file.
`:s/\<./\u&/g`	Uppercase first letter of each word on current line. (Useful for titles.)
`:%s/yes/No/g`	Globally change a word to No.
`:%s/Yes/~/g`	Globally change a different word to No (previous replacement).

Finally, here are some sed examples for transposing words. A simple transposition of two words might look like this:

s/die or do/do or die/

The real trick is to use hold buffers to transpose variable patterns. For example, to transpose using hold buffers:

s/\([Dd]ie\) or \([Dd]o\)/\2 or \1/