Sunday, December 28, 2008

Tasks management on Linux using crontab

Crontab

The crontab (cron derives fromchronos, Greek for time; tab stands fortable) command, found in Unix and Unix-like operating systems, is used to schedule commands to be executed periodically. To see what crontabs are currently running on your system, you can open a terminal and run:

crontab -l

To edit the list of cronjobs you can run:

crontab -e

This will open a the default editor ( vi or pico) to manipulate the crontab settings. If you save and exit the editor, all your cronjobs are saved into crontab. Cronjobs are written in the following format:

* * * * * /bin/execute/this/script.sh

Scheduling explained

As you can see there are 5 stars. The stars represent different date parts in the following order:

minute (from 0 to 59)
hour (from 0 to 23)
day of month (from 1 to 31)
month (from 1 to 12)
day of week (from 0 to 6) (0=Sunday)

Execute every minute

If you leave the star, or asterisk, it means every. Maybe that's a bit unclear. Let's use the the previous example again:

* * * * * /bin/execute/this/script.sh

They are all still asterisks! So this means execute /bin/execute/this/script.sh:

every minute
of every hour
of every day of the month
of every month
and every day in the week.

In short: This script is being executed every minute. Without exception.

Execute every Friday 1AM

So if we want to schedule the script to run at 1AM every Friday, we would need the following cronjob:

0 1 * * 5 /bin/execute/this/script.sh

Get it? The script is now being executed when the system clock hits:

minute: 0
of hour: 1
of day of month: * (every day of month)
of month: * (every month)
and weekday: 5 (=Friday)

Execute 10 past after every hour on the 1st of every month

Here's another one, just for practicing

10 * 1 * * /bin/execute/this/script.sh

Fair enough, it takes some getting used to, but it offers great flexibility.

If you want to run something every 10 minutes

0,10,20,30,40,50 * * * * /bin/execute/this/script.sh

Mailing the crontab output of just one cronjob

And change the cronjob like this:*/10 * * * * /bin/execute/this/script.sh 2>&1 | mail -s "Cronjob ouput" yourname@

Thursday, December 25, 2008

Running Hadoop with HBase On CentOS Linux (Multi-Node Cluster)

1.Prerequisites

1.1 Configure single nodes

Use my tutorial "Running Hadoop with HBase On CentOS Linux (Single-Node Cluster)"

1.2 SSH public access

Add mapping to /etc/hosts for each your nodes:

10.1.0.55 master
10.1.0.56 slave

To do that you need to add public key from master to slave ~/.ssh/authorized_keys and via verse, so eventually tou should be able to do the following:


[root@37 /usr/local/bin/hbase]ssh master
The authenticity of host 'master (10.1.0.55)' can't be established.
RSA key fingerprint is 09:e2:73:ac:6f:42:d1:da:13:20:76:10:36:29:c4:62.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'master,10.1.0.55' (RSA) to the list of known hosts.
Last login: Thu Dec 25 07:19:16 2008 from slave
[root@37 ~]exit
logout
Connection to master closed.
[root@37 /usr/local/bin/hbase]ssh slave
The authenticity of host 'slave (10.1.0.56)' can't be established.
RSA key fingerprint is 5c:4f:81:19:ad:f3:78:02:ce:64:f1:67:10:ca:c5:b8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'slave,10.1.0.56' (RSA) to the list of known hosts.
Last login: Thu Dec 25 07:16:47 2008 from 38.d.de.static.xlhost.com
[root@38 ~]exit
logout
Connection to slave closed.

2.Cluster Overview

The master node will also act as a slave because we only have two machines available in our cluster but still want to spread data storage and processing to multiple machines.

The master node will run the "master" daemons for each layer: namenode for the HDFS storage layer, and jobtracker for the MapReduce processing layer. Both machines will run the "slave" daemons: datanode for the HDFS layer, and tasktracker for MapReduce processing layer. Basically, the "master" daemons are responsible for coordination and management of the "slave" daemons while the latter will do the actual data storage and data processing work.

3. Hadoop core Configuration

3.1 conf/masters (master only)

The conf/masters file defines the master nodes of our multi-node cluster.

On master, update $HADOOP_HOME/conf/masters that it looks like this:

  
master

3.2 conf/slaves (master only)

This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run. We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data.

On master, update $HADOOP_HOME/conf/slaves that it looks like this:


master
slave

If you have additional slave nodes, just add them to the conf/slaves file, one per line (do this on all machines in the cluster).

 master
slave
slave01
slave02
slave03

3.3 conf/hadoop-site.xml (all machines)

Following a sample configuration for all hosts, explanation of all parameters link





hadoop.tmp.dir
/usr/local/bin/hadoop/datastore/hadoop-${user.name}
A base for other temporary directories.



fs.default.name
hdfs://master:54310
The name of the default file system.  A URI whose
scheme and authority determine the FileSystem implementation.  The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class.  The uri's authority is used to
determine the host, port, etc. for a filesystem.



mapred.job.tracker
master:54311
The host and port that the MapReduce job tracker runs
at.  If "local", then jobs are run in-process as a single map
and reduce task.




dfs.replication
1
Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.



mapred.reduce.tasks
8
The default number of reduce tasks per job.  Typically set
to a prime close to the number of available hosts.  Ignored when
mapred.job.tracker is "local".



mapred.tasktracker.reduce.tasks.maximum
8
The maximum number of reduce tasks that will be run
simultaneously by a task tracker.



mapred.child.java.opts
-Xmx500m
Java opts for the task tracker child processes.
The following symbol, if present, will be interpolated: @taskid@ is replaced
by current TaskID. Any other occurrences of '@' will go unchanged.
For example, to enable verbose gc logging to a file named for the taskid in
/tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:
-Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc

The configuration variable mapred.child.ulimit can be used to control the
maximum virtual memory of the child processes.

Very important: now format each node data storage before proceeding.

3.4 Start/Stop Hadoop cluster

HDFS daemons: start/stop namenode command bin/start-dfs.sh or bin/stop-dfs.sh
MapReduce daemons: run the command $HADOOP_HOME/bin/start-mapred.sh on the machine you want the jobtracker to run on. This will bring up the MapReduce cluster with the jobtracker running on the machine you ran the previous command on, and tasktrackers on the machines listed in the conf/slaves file.

[root@37 /usr/local/bin/hadoop]bin/start-dfs.sh
starting namenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-namenode-37.c3.33.static.xlhost.com.out
master: starting datanode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-datanode-37.c3.33.static.xlhost.com.out
slave: starting datanode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-datanode-38.c3.33.static.xlhost.com.out
master: starting secondarynamenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-secondarynamenode-37.c3.33.static.xlhost.com.out
[root@37 /usr/local/bin/hadoop]bin/start-mapred.sh
starting jobtracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-jobtracker-37.c3.33.static.xlhost.com.out
master: starting tasktracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-tasktracker-37.c3.33.static.xlhost.com.out
slave: starting tasktracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-tasktracker-38.c3.33.static.xlhost.com.out
[root@37 /usr/local/bin/hadoop]jps]
-bash: jps]: command not found
[root@37 /usr/local/bin/hadoop]jps
28638 DataNode
29052 Jps
28527 NameNode
28764 SecondaryNameNode
28847 JobTracker
28962 TaskTracker
[root@37 /usr/local/bin/hadoop]bin/hadoop dfs -copyFromLocal LICENSE.txt testWordCount
[root@37 /usr/local/bin/hadoop]bin/hadoop dfs -ls
Found 1 items
-rw-r--r--   1 root supergroup      11358 2008-12-25 08:20 /user/root/testWordCount
[root@37 /usr/local/bin/hadoop]bin/hadoop jar hadoop-0.18.2-examples.jar wordcount testWordCount testWordCount-out
08/12/25 08:21:46 INFO mapred.FileInputFormat: Total input paths to process : 1
08/12/25 08:21:46 INFO mapred.FileInputFormat: Total input paths to process : 1
08/12/25 08:21:47 INFO mapred.JobClient: Running job: job_200812250818_0001
08/12/25 08:21:48 INFO mapred.JobClient:  map 0% reduce 0%
08/12/25 08:21:53 INFO mapred.JobClient:  map 50% reduce 0%
08/12/25 08:21:55 INFO mapred.JobClient:  map 100% reduce 0%
08/12/25 08:22:01 INFO mapred.JobClient:  map 100% reduce 12%
08/12/25 08:22:02 INFO mapred.JobClient:  map 100% reduce 25%
08/12/25 08:22:05 INFO mapred.JobClient:  map 100% reduce 41%
08/12/25 08:22:09 INFO mapred.JobClient:  map 100% reduce 52%
08/12/25 08:22:14 INFO mapred.JobClient:  map 100% reduce 64%
08/12/25 08:22:19 INFO mapred.JobClient:  map 100% reduce 66%
08/12/25 08:22:24 INFO mapred.JobClient:  map 100% reduce 77%
08/12/25 08:22:29 INFO mapred.JobClient:  map 100% reduce 79%
08/12/25 08:25:30 INFO mapred.JobClient:  map 100% reduce 89%
08/12/25 08:26:12 INFO mapred.JobClient: Job complete: job_200812250818_0001
08/12/25 08:26:12 INFO mapred.JobClient: Counters: 17
08/12/25 08:26:12 INFO mapred.JobClient:   Job Counters
08/12/25 08:26:12 INFO mapred.JobClient:     Data-local map tasks=1
08/12/25 08:26:12 INFO mapred.JobClient:     Launched reduce tasks=9
08/12/25 08:26:12 INFO mapred.JobClient:     Launched map tasks=2
08/12/25 08:26:12 INFO mapred.JobClient:     Rack-local map tasks=1
08/12/25 08:26:12 INFO mapred.JobClient:   Map-Reduce Framework
08/12/25 08:26:12 INFO mapred.JobClient:     Map output records=1581
08/12/25 08:26:12 INFO mapred.JobClient:     Reduce input records=593
08/12/25 08:26:12 INFO mapred.JobClient:     Map output bytes=16546
08/12/25 08:26:12 INFO mapred.JobClient:     Map input records=202
08/12/25 08:26:12 INFO mapred.JobClient:     Combine output records=1292
08/12/25 08:26:12 INFO mapred.JobClient:     Map input bytes=11358
08/12/25 08:26:12 INFO mapred.JobClient:     Combine input records=2280
08/12/25 08:26:12 INFO mapred.JobClient:     Reduce input groups=593
08/12/25 08:26:12 INFO mapred.JobClient:     Reduce output records=593
08/12/25 08:26:12 INFO mapred.JobClient:   File Systems
08/12/25 08:26:12 INFO mapred.JobClient:     HDFS bytes written=6117
08/12/25 08:26:12 INFO mapred.JobClient:     Local bytes written=19010
08/12/25 08:26:12 INFO mapred.JobClient:     HDFS bytes read=13872
08/12/25 08:26:12 INFO mapred.JobClient:     Local bytes read=8620

4. HBase configuration

4.1 $HBASE_HOME/hbase-site.sh




hbase.rootdir
hdfs://master:54310/hbase
The directory shared by region servers.



hbase.master
master:60000
The host and port that the HBase master runs at.

4.2 $HBASE_HOME/conf/regionservers


master
slave

4.3 Start HBase on master

Use command $HBASE_HOME/bin/hbase-start.sh


[root@37 /usr/local/bin/hbase]bin/start-hbase.sh
starting master, logging to /usr/local/bin/hbase/bin/../logs/hbase-root-master-37.c3.33.static.xlhost.com.out
slave: starting regionserver, logging to /usr/local/bin/hbase/bin/../logs/hbase-root-regionserver-38.c3.33.static.xlhost.com.out
master: starting regionserver, logging to /usr/local/bin/hbase/bin/../logs/hbase-root-regionserver-37.c3.33.static.xlhost.com.out
[root@37 /usr/local/bin/hbase]jps
30362 SecondaryNameNode
30455 JobTracker
30570 TaskTracker
30231 DataNode
32003 Jps
31772 HMaster
30115 NameNode
31919 HRegionServer

On slave you suppose to see the following output:


[root@38 /usr/local/bin/hbase]jps
513 DataNode
6360 Jps
6214 HRegionServer
628 TaskTracker

Wednesday, December 24, 2008

Code highlighting in blogspot

Adding syntax highlight to blogger is very simple, first of all you need to download SyntaxHighlighter.

Now extract the contents of the package and upload the Scripts and Styles folder to any host or website which can be linked from your blog.

To make it work, you will need to edit your blog's template and add the following code after the -- end outer-wrapper -- tag:

To add some java code to your blog use:

The list of supported languages and their aliases can be found at the syntaxhighlighter wiki

Running Hadoop with HBase on CentOS Linux (Single-Node Cluster)

1.Purpose of this tutorial:

In this tutorial, I will describe the required steps for setting up a single-node Hadoop/Hbase cluster using the Hadoop Distributed File System (HDFS) on CentOS Linux.

Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduce. HDFS is a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storeage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop.

HBase's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware.

2.Prerequisites

CentOS Linux.
Hadoop 0.18.2


[root@localhost bin]# wget 'http://mirror.mirimar.net/apache/hadoop/core/hadoop-0.18.2/hadoop-0.18.2.tar.gz'
--02:49:53--  http://mirror.mirimar.net/apache/hadoop/core/hadoop-0.18.2/hadoop-0.18.2.tar.gz
Resolving mirror.mirimar.net... 194.90.150.47
Connecting to mirror.mirimar.net|194.90.150.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16836495 (16M) [application/x-gzip]
Saving to: `hadoop-0.18.2.tar.gz'

100%[=========================================================================================================================================>] 16,836,495   471K/s   in 36s

HBase 0.18.1


[root@localhost bin]# wget 'http://mirror.mirimar.net/apache/hadoop/hbase/hbase-0.18.1/hbase-0.18.1.tar.gz'
--02:51:42--  http://mirror.mirimar.net/apache/hadoop/hbase/hbase-0.18.1/hbase-0.18.1.tar.gz
Resolving mirror.mirimar.net... 194.90.150.47
Connecting to mirror.mirimar.net|194.90.150.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16295734 (16M) [application/x-gzip]
Saving to: `hbase-0.18.1.tar.gz'

100%[=========================================================================================================================================>] 16,295,734   496K/s   in 34s

JDK1.5.14 from Sun
SSH
Edit environment sessings


[root@34 ~]# vi ~/.bash_profile
// JDK installation directory
export JAVA_HOME=/usr/local/bin/jdk1.5.0_14
PATH=$JAVA_HOME/bin:$PATH:$HOME/bin
export HADOOP_HOME=/usr/local/bin/hadoop
export HBASE_HOME=/usr/local/bin/hbase
//optional
export PS1="[\u@\h \w]"

Please don't forget to add all environment variables to ~/.bash_profile, otherwise all exports will be deleted after you disconnect your SSH session.

2.1 Java

For now we'll use HBase 0.18.1 which was compiled with JDK1.5 and Hadoop 0.18.2 which supports jdk1.5.x, today Hadoop 0.19.0 is available, but it requires jdk1.6, HBase suppose to support jdk1.6 only in version 0.19

Install JDK1.5.14,


[root@localhost /usr/local/bin]wget 'http://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_Developer-Site/en_US/-/USD/VerifyItem-Start/jdk-1_5_0_14-linux-i586.bin?BundledLineItemUUID=PspIBe.oQkIAAAEeWNw8f4HX&OrderID=na1IBe.ouqIAAAEePNw8f4HX&ProductID=YOzACUFBuXAAAAEYlak5AXuQ&FileName=/jdk-1_5_0_14-linux-i586.bin' -O jdk-1_5_0_14-linux-i586.bin
--03:25:00--  http://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_Developer-Site/en_US/-/USD/VerifyItem-Start/jdk-1_5_0_14-linux-i586.bin?BundledLineItemUUID=PspIBe.oQkIAAAEeWNw8f4HX&OrderID=na1IBe.ouqIAAAEePNw8f4HX&ProductID=YOzACUFBuXAAAAEYlak5AXuQ&FileName=/jdk-1_5_0_14-linux-i586.bin
Resolving cds.sun.com... 72.5.239.134
Connecting to cds.sun.com|72.5.239.134|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://cds-esd.sun.com/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin?AuthParam=1230539222_7e7c419133c9fb57c076e8e08293fd8c&TicketId=B%2Fw5lxuBTFhPQB1LOFJTnQTr&GroupName=CDS&FilePath=/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin&File=jdk-1_5_0_14-linux-i586.bin [following]
--03:25:01--  http://cds-esd.sun.com/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin?AuthParam=1230539222_7e7c419133c9fb57c076e8e08293fd8c&TicketId=B%2Fw5lxuBTFhPQB1LOFJTnQTr&GroupName=CDS&FilePath=/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin&File=jdk-1_5_0_14-linux-i586.bin
Resolving cds-esd.sun.com... 98.27.88.9, 98.27.88.39
Connecting to cds-esd.sun.com|98.27.88.9|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49649265 (47M) [application/x-sdlc]
Saving to: `jdk-1_5_0_14-linux-i586.bin'

100%[=========================================================================================================================================>] 49,649,265  5.36M/s   in 9.0s

03:25:11 (5.25 MB/s) - `jdk-1_5_0_14-linux-i586.bin' saved [49649265/49649265]

define JAVA_HOME and add $JAVA_HOME/bin to $PATH

2.2 SSH public key authentication

Hadoop requires SSH public access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it, so

[root@rt ~/.ssh] ssh-keygen -t rsa
//This will create two files in your ~/.ssh directory 
//id_rsa: your private key
//id_rsa.pub: is your public key.
[root@rt ~/.ssh] cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[root@rt ~/.ssh] ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is 55:d7:91:86:ea:86:8f:51:89:9f:68:b0:75:88:52:72.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
[root@rt ~]

As you see the final test is to see if you able to make ssh public authentication connection to the localhost

If the SSH connection fails, these general tips might help:

Enable debugging with ssh -vvv localhost and investigate the error in detail.
Check the SSH server configuration in /etc/ssh/sshd_config, in particular the options PubkeyAuthentication (which should be set to yes) and AllowUsers (if this option is active, add the hadoop user to it). If you made any changes to the SSH server configuration file, you can force a configuration reload with sudo /etc/init.d/ssh reload.

2.3 Disabling IPv6

To disable IPv6 on CentOS Linux, open /etc/modprobe.d/blacklist in the editor of your choice and add the following lines to the end of the file:

 # disable IPv6
 blacklist ipv6

You have to reboot your machine in order to make the changes take effect.

2.4 Edit open files limit.

Edit file /etc/security/limits.conf, add the following lines:

root - nofile 100000

root - locks 100000

Run ulimit -n 1000000 in shell.

3.Hadoop

3.1 Installation

- Unpack hadoop archive to /usr/local/bin (could any directory)

- Move unpacked directory to /usr/local/bin/hadoop: mv hadoop.18.2 hadoop

- Set HADOOP_HOME: export HADOOP_HOME=/usr/local/bin/hadoop

3.2 Configuration

Set up JAVA_HOME in $HADOOP_HOME/conf/hadoop-env.sh to point to your java location:


//The java implementation to use.  Required.
export JAVA_HOME=/usr/local/bin/jdk1.5.0_14

Set up $HADOOP_HOME/conf/hadoop-site.sh

Any site-specific configuration of Hadoop is configured in $HADOOP_HOME/conf/hadoop-site.xml. Here we will configure the directory where Hadoop will store its data files, the ports it listens to, etc. Our setup will use Hadoop's Distributed File System, HDFS, even though our little "cluster" only contains our single local machine.

You can leave the settings below as is with the exception of the hadoop.tmp.dir variable which you have to change to the directory of your choice, for example:

/usr/local/hadoop-datastore/hadoop-${user.name}.

Hadoop will expand ${user.name} to the system user which is running Hadoop, so in our case this will be hadoop and thus the final path will be /usr/local/hadoop-datastore/hadoop-hadoop.






hadoop.tmp.dir
/usr/local/bin/hadoop/datastore/hadoop-${user.name}
A base for other temporary directories.



fs.default.name
hdfs://master:54310
The name of the default file system.  A URI whose
scheme and authority determine the FileSystem implementation.  The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class.  The uri's authority is used to
determine the host, port, etc. for a filesystem.



mapred.job.tracker
master:54311
The host and port that the MapReduce job tracker runs
at.  If "local", then jobs are run in-process as a single map
and reduce task.



mapred.reduce.tasks
4
The default number of reduce tasks per job.  Typically set
to a prime close to the number of available hosts.  Ignored when
mapred.job.tracker is "local".



mapred.tasktracker.reduce.tasks.maximum
4
The maximum number of reduce tasks that will be run
simultaneously by a task tracker.



dfs.replication
2
Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.

2.2 Formatting the name node

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your "cluster" (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem, this will cause all your data to be erased.

To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command

[root@cc hadoop]# bin/hadoop namenode -format

The output suppose to be like this:


[root@cc hadoop]# bin/hadoop namenode -format
08/12/24 10:56:34 INFO dfs.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = cc.d.de.static.xlhost.com/206.222.13.204
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.18.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 709042; compiled by 'ndaley' on Thu Oct 30 01:07:18 UTC 2008
************************************************************/
Re-format filesystem in /usr/local/bin/hadoop/datastore/hadoop-root/dfs/name ? (Y or N) Y
08/12/24 10:57:40 INFO fs.FSNamesystem: fsOwner=root,root,bin,daemon,sys,adm,disk,wheel
08/12/24 10:57:40 INFO fs.FSNamesystem: supergroup=supergroup
08/12/24 10:57:40 INFO fs.FSNamesystem: isPermissionEnabled=true
08/12/24 10:57:40 INFO dfs.Storage: Image file of size 78 saved in 0 seconds.
08/12/24 10:57:40 INFO dfs.Storage: Storage directory /usr/local/bin/hadoop/datastore/hadoop-root/dfs/name has been successfully formatted.
08/12/24 10:57:40 INFO dfs.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at cc.d.de.static.xlhost.com/206.222.13.204
************************************************************/

2.3 Starting/Stopping your single-node cluster

Run the command:


[root@cc hadoop]# $HADOOP_HOME/bin/start-all.sh

You suppose to see the following output:


starting namenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-namenode-cc.com.out
localhost: starting datanode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-datanode-cc.out
localhost: starting secondarynamenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-secondarynamenode-cc.out
starting jobtracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-jobtracker-cc.com.out
localhost: starting tasktracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-tasktracker-cc.com.out

Run example map-reduce job that comes with hadoop installation:


[root@38 /usr/local/bin/hadoop]bin/hadoop dfs -copyFromLocal LICENSE.txt testWordCount
[root@38 /usr/local/bin/hadoop]bin/hadoop dfs -ls
Found 1 items
-rw-r--r--   1 root supergroup      11358 2008-12-25 04:54 /user/root/testWordCount
[root@38 /usr/local/bin/hadoop]bin/hadoop jar hadoop-0.18.2-examples.jar wordcount testWordCount testWordCount-output
08/12/25 04:55:47 INFO mapred.FileInputFormat: Total input paths to process : 1
08/12/25 04:55:47 INFO mapred.FileInputFormat: Total input paths to process : 1
08/12/25 04:55:48 INFO mapred.JobClient: Running job: job_200812250447_0001
08/12/25 04:55:49 INFO mapred.JobClient:  map 0% reduce 0%
08/12/25 04:55:51 INFO mapred.JobClient:  map 100% reduce 0%
08/12/25 04:55:56 INFO mapred.JobClient: Job complete: job_200812250447_0001
08/12/25 04:55:56 INFO mapred.JobClient: Counters: 16
08/12/25 04:55:56 INFO mapred.JobClient:   Job Counters
08/12/25 04:55:56 INFO mapred.JobClient:     Data-local map tasks=2
08/12/25 04:55:56 INFO mapred.JobClient:     Launched reduce tasks=1
08/12/25 04:55:56 INFO mapred.JobClient:     Launched map tasks=2
08/12/25 04:55:56 INFO mapred.JobClient:   Map-Reduce Framework
08/12/25 04:55:56 INFO mapred.JobClient:     Map output records=1581
08/12/25 04:55:56 INFO mapred.JobClient:     Reduce input records=593
08/12/25 04:55:56 INFO mapred.JobClient:     Map output bytes=16546
08/12/25 04:55:56 INFO mapred.JobClient:     Map input records=202
08/12/25 04:55:56 INFO mapred.JobClient:     Combine output records=1292
08/12/25 04:55:56 INFO mapred.JobClient:     Map input bytes=11358
08/12/25 04:55:56 INFO mapred.JobClient:     Combine input records=2280
08/12/25 04:55:56 INFO mapred.JobClient:     Reduce input groups=593
08/12/25 04:55:56 INFO mapred.JobClient:     Reduce output records=593
08/12/25 04:55:56 INFO mapred.JobClient:   File Systems
08/12/25 04:55:56 INFO mapred.JobClient:     HDFS bytes written=6117
08/12/25 04:55:56 INFO mapred.JobClient:     Local bytes written=18568
08/12/25 04:55:56 INFO mapred.JobClient:     HDFS bytes read=13872
08/12/25 04:55:56 INFO mapred.JobClient:     Local bytes read=8542
[root@38 /usr/local/bin/hadoop]bin/hadoop dfs -ls testWordCount-output
Found 2 items
drwxr-xr-x   - root supergroup          0 2008-12-25 04:55 /user/root/testWordCount-output/_logs
-rw-r--r--   1 root supergroup       6117 2008-12-25 04:55 /user/root/testWordCount-output/part-00000
[root@38 /usr/local/bin/hadoop]bin/hadoop dfs -cat testWordCount-output/part-00000
// suppose to see something like this
...
tracking        1
trade   1
trademark,      1
trademarks,     1
transfer        1
transformation  1
translation     1
...

To stop Hadoop cluster run the following:


[root@37 /usr/local/bin/hadoop]bin/stop-all.sh
no jobtracker to stop
localhost: no tasktracker to stop
no namenode to stop
localhost: no datanode to stop
localhost: no secondarynamenode to stop

2.4 Hadoop monitoring and debugging

Please see hadoop tips of how to debug Map-Reduce programs. Worth to mention that hadoop logs are providing the most information
from $HADOOP_HOME/logs or there are links from hadoop web interfaces.

Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:

The job tracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file

http://localhost:50030/ - web UI for MapReduce job tracker(s)

The task tracker web UI shows you running and non-running tasks.

http://localhost:50060/ - web UI for task tracker(s)

The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser.

http://localhost:50070/ - web UI for HDFS name node(s)

If everything work fine you suppose to see the following output after running jps utility:


[root@cc hadoop]# jps
3060 SecondaryNameNode
3136 JobTracker
2814 NameNode
3270 TaskTracker
3458 Jps

[root@38 /usr/local/bin/hadoop]netstat -plten | grep java
tcp        0      0 :::50020                    :::*                        LISTEN      0          1248347094 5131/java
tcp        0      0 ::ffff:127.0.0.1:54310      :::*                        LISTEN      0          1248346675 5032/java
tcp        0      0 ::ffff:127.0.0.1:54311      :::*                        LISTEN      0          1248347128 5351/java
tcp        0      0 :::50090                    :::*                        LISTEN      0          1248347119 5272/java
tcp        0      0 :::50060                    :::*                        LISTEN      0          1248347327 5467/java
tcp        0      0 :::50030                    :::*                        LISTEN      0          1248347293 5351/java
tcp        0      0 :::48498                    :::*                        LISTEN      0          1248347011 5272/java
tcp        0      0 :::50070                    :::*                        LISTEN      0          1248346888 5032/java
tcp        0      0 :::53210                    :::*                        LISTEN      0          1248347103 5351/java
tcp        0      0 :::50010                    :::*                        LISTEN      0          1248347000 5131/java
tcp        0      0 :::50075                    :::*                        LISTEN      0          1248347020 5131/java
tcp        0      0 :::40315                    :::*                        LISTEN      0          1248346669 5032/java
tcp        0      0 ::ffff:127.0.0.1:47198      :::*                        LISTEN      0          1248347345 5467/java
tcp        0      0 :::56575                    :::*                        LISTEN      0          1248346834 5131/java

3.HBase

3.1 Configuration

- Unpack HBase archive to /usr/local/bin
- Move hbase.18.1 to hbase
- Define HBASE_HOME point to /usr/local/bin/hbase( don't forget to edit ~/.bash_profile )
- Define JAVA_HOME in $HBASE_HOME/conf/hbase-env.sh


export HADOOP_CLASSPATH=/usr/local/bin/hbase/conf:/usr/local/bin/hbase/hbase-0.18.1.jar:/usr/local/bin/hbase/hbase-0.18.1-test.jar

- Edit $HADOOP_HOME/conf/hadoop-env.sh, add this(for instance):


export JAVA_HOME=/usr/local/bin/jdk1.5.0_14/

3.2 Pseudo-Distributed Operation

A pseudo-distributed operation is simply a distributed operation run on a single host. Once you have confirmed your DFS setup, configuring HBase for use on one host requires modification of ${HBASE_HOME}/conf/hbase-site.xml, which needs to be pointed at the running Hadoop DFS instance. Use hbase-site.xml to override the properties defined in ${HBASE_HOME}/conf/hbase-default.xml (hbase-default.xml itself should never be modified). At a minimum the hbase.rootdir property should be redefined in hbase-site.xml to point HBase at the Hadoop filesystem to use. For example, adding the property below to your hbase-site.xml says that HBase should use the /hbase directory in the HDFS whose namenode is at port 54310 on your local machine:




hbase.rootdir
hdfs://localhost:54310/hbase
The directory shared by region servers.

3.3 Example API Usage

Once you have a running HBase, you probably want a way to hook your application up to it.If your application is in Java, then you should use the Java API. The following example takes as input excel formatted file and name of already existed table in HTable, process records and writes them to HBase. You could look at client example here


package org.examples;

import java.io.IOException;
import java.util.Iterator;
import java.util.regex.Pattern;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.io.BatchUpdate;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapred.TableReduce;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.WritableUtils;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
* Sample uploader Map-Reduce example class. Takes excel format file as input and write output to HBase
*
*/

public class SampleUploader extends MapReduceBase implements Mapper, Tool
{
static enum Counters { MAP_LINES,REDUCE_LINES }
private static final String NAME = "SampleUploader";
private Configuration conf;
static final String OUTPUT_COLUMN = "value:";
static final String OUTPUT_KEY = "key:";
long numRecords;
private Text idText = new Text();
private Text recordText = new Text();
private String inputFile;

/** A WritableComparator optimized for Text keys. */
public static class Comparator extends WritableComparator
{
public Comparator()
{
super(Text.class);
}

public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2)
{
int n1 = WritableUtils.decodeVIntSize(b1[s1]);
int n2 = WritableUtils.decodeVIntSize(b2[s2]);
return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2);
}
}


public JobConf createSubmittableJob(String[] args)
{
JobConf c = new JobConf(getConf(), SampleUploader.class);
c.setJobName(NAME);
c.setInputFormat(TextInputFormat.class);
FileInputFormat.setInputPaths(c, new Path(args[0]));
//c.setInputPaths(new Path(args[0]));
c.setMapperClass(this.getClass());
c.setMapOutputKeyClass(Text.class);
c.setMapOutputValueClass(Text.class);
c.setReducerClass(TableUploader.class);
TableReduce.initJob(args[1], TableUploader.class, c);
return c;
}
public void configure(JobConf job)
{
inputFile = job.get("map.input.file");
}
public void map(LongWritable k, Text v,OutputCollector output, Reporter r) throws IOException
{

String lineWithoutURLs = Pattern.compile("\"[^\"]* "").matcher(v.toString()).replaceAll("");
String userID = lineWithoutURLs.substring(22, 54);

r.incrCounter(Counters.MAP_LINES, 1);
if ((++numRecords % 10000) == 0)
{
System.out.println("Finished mapping of " + numRecords + " records " + "from the input file: " + inputFile);
}
idText.set(userID);
recordText.set(lineWithoutURLs);
output.collect( idText,recordText );
}

public static class TableUploader  extends TableReduce
{

@Override
public void reduce( Text k, Iterator v,
OutputCollector output,
Reporter r) throws IOException
{

BatchUpdate outval = new BatchUpdate(k.toString());
while (v.hasNext())
{
String value = v.next().toString();
String dateStamp = value.substring(0, 20).replaceAll("[-:,]", "");
outval.put( OUTPUT_COLUMN + dateStamp, value.getBytes());
r.incrCounter(Counters.REDUCE_LINES, 1);
}
output.collect( new ImmutableBytesWritable( k.getBytes()), outval);
}
}



static int printUsage()
{
System.out.println(NAME + "  ");
return -1;
}

public int run( String[] args) throws Exception
{
// Make sure there are exactly 2 parameters left.
if (args.length != 2) {
System.out.println("ERROR: Wrong number of parameters: " +
args.length + " instead of 2.");
return printUsage();
}
JobClient.runJob(createSubmittableJob(args));
return 0;
}

public Configuration getConf() {
return this.conf;
}

public void setConf(final Configuration c) {
this.conf = c;
}

public static void main(String[] args) throws Exception
{
int errCode = ToolRunner.run(new Configuration(), new SampleUploader(),
args);
System.exit(errCode);

}



}

3.4 Running HBase and using HBase shell

Start HBase with the following command:


${HBASE_HOME}/bin/start-hbase.sh

If HBase is started succesfully you suppose to see a following output after running jps:


[root@37 /usr/local/bin/hadoop]jps
10379 DataNode
18303 Jps
10637 TaskTracker
10536 JobTracker
10286 NameNode
17512 HMaster

Check if HBase is running with web interface: http://localhost:60030

Once HBase has started, run HBase Shell with

${HBASE_HOME}/bin/hbase shell

Create a sample table for our tests:


create 'table_keyInMemory', {NAME => 'key',IN_MEMORY => true, VERSIONS => 1,BLOCKCACHE => true},{NAME => 'value',VERSIONS => 1}

To stop HBase, exit the HBase shell and enter:


${HBASE_HOME}/bin/stop-hbase.sh

3.5 Finally: Rinning Map-Reduce jobs:

First copy an input file from local filesystem:


[root@37 /usr/local/bin/hadoop]bin/hadoop dfs -copyFromLocal /localFile fileNameInHadoopDFS
[root@37 /usr/local/bin/hadoop] bin/hadoop dfs -ls
Found 1 items -rw-r--r-- 1 root supergroup 27325 2008-12-25 03:26 /user/root fileNameInHadoopDFS
// Run Map-Reduce script
[root@37 /usr/local/bin/hadoop]bin/hadoop jar Test.jar org.exelate.Uploader 100 sampleTable
08/12/25 03:55:40 INFO mapred.FileInputFormat: Total input paths to process : 1
08/12/25 03:55:40 INFO mapred.FileInputFormat: Total input paths to process : 1
08/12/25 03:55:40 INFO mapred.JobClient: Running job: job_200812250352_0002
08/12/25 03:55:41 INFO mapred.JobClient:  map 0% reduce 0%
08/12/25 03:55:44 INFO mapred.JobClient:  map 100% reduce 0%
08/12/25 03:55:50 INFO mapred.JobClient: Job complete: job_200812250352_0002
08/12/25 03:55:50 INFO mapred.JobClient: Counters: 17
08/12/25 03:55:50 INFO mapred.JobClient:   Job Counters
08/12/25 03:55:50 INFO mapred.JobClient:     Data-local map tasks=2
08/12/25 03:55:50 INFO mapred.JobClient:     Launched reduce tasks=1
08/12/25 03:55:50 INFO mapred.JobClient:     Launched map tasks=2
08/12/25 03:55:50 INFO mapred.JobClient:   Map-Reduce Framework
08/12/25 03:55:50 INFO mapred.JobClient:     Map output records=100
08/12/25 03:55:50 INFO mapred.JobClient:     Reduce input records=100
08/12/25 03:55:50 INFO mapred.JobClient:     Map output bytes=16516
08/12/25 03:55:50 INFO mapred.JobClient:     Map input records=100
08/12/25 03:55:50 INFO mapred.JobClient:     Combine output records=0
08/12/25 03:55:50 INFO mapred.JobClient:     Map input bytes=27325
08/12/25 03:55:50 INFO mapred.JobClient:     Combine input records=0
08/12/25 03:55:50 INFO mapred.JobClient:     Reduce input groups=44
08/12/25 03:55:50 INFO mapred.JobClient:     Reduce output records=44
08/12/25 03:55:50 INFO mapred.JobClient:   File Systems
08/12/25 03:55:50 INFO mapred.JobClient:     Local bytes written=33928
08/12/25 03:55:50 INFO mapred.JobClient:     HDFS bytes read=30048
08/12/25 03:55:50 INFO mapred.JobClient:     Local bytes read=16921
08/12/25 03:55:50 INFO mapred.JobClient:   org.exelate.SampleUploader$Counters
08/12/25 03:55:50 INFO mapred.JobClient:     MAP_LINES=100
08/12/25 03:55:50 INFO mapred.JobClient:     REDUCE_LINES=100

HDFS Command Reference

There are many more commands in bin/hadoop dfs than were demonstrated here, although these basic operations will get you started. Running bin/hadoop dfs with no additional arguments will list all commands which can be run with the FsShell system. Furthermore, bin/hadoop dfs -help commandName will display a short usage summary for the operation in question, if you are stuck.

A table of all operations is reproduced below. The following conventions are used for parameters:

italics denote variables to be filled out by the user.

"path" means any file or directory name.

"path..." means one or more file or directory names.

"file" means any filename.
"src" and "dest" are path names in a directed operation.

"localSrc" and "localDest" are paths as above, but on the local file system. All other file and path names refer to objects inside HDFS.

Parameters in [brackets] are optional.

Command	Operation
-ls path	Lists the contents of the directory specified by path, showing the names, permissions, owner, size and modification date for each entry.
-lsr path	Behaves like `-ls`, but recursively displays entries in all subdirectories of path.
-du path	Shows disk usage, in bytes, for all files which match path; filenames are reported with the full HDFS protocol prefix.
-dus path	Like `-du`, but prints a summary of disk usage of all files/directories in the path.
-mv src dest	Moves the file or directory indicated by src to dest, within HDFS.
-cp src dest	Copies the file or directory identified by src to dest, within HDFS.
-rm path	Removes the file or empty directory identified by path.
-rmr path	Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or subdirectories of path).
-put localSrc dest	Copies the file or directory from the local file system identified by localSrc to dest within the DFS.
-copyFromLocal localSrc dest	Identical to `-put`
-moveFromLocal localSrc dest	Copies the file or directory from the local file system identified by localSrc to dest within HDFS, then deletes the local copy on success.
-get [-crc] src localDest	Copies the file or directory in HDFS identified by src to the local file system path identified by localDest.
-getmerge src localDest [addnl]	Retrieves all files that match the path src in HDFS, and copies them to a single, merged file in the local file system identified by localDest.
-cat filename	Displays the contents of filename on stdout.
-copyToLocal [-crc] src localDest	Identical to `-get`
-moveToLocal [-crc] src localDest	Works like `-get`, but deletes the HDFS copy on success.
-mkdir path	Creates a directory named path in HDFS. Creates any parent directories in path that are missing (e.g., like `mkdir -p` in Linux).
-setrep [-R] [-w] rep path	Sets the target replication factor for files identified by path to rep. (The actual replication factor will move toward the target over time)
-touchz path	Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path, unless the file is already size 0.
-test -[ezd] path	Returns 1 if path exists; has zero length; or is a directory, or 0 otherwise.
-stat [format] path	Prints information about path. format is a string which accepts file size in blocks (%b), filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).
-tail [-f] file	Shows the lats 1KB of file on stdout.
-chmod [-R] mode,mode,... path...	Changes the file permissions associated with one or more objects identified by path.... Performs changes recursively with `-R`. mode is a 3-digit octal mode, or `{augo}+/-{rwxX}`. Assumes `a` if no scope is specified and does not apply a umask.
-chown [-R] [owner][:[group]] path...	Sets the owning user and/or group for files or directories identified by path.... Sets owner recursively if `-R` is specified.
-chgrp [-R] group path...	Sets the owning group for files or directories identified by path.... Sets group recursively if `-R` is specified.
-help cmd	Returns usage information for one of the commands listed above. You must omit the leading '-' character in cmd

Tuesday, December 23, 2008

Change hostname in Linux

First you need to find out your hostname, you can do this with

$ hostname
localhost.localdomain
$

Edit /etc/hosts

You need to edit /etc/hosts and add a line for your host name

$ cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
$

My new server IP is 72.232.196.90, i need to assign it hostname server12.hosthat.com, to do this, i have edited /etc/hosts as follows.

# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
72.232.196.90           server12.hosthat.com server12

Edit /etc/sysconfig/network

First lets see what is in the file

$ cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=localhost.localdomain
$

To change servers hostname to server12.hosthat.com, change the file as follows.

$ cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=server12.hosthat.com
$

Now you need to reboot the server to change the hostname.

vmstat

Report virtual memory statistics

SYNOPSIS

vmstat [-n] [delay [ count]]
vmstat[-V]

vmstat reports information about processes, memory, paging, block IO, traps, and cpu activity.

The first report produced gives averages since the last reboot. Additional reports give information on a sampling period of length delay. The process and memory reports are instantaneous in either case.


[root@]# vmstat 1
   procs                      memory      swap          io     system      cpu
 r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id
 0  0  0 6085936  91884 1707572 1333604   48   50    70    73   46     7  1  1 23
 0  0  0 6085936  88464 1707608 1335120    0    0  1374    49 2750  1784  0  0 100
 0  0  0 6085936  88048 1707660 1335952    0    0   753   913 2532  1883  0  0 100
 0  0  0 6085936  87632 1707824 1337460    0    0  1282   908 2452  2054  0  0 100

Monday, December 15, 2008

Use Public/Private Keys for Authentication

First, create a public/private key pair on the client that you will use to connect to the server (you will need to do this from each client machine from which you connect):

$ ssh-keygen -t rsa

This will create two files in your ~/.ssh directory 
id_rsa: your private key
id_rsa.pub: is your public key.

If you don't want to still be asked for a password each time you connect,
just press enter when asked for a password when creating the key pair.
It is up to you to decide whether or not you should password encrypt
your key when you create it. If you don't password encrypt your key,
then anyone gaining access to your local machine will automatically
have ssh access to the remote server. Also, root on the local machine
has access to your keys although one assumes that if you can't trust
root (or root is compromised) then you're in real trouble. Encrypting
the key adds additional security at the expense of eliminating the need
for entering a password for the ssh server only to be replaced with
entering a password for the use of the key. Now set permissions on your private key:

$ chmod 700 ~/.ssh
$ chmod 600 ~/.ssh/id_rsa

Copy the public key (id_rsa.pub) to the server and install it to the authorized_keys list:

$ cat id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 600 ~/.ssh/authorized_keys

Once you've checked you can successfully login to the server using your public/private key pair,
you can disable password authentication completely by adding the following setting to your
/etc/ssh/sshd_config

# Disable password authentication forcing use of keys
PasswordAuthentication no

Thursday, December 11, 2008

Distributed Systems and Web Scalability Resources

No long write-ups this week, just a short list of some great resources that I've found very inspirational and thought provoking. I've broken these resources up into two lists: Blogs and Presentations.

Blogs

The blogs listed below are ones that I subscribe to and are filled with some great posts about capacity planning, scalability problems and solutions, and distributed system information. Each blog is authored by exceptionally smart people and many of them have significant experience building production-level scalable systems.

Nati Shalom's Blog: Discussions about middleware and distributed technologies
http://natishalom.typepad.com/nati_shaloms_blog/

All Things Distributed: Werner Vogels' weblog on building scalable and robust distributed systems.
http://www.allthingsdistributed.com/

High Scalability: Building bigger, faster, more reliable websites
http://highscalability.com/

ProductionScale: Information Technology, Scalability, Technology Operations, and Cloud Computing
http://www.productionscale.com/

iamcal.com
http://www.iamcal.com/ (the "talks" section is particularly interesting)

Kitchen Soap: Thoughts on capacity planning and web operations
http://www.kitchensoap.com/

MySQL Performance Blog: Everything about MySQL Performance
http://www.mysqlperformanceblog.com/

Presentations

The presentations listed below are from the SlideShare site and are primarily the slides used to accompany scalability talks from around the world. Many of them outline the problems that various companies have encountered during their non-linear growth phases and how they've solved them by scaling their systems.

Scalable Internet Architectures
http://www.slideshare.net/shiflett/scalable-internet-architectures

How to build the Web
http://www.slideshare.net/simon/how-to-build-the-web

Netlog: What we learned about scalability & high availability
http://www.slideshare.net/folke/netlog-what-we-learned-about-scalability-high-availability-430211

Database Sharding at Netlog
http://www.slideshare.net/oemebamo/database-sharding-at-netlog-presentation

MySQL 2007 Techn At Digg V3
http://www.slideshare.net/epee/mysql-2007-tech-at-digg-v3

Flickr and PHP
http://www.slideshare.net/coolpics/flickr-44054

Scalable Web Architectures: Common Patterns and Approaches
http://www.slideshare.net/techdude/scalable-web-architectures-common-patterns-and-approaches

How to scale your web app
http://www.slideshare.net/Georgio_1999/how-to-scale-your-web-app

Google Cluster Innards
http://www.slideshare.net/ultradvorka/google-cluster-innards

Sharding Architectures
http://www.slideshare.net/guest0e6d5e/sharding-architectures

Amazon EC2 setup - link
Yahoo Hadoop tutorial - link
Michael Noll blog - link
Hadoop main page - link
Google lectures - link
HBase resources - link
Distributed computing(IBM) - link
HBase and BigTable - link

Understanding HBase column-family performance options - link

Debugging and Tuning Map-Reduce Applications

by Arun C Murthy, Principal Engineer at Yahoo! and Member of Apache Hadoop PMC

http://www.vimeo.com/2085477