The crontab (cron derives fromchronos, Greek for time; tab stands fortable) command, found in Unix and Unix-like operating systems, is used to schedule commands to be executed periodically. To see what crontabs are currently running on your system, you can open a terminal and run:
crontab -l
To edit the list ofcronjobsyou can run:
crontab -e
This will open a the default editor ( vi or pico) to manipulate the crontab settings. If you save and exit the editor, all your cronjobs are saved into crontab. Cronjobs are written in the following format:
* * * * * /bin/execute/this/script.sh
Scheduling explained
As you can see there are 5 stars. The stars represent different date parts in the following order:
minute (from 0 to 59)
hour (from 0 to 23)
day of month (from 1 to 31)
month (from 1 to 12)
day of week (from 0 to 6) (0=Sunday)
Execute every minute
If you leave the star, or asterisk, it meansevery. Maybe that's a bit unclear. Let's use the the previous example again:
* * * * * /bin/execute/this/script.sh
They are all still asterisks! So this means execute/bin/execute/this/script.sh:
everyminute
ofeveryhour
ofeveryday of the month
ofeverymonth
andeveryday in the week.
In short: This script is being executed every minute. Without exception.
Execute every Friday 1AM
So if we want to schedule the script to run at 1AM every Friday, we would need the following cronjob:
0 1 * * 5 /bin/execute/this/script.sh
Get it? The script is now being executed when the system clock hits:
minute: 0
of hour: 1
of day of month: * (every day of month)
of month: * (every month)
and weekday: 5 (=Friday)
Execute 10 past after every hour on the 1st of every month
Here's another one, just for practicing
10 * 1 * * /bin/execute/this/script.sh
Fair enough, it takes some getting used to, but it offers great flexibility.
To do that you need to add public key from master to slave ~/.ssh/authorized_keys and via verse, so eventually tou should be able to do the following:
[root@37 /usr/local/bin/hbase]ssh master The authenticity of host 'master (10.1.0.55)' can't be established. RSA key fingerprint is 09:e2:73:ac:6f:42:d1:da:13:20:76:10:36:29:c4:62. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'master,10.1.0.55' (RSA) to the list of known hosts. Last login: Thu Dec 25 07:19:16 2008 from slave [root@37 ~]exit logout Connection to master closed. [root@37 /usr/local/bin/hbase]ssh slave The authenticity of host 'slave (10.1.0.56)' can't be established. RSA key fingerprint is 5c:4f:81:19:ad:f3:78:02:ce:64:f1:67:10:ca:c5:b8. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'slave,10.1.0.56' (RSA) to the list of known hosts. Last login: Thu Dec 25 07:16:47 2008 from 38.d.de.static.xlhost.com [root@38 ~]exit logout Connection to slave closed.
2.Cluster Overview
The master node will also act as a slave because we only have two machines available in our cluster but still want to spread data storage and processing to multiple machines.
The master node will run the "master" daemons for each layer: namenode for the HDFS storage layer, and jobtracker for the MapReduce processing layer. Both machines will run the "slave" daemons: datanode for the HDFS layer, and tasktracker for MapReduce processing layer. Basically, the "master" daemons are responsible for coordination and management of the "slave" daemons while the latter will do the actual data storage and data processing work.
3. Hadoop core Configuration
3.1 conf/masters (master only)
The conf/masters file defines the master nodes of our multi-node cluster.
On master, update $HADOOP_HOME/conf/masters that it looks like this:
master
3.2 conf/slaves (master only)
This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run. We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data.
On master, update $HADOOP_HOME/conf/slaves that it looks like this:
master slave
If you have additional slave nodes, just add them to the conf/slaves file, one per line (do this on all machines in the cluster).
master slave slave01 slave02 slave03
3.3 conf/hadoop-site.xml (all machines)
Following a sample configuration for all hosts, explanation of all parameters link
hadoop.tmp.dir /usr/local/bin/hadoop/datastore/hadoop-${user.name} A base for other temporary directories.
fs.default.name hdfs://master:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.
mapred.job.tracker master:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
dfs.replication 1 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
mapred.reduce.tasks 8 The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local".
mapred.tasktracker.reduce.tasks.maximum 8 The maximum number of reduce tasks that will be run simultaneously by a task tracker.
mapred.child.java.opts -Xmx500m Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc
The configuration variable mapred.child.ulimit can be used to control the maximum virtual memory of the child processes.
Very important: now format each node data storage before proceeding.
3.4 Start/Stop Hadoop cluster
HDFS daemons: start/stop namenode command bin/start-dfs.sh or bin/stop-dfs.sh MapReduce daemons: run the command $HADOOP_HOME/bin/start-mapred.sh on the machine you want the jobtracker to run on. This will bring up the MapReduce cluster with the jobtracker running on the machine you ran the previous command on, and tasktrackers on the machines listed in the conf/slaves file.
[root@37 /usr/local/bin/hadoop]bin/start-dfs.sh starting namenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-namenode-37.c3.33.static.xlhost.com.out master: starting datanode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-datanode-37.c3.33.static.xlhost.com.out slave: starting datanode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-datanode-38.c3.33.static.xlhost.com.out master: starting secondarynamenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-secondarynamenode-37.c3.33.static.xlhost.com.out [root@37 /usr/local/bin/hadoop]bin/start-mapred.sh starting jobtracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-jobtracker-37.c3.33.static.xlhost.com.out master: starting tasktracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-tasktracker-37.c3.33.static.xlhost.com.out slave: starting tasktracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-tasktracker-38.c3.33.static.xlhost.com.out [root@37 /usr/local/bin/hadoop]jps] -bash: jps]: command not found [root@37 /usr/local/bin/hadoop]jps 28638 DataNode 29052 Jps 28527 NameNode 28764 SecondaryNameNode 28847 JobTracker 28962 TaskTracker [root@37 /usr/local/bin/hadoop]bin/hadoop dfs -copyFromLocal LICENSE.txt testWordCount [root@37 /usr/local/bin/hadoop]bin/hadoop dfs -ls Found 1 items -rw-r--r-- 1 root supergroup 11358 2008-12-25 08:20 /user/root/testWordCount [root@37 /usr/local/bin/hadoop]bin/hadoop jar hadoop-0.18.2-examples.jar wordcount testWordCount testWordCount-out 08/12/25 08:21:46 INFO mapred.FileInputFormat: Total input paths to process : 1 08/12/25 08:21:46 INFO mapred.FileInputFormat: Total input paths to process : 1 08/12/25 08:21:47 INFO mapred.JobClient: Running job: job_200812250818_0001 08/12/25 08:21:48 INFO mapred.JobClient: map 0% reduce 0% 08/12/25 08:21:53 INFO mapred.JobClient: map 50% reduce 0% 08/12/25 08:21:55 INFO mapred.JobClient: map 100% reduce 0% 08/12/25 08:22:01 INFO mapred.JobClient: map 100% reduce 12% 08/12/25 08:22:02 INFO mapred.JobClient: map 100% reduce 25% 08/12/25 08:22:05 INFO mapred.JobClient: map 100% reduce 41% 08/12/25 08:22:09 INFO mapred.JobClient: map 100% reduce 52% 08/12/25 08:22:14 INFO mapred.JobClient: map 100% reduce 64% 08/12/25 08:22:19 INFO mapred.JobClient: map 100% reduce 66% 08/12/25 08:22:24 INFO mapred.JobClient: map 100% reduce 77% 08/12/25 08:22:29 INFO mapred.JobClient: map 100% reduce 79% 08/12/25 08:25:30 INFO mapred.JobClient: map 100% reduce 89% 08/12/25 08:26:12 INFO mapred.JobClient: Job complete: job_200812250818_0001 08/12/25 08:26:12 INFO mapred.JobClient: Counters: 17 08/12/25 08:26:12 INFO mapred.JobClient: Job Counters 08/12/25 08:26:12 INFO mapred.JobClient: Data-local map tasks=1 08/12/25 08:26:12 INFO mapred.JobClient: Launched reduce tasks=9 08/12/25 08:26:12 INFO mapred.JobClient: Launched map tasks=2 08/12/25 08:26:12 INFO mapred.JobClient: Rack-local map tasks=1 08/12/25 08:26:12 INFO mapred.JobClient: Map-Reduce Framework 08/12/25 08:26:12 INFO mapred.JobClient: Map output records=1581 08/12/25 08:26:12 INFO mapred.JobClient: Reduce input records=593 08/12/25 08:26:12 INFO mapred.JobClient: Map output bytes=16546 08/12/25 08:26:12 INFO mapred.JobClient: Map input records=202 08/12/25 08:26:12 INFO mapred.JobClient: Combine output records=1292 08/12/25 08:26:12 INFO mapred.JobClient: Map input bytes=11358 08/12/25 08:26:12 INFO mapred.JobClient: Combine input records=2280 08/12/25 08:26:12 INFO mapred.JobClient: Reduce input groups=593 08/12/25 08:26:12 INFO mapred.JobClient: Reduce output records=593 08/12/25 08:26:12 INFO mapred.JobClient: File Systems 08/12/25 08:26:12 INFO mapred.JobClient: HDFS bytes written=6117 08/12/25 08:26:12 INFO mapred.JobClient: Local bytes written=19010 08/12/25 08:26:12 INFO mapred.JobClient: HDFS bytes read=13872 08/12/25 08:26:12 INFO mapred.JobClient: Local bytes read=8620
4. HBase configuration
4.1 $HBASE_HOME/hbase-site.sh
hbase.rootdir hdfs://master:54310/hbase The directory shared by region servers.
hbase.master master:60000 The host and port that the HBase master runs at.
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduce. HDFS is a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.
HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storeage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop.
HBase's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware.
[root@localhost bin]# wget 'http://mirror.mirimar.net/apache/hadoop/core/hadoop-0.18.2/hadoop-0.18.2.tar.gz' --02:49:53-- http://mirror.mirimar.net/apache/hadoop/core/hadoop-0.18.2/hadoop-0.18.2.tar.gz Resolving mirror.mirimar.net... 194.90.150.47 Connecting to mirror.mirimar.net|194.90.150.47|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 16836495 (16M) [application/x-gzip] Saving to: `hadoop-0.18.2.tar.gz'
100%[=========================================================================================================================================>] 16,836,495 471K/s in 36s
HBase 0.18.1
[root@localhost bin]# wget 'http://mirror.mirimar.net/apache/hadoop/hbase/hbase-0.18.1/hbase-0.18.1.tar.gz' --02:51:42-- http://mirror.mirimar.net/apache/hadoop/hbase/hbase-0.18.1/hbase-0.18.1.tar.gz Resolving mirror.mirimar.net... 194.90.150.47 Connecting to mirror.mirimar.net|194.90.150.47|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 16295734 (16M) [application/x-gzip] Saving to: `hbase-0.18.1.tar.gz'
100%[=========================================================================================================================================>] 16,295,734 496K/s in 34s
Please don't forget to add all environment variables to ~/.bash_profile, otherwise all exports will be deleted after you disconnect your SSH session.
2.1 Java
For now we'll use HBase 0.18.1 which was compiled with JDK1.5 and Hadoop 0.18.2 which supports jdk1.5.x, today Hadoop 0.19.0 is available, but it requires jdk1.6, HBase suppose to support jdk1.6 only in version 0.19
Install JDK1.5.14,
[root@localhost /usr/local/bin]wget 'http://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_Developer-Site/en_US/-/USD/VerifyItem-Start/jdk-1_5_0_14-linux-i586.bin?BundledLineItemUUID=PspIBe.oQkIAAAEeWNw8f4HX&OrderID=na1IBe.ouqIAAAEePNw8f4HX&ProductID=YOzACUFBuXAAAAEYlak5AXuQ&FileName=/jdk-1_5_0_14-linux-i586.bin' -O jdk-1_5_0_14-linux-i586.bin --03:25:00-- http://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_Developer-Site/en_US/-/USD/VerifyItem-Start/jdk-1_5_0_14-linux-i586.bin?BundledLineItemUUID=PspIBe.oQkIAAAEeWNw8f4HX&OrderID=na1IBe.ouqIAAAEePNw8f4HX&ProductID=YOzACUFBuXAAAAEYlak5AXuQ&FileName=/jdk-1_5_0_14-linux-i586.bin Resolving cds.sun.com... 72.5.239.134 Connecting to cds.sun.com|72.5.239.134|:80... connected. HTTP request sent, awaiting response... 302 Found Location: http://cds-esd.sun.com/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin?AuthParam=1230539222_7e7c419133c9fb57c076e8e08293fd8c&TicketId=B%2Fw5lxuBTFhPQB1LOFJTnQTr&GroupName=CDS&FilePath=/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin&File=jdk-1_5_0_14-linux-i586.bin [following] --03:25:01-- http://cds-esd.sun.com/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin?AuthParam=1230539222_7e7c419133c9fb57c076e8e08293fd8c&TicketId=B%2Fw5lxuBTFhPQB1LOFJTnQTr&GroupName=CDS&FilePath=/ESD37/JSCDL/jdk/1.5.0_14/jdk-1_5_0_14-linux-i586.bin&File=jdk-1_5_0_14-linux-i586.bin Resolving cds-esd.sun.com... 98.27.88.9, 98.27.88.39 Connecting to cds-esd.sun.com|98.27.88.9|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 49649265 (47M) [application/x-sdlc] Saving to: `jdk-1_5_0_14-linux-i586.bin'
100%[=========================================================================================================================================>] 49,649,265 5.36M/s in 9.0s
Hadoop requires SSH public access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it, so
[root@rt ~/.ssh] ssh-keygen -t rsa //This will create two files in your ~/.ssh directory //id_rsa: yourprivate key //id_rsa.pub: is your public key. [root@rt ~/.ssh] cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys [root@rt ~/.ssh] ssh localhost The authenticity of host 'localhost (127.0.0.1)' can't be established. RSA key fingerprint is 55:d7:91:86:ea:86:8f:51:89:9f:68:b0:75:88:52:72. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (RSA) to the list of known hosts. [root@rt ~]
As you see the final test is to see if you able to make ssh public authentication connection to the localhost
If the SSH connection fails, these general tips might help:
Enable debugging with ssh -vvv localhost and investigate the error in detail.
Check the SSH server configuration in /etc/ssh/sshd_config, in particular the options PubkeyAuthentication(which should be set to yes) and AllowUsers(if this option is active, add the hadoop user to it). If you made any changes to the SSH server configuration file, you can force a configuration reload with sudo /etc/init.d/ssh reload.
2.3 Disabling IPv6
To disable IPv6 on CentOS Linux, open /etc/modprobe.d/blacklist in the editor of your choice and add the following lines to the end of the file:
# disable IPv6 blacklist ipv6
You have to reboot your machine in order to make the changes take effect.
2.4 Edit open files limit.
Edit file /etc/security/limits.conf, add the following lines:
root - nofile 100000
root - locks 100000
Run ulimit -n 1000000 in shell.
3.Hadoop
3.1 Installation
- Unpack hadoop archive to /usr/local/bin (could any directory)
- Move unpacked directory to /usr/local/bin/hadoop: mv hadoop.18.2 hadoop
- Set HADOOP_HOME: export HADOOP_HOME=/usr/local/bin/hadoop
3.2 Configuration Set up JAVA_HOME in $HADOOP_HOME/conf/hadoop-env.sh to point to your java location:
//The java implementation to use. Required. export JAVA_HOME=/usr/local/bin/jdk1.5.0_14
Set up $HADOOP_HOME/conf/hadoop-site.sh
Any site-specific configuration of Hadoop is configured in $HADOOP_HOME/conf/hadoop-site.xml. Here we will configure the directory where Hadoop will store its data files, the ports it listens to, etc. Our setup will use Hadoop's Distributed File System, HDFS, even though our little "cluster" only contains our single local machine.
You can leave the settings below as is with the exception of the hadoop.tmp.dir variable which you have to change to the directory of your choice, for example:
/usr/local/hadoop-datastore/hadoop-${user.name}.
Hadoop will expand ${user.name} to the system user which is running Hadoop, so in our case this will be hadoop and thus the final path will be /usr/local/hadoop-datastore/hadoop-hadoop.
hadoop.tmp.dir /usr/local/bin/hadoop/datastore/hadoop-${user.name} A base for other temporary directories.
fs.default.name hdfs://master:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.
mapred.job.tracker master:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
mapred.reduce.tasks 4 The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local".
mapred.tasktracker.reduce.tasks.maximum 4 The maximum number of reduce tasks that will be run simultaneously by a task tracker.
dfs.replication 2 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
2.2 Formatting the name node
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your "cluster" (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem, this will cause all your data to be erased.
To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command
[root@cc hadoop]# bin/hadoop namenode -format
The output suppose to be like this:
[root@cc hadoop]# bin/hadoop namenode -format 08/12/24 10:56:34 INFO dfs.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = cc.d.de.static.xlhost.com/206.222.13.204 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.18.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 709042; compiled by 'ndaley' on Thu Oct 30 01:07:18 UTC 2008 ************************************************************/ Re-format filesystem in /usr/local/bin/hadoop/datastore/hadoop-root/dfs/name ? (Y or N) Y 08/12/24 10:57:40 INFO fs.FSNamesystem: fsOwner=root,root,bin,daemon,sys,adm,disk,wheel 08/12/24 10:57:40 INFO fs.FSNamesystem: supergroup=supergroup 08/12/24 10:57:40 INFO fs.FSNamesystem: isPermissionEnabled=true 08/12/24 10:57:40 INFO dfs.Storage: Image file of size 78 saved in 0 seconds. 08/12/24 10:57:40 INFO dfs.Storage: Storage directory /usr/local/bin/hadoop/datastore/hadoop-root/dfs/name has been successfully formatted. 08/12/24 10:57:40 INFO dfs.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at cc.d.de.static.xlhost.com/206.222.13.204 ************************************************************/
2.3 Starting/Stopping your single-node cluster
Run the command:
[root@cc hadoop]# $HADOOP_HOME/bin/start-all.sh
You suppose to see the following output:
starting namenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-namenode-cc.com.out localhost: starting datanode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-datanode-cc.out localhost: starting secondarynamenode, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-secondarynamenode-cc.out starting jobtracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-jobtracker-cc.com.out localhost: starting tasktracker, logging to /usr/local/bin/hadoop/bin/../logs/hadoop-root-tasktracker-cc.com.out
Run example map-reduce job that comes with hadoop installation:
[root@38 /usr/local/bin/hadoop]bin/hadoop dfs -copyFromLocal LICENSE.txt testWordCount [root@38 /usr/local/bin/hadoop]bin/hadoop dfs -ls Found 1 items -rw-r--r-- 1 root supergroup 11358 2008-12-25 04:54 /user/root/testWordCount [root@38 /usr/local/bin/hadoop]bin/hadoop jar hadoop-0.18.2-examples.jar wordcount testWordCount testWordCount-output 08/12/25 04:55:47 INFO mapred.FileInputFormat: Total input paths to process : 1 08/12/25 04:55:47 INFO mapred.FileInputFormat: Total input paths to process : 1 08/12/25 04:55:48 INFO mapred.JobClient: Running job: job_200812250447_0001 08/12/25 04:55:49 INFO mapred.JobClient: map 0% reduce 0% 08/12/25 04:55:51 INFO mapred.JobClient: map 100% reduce 0% 08/12/25 04:55:56 INFO mapred.JobClient: Job complete: job_200812250447_0001 08/12/25 04:55:56 INFO mapred.JobClient: Counters: 16 08/12/25 04:55:56 INFO mapred.JobClient: Job Counters 08/12/25 04:55:56 INFO mapred.JobClient: Data-local map tasks=2 08/12/25 04:55:56 INFO mapred.JobClient: Launched reduce tasks=1 08/12/25 04:55:56 INFO mapred.JobClient: Launched map tasks=2 08/12/25 04:55:56 INFO mapred.JobClient: Map-Reduce Framework 08/12/25 04:55:56 INFO mapred.JobClient: Map output records=1581 08/12/25 04:55:56 INFO mapred.JobClient: Reduce input records=593 08/12/25 04:55:56 INFO mapred.JobClient: Map output bytes=16546 08/12/25 04:55:56 INFO mapred.JobClient: Map input records=202 08/12/25 04:55:56 INFO mapred.JobClient: Combine output records=1292 08/12/25 04:55:56 INFO mapred.JobClient: Map input bytes=11358 08/12/25 04:55:56 INFO mapred.JobClient: Combine input records=2280 08/12/25 04:55:56 INFO mapred.JobClient: Reduce input groups=593 08/12/25 04:55:56 INFO mapred.JobClient: Reduce output records=593 08/12/25 04:55:56 INFO mapred.JobClient: File Systems 08/12/25 04:55:56 INFO mapred.JobClient: HDFS bytes written=6117 08/12/25 04:55:56 INFO mapred.JobClient: Local bytes written=18568 08/12/25 04:55:56 INFO mapred.JobClient: HDFS bytes read=13872 08/12/25 04:55:56 INFO mapred.JobClient: Local bytes read=8542 [root@38 /usr/local/bin/hadoop]bin/hadoop dfs -ls testWordCount-output Found 2 items drwxr-xr-x - root supergroup 0 2008-12-25 04:55 /user/root/testWordCount-output/_logs -rw-r--r-- 1 root supergroup 6117 2008-12-25 04:55 /user/root/testWordCount-output/part-00000 [root@38 /usr/local/bin/hadoop]bin/hadoop dfs -cat testWordCount-output/part-00000 // suppose to see something like this ... tracking 1 trade 1 trademark, 1 trademarks, 1 transfer 1 transformation 1 translation 1 ...
To stop Hadoop cluster run the following:
[root@37 /usr/local/bin/hadoop]bin/stop-all.sh no jobtracker to stop localhost: no tasktracker to stop no namenode to stop localhost: no datanode to stop localhost: no secondarynamenode to stop
2.4 Hadoop monitoring and debugging
Please see hadoop tips of how to debug Map-Reduce programs. Worth to mention that hadoop logs are providing the most information from $HADOOP_HOME/logsor there are links from hadoop web interfaces.
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:
The job tracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file
The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser.
- Unpack HBase archive to /usr/local/bin - Move hbase.18.1 to hbase - Define HBASE_HOME point to /usr/local/bin/hbase( don't forget to edit ~/.bash_profile ) - Define JAVA_HOME in $HBASE_HOME/conf/hbase-env.sh
A pseudo-distributed operation is simply a distributed operation run on a single host. Once you have confirmed your DFS setup, configuring HBase for use on one host requires modification of ${HBASE_HOME}/conf/hbase-site.xml, which needs to be pointed at the running Hadoop DFS instance. Use hbase-site.xml to override the properties defined in ${HBASE_HOME}/conf/hbase-default.xml (hbase-default.xml itself should never be modified). At a minimum the hbase.rootdir property should be redefined in hbase-site.xmlto point HBase at the Hadoop filesystem to use. For example, adding the property below to your hbase-site.xml says that HBase should use the /hbase directory in the HDFS whose namenode is at port 54310 on your local machine:
hbase.rootdir hdfs://localhost:54310/hbase The directory shared by region servers.
Once you have a running HBase, you probably want a way to hook your application up to it.If your application is in Java, then you should use the Java API. The following example takes as input excel formatted file and name of already existed table in HTable, process records and writes them to HBase. You could look at client example here
/** * Sample uploader Map-Reduce example class. Takes excel format file as input and write output to HBase * */
public class SampleUploader extends MapReduceBase implements Mapper, Tool { static enum Counters { MAP_LINES,REDUCE_LINES } private static final String NAME = "SampleUploader"; private Configuration conf; static final String OUTPUT_COLUMN = "value:"; static final String OUTPUT_KEY = "key:"; long numRecords; private Text idText = new Text(); private Text recordText = new Text(); private String inputFile;
/** A WritableComparator optimized for Text keys. */ public static class Comparator extends WritableComparator { public Comparator() { super(Text.class); }
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { int n1 = WritableUtils.decodeVIntSize(b1[s1]); int n2 = WritableUtils.decodeVIntSize(b2[s2]); return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2); } }
public JobConf createSubmittableJob(String[] args) { JobConf c = new JobConf(getConf(), SampleUploader.class); c.setJobName(NAME); c.setInputFormat(TextInputFormat.class); FileInputFormat.setInputPaths(c, new Path(args[0])); //c.setInputPaths(new Path(args[0])); c.setMapperClass(this.getClass()); c.setMapOutputKeyClass(Text.class); c.setMapOutputValueClass(Text.class); c.setReducerClass(TableUploader.class); TableReduce.initJob(args[1], TableUploader.class, c); return c; } public void configure(JobConf job) { inputFile = job.get("map.input.file"); } public void map(LongWritable k, Text v,OutputCollector output, Reporter r) throws IOException {
public int run( String[] args) throws Exception { // Make sure there are exactly 2 parameters left. if (args.length != 2) { System.out.println("ERROR: Wrong number of parameters: " + args.length + " instead of 2."); return printUsage(); } JobClient.runJob(createSubmittableJob(args)); return 0; }
public Configuration getConf() { return this.conf; }
public void setConf(final Configuration c) { this.conf = c; }
public static void main(String[] args) throws Exception { int errCode = ToolRunner.run(new Configuration(), new SampleUploader(), args); System.exit(errCode);
}
}
3.4 Running HBase and using HBase shell
Start HBase with the following command:
${HBASE_HOME}/bin/start-hbase.sh
If HBase is started succesfully you suppose to see a following output after running jps:
Lists the contents of the directory specified by path, showing the names, permissions, owner, size and modification date for each entry.
-lsr path
Behaves like -ls, but recursively displays entries in all subdirectories of path.
-du path
Shows disk usage, in bytes, for all files which match path; filenames are reported with the full HDFS protocol prefix.
-dus path
Like -du, but prints a summary of disk usage of all files/directories in the path.
-mv srcdest
Moves the file or directory indicated by src to dest, within HDFS.
-cp srcdest
Copies the file or directory identified by src to dest, within HDFS.
-rm path
Removes the file or empty directory identified by path.
-rmr path
Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or subdirectories of path).
-put localSrcdest
Copies the file or directory from the local file system identified by localSrc to dest within the DFS.
-copyFromLocal localSrcdest
Identical to -put
-moveFromLocal localSrcdest
Copies the file or directory from the local file system identified by localSrc to dest within HDFS, then deletes the local copy on success.
-get [-crc] srclocalDest
Copies the file or directory in HDFS identified by src to the local file system path identified by localDest.
-getmerge srclocalDest [addnl]
Retrieves all files that match the path src in HDFS, and copies them to a single, merged file in the local file system identified by localDest.
-cat filename
Displays the contents of filename on stdout.
-copyToLocal [-crc] srclocalDest
Identical to -get
-moveToLocal [-crc] srclocalDest
Works like -get, but deletes the HDFS copy on success.
-mkdir path
Creates a directory named path in HDFS. Creates any parent directories in path that are missing (e.g., like mkdir -p in Linux).
-setrep [-R] [-w] reppath
Sets the target replication factor for files identified by path to rep. (The actual replication factor will move toward the target over time)
-touchz path
Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path, unless the file is already size 0.
-test -[ezd] path
Returns 1 if pathexists; has zero length; or is a directory, or 0 otherwise.
-stat [format] path
Prints information about path. format is a string which accepts file size in blocks (%b), filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).
-tail [-f] file
Shows the lats 1KB of file on stdout.
-chmod [-R] mode,mode,...path...
Changes the file permissions associated with one or more objects identified by path.... Performs changes recursively with -R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}. Assumes a if no scope is specified and does not apply a umask.
-chown [-R] [owner][:[group]] path...
Sets the owning user and/or group for files or directories identified by path.... Sets owner recursively if -R is specified.
-chgrp [-R] grouppath...
Sets the owning group for files or directories identified by path.... Sets group recursively if -R is specified.
-help cmd
Returns usage information for one of the commands listed above. You must omit the leading '-' character in cmd
First you need to find out your hostname, you can do this with
$ hostname localhost.localdomain $
Edit /etc/hosts
You need to edit /etc/hosts and add a line for your host name
$ cat /etc/hosts # Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 localhost.localdomain localhost $
My new server IP is 72.232.196.90, i need to assign it hostname server12.hosthat.com, to do this, i have edited /etc/hosts as follows.
# Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 localhost.localdomain localhost 72.232.196.90 server12.hosthat.com server12
vmstat reports information about processes, memory, paging, block IO, traps, and cpu activity.
The first report produced gives averages since the last reboot. Additional reports give information on a sampling period of length delay. The process and memory reports are instantaneous in either case.
[root@]# vmstat 1 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 0 0 6085936 91884 1707572 1333604 48 50 70 73 46 7 1 1 23 0 0 0 6085936 88464 1707608 1335120 0 0 1374 49 2750 1784 0 0 100 0 0 0 6085936 88048 1707660 1335952 0 0 753 913 2532 1883 0 0 100 0 0 0 6085936 87632 1707824 1337460 0 0 1282 908 2452 2054 0 0 100
First, create a public/private key pair on the client that you will use to connect to the server (you will need to do this from each client machine from which you connect):
$ ssh-keygen -t rsa
This will create two files in your ~/.ssh directory id_rsa: yourprivate key id_rsa.pub: is your public key.
If you don't want to still be asked for a password each time you connect, just press enter when asked for a password when creating the key pair. It is up to you to decide whether or not you should password encrypt your key when you create it. If you don't password encrypt your key, then anyone gaining access to your local machine will automatically have ssh access to the remote server. Also, root on the local machine has access to your keys although one assumes that if you can't trust root (or root is compromised) then you're in real trouble. Encrypting the key adds additional security at the expense of eliminating the need for entering a password for the ssh server only to be replaced with entering a password for the use of the key. Now set permissions on your private key:
$ chmod 700 ~/.ssh $ chmod 600 ~/.ssh/id_rsa
Copy the public key (id_rsa.pub) to the server and install it to the authorized_keys list:
Once you've checked you can successfully login to the server using your public/private key pair, you can disable password authentication completely by adding the following setting to your /etc/ssh/sshd_config
# Disable password authentication forcing use of keys PasswordAuthentication no
No long write-ups this week, just a short list of some great resources that I've found very inspirational and thought provoking. I've broken these resources up into two lists: Blogs and Presentations.
Blogs
The blogs listed below are ones that I subscribe to and are filled with some great posts about capacity planning, scalability problems and solutions, and distributed system information. Each blog is authored by exceptionally smart people and many of them have significant experience building production-level scalable systems.
The presentations listed below are from the SlideShare site and are primarily the slides used to accompany scalability talks from around the world. Many of them outline the problems that various companies have encountered during their non-linear growth phases and how they've solved them by scaling their systems.
Amazon EC2 setup - link Yahoo Hadoop tutorial - link Michael Noll blog - link Hadoop main page - link Google lectures - link HBase resources - link Distributed computing(IBM) - link HBase and BigTable - link
I've recently encountered on file handlers limit problem while running java program that holds a large hash of file descriptors. The following example describes how to raise the maximum number of file descriptors per process to 4096 on the RedHat/CentOS distibution of Linux:
Process File Descriptor Tuning
In addition to configuring system-wide global file-descriptor values, you must also consider per-process limits.
The following example describes how to raise the maximum number of file descriptors per process to 4096 on the RedHat?CentOS distibution of Linux:
Allow all users to modify their file descriptor limits from an initial value of 1024 up to the maximum permitted value of 4096 by changing /etc/security/limits.conf
* soft nofile 1024 * hard nofile 4096
In /etc/pam.d/login, add:
session required /lib/security/pam_limits.so
Increase the system-wide file descriptor limit by adding the following line to the /etc/rc.d/rc.local startup script:
# double space a file awk '1;{print ""}' awk 'BEGIN{ORS="\n\n"};1'
# double space a file which already has blank lines in it. Output file # should contain no more than one blank line between lines of text. # NOTE: On Unix systems, DOS lines which have only CRLF (\r\n) are # often treated as non-blank, and thus 'NF' alone will return TRUE. awk 'NF{print $0 "\n"}'
# triple space a file awk '1;{print "\n"}'
NUMBERING AND CALCULATIONS:
# precede each line by its line number FOR THAT FILE (left alignment). # Using a tab (\t) instead of space will preserve margins. awk '{print FNR "\t" $0}' files*
# precede each line by its line number FOR ALL FILES TOGETHER, with tab. awk '{print NR "\t" $0}' files*
# number each line of a file (number on left, right-aligned) # Double the percent signs if typing from the DOS command prompt. awk '{printf("%5d : %s\n", NR,$0)}'
# number each line of file, but only print numbers if line is not blank # Remember caveats about Unix treatment of \r (mentioned above) awk 'NF{$0=++a " :" $0};{print}' awk '{print (NF? ++a " :" :"") $0}'
# print the sums of the fields of every line awk '{s=0; for (i=1; i<=NF; i++) s=s+$i; print s}'
# add all fields in all lines and print the sum awk '{for (i=1; i<=NF; i++) s=s+$i}; END{print s}'
# print every line after replacing each field with its absolute value awk '{for (i=1; i<=NF; i++) if ($i < 0) $i = -$i; print }' awk '{for (i=1; i<=NF; i++) $i = ($i < 0) ? -$i : $i; print }'
# print the total number of fields ("words") in all lines awk '{ total = total + NF }; END {print total}' file
# print the total number of lines that contain "Beth" awk '/Beth/{n++}; END {print n+0}' file
# print the largest first field and the line that contains it # Intended for finding the longest string in field #1 awk '$1 > max {max=$1; maxline=$0}; END{ print max, maxline}'
# print the number of fields in each line, followed by the line awk '{ print NF ":" $0 } '
# print the last field of each line awk '{ print $NF }'
# print the last field of the last line awk '{ field = $NF }; END{ print field }'
# print every line with more than 4 fields awk 'NF > 4'
# print every line where the value of the last field is > 4 awk '$NF > 4'
TEXT CONVERSION AND SUBSTITUTION:
# IN UNIX ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format awk '{sub(/\r$/,"");print}' # assumes EACH line ends with Ctrl-M
# IN UNIX ENVIRONMENT: convert Unix newlines (LF) to DOS format awk '{sub(/$/,"\r");print}
# IN DOS ENVIRONMENT: convert Unix newlines (LF) to DOS format awk 1
# IN DOS ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format # Cannot be done with DOS versions of awk, other than gawk: gawk -v BINMODE="w" '1' infile >outfile
# Use "tr" instead. tr -d \r outfile # GNU tr version 1.22 or higher
# delete leading whitespace (spaces, tabs) from front of each line # aligns all text flush left awk '{sub(/^[ \t]+/, ""); print}'
# delete trailing whitespace (spaces, tabs) from end of each line awk '{sub(/[ \t]+$/, "");print}'
# delete BOTH leading and trailing whitespace from each line awk '{gsub(/^[ \t]+|[ \t]+$/,"");print}' awk '{$1=$1;print}' # also removes extra space between fields
# insert 5 blank spaces at beginning of each line (make page offset) awk '{sub(/^/, " ");print}'
# align all text flush right on a 79-column width awk '{printf "%79s\n", $0}' file*
# center all text on a 79-character width awk '{l=length();s=int((79-l)/2); printf "%"(s+l)"s\n",$0}' file*
# substitute (find and replace) "foo" with "bar" on each line awk '{sub(/foo/,"bar");print}' # replaces only 1st instance gawk '{$0=gensub(/foo/,"bar",4);print}' # replaces only 4th instance awk '{gsub(/foo/,"bar");print}' # replaces ALL instances in a line
# substitute "foo" with "bar" ONLY for lines which contain "baz" awk '/baz/{gsub(/foo/, "bar")};{print}'
# substitute "foo" with "bar" EXCEPT for lines which contain "baz" awk '!/baz/{gsub(/foo/, "bar")};{print}'
# change "scarlet" or "ruby" or "puce" to "red" awk '{gsub(/scarlet|ruby|puce/, "red"); print}'
# reverse order of lines (emulates "tac") awk '{a[i++]=$0} END {for (j=i-1; j>=0;) print a[j--] }' file*
# if a line ends with a backslash, append the next line to it # (fails if there are multiple lines ending with backslash...) awk '/\\$/ {sub(/\\$/,""); getline t; print $0 t; next}; 1' file*
# print and sort the login names of all users awk -F ":" '{ print $1 | "sort" }' /etc/passwd
# print the first 2 fields, in opposite order, of every line awk '{print $2, $1}' file
# switch the first 2 fields of every line awk '{temp = $1; $1 = $2; $2 = temp}' file
# print every line, deleting the second field of that line awk '{ $2 = ""; print }'
# print in reverse order the fields of every line awk '{for (i=NF; i>0; i--) printf("%s ",i);printf ("\n")}' file
# remove duplicate, nonconsecutive lines awk '! a[$0]++' # most concise script awk '!($0 in a) {a[$0];print}' # most efficient script
# concatenate every 5 lines of input, using a comma separator # between fields awk 'ORS=%NR%5?",":"\n"' file
SELECTIVE PRINTING OF CERTAIN LINES:
# print first 10 lines of file (emulates behavior of "head") awk 'NR < 11'
# print first line of file (emulates "head -1") awk 'NR>1{exit};1'
# print the last 2 lines of a file (emulates "tail -2") awk '{y=x "\n" $0; x=$0};END{print y}'
# print the last line of a file (emulates "tail -1") awk 'END{print}'
# print only lines which match regular expression (emulates "grep") awk '/regex/'
# print only lines which do NOT match regex (emulates "grep -v") awk '!/regex/'
# print the line immediately before a regex, but not the line # containing the regex awk '/regex/{print x};{x=$0}' awk '/regex/{print (x=="" ? "match on line 1" : x)};{x=$0}'
# print the line immediately after a regex, but not the line # containing the regex awk '/regex/{getline;print}'
# grep for AAA and BBB and CCC (in any order) awk '/AAA/; /BBB/; /CCC/'
# grep for AAA and BBB and CCC (in that order) awk '/AAA.*BBB.*CCC/'
# print only lines of 65 characters or longer awk 'length > 64'
# print only lines of less than 65 characters awk 'length < 64'
# print section of file from regular expression to end of file awk '/regex/,0' awk '/regex/,EOF'
# print section of file based on line numbers (lines 8-12, inclusive) awk 'NR==8,NR==12'
# print line number 52 awk 'NR==52' awk 'NR==52 {print;exit}' # more efficient on large files
# print section of file between two regular expressions (inclusive) awk '/Iowa/,/Montana/' # case sensitive
SELECTIVE DELETION OF CERTAIN LINES:
# delete ALL blank lines from a file (same as "grep '.' ") awk NF awk '/./'
A number of Unix text-processing utilities let you search for, and in some cases change, text patterns rather than fixed strings. These utilities include the editing programs ed, ex, vi, and sed, the awk programming language, and the commands grep and egrep. Text patterns (formally called regular expressions) contain normal characters mixed with special characters (called metacharacters).
1.3.1 Filenames Versus Patterns
Metacharacters used in pattern matching are different from metacharacters used for filename expansion. When you issue a command on the command line, special characters are seen first by the shell, then by the program; therefore, unquoted metacharacters are interpreted by the shell for filename expansion. For example, the command:
$ grep [A-Z]* chap[12]
could be transformed by the shell into:
$ grep Array.c Bug.c Comp.c chap1 chap2
and would then try to find the pattern Array.c in files Bug.c, Comp.c, chap1, and chap2. To bypass the shell and pass the special characters to grep, use quotes as follows:
$ grep "[A-Z]*" chap[12]
Double quotes suffice in most cases, but single quotes are the safest bet.
Note also that in pattern matching, ? matches zero or one instance of a regular expression; in filename expansion, ? matches a single character.
1.3.2 Metacharacters
Different metacharacters have different meanings, depending upon where they are used. In particular, regular expressions used for searching through text (matching) have one set of metacharacters, while the metacharacters used when processing replacement text have a different set. These sets also vary somewhat per program. This section covers the metacharacters used for searching and replacing, with descriptions of the variants in the different utilities.
1.3.2.1 Search patterns
The characters in the following table have special meaning only in search patterns:
Character
Pattern
.
Match any single character except newline. Can match newline in awk.
*
Match any number (or none) of the single character that immediately precedes it. The preceding character can also be a regular expression. For example, since . (dot) means any character, .* means "match any number of any character."
^
Match the following regular expression at the beginning of the line or string.
$
Match the preceding regular expression at the end of the line or string.
\
Turn off the special meaning of the following character.
[ ]
Match any one of the enclosed characters. A hyphen (-) indicates a range of consecutive characters. A circumflex (^) as the first character in the brackets reverses the sense: it matches any one character not in the list. A hyphen or close bracket (]) as the first character is treated as a member of the list. All other metacharacters are treated as members of the list (i.e., literally).
{n,m}
Match a range of occurrences of the single character that immediately precedes it. The preceding character can also be a metacharacter. {n} matches exactly n occurrences; {n,} matches at least n occurrences; and {n,m} matches any number of occurrences between n and m. n and m must be between 0 and 255, inclusive.
\{n,m\}
Just like {n,m}, but with backslashes in front of the braces.
\( \)
Save the pattern enclosed between \( and \) into a special holding space. Up to nine patterns can be saved on a single line. The text matched by the subpatterns can be "replayed" in substitutions by the escape sequences \1 to \9.
\n
Replay the nth sub-pattern enclosed in \( and \) into the pattern at this point. n is a number from 1 to 9, with 1 starting on the left.
\< \>
Match characters at beginning (\<) or end (\>) of a word.
+
Match one or more instances of preceding regular expression.
?
Match zero or one instances of preceding regular expression.
|
Match the regular expression specified before or after.
( )
Apply a match to the enclosed group of regular expressions.
Many Unix systems allow the use of POSIX character classes within the square brackets that enclose a group of characters. These are typed enclosed in [: and :]. For example, [[:alnum:]] matches a single alphanumeric character.
Class
Characters matched
alnum
Alphanumeric characters
alpha
Alphabetic characters
blank
Space or TAB
cntrl
Control characters
digit
Decimal digits
graph
Nonspace characters
lower
Lowercase characters
print
Printable characters
space
Whitespace characters
upper
Uppercase characters
xdigit
Hexadecimal digits
1.3.2.2 Replacement patterns
The characters in the following table have special meaning only in replacement patterns:
Character
Pattern
\
Turn off the special meaning of the following character.
\n
Restore the text matched by the nth pattern previously saved by \( and \). n is a number from 1 to 9, with 1 starting on the left.
&
Reuse the text matched by the search pattern as part of the replacement pattern.
~
Reuse the previous replacement pattern in the current replacement pattern. Must be the only character in the replacement pattern (ex and vi).
%
Reuse the previous replacement pattern in the current replacement pattern. Must be the only character in the replacement pattern (ed).
\u
Convert first character of replacement pattern to uppercase.
\U
Convert entire replacement pattern to uppercase.
\l
Convert first character of replacement pattern to lowercase.
\L
Convert entire replacement pattern to lowercase.
\E
Turn off previous \U or \L.
\e
Turn off previous \u or \l.
1.3.3 Metacharacters, Listed by Unix Program
Some metacharacters are valid for one program but not for another. Those that are available to a Unix program are marked by a bullet () in the following table. (This table is correct for SVR4 and Solaris and most commercial Unix systems, but it's always a good idea to verify your system's behavior.) Items marked with a "P" are specified by POSIX; double check your system's version. Full descriptions were provided in the previous section.
[1] Stored sub-patterns can be "replayed" during matching. See the examples in the next table.
Note that in ed, ex, vi, and sed, you specify both a search pattern (on the left) and a replacement pattern (on the right). The metacharacters listed in this table are meaningful only in a search pattern.
In ed, ex, vi, and sed, the following metacharacters are valid only in a replacement pattern:
Symbol
ex
vi
sed
ed
Action
\
Escape following character.
\n
Text matching pattern stored in \( \).
&
Text matching search pattern.
~
Reuse previous replacement pattern.
%
Reuse previous replacement pattern.
\u \U
Change character(s) to uppercase.
\l \L
Change character(s) to lowercase.
\E
Turn off previous \U or \L.
\e
Turn off previous \u or \l.
1.3.4 Examples of Searching
When used with grep or egrep, regular expressions should be surrounded by quotes. (If the pattern contains a $, you must use single quotes; e.g., 'pattern'.) When used with ed, ex, sed, and awk, regular expressions are usually surrounded by / although (except for awk), any delimiter works. Here are some example patterns:
Pattern
What does it match?
bag
The string bag.
^bag
bag at the beginning of the line.
bag$
bag at the end of the line.
^bag$
bag as the only word on the line.
[Bb]ag
Bag or bag.
b[aeiou]g
Second letter is a vowel.
b[^aeiou]g
Second letter is a consonant (or uppercase or symbol).
b.g
Second letter is any character.
^...$
Any line containing exactly three characters.
^\.
Any line that begins with a dot.
^\.[a-z][a-z]
Same as previous, followed by two lowercase letters (e.g., troff requests).
^\.[a-z]\{2\}
Same as previous; ed, grep and sed only.
^[^.]
Any line that doesn't begin with a dot.
bugs*
bug, bugs, bugss, etc.
"word"
A word in quotes.
"*word"*
A word, with or without quotes.
[A-Z][A-Z]*
One or more uppercase letters.
[A-Z]+
Same as previous; egrep or awk only.
[[:upper:]]+
Same as previous; POSIX egrep or awk.
[A-Z].*
An uppercase letter, followed by zero or more characters.
[A-Z]*
Zero or more uppercase letters.
[a-zA-Z]
Any letter, either lower- or uppercase.
[^0-9A-Za-z]
Any symbol or space (not a letter or a number).
[^[:alnum:]]
Same, using POSIX character class.
egrep or awk pattern
What does it match?
[567]
One of the numbers 5, 6, or 7.
five|six|seven
One of the words five, six, or seven.
80[2-4]?86
8086, 80286, 80386, or 80486.
80[2-4]?86|Pentium
8086, 80286, 80386, 80486, or Pentium.
compan(y|ies)
company or companies.
ex or vi pattern
What does it match?
\
Words like theater, there, or the.
the\>
Words like breathe, seethe, or the.
\
The word the.
ed, sed, or grep pattern
What does it match?
0\{5,\}
Five or more zeros in a row.
[0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}
U.S. Social Security number (nnn-nn-nnnn).
\(why\).*\1
A line with two occurrences of why.
\([[:alpha:]_][[:alnum:]_.]*\) = \1;
C/C++ simple assignment statements.
1.3.4.1 Examples of searching and replacing
The following examples show the metacharacters available to sed or ex. Note that ex commands begin with a colon. A space is marked by a ; a TAB is marked by a .
Command
Result
s/.*/( & )/
Redo the entire line, but add parentheses.
s/.*/mv & &.old/
Change a wordlist (one word per line) into mv commands.
/^$/d
Delete blank lines.
:g/^$/d
Same as previous, in ex editor.
/^[]*$/d
Delete blank lines, plus lines containing only spaces or s.
:g/^[]*$/d
Same as previous, in ex editor.
s/*//g
Turn one or more spaces into one space.
:%s/*//g
Same as previous, in ex editor.
:s/[0-9]/Item &:/
Turn a number into an item label (on the current line).
:s
Repeat the substitution on the first occurrence.
:&
Same as previous.
:sg
Same as previous, but for all occurrences on the line.
:&g
Same as previous.
:%&g
Repeat the substitution globally (i.e., on all lines).
:.,$s/Fortran/\U&/g
On current line to last line, change word to uppercase.
:%s/.*/\L&/
Lowercase entire file.
:s/\<./\u&/g
Uppercase first letter of each word on current line. (Useful for titles.)
:%s/yes/No/g
Globally change a word to No.
:%s/Yes/~/g
Globally change a different word to No (previous replacement).
Finally, here are some sed examples for transposing words. A simple transposition of two words might look like this:
s/die or do/do or die/
The real trick is to use hold buffers to transpose variable patterns. For example, to transpose using hold buffers: