Sunday, November 30, 2008

Process File Descriptor Tuning on Linux

I've recently encountered on file handlers limit problem while running java program that holds a large hash of file descriptors. The following example describes how to raise the maximum number of file descriptors per process to 4096 on the RedHat/CentOS distibution of Linux:

Process File Descriptor Tuning

In addition to configuring system-wide global file-descriptor values, you must also consider per-process limits.

The following example describes how to raise the maximum number of file descriptors per process to 4096 on the RedHat?CentOS distibution of Linux:

  1. Allow all users to modify their file descriptor limits from an initial value of 1024 up to the maximum permitted value of 4096 by changing /etc/security/limits.conf

       *       soft    nofile  1024
    * hard nofile 4096

    In /etc/pam.d/login, add:

       session required /lib/security/pam_limits.so
  2. Increase the system-wide file descriptor limit by adding the following line to the /etc/rc.d/rc.local startup script:

       echo -n "8192" > /proc/sys/fs/file-max

    or, on 2.6 kernels:

       echo -n "8192" > $( mount | grep sysfs | cut -d" " -f 3 )/fs/file-max

    Now restart the system or run these commands from a command line to apply these changes.

  3. You will then need to tell the system to use the new limits:

    ulimit -n unlimited (bash)

    or

    ulimit -n 65535 (bash)

    or

    unlimit descriptors (csh, tcsh).
  4. Verify this has raised the limit by checking the output of:
    ulimit -a (bash) or limit (csh, tcsh)

Wednesday, November 26, 2008

HANDY ONE-LINERS FOR AWK

FILE SPACING:

# double space a file
awk '1;{print ""}'
awk 'BEGIN{ORS="\n\n"};1'

# double space a file which already has blank lines in it. Output file
# should contain no more than one blank line between lines of text.
# NOTE: On Unix systems, DOS lines which have only CRLF (\r\n) are
# often treated as non-blank, and thus 'NF' alone will return TRUE.
awk 'NF{print $0 "\n"}'

# triple space a file
awk '1;{print "\n"}'

NUMBERING AND CALCULATIONS:

# precede each line by its line number FOR THAT FILE (left alignment).
# Using a tab (\t) instead of space will preserve margins.
awk '{print FNR "\t" $0}' files*

# precede each line by its line number FOR ALL FILES TOGETHER, with tab.
awk '{print NR "\t" $0}' files*

# number each line of a file (number on left, right-aligned)
# Double the percent signs if typing from the DOS command prompt.
awk '{printf("%5d : %s\n", NR,$0)}'

# number each line of file, but only print numbers if line is not blank
# Remember caveats about Unix treatment of \r (mentioned above)
awk 'NF{$0=++a " :" $0};{print}'
awk '{print (NF? ++a " :" :"") $0}'

# count lines (emulates "wc -l")
awk 'END{print NR}'

# print the sums of the fields of every line
awk '{s=0; for (i=1; i<=NF; i++) s=s+$i; print s}'

# add all fields in all lines and print the sum
awk '{for (i=1; i<=NF; i++) s=s+$i}; END{print s}'

# print every line after replacing each field with its absolute value
awk '{for (i=1; i<=NF; i++) if ($i < 0) $i = -$i; print }'
awk '{for (i=1; i<=NF; i++) $i = ($i < 0) ? -$i : $i; print }'

# print the total number of fields ("words") in all lines
awk '{ total = total + NF }; END {print total}' file

# print the total number of lines that contain "Beth"
awk '/Beth/{n++}; END {print n+0}' file

# print the largest first field and the line that contains it
# Intended for finding the longest string in field #1
awk '$1 > max {max=$1; maxline=$0}; END{ print max, maxline}'

# print the number of fields in each line, followed by the line
awk '{ print NF ":" $0 } '

# print the last field of each line
awk '{ print $NF }'

# print the last field of the last line
awk '{ field = $NF }; END{ print field }'

# print every line with more than 4 fields
awk 'NF > 4'

# print every line where the value of the last field is > 4
awk '$NF > 4'


TEXT CONVERSION AND SUBSTITUTION:

# IN UNIX ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format
awk '{sub(/\r$/,"");print}' # assumes EACH line ends with Ctrl-M

# IN UNIX ENVIRONMENT: convert Unix newlines (LF) to DOS format
awk '{sub(/$/,"\r");print}

# IN DOS ENVIRONMENT: convert Unix newlines (LF) to DOS format
awk 1

# IN DOS ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format
# Cannot be done with DOS versions of awk, other than gawk:
gawk -v BINMODE="w" '1' infile >outfile

# Use "tr" instead.
tr -d \r outfile # GNU tr version 1.22 or higher

# delete leading whitespace (spaces, tabs) from front of each line
# aligns all text flush left
awk '{sub(/^[ \t]+/, ""); print}'

# delete trailing whitespace (spaces, tabs) from end of each line
awk '{sub(/[ \t]+$/, "");print}'

# delete BOTH leading and trailing whitespace from each line
awk '{gsub(/^[ \t]+|[ \t]+$/,"");print}'
awk '{$1=$1;print}' # also removes extra space between fields

# insert 5 blank spaces at beginning of each line (make page offset)
awk '{sub(/^/, " ");print}'

# align all text flush right on a 79-column width
awk '{printf "%79s\n", $0}' file*

# center all text on a 79-character width
awk '{l=length();s=int((79-l)/2); printf "%"(s+l)"s\n",$0}' file*

# substitute (find and replace) "foo" with "bar" on each line
awk '{sub(/foo/,"bar");print}' # replaces only 1st instance
gawk '{$0=gensub(/foo/,"bar",4);print}' # replaces only 4th instance
awk '{gsub(/foo/,"bar");print}' # replaces ALL instances in a line

# substitute "foo" with "bar" ONLY for lines which contain "baz"
awk '/baz/{gsub(/foo/, "bar")};{print}'

# substitute "foo" with "bar" EXCEPT for lines which contain "baz"
awk '!/baz/{gsub(/foo/, "bar")};{print}'

# change "scarlet" or "ruby" or "puce" to "red"
awk '{gsub(/scarlet|ruby|puce/, "red"); print}'

# reverse order of lines (emulates "tac")
awk '{a[i++]=$0} END {for (j=i-1; j>=0;) print a[j--] }' file*

# if a line ends with a backslash, append the next line to it
# (fails if there are multiple lines ending with backslash...)
awk '/\\$/ {sub(/\\$/,""); getline t; print $0 t; next}; 1' file*

# print and sort the login names of all users
awk -F ":" '{ print $1 | "sort" }' /etc/passwd

# print the first 2 fields, in opposite order, of every line
awk '{print $2, $1}' file

# switch the first 2 fields of every line
awk '{temp = $1; $1 = $2; $2 = temp}' file

# print every line, deleting the second field of that line
awk '{ $2 = ""; print }'

# print in reverse order the fields of every line
awk '{for (i=NF; i>0; i--) printf("%s ",i);printf ("\n")}' file

# remove duplicate, consecutive lines (emulates "uniq")
awk 'a !~ $0; {a=$0}'

# remove duplicate, nonconsecutive lines
awk '! a[$0]++' # most concise script
awk '!($0 in a) {a[$0];print}' # most efficient script

# concatenate every 5 lines of input, using a comma separator
# between fields
awk 'ORS=%NR%5?",":"\n"' file



SELECTIVE PRINTING OF CERTAIN LINES:

# print first 10 lines of file (emulates behavior of "head")
awk 'NR < 11'

# print first line of file (emulates "head -1")
awk 'NR>1{exit};1'

# print the last 2 lines of a file (emulates "tail -2")
awk '{y=x "\n" $0; x=$0};END{print y}'

# print the last line of a file (emulates "tail -1")
awk 'END{print}'

# print only lines which match regular expression (emulates "grep")
awk '/regex/'

# print only lines which do NOT match regex (emulates "grep -v")
awk '!/regex/'

# print the line immediately before a regex, but not the line
# containing the regex
awk '/regex/{print x};{x=$0}'
awk '/regex/{print (x=="" ? "match on line 1" : x)};{x=$0}'

# print the line immediately after a regex, but not the line
# containing the regex
awk '/regex/{getline;print}'

# grep for AAA and BBB and CCC (in any order)
awk '/AAA/; /BBB/; /CCC/'

# grep for AAA and BBB and CCC (in that order)
awk '/AAA.*BBB.*CCC/'

# print only lines of 65 characters or longer
awk 'length > 64'

# print only lines of less than 65 characters
awk 'length < 64'

# print section of file from regular expression to end of file
awk '/regex/,0'
awk '/regex/,EOF'

# print section of file based on line numbers (lines 8-12, inclusive)
awk 'NR==8,NR==12'

# print line number 52
awk 'NR==52'
awk 'NR==52 {print;exit}' # more efficient on large files

# print section of file between two regular expressions (inclusive)
awk '/Iowa/,/Montana/' # case sensitive


SELECTIVE DELETION OF CERTAIN LINES:

# delete ALL blank lines from a file (same as "grep '.' ")
awk NF
awk '/./'

Tuesday, November 25, 2008

Searching with shell utilities

1.3 Matching Text

A number of Unix text-processing utilities let you search for, and in some cases change, text patterns rather than fixed strings. These utilities include the editing programs ed, ex, vi, and sed, the awk programming language, and the commands grep and egrep. Text patterns (formally called regular expressions) contain normal characters mixed with special characters (called metacharacters).

1.3.1 Filenames Versus Patterns

Metacharacters used in pattern matching are different from metacharacters used for filename expansion. When you issue a command on the command line, special characters are seen first by the shell, then by the program; therefore, unquoted metacharacters are interpreted by the shell for filename expansion. For example, the command:

$ grep [A-Z]* chap[12]

could be transformed by the shell into:

$ grep Array.c Bug.c Comp.c chap1 chap2

and would then try to find the pattern Array.c in files Bug.c, Comp.c, chap1, and chap2. To bypass the shell and pass the special characters to grep, use quotes as follows:

$ grep "[A-Z]*" chap[12]

Double quotes suffice in most cases, but single quotes are the safest bet.

Note also that in pattern matching, ? matches zero or one instance of a regular expression; in filename expansion, ? matches a single character.

1.3.2 Metacharacters

Different metacharacters have different meanings, depending upon where they are used. In particular, regular expressions used for searching through text (matching) have one set of metacharacters, while the metacharacters used when processing replacement text have a different set. These sets also vary somewhat per program. This section covers the metacharacters used for searching and replacing, with descriptions of the variants in the different utilities.

1.3.2.1 Search patterns

The characters in the following table have special meaning only in search patterns:

Character

Pattern

.

Match any single character except newline. Can match newline in awk.

*

Match any number (or none) of the single character that immediately precedes it. The preceding character can also be a regular expression. For example, since . (dot) means any character, .* means "match any number of any character."

^

Match the following regular expression at the beginning of the line or string.

$

Match the preceding regular expression at the end of the line or string.

\

Turn off the special meaning of the following character.

[ ]

Match any one of the enclosed characters. A hyphen (-) indicates a range of consecutive characters. A circumflex (^) as the first character in the brackets reverses the sense: it matches any one character not in the list. A hyphen or close bracket (]) as the first character is treated as a member of the list. All other metacharacters are treated as members of the list (i.e., literally).

{n,m}

Match a range of occurrences of the single character that immediately precedes it. The preceding character can also be a metacharacter. {n} matches exactly n occurrences; {n,} matches at least n occurrences; and {n,m} matches any number of occurrences between n and m. n and m must be between 0 and 255, inclusive.

\{n,m\}

Just like {n,m}, but with backslashes in front of the braces.

\( \)

Save the pattern enclosed between \( and \) into a special holding space. Up to nine patterns can be saved on a single line. The text matched by the subpatterns can be "replayed" in substitutions by the escape sequences \1 to \9.

\n

Replay the nth sub-pattern enclosed in \( and \) into the pattern at this point. n is a number from 1 to 9, with 1 starting on the left.

\< \>

Match characters at beginning (\<) or end (\>) of a word.

+

Match one or more instances of preceding regular expression.

?

Match zero or one instances of preceding regular expression.

|

Match the regular expression specified before or after.

( )

Apply a match to the enclosed group of regular expressions.

Many Unix systems allow the use of POSIX character classes within the square brackets that enclose a group of characters. These are typed enclosed in [: and :]. For example, [[:alnum:]] matches a single alphanumeric character.

Class

Characters matched

alnum

Alphanumeric characters

alpha

Alphabetic characters

blank

Space or TAB

cntrl

Control characters

digit

Decimal digits

graph

Nonspace characters

lower

Lowercase characters

print

Printable characters

space

Whitespace characters

upper

Uppercase characters

xdigit

Hexadecimal digits

1.3.2.2 Replacement patterns

The characters in the following table have special meaning only in replacement patterns:

Character

Pattern

\

Turn off the special meaning of the following character.

\n

Restore the text matched by the nth pattern previously saved by \( and \). n is a number from 1 to 9, with 1 starting on the left.

&

Reuse the text matched by the search pattern as part of the replacement pattern.

~

Reuse the previous replacement pattern in the current replacement pattern. Must be the only character in the replacement pattern (ex and vi).

%

Reuse the previous replacement pattern in the current replacement pattern. Must be the only character in the replacement pattern (ed).

\u

Convert first character of replacement pattern to uppercase.

\U

Convert entire replacement pattern to uppercase.

\l

Convert first character of replacement pattern to lowercase.

\L

Convert entire replacement pattern to lowercase.

\E

Turn off previous \U or \L.

\e

Turn off previous \u or \l.

1.3.3 Metacharacters, Listed by Unix Program

Some metacharacters are valid for one program but not for another. Those that are available to a Unix program are marked by a bullet (figs/U2022.gif) in the following table. (This table is correct for SVR4 and Solaris and most commercial Unix systems, but it's always a good idea to verify your system's behavior.) Items marked with a "P" are specified by POSIX; double check your system's version. Full descriptions were provided in the previous section.

Symbol

ed

ex

vi

sed

awk

grep

egrep

Action

.

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

Match any character.

*

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

Match zero or more preceding.

^

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

Match beginning of line/string.

$








Match end of line/string.

\

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

Escape following character.

[ ]

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

Match one from a set.

\( \)

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

Store pattern for later replay.[1]

\n

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

Replay sub-pattern in match.

{ }

figs/U2022.gifP

figs/U2022.gifP

Match a range of instances.

\{ \}

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

Match a range of instances.

\< \>

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

Match word's beginning or end.

+

figs/U2022.gif

figs/U2022.gif

Match one or more preceding.

?

figs/U2022.gif

figs/U2022.gif

Match zero or one preceding.

|

figs/U2022.gif

figs/U2022.gif

Separate choices to match.

( )

figs/U2022.gif

figs/U2022.gif

Group expressions to match.

[1] Stored sub-patterns can be "replayed" during matching. See the examples in the next table.

Note that in ed, ex, vi, and sed, you specify both a search pattern (on the left) and a replacement pattern (on the right). The metacharacters listed in this table are meaningful only in a search pattern.

In ed, ex, vi, and sed, the following metacharacters are valid only in a replacement pattern:

Symbol

ex

vi

sed

ed

Action

\

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

Escape following character.

\n

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

Text matching pattern stored in \( \).

&

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

figs/U2022.gif

Text matching search pattern.

~

figs/U2022.gif

figs/U2022.gif

Reuse previous replacement pattern.

%

figs/U2022.gif

Reuse previous replacement pattern.

\u \U

figs/U2022.gif

figs/U2022.gif

Change character(s) to uppercase.

\l \L

figs/U2022.gif

figs/U2022.gif

Change character(s) to lowercase.

\E

figs/U2022.gif

figs/U2022.gif

Turn off previous \U or \L.

\e

figs/U2022.gif

figs/U2022.gif

Turn off previous \u or \l.

1.3.4 Examples of Searching

When used with grep or egrep, regular expressions should be surrounded by quotes. (If the pattern contains a $, you must use single quotes; e.g., 'pattern'.) When used with ed, ex, sed, and awk, regular expressions are usually surrounded by / although (except for awk), any delimiter works. Here are some example patterns:

Pattern

What does it match?

bag

The string bag.

^bag

bag at the beginning of the line.

bag$

bag at the end of the line.

^bag$

bag as the only word on the line.

[Bb]ag

Bag or bag.

b[aeiou]g

Second letter is a vowel.

b[^aeiou]g

Second letter is a consonant (or uppercase or symbol).

b.g

Second letter is any character.

^...$

Any line containing exactly three characters.

^\.

Any line that begins with a dot.

^\.[a-z][a-z]

Same as previous, followed by two lowercase letters (e.g., troff requests).

^\.[a-z]\{2\}

Same as previous; ed, grep and sed only.

^[^.]

Any line that doesn't begin with a dot.

bugs*

bug, bugs, bugss, etc.

"word"

A word in quotes.

"*word"*

A word, with or without quotes.

[A-Z][A-Z]*

One or more uppercase letters.

[A-Z]+

Same as previous; egrep or awk only.

[[:upper:]]+

Same as previous; POSIX egrep or awk.

[A-Z].*

An uppercase letter, followed by zero or more characters.

[A-Z]*

Zero or more uppercase letters.

[a-zA-Z]

Any letter, either lower- or uppercase.

[^0-9A-Za-z]

Any symbol or space (not a letter or a number).

[^[:alnum:]]

Same, using POSIX character class.

egrep or awk pattern

What does it match?

[567]

One of the numbers 5, 6, or 7.

five|six|seven

One of the words five, six, or seven.

80[2-4]?86

8086, 80286, 80386, or 80486.

80[2-4]?86|Pentium

8086, 80286, 80386, 80486, or Pentium.

compan(y|ies)

company or companies.

ex or vi pattern

What does it match?

\

Words like theater, there, or the.

the\>

Words like breathe, seethe, or the.

\

The word the.

ed, sed, or grep pattern

What does it match?

0\{5,\}

Five or more zeros in a row.

[0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}

U.S. Social Security number (nnn-nn-nnnn).

\(why\).*\1

A line with two occurrences of why.

\([[:alpha:]_][[:alnum:]_.]*\) = \1;

C/C++ simple assignment statements.

1.3.4.1 Examples of searching and replacing

The following examples show the metacharacters available to sed or ex. Note that ex commands begin with a colon. A space is marked by a figs/squ.gif; a TAB is marked by a figs/U2192.gif.

Command

Result

s/.*/( & )/

Redo the entire line, but add parentheses.

s/.*/mv & &.old/

Change a wordlist (one word per line) into mv commands.

/^$/d

Delete blank lines.

:g/^$/d

Same as previous, in ex editor.

/^[figs/squ.giffigs/U2192.gif]*$/d

Delete blank lines, plus lines containing only spaces or figs/squ.gifs.

:g/^[figs/squ.giffigs/U2192.gif]*$/d

Same as previous, in ex editor.

s/figs/squ.giffigs/squ.gif*/figs/squ.gif/g

Turn one or more spaces into one space.

:%s/figs/squ.giffigs/squ.gif*/figs/squ.gif/g

Same as previous, in ex editor.

:s/[0-9]/Item &:/

Turn a number into an item label (on the current line).

:s

Repeat the substitution on the first occurrence.

:&

Same as previous.

:sg

Same as previous, but for all occurrences on the line.

:&g

Same as previous.

:%&g

Repeat the substitution globally (i.e., on all lines).

:.,$s/Fortran/\U&/g

On current line to last line, change word to uppercase.

:%s/.*/\L&/

Lowercase entire file.

:s/\<./\u&/g

Uppercase first letter of each word on current line. (Useful for titles.)

:%s/yes/No/g

Globally change a word to No.

:%s/Yes/~/g

Globally change a different word to No (previous replacement).

Finally, here are some sed examples for transposing words. A simple transposition of two words might look like this:

s/die or do/do or die/

The real trick is to use hold buffers to transpose variable patterns. For example, to transpose using hold buffers:

s/\([Dd]ie\) or \([Dd]o\)/\2 or \1/