[Linux for Newbies]

Linux for Newbies, Part 11:
Regular Expressions

by Gene Wilburn

(The Computer Paper, June 2000. Copyright © Wilburn Communications Ltd. All rights reserved)


In this beginner's series, based on Red Hat Linux 6, we've progressed through a set of building blocks designed to make you more comfortable with your new Linux environment. In the last installment we explored the various ways to locate things on a GNU/Linux system. Some of the examples in the installment used wildcards familiar to most Windows, OS/2 and DOS users, such as "*.html" to match all files that end in ".html".

In this installment we'll take wildcards and their ilk to another level. Linux programs and utilities normally provide a feature called "regular expressions", a mild-sound term for something that is rich in scope and deep in applicability. Regular expressions--often shortened to "regex" or "regexp"--provide powerful pattern matching and text manipulation capabilities.

Regular expressions are integrated into grep, Perl, Tcl, awk, Python, vi, Emacs, sed and Expect, to name a few popular Unix programs that use them. Regex's are also included in several Windows programming products, including Delphi and Visual C++. The reason for the popularity of regular expressions is that they provide a logical tool for manipulating text and data. They can make complex tasks simple. Learning how to formulate regular expressions is an important step on the way to becoming a Linux power user.

Simple Regular Expressions

Let's start with some simple examples and work our way up. As you know by now, Linux is a case-sensitive operating system. This means that "Index.html" and "index.html" are two completely different files. Let's say you've transferred some HTML files from a Windows environment to a Linux-based Apache server. Due to eccentricities of Windows operating systems, you may have a mixed bag of upper and lower case filenames when the files arrive. Furthermore, you may have some files on your Web site that end in ".html" and some that end in ".htm" because some of the pages were created with Windows products that retrofit filenames back to DOS 8.3 limitations.

So let's assume that, initially, you're looking for a transferred file called "default.htm", except that it might be "Default.htm". To do a listing that would catch either variant of the filename, you can type the following:


$ ls -l [Dd]efault.htm

The "[Dd]" is a Unix-style wildcard, meaning that for the first character in the filename, match any character in this character set (delimited by brackets). It will find both "Default.htm" and "default.htm".

Now let's assume you're in a large HTML directory where you suspect you have a mix of .htm and .html files residing and you would like to see what they are. Listing these files, and only these, is not as simple as the previous example. Using wildcards, you could type:


$ ls *.htm*
Temp.html      temp.htm       temp.html.bak  temp.html~

But as you can see, this includes backup files as well. Not only that, it didn't catch any imported files that end in .HTM (upper case). If, instead of wildcarding the "ls" command, we pipe the ls output through egrep (a member of the grep family), we can set up a regular expression that will limit what is displayed to just source Web pages and no backup files. Here's one way of doing it:


$ ls * | egrep -i ".html?$"
TEMP2.HTM
Temp.html
temp.htm

This gives us exactly what we wanted so let's examine this command more closely to see what it's doing. First we ran ls against everything, with the wildcard "*". We piped the output of ls to egrep. In the grep family, regular expression support comes in two levels, basic and extended. Grep supports basic regex while egrep supports extended regex, which we required for this example.

We ran egrep with the "-i" flag, meaning "ignore case", allowing us to match on both upper- and lower-case filenames. Then we come to the pattern itself. The ".htm" part of the pattern matches anything that contains ".htm" (ignoring case). This is followed by "l?". In "regex speak" a question mark ("?") is a special character, meaning match the preceding character zero or one time. In other words the match is optional. By this logic filenames with both "htm" and "html" qualify.

If we had stopped there we would have once again listed all the backup files, so we needed one additional piece of logic: the dollar sign ("$"). The dollar sign is a special character signifying the location at end of a line or string. It's called an anchor character. (Its opposite for the location at the beginning of line or string is the caret symbol "^"). Hence the simple pattern "*.html?$" actually contains a neat bit of logic embedded into the pattern. It gives a pattern match with an optional character, but adds that the pattern must match at the end of the line. Hence a file such as "temp.html.bak" does not qualify because the "html" part of its filename is not at the end of the string.

Regular Expression Metacharacters

Regular expressions use a fairly standardized set of metacharacters that work from one program to another. They're worth memorizing because they come in handy over and over. Once you learn them, you can move on to advanced regular expressions. (See fig. 1)


Fig. 1 -- Regular Expression Metacharacters

Basic regular expression metacharacters

.       any one character
[...]   any character listed in a character class
[^...]  any character NOT listed in the class
^       beginning of line anchor
$       end of line anchor
\<      start of word anchor
\>      end of word anchor
|       or bar ("or" logic separating expressions)
()      parentheses (limits scope of | "or bar")
\       escape (used before a metacharacter to match a literal)

Extended regular expression metacharacters

?       one optional match on preceding, no match required
*       unlimited optional matches on preceding, no match required
+       one match on preceding required, unlimited allowed


Unfortunately there are a few minor variations in regex symbols from program to program but the above expressions work in most Linux tools. Grep supports the basic regex characters while egrep supports the extended symbols.

The variations of regular expressions are documented in Mastering Regular Expressions, Jeffrey E.F. Friedl, O'Reilly & Associates (ISBN 1-56592-257-3, $42.95). This book is a masterpiece--one I would highly recommend for your technical library, especially if you write Perl scripts.

Practicing Regular Expressions

The best way to get started learning regular expressions is to use grep and egrep to poke through text files. Because the greps are read-only, you can experiment at will. You can do no inadvertent damage.

A typical use for the greps is fishing for patterns in log files. I'll demonstrate with an example from my home Linux gateway server that connects me to the Internet.

Because I'm on a cable modem, I like to monitor for unauthorized access attempts on my server. I can use a simple grep to scan for the word "refused" in /var/log/secure:


# grep refused /var/log/secure
Apr 10 18:40:14 cr123456-a in.telnetd[24748]: refused connect from
cr654321-a.wlfdle1.on.wave.home.com

As you can see, one of my network neighbours attempted to telnet into my system. I don't have telnet running, but the system logged the attempt. (In the interests of security I've altered both hostnames in this real example.)

This kind of grep is simple and requires no regular expressions. Now let's move on to an example that employs a regex. I have all my "Linux for Newbies" columns in the directory ~/docs/linux/newbies. Using egrep I can find out how many lines, in all the columns to date, contain the word "the". At the same time, I don't want to match on words like "theatre" or "them" or "these" but I do want to include any instances of "The" which might start a sentence.

First I'll create the following expression and test it visually:


$ egrep "\<[Tt]he\>" ~/docs/linux/newbies/newb*.txt
...
newb011.txt:The best way to get started learning regular expressions
newb011.txt:A typical use for the greps is fishing
...

I then pipe the results through wc to get a line count:


$ egrep "\<[Tt]he\>" ~/docs/linux/newbies/newb*.txt | wc -l
    668

If I changed the expression to "[Tt].n" looking for "then", "than", or "thin", I would also snag words like "something". By adding word anchors, I narrow the scope of the expression to the word I'm after: "\<[Th].n\>".

Regular Expressions in Scripts

While regular expressions are extremely handy to run right from the command line for one-off searches, they're even more useful inside programs and scripts. In a previous column we used the df command to check on disk usage. On one of my home Linux servers, a df returns the following:


$ df -h
Filesystem            Size  Used  Avail  Capacity Mounted on
/dev/hda4             701M  112M   553M     17%   /
/dev/hda1              78M  1.1M    73M      1%   /boot
/dev/hdb2             1.6G  516M   1.0G     32%   /home
/dev/hda3              97M   10K    92M      0%   /tmp
/dev/hdb1             1.3G  843M   404M     68%   /usr

All operating systems get cranky if you run out of disk space so you may want to create a system script that watches for a certain threshold of capacity and warns you that disk usage has gone beyond this.

I'll use awk and regular expressions to build a watchdog script. Awk, the predecessor of Perl, is a useful scripting tool. What awk does, by default, is parse lines into fields. Hence in the "df -h" example, the output breaks down into six fields, separated by white space. In awk, those fields can be referenced by the symbols $1, $2, $3, $4, $5, and $6.

To show how this works, you can type the following on the command line:


$ df -h | awk '{print $5 "\t" $1}'
Capacity        Filesystem
17%     /dev/hda4
1%      /dev/hda1
32%     /dev/hdb2
0%      /dev/hda3
68%     /dev/hdb1

Don't worry about the awk syntax for now. We'll touch on that in a later column. Notice how we used awk to rearrange the output. Now let's add a regular expression that matches anything 50% or greater:

$ df -h | awk '$5 ~ "[5-9].%" {print $5 "\t" $1}'     
68%     /dev/hdb1

The tilde symbol "~" in awk means match a regular expression. What we're saying on this line is that if field $5 (Capacity) matches any number beginning with 5 through 9 and contains any other character followed by a percent sign ("%"), then print field 5, print a tab, and print field 1.

If we wanted only to scan for file systems that were 80% or more full, we'd change [5-9] to [8-9].

What this logic would not pick up, however, is any file system that is 100% full. To get that we need to add another piece to the expression:


... $5 ~ "[5-9].%" || $5 == "100%" ...

The double pipe symbols mean "or" in awk and the double equal signs mean "equals".

Once this works to your satisfaction and you've tested it at the command line, you write the command to a file and turn it into a script. Call it dfcheck and add it to /usr/local/bin. Remember to chmod the script to an executable:


# chmod a+x /usr/local/bin/dfcheck

You, as root, could now create another script, based on dfcheck, called something like dfcheck.daily and put it into your /etc/cron.daily directory. You can have it email the results to you on a daily basis. The new script is a one-liner:


/usr/local/bin/dfcheck | mail -s "Daily disk usage check" me@somewhere

Change "me@somewhere" to a real email address.

That's a quick introduction to regular expressions, and an even quicker introduction to script writing. Regex formulation is a useful, enabling skill. The more adept you become at creating regex patterns, the more confident you become at writing scripts and Perl programs.

I will admit, though, that it's difficult to explain to family and non-computing friends that you're studying "regular expressions." Get used to the odd looks you'll get every time you mention it.

Next time: building a Linux home network

Gene Wilburn (gene@wilburn.ca) is a Toronto-based IT specialist, musician and writer who operates a small farm of Linux servers.

-30-