Table of Contents


Corning Community College


UNIX/Linux Fundamentals



Lab 0x9: Pattern Matching with Regular Expressions

~~TOC~~

Objective

To become familiar with Regular Expressions and their applications.

Reading

There are many documents available on Regular Expressions. Check out some of the following:

In “Harley Hahn's Guide to UNIX and Linux”, please read:

For anyone with access to the book “UNIX in a Nutshell, 3rd Edition”, Chapter 6 in the book deals with Regular Expressions on pages 295-301.

Additionally, the grep(1) manual page has a section dedicated to Regular Expressions.

Background

Back in the UNIX Shell lab, you were introduced to file wildcards: *, ?, and [ ]. As you should have experienced, this useful functionality of the UNIX shell allows for some fairly precise filename searching.

Now, with UNIX being as flexible as it is, this same “wildcard” functionality can also be applied to text processing. That's right! You can also search through streams of text for occurrences of letters, specific words, or specific places in the text.

These Regular Expressions are as follows:

Regular Expression Symbol Description of Functionality
. Match any character
* Match 0 or more of the preceding
^ Beginning of line or string
$ End of line or string
[ ] Character class - match any of the enclosed characters
[^ ] Negated character class - do not match any of the enclosed characters
\< Beginning of word
\> End of word

NOTE: With character classes, you can specify ranges, such as the uppercase alphabet: [A-Z]

Sets of ranges are concatenated, no need for commas. For example, both the lowercase alphabet and any numeric digit: [a-z0-9]

The first five are considered the basic Regular Expressions. All programs that support Regular Expressions will support these basic types.

If you include the ^ inside the character class, it will change the function to exclude any of the enclosed characters, for example:

[^abcd]

Is a character class that will not match any of a, b, c, or d. Note that those are lowercase letters, so it will only exclude those letters, and not their uppercase counterparts.

Procedure

The grep(1) utility is extremely useful in the area of text-searching, and Regular Expressions. We will be calling upon the capability of this tool quite often, so let us take a look at it:

1. Using the grep(1) utility in the /etc/passwd file, perform the following searches:
a.grep for the substring 'System' (note capitalization). What did you type on the command-line?
b.What does this search do?

As you can see, grep(1) can be used to search for literal text strings, but it can also be used to search based upon a pattern:

2. Using the grep(1) utility in the /etc/passwd file, perform the following search:
a.grep for the pattern '^[b-d][aeiou]'. What did you type on the command-line?
b.What does this search do?
c.How is this more powerful than just searching for a literal string?

And of course, the more practice you have, the better off you are.

3. Using the grep(1) utility in the /etc/passwd file, perform the following search:
a.Search for all the lines starting with any of your initials (first or last). Be sure to include command used, and matching lines.
b.Search for all the lines starting with r, followed by any lowercase vowel, and ending with an h. How did you do it? What were your results?

NOTE: Be sure to use quotes around the regular expressions arguments to grep. It helps to differentiate between grep regexp's and shell wildcards. The single quotes are most preferable, as you are specifying a literal string from the shell's perspective.

The less pager can also be used with Regular Expressions. The on-line manual pages are setup to use the default pager, which should be less in most cases.

To search in less (and therefore the manual pages), use the forward slash /, followed by your search pattern.

Finally, we take a look again at the vi editor, which has some very powerful functionality when dealing with Regular Expressions. The substitute function, an ex command, can be quite useful.

The basic syntax is as follows:

:[address]s[/pattern/replacement/][options][count]

Where:

In the regex/ subdirectory of the UNIX Public Directory you will find a file called regex.html, which is a copy of lab #0, with some changes. Looking through this file, you will see several HTML tags. Having to make changes to this file could result in massive changes, so why worry about doing it by hand? Let Regular Expressions help!

4. Do the following (be sure to show the substitution command used):
a.Oops! I made a typo! All the <center> tags are spelled British style as <centre>. Go ahead and correct this for all occurrences in the entire file.
b.The closing center tags are currently </CENTRE>, so go change them to </center>. Be sure to properly handle the /.
c.This file uses the old <b>-style boldness tags. We want to be fairly modern and use <strong> instead. So go ahead and get that all set.
d.Go ahead and make the appropriate changes to all the </b> tags to their corresponding </strong> counterparts.
e.No need to provide the updated file, just show me the substitution commands given in the first four parts.

Imagine if you had a massive file in need of changes? Would you want to spend hours doing it all by hand? Or construct a simple RegEx pattern and have the computer do the work for you? THAT is the power of Regular Expressions.

5. Change into the /usr/share/dict directory and locate the 'words' file.
a.Do you see it? It is a symbolic link. Chase it down to its destination, show me what it is, and how you found it.
b.View this file… how does the file appear to be made up?
c.How many entries are in this file? Show me how you accomplished this.

Using this dictionary, I'd like for you to perform some searches, aided by Regular Expressions you construct. Be sure to show your pattern, as well as provide a count of how many words match your pattern.

6. Construct RegEx according to the following criteria and show me what you typed, and show me how many words match your pattern:
a.All words exactly 5 characters in length
b.All words starting with any of your initials
c.All words starting with your first initial, having your middle initial occur somewhere after the first, and end with your last initial.
d.All words that start and end with lowercase vowels.
e.All words that start with any of your initials, immediately followed by any lowercase vowel, and ending with the letters 'e', 's', or 't'
f.All words that do not start with any of your initials.
g.All words at least 3 characters in length, and do not start with “th
h.All 3 letter words that end in 'e'
i.All words that contain the substring “bob” but do not end with the letter 'b'
j.Only the words that start with the substring “blue”.
k.All the words that contain no vowels (consider 'Y' in all cases a vowel).
l.All the words that do not begin with a vowel, that can have anything for the second character, only 'a', 'b', 'c', or 'd' for the third character, and end with a vowel.

It is important to understand the nature of RegEx and the patterns they create. We will be using this knowledge when we wish to perform advanced searches, and in shell scripting. So be sure to ask any questions if you don't understand something.

Conclusions

This assignment has activities which you should tend to. While there is no formal submission required, it would be prudent to document some of the knowledge and experience gained on your Opus.

As always, the class mailing list and class IRC channel are available for assistance, but not answers.