\\ Corning Community College \\ UNIX/Linux Fundamentals \\ \\ Lab 0xA: Pattern Matching with Regular Expressions \\ \\ ~~TOC~~ ======Objective====== To become familiar with Regular Expressions and their applications. ======Reading====== There are many documents available on Regular Expressions. Check out some of the following: * Article on Wikipedia: [[http://en.wikipedia.org/wiki/Regular_Expressions|Regular Expressions]] * Nice overview: [[http://www.cs.colorado.edu/~schenkc/UNIX_Regular_Expressions.pdf|A Tao of Regular Expressions, by Steve Mansour]] * The ultimate RegEx website: [[http://www.regular-expressions.info/|Regular-Expressions.info]] (be sure to explore this site, there are a lot of excellent resources) * Regular Expression Tutorial: [[http://www.regular-expressions.info/tutorial.html|http://www.regular-expressions.info/tutorial.html]] In "Harley Hahn's Guide to UNIX and Linux", please read: * Chapter 20, "Regular Expressions", on pages 497-519. For anyone with access to the book "UNIX in a Nutshell, 3rd Edition", Chapter 6 in the book deals with Regular Expressions on pages 295-301. Additionally, the **grep**(**1**) manual page has a section dedicated to Regular Expressions. ======Background====== Back in the UNIX Shell lab, you were introduced to file wildcards: *****, **?**, and **[ ]**. As you should have experienced, this useful functionality of the UNIX shell allows for some fairly precise filename searching. Now, with UNIX being as flexible as it is, this same "wildcard" functionality can also be applied to text processing. That's right! You can also search through streams of text for occurrences of letters, specific words, or specific places in the text. These **Regular Expressions** are as follows: ^ Regular Expression Symbol ^ Description of Functionality | | . | Match any character | | * | Match 0 or more of the preceding | | ^ | Beginning of line or string | | $ | End of line or string | | [ ] | Character class - match any of the enclosed characters | | [^ ] | Negated character class - do not match any of the enclosed characters | | \< | Beginning of word | | \> | End of word | NOTE: With character classes, you can specify ranges, such as the uppercase alphabet: **[A-Z]** Sets of ranges are concatenated, no need for commas. For example, both the lowercase alphabet and any numeric digit: **[a-z0-9]** The first five are considered the basic Regular Expressions. All programs that support Regular Expressions will support these basic types. If you include the **^** inside the character class, it will change the function to exclude any of the enclosed characters, for example: **[^abcd]** Is a character class that will not match any of **a**, **b**, **c**, or **d**. Note that those are lowercase letters, so it will only exclude those letters, and not their uppercase counterparts. ======Procedure====== The **grep**(**1**) utility is extremely useful in the area of text-searching, and Regular Expressions. We will be calling upon the capability of this tool quite often, so let us take a look at it: ^ 1. ^|Using the **grep**(**1**) utility in the **/etc/passwd** file, perform the following searches:| | ^ a.|grep for the substring 'System' (note capitalization). What did you type on the command-line?| |:::^ b.|What does this search do?| As you can see, **grep**(**1**) can be used to search for literal text strings, but it can also be used to search based upon a pattern: ^ 2. ^|Using the **grep**(**1**) utility in the **/etc/passwd** file, perform the following search:| | ^ a.|grep for the pattern '^[b-d][aeiou]'. What did you type on the command-line?| |:::^ b.|What does this search do?| |:::^ c.|How is this more powerful than just searching for a literal string?| And of course, the more practice you have, the better off you are. ^ 3. ^|Using the **grep**(**1**) utility in the **/etc/passwd** file, perform the following search:| | ^ a.|Search for all the lines starting with any of your initials (first or last). Be sure to include command used, and matching lines.| |:::^ b.|Search for all the lines starting with r, followed by any lowercase vowel, and ending with an h. How did you do it? What were your results?| **__NOTE__**: Be sure to use quotes around the regular expressions arguments to grep. It helps to differentiate between grep regexp's and shell wildcards. The single quotes are most preferable, as you are specifying a literal string from the shell's perspective. The **less** pager can also be used with Regular Expressions. The on-line manual pages are setup to use the default pager, which should be less in most cases. To search in **less** (and therefore the manual pages), use the forward slash **/**, followed by your search pattern. Finally, we take a look again at the **vi** editor, which has some very powerful functionality when dealing with Regular Expressions. The substitute function, an **ex** command, can be quite useful. The basic syntax is as follows: :[address]s[/pattern/replacement/][options][count] Where: * //address// can be a number of lines- **%** for the entire file, or any other valid addressing scheme used in vi. * //pattern// is the text you are searching for (which can include regular expressions) * //replacement// is the text that will replace the text found by pattern * and finally, //options// can be one of c, g, or p. (g = global) In the **regex/** subdirectory of the UNIX Public Directory you will find a file called **regex.html**, which is a copy of lab #0, with some changes. Looking through this file, you will see several HTML tags. Having to make changes to this file could result in massive changes, so why worry about doing it by hand? Let Regular Expressions help! ^ 4. ^|Do the following (be sure to show the substitution command used):| | ^ a.|Oops! I made a typo! All the
tags are spelled British style as . Go ahead and correct this for all occurrences in the entire file.| |:::^ b.|The closing center tags are currently , so go change them to
. Be sure to properly handle the /.| |:::^ c.|This file uses the old -style boldness tags. We want to be fairly modern and use instead. So go ahead and get that all set.| |:::^ d.|Go ahead and make the appropriate changes to all the tags to their corresponding counterparts.| |:::^ e.|No need to provide the updated file, just show me the substitution commands given in the first four parts.| Imagine if you had a massive file in need of changes? Would you want to spend hours doing it all by hand? Or construct a simple RegEx pattern and have the computer do the work for you? THAT is the power of Regular Expressions. ^ 5. ^|Change into the **/usr/share/dict** directory and locate the '**words**' file.| | ^ a.|Do you see it? It is a symbolic link. Chase it down to its destination, show me what it is, and how you found it.| |:::^ b.|View this file... how does the file appear to be made up?| |:::^ c.|How many entries are in this file? Show me how you accomplished this.| Using this dictionary, I'd like for you to perform some searches, aided by Regular Expressions you construct. Be sure to show your pattern, as well as provide a count of how many words match your pattern. ^ 6. ^|Construct RegEx according to the following criteria and show me what you typed, and show me how many words match your pattern:| | ^ a.|All words exactly 5 characters in length| |:::^ b.|All words starting with any of your initials| |:::^ c.|All words starting with your first initial, having your middle initial occur somewhere after the first, and end with your last initial.| |:::^ d.|All words that start and end with lowercase vowels.| |:::^ e.|All words that start with any of your initials, immediately followed by any lowercase vowel, and ending with the letters '**e**', '**s**', or '**t**'| |:::^ f.|All words that do not start with any of your initials.| |:::^ g.|All words at least 3 characters in length, and do not start with "**th**"| |:::^ h.|All 3 letter words that end in '**e**'| |:::^ i.|All words that contain the substring "**bob**" but do not end with the letter '**b**'| |:::^ j.|Only the words that start with the substring "**blue**".| |:::^ k.|All the words that contain no vowels (consider '**Y**' in all cases a vowel).| |:::^ l.|All the words that do not begin with a vowel, that can have anything for the second character, only '**a**', '**b**', '**c**', or '**d**' for the third character, and end with a vowel.| It is important to understand the nature of RegEx and the patterns they create. We will be using this knowledge when we wish to perform advanced searches, and in shell scripting. So be sure to ask any questions if you don't understand something. ======Conclusions====== All questions in this assignment require an action or response. Please organize your responses into an easily readable format and submit the final results to your instructor. Your assignment is expected to be performed and submitted in a clear and organized fashion- messy or unorganized assignments may have points deducted. Be sure to adhere to the submission policy. The successful results of the following actions will be considered for evaluation: * your responses to questions submitted at the following form:
http://lab46.corning-cc.edu/haas/content/unix/submit.php?laba
\\ * the response from the form (received via e-mail) saved as **laba.txt** to your **~/src/unix/** directory * addition/commit of **~/src/unix/laba.txt** into your repository (CS 0x0 sets you up to do this). As always, the class mailing list and class IRC channel are available for assistance, but not answers.