Lab46 Wiki

Objective

To become familar with the concepts of text filtering, and some of the UNIX utilities that are useful in this process.

Reading

Check out the manual pages for the following utilities:

cat(1) - concatenate files
cut(1) - cut text
grep(1) - globally search for regular expression and print
head(1) - print first “n” lines of output
sed(1) - stream editor
sort(1) - sort output
tail(1) - print last “n” lines of output
tr(1) - translate characters
uniq(1) - filter out duplicate lines from sorted file
wc(1) - word count

In “Harley Hahn's Guide to UNIX and Linux”, please read:

Chapter 16 (“Filters: Introduction and Basic Operations”, pages 373-394).
Chapter 17 (“Filters: Comparing and Extracting”, pages 395-420).
Chapter 18 (“Filters: Counting and Formatting”, pages 421-446).
Chapter 19 (“Filters: Selecting, Sorting, Combining, and Changing”, pages 447-496).

Background

Filtering is a big deal in many areas that deal with information processing. Say you've got a database of produce for a grocery store, and you want to view JUST the information regarding the banana shipments.. instead of sorting through the entire database and picking out the data you want manually- why not put all the data through a filter and simply view the pertinent data?

A filter, as defined on http://dictionary.reference.com/ is a program or routine that blocks access to data that meet a particular criterion.

For example: a Web filter that screens out vulgar sites.

UNIX provides some utilities that allow you to accomplish impressive amounts of filtering. When coupled with Regular Expressions, you can combine the power of pattern matching with your filter, adding considerable flexibility to your arsenal of tricks.

The next step is to apply shell scripts, which allow you to write “programs” that can take advantage of all the utilities and features available on a system. We will be looking at shell scripts another week. First we must get the foundations in place so we can better appreciate shell scripts.

So, some basics of filtering:

In order to do any sort of filtering, we need to know what we want to filter. Makes sense.

Before we employ filtering, we must have some clear idea about what we would like to filter, and how to safely maintain the data we wish to let through. (Filtering is no good if the data you are after gets damaged in the process).

The UNIX cat(1) utility is a general all-purpose tool that can be used to display the contents of text files. cat(1) also provides a number of other features that can be handy for debugging problems you may encounter with text files. (the -n and -e arguments can be particularly useful).

Let's play with a sample database. In the filters/ subdirectory of the UNIX Public Directory you will find a file called “sample.db”. Copy it to your home directory.

Let's try some stuff out.

No filtering, or a filter that lets everything through

Display the contents:

lab46:~$ cat sample.db

This is the simplest form of filtering possible– none at all. All the data in the text file is passed to STDOUT.

Even at this stage we can do some useful things with the data. For example, if we wanted to find out how many lines were in the database:

lab46:~$ cat sample.db | wc -l

The database will display to STDOUT in all its entirety. You will notice the database is setup as follows:

name:sid:major:year:favorite candy

With this information we can make some important observations about the structure of the database:

fields are separated by a colon (:)
last entry on the line is followed by a star (*)

To be effective in filtering text, we must be aware of the structure of that text. The more you know about how some structure is set up, the better we can design a solution to the particular problem.

keyword filtering

Ok, so let us filter some of this information:

Find all the students who are in Biology:

lab46:~$ cat sample.db | grep Biology

We can do more complicated searches too:

Find all the students who are in Biology AND like Lollipops:

lab46:~$ cat sample.db | grep Biology | grep Lollipops

1.		Perform the following searches on the database:
	a.	Find all the students that are a Freshman
	b.	Same as above but in alphabetical order
	c.	Any duplicate entries? Remove any duplicates.
	d.	Using the wc(1) utility, how many matches did you get?

Be sure to give me the command-line incantations you came up with, and any observations you made.

filter for manipulation

So we've done some simple searches on our database. We've filtered the output to get desired values. But we don't have to stop there. Not only can we filter the text, we can manipulate it to our liking.

The cut(1) utility lets us literally cut columns from the output.

It relies on a thing called a field-separator, which will be used as a logical separator of the data.

Using the “-d” argument to cut, we can specify the field separator in our data. The “-f” option will parse the text in fields based on the established field separator.

So, looking at the following text:

hello there:this:is:a:bunch of:text.

Looking at this example, we can see that “:” would make for an excellent field separator.

With “:” as the field separator, the logical structure of the above text is logically represented as follows:

Field 1	Field 2	Field 3	Field 4	Field 5	Field 6
hello there	this	is	a	bunch of	text.

We can test these properties out by using cut(1) on the command-line:

lab46:~$ echo "hello there:this:is:a:bunch of:text." | cut -d":" -f#

Where # is a specific field or range of fields. (ie -f2 or -f2,4 or -f1-3)

2.		Let's play with the cut(1) utility:
	a.	What would the following command-line display: echo “hello there:this:is:a:bunch of:text.” \| cut -d“:” -f3
	b.	If you wanted to get “hello there text.” to display to the screen, what manipulation to the text would you have to do?
	c.	Did your general attempt work? Is there extra information?

If you found that extra information showed up when you tried to do that last part- taking a closer look will show why:

If you tell cut(1) to display any fields that aren't immediately next to one another, it will insert the field separator to indicate the separation.

So how do you keep this functionality while still getting the exact data you seek? Well, nobody said we could only apply one filter to text.

The Stream Editor - sed

Remember back when we played with vi/vim? Remember that useful search and replace command:

:%s/regex/replacement/g

That was quite useful. And luckily, we've got that same ability on the command line. Introducing “sed(1)”, the stream editor.

sed provides some of the features we've come to enjoy in vi, and is for all intents and purposes a non-interactive editor. One useful ability, however, is its ability to edit data streams (that is, STDOUT, including that generated from our command lines).

Perhaps the most immediately useful command found in sed will be its search and replace, which is pretty much just like the vi/vim variant:

sed -e 's/regex/replacement/g'

However, if you look close, you will see that we did not include any sort of file to operate on. While we can, one of the other common uses of sed is to pop it in a command-line with everything else, stuck together with the all-powerful pipe (|).

For example, so solve the above problem with the field separator:

lab46:~$ echo "hello there:this:is:a:bunch of:text." | cut -d":" -f1,6 | sed -e 's/:/ /g'

We used sed to replace any occurrence of the “:” with a single space.

3.		Answer me the following:
	a.	Does the above command-line fix the problem from #2c?
	b.	If you wanted to change all “t”'s to uppercase “T”'s in addition to that, what would you do?
	c.	If you wanted to replace all the period symbols in the text with asterisks, how would you do it?
	d.	What does the resulting output look like?

From head(1) to tail(1)

Two other utilities you may want to become acquainted with are the head(1) and tail(1) utilities.

head(1) will allow you to print a specified number of lines from 1 to n. So if you needed to print, say, the first 12 lines of a file, head(1) will be a good bet.

For example, to display the first 4 lines of our sample database:

lab46:~$ head -12 sample.db

And, of course, adding it onto an existing command line using the pipe. In this example, the first two results of all the *ology Majors:

lab46:~$ cat sample.db | grep "ology" | head -2

See where we're going with this? We can use these utilities to put together massively powerful command-line incantations create all sorts of interesting filters.

tail(1) works in the opposite end- starting at the end of the file and working backwards towards the beginning. So if you wanted to display the last 8 lines of a file, for example. tail(1) also has the nifty ability to continually monitor a file and update its output should the source file change. This is useful for monitoring log files that are continually updated.

Translating characters with tr

This is another useful tool to be familiar with. With tr(1), you can substitute any character or sequence of characters with another. The nice thing is that you can quickly use it to do end-of-line character translations, useful in converting text files from DOS format to UNIX or Mac format (or any combination therein).

ASCII file line endings

An important thing to be aware of is how the various systems terminate their lines. Check the following table:

System	Line Ending Character(s)
DOS	Carriage Return, Line Feed (CRLF)
Mac	Carriage Return (CR)
UNIX	Line Feed (LF)

So what does this mean to you? Well, if you have a file that was formatted with Mac-style line endings, and you're trying to read that file on a UNIX system, you may notice that everything appears as a single line at the top of the screen. This is because the Mac uses just Carriage Return to terminate its lines, and UNIX uses just Line Feeds… so the two are drastically incompatible for standard text display reasons.

For example, let's say we have a UNIX file we wish to convert to DOS format. We would need to convert every terminating Line Feed to a Carriage Return & Line Feed combination (and take note that the Carriage Return needs to come first and then the Line Feed). We would do something that looks like this:

lab46:~$ tr "\n" "\r\n" < file.unix > file.dos

To interpret this:

\n is the special escape sequence that we're all familiar with. In C, you can use it to issue an end-of-line character. So in UNIX, this represents a Line Feed (LF).

\r is the special escape sequence that corresponds to a Carriage Return (CR).

The first argument is the original sequence. The second is what we would like to replace it with. (in this case, replace every LF with a CRLF combination).

Then, using UNIX I/O redirection operations, file.unix is redirected as input to tr(1), and file.dos is created and will contain the output.

In the filters/ subdirectory of the UNIX Public Directory you will find some text files in DOS, Mac, and UNIX format.

4.		Let's do some tr(1) conversions:
	a.	Convert file.mac to UNIX format. Show me how you did this, as well as any interesting messages you find inside.
	b.	Convert readme.unix to DOS format. Same deal as above.
	c.	Convert dos.txt to Mac format. Show me the command-line used.

Procedure

Looking back on our database (sample.db in the filters/ subdirectory of the UNIX Public Directory), let's do some more operations on it:

5.		Develop, explain, and show me the command-lines for the following:
	a.	How many unique students are there in the database?
	b.	How many unique majors are there in the database?
	c.	How many unique “favorite candies” in the database? (remove any trailing asterisks from the output)

HINT: sort them in alphabetical order, and make sure there are no duplicates. Also- make sure you don't count the title banner as a “student”. Also be sure to either omit the header, or have the header at the top of any provided output.

6.		Using the pelopwar.txt file from the grep/ subdirectory of the UNIX Public Directory, construct filters to do the following:
	a.	Show me the first 22 lines of this file. How did you do this?
	b.	Show me the last 4 lines of this file. How did you do this?
	c.	Show me lines 32-48 of this file. How did you do this? (HINT: the last 16 lines of the first 48)
	d.	Of the last 12 lines in this file, show me the first 4. How did you do this?

Being familiar with the commands and utilities available to you on the system greatly increases your ability to construct effective filters, and ultimately solve problems in a more efficient and creative manner.

Conclusions

This assignment has activities which you should tend to- document/summarize knowledge learned on your Opus.

As always, the class mailing list and class IRC channel are available for assistance, but not answers.

Lab46 Wiki

Sidebar

Table of Contents

Objective

Reading

Background

No filtering, or a filter that lets everything through

keyword filtering

filter for manipulation

The Stream Editor - sed

From head(1) to tail(1)

Translating characters with tr

ASCII file line endings

Procedure

Conclusions

Lab46 Wiki

User Tools

Site Tools

Sidebar

Table of Contents

Objective

Reading

Background

No filtering, or a filter that lets everything through

keyword filtering

filter for manipulation

The Stream Editor - sed

From head(1) to tail(1)

Translating characters with tr

ASCII file line endings

Procedure

Conclusions

Page Tools