This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
haas:spring2014:unix:labs:laba [2014/03/23 19:08] – removed wedge | haas:spring2014:unix:labs:laba [2014/03/23 19:10] (current) – wedge | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | <WRAP centeralign round box> | ||
+ | < | ||
+ | < | ||
+ | <fs 125%>Lab 0xA: Filters</ | ||
+ | </ | ||
+ | ~~TOC~~ | ||
+ | =====Objective===== | ||
+ | To become familar with the concepts of text filtering, and some of the UNIX utilities that are useful in this process. | ||
+ | =====Reading===== | ||
+ | Check out the manual pages for the following utilities: | ||
+ | |||
+ | * **cat**(**1**) - concatenate files | ||
+ | * **cut**(**1**) - cut text | ||
+ | * **grep**(**1**) - globally search for regular expression and print | ||
+ | * **head**(**1**) - print first " | ||
+ | * **sed**(**1**) - stream editor | ||
+ | * **sort**(**1**) - sort output | ||
+ | * **tail**(**1**) - print last " | ||
+ | * **tr**(**1**) - translate characters | ||
+ | * **uniq**(**1**) - filter out duplicate lines from sorted file | ||
+ | * **wc**(**1**) - word count | ||
+ | |||
+ | In " | ||
+ | |||
+ | * Chapter 16 (" | ||
+ | * Chapter 17 (" | ||
+ | * Chapter 18 (" | ||
+ | * Chapter 19 (" | ||
+ | =====Background===== | ||
+ | Filtering is a big deal in many areas that deal with information processing. Say you've got a database of produce for a grocery store, and you want to view JUST the information regarding the banana shipments.. instead of sorting through the entire database and picking out the data you want manually- why not put all the data through a filter and simply view the pertinent data? | ||
+ | |||
+ | <WRAP round info box>A filter, as defined on http:// | ||
+ | \\ | ||
+ | For example: a Web filter that screens out vulgar sites. | ||
+ | </ | ||
+ | |||
+ | UNIX provides some utilities that allow you to accomplish impressive amounts of filtering. When coupled with Regular Expressions, | ||
+ | |||
+ | The next step is to apply shell scripts, which allow you to write " | ||
+ | |||
+ | So, some basics of filtering: | ||
+ | |||
+ | In order to do any sort of filtering, we need to know what we want to filter. Makes sense. | ||
+ | |||
+ | Before we employ filtering, we must have some clear idea about what we would like to filter, and how to safely maintain the data we wish to let through. (Filtering is no good if the data you are after gets damaged in the process). | ||
+ | |||
+ | The UNIX **cat**(**1**) utility is a general all-purpose tool that can be used to display the contents of text files. **cat**(**1**) also provides a number of other features that can be handy for debugging problems you may encounter with text files. (the **-n** and **-e** arguments can be particularly useful). | ||
+ | |||
+ | Let's play with a sample database. In the **filters/ | ||
+ | |||
+ | Let's try some stuff out. | ||
+ | |||
+ | ====No filtering, or a filter that lets everything through==== | ||
+ | |||
+ | Display the contents: | ||
+ | |||
+ | <cli> | ||
+ | lab46:~$ cat sample.db | ||
+ | </ | ||
+ | |||
+ | This is the simplest form of filtering possible-- none at all. All the data in the text file is passed to **STDOUT**. | ||
+ | |||
+ | Even at this stage we can do some useful things with the data. For example, if we wanted to find out how many lines were in the database: | ||
+ | |||
+ | <cli> | ||
+ | lab46:~$ cat sample.db | wc -l | ||
+ | </ | ||
+ | |||
+ | The database will display to STDOUT in all its entirety. You will notice the database is setup as follows: | ||
+ | |||
+ | < | ||
+ | name: | ||
+ | </ | ||
+ | |||
+ | With this information we can make some important observations about the structure of the database: | ||
+ | |||
+ | * fields are separated by a colon (:) | ||
+ | * last entry on the line is followed by a star (*) | ||
+ | |||
+ | To be effective in filtering text, we must be aware of the structure of that text. The more you know about how some structure is set up, the better we can design a solution to the particular problem. | ||
+ | ====keyword filtering==== | ||
+ | |||
+ | Ok, so let us filter some of this information: | ||
+ | |||
+ | Find all the students who are in // | ||
+ | |||
+ | <cli> | ||
+ | lab46:~$ cat sample.db | grep Biology | ||
+ | </ | ||
+ | |||
+ | We can do more complicated searches too: | ||
+ | |||
+ | Find all the students who are in Biology AND like Lollipops: | ||
+ | |||
+ | <cli> | ||
+ | lab46:~$ cat sample.db | grep Biology | grep Lollipops | ||
+ | </ | ||
+ | |||
+ | ^ 1. ^|Perform the following searches on the database:| | ||
+ | | ^ a.|Find all the students that are a // | ||
+ | |:::^ b.|Same as above but in // | ||
+ | |:::^ c.|Any duplicate entries? Remove any duplicates.| | ||
+ | |:::^ d.|Using the **wc**(**1**) utility, how many matches did you get?| | ||
+ | |||
+ | Be sure to give me the command-line incantations you came up with, and any observations you made. | ||
+ | ====filter for manipulation==== | ||
+ | |||
+ | So we've done some simple searches on our database. We've filtered the output to get desired values. But we don't have to stop there. Not only can we filter the text, we can manipulate it to our liking. | ||
+ | |||
+ | The **cut**(**1**) utility lets us literally cut columns from the output. | ||
+ | |||
+ | It relies on a thing called a field-separator, | ||
+ | |||
+ | Using the " | ||
+ | |||
+ | So, looking at the following text: | ||
+ | |||
+ | < | ||
+ | hello there: | ||
+ | </ | ||
+ | |||
+ | Looking at this example, we can see that ":" | ||
+ | |||
+ | With ":" | ||
+ | |||
+ | ^ Field 1 ^ Field 2 ^ Field 3 ^ Field 4 ^ Field 5 ^ Field 6 | | ||
+ | | hello there | this | is | a | bunch of | text. | | ||
+ | |||
+ | |||
+ | We can test these properties out by using **cut**(**1**) on the command-line: | ||
+ | |||
+ | <cli> | ||
+ | lab46:~$ echo "hello there: | ||
+ | </ | ||
+ | |||
+ | Where # is a specific field or range of fields. (ie **-f2** or **-f2,4** or **-f1-3**) | ||
+ | |||
+ | ^ 2. ^|Let' | ||
+ | | ^ a.|What would the following command-line display: **echo "hello there: | ||
+ | |:::^ b.|If you wanted to get "hello there text." to display to the screen, what manipulation to the text would you have to do?| | ||
+ | |:::^ c.|Did your general attempt work? Is there extra information? | ||
+ | |||
+ | If you found that extra information showed up when you tried to do that last part- taking a closer look will show why: | ||
+ | |||
+ | If you tell **cut**(**1**) to display any fields that aren't immediately next to one another, it will insert the field separator to indicate the separation. | ||
+ | |||
+ | So how do you keep this functionality while still getting the exact data you seek? Well, nobody said we could only apply one filter to text. | ||
+ | =====The Stream Editor - sed===== | ||
+ | |||
+ | Remember back when we played with **vi/vim**? Remember that useful search and replace command: | ||
+ | |||
+ | < | ||
+ | : | ||
+ | </ | ||
+ | |||
+ | That was quite useful. And luckily, we've got that same ability on the command line. Introducing " | ||
+ | |||
+ | sed provides some of the features we've come to enjoy in vi, and is for all intents and purposes a non-interactive editor. One useful ability, however, is its ability to edit data streams (that is, **STDOUT**, including that generated from our command lines). | ||
+ | |||
+ | Perhaps the most immediately useful command found in sed will be its search and replace, which is pretty much just like the **vi/vim** variant: | ||
+ | |||
+ | < | ||
+ | sed -e ' | ||
+ | </ | ||
+ | |||
+ | However, if you look close, you will see that we did not include any sort of file to operate on. While we can, one of the other common uses of sed is to pop it in a command-line with everything else, stuck together with the all-powerful pipe (**|**). | ||
+ | |||
+ | For example, so solve the above problem with the field separator: | ||
+ | |||
+ | <cli> | ||
+ | lab46:~$ echo "hello there: | ||
+ | </ | ||
+ | |||
+ | We used sed to replace any occurrence of the ":" | ||
+ | |||
+ | ^ 3. ^|Answer me the following:| | ||
+ | | ^ a.|Does the above command-line fix the problem from #2c?| | ||
+ | |:::^ b.|If you wanted to change all " | ||
+ | |:::^ c.|If you wanted to replace all the period symbols in the text with asterisks, how would you do it?| | ||
+ | |:::^ d.|What does the resulting output look like?| | ||
+ | |||
+ | =====From head(1) to tail(1)===== | ||
+ | |||
+ | Two other utilities you may want to become acquainted with are the **head**(**1**) and **tail**(**1**) utilities. | ||
+ | |||
+ | **head**(**1**) will allow you to print a specified number of lines from //1 to n//. So if you needed to print, say, the first 12 lines of a file, **head**(**1**) will be a good bet. | ||
+ | |||
+ | For example, to display the first 4 lines of our sample database: | ||
+ | |||
+ | <cli> | ||
+ | lab46:~$ head -12 sample.db | ||
+ | </ | ||
+ | |||
+ | And, of course, adding it onto an existing command line using the pipe. In this example, the first two results of all the *ology Majors: | ||
+ | |||
+ | <cli> | ||
+ | lab46:~$ cat sample.db | grep " | ||
+ | </ | ||
+ | |||
+ | See where we're going with this? We can use these utilities to put together massively powerful command-line incantations create all sorts of interesting filters. | ||
+ | |||
+ | **tail**(**1**) works in the opposite end- starting at the end of the file and working backwards towards the beginning. So if you wanted to display the last 8 lines of a file, for example. **tail**(**1**) also has the nifty ability to continually monitor a file and update its output should the source file change. This is useful for monitoring log files that are continually updated. | ||
+ | =====Translating characters with tr===== | ||
+ | |||
+ | This is another useful tool to be familiar with. With **tr**(**1**), | ||
+ | ====ASCII file line endings==== | ||
+ | |||
+ | An important thing to be aware of is how the various systems terminate their lines. Check the following table: | ||
+ | |||
+ | ^ System | ||
+ | | DOS | Carriage Return, Line Feed (CRLF) | ||
+ | | Mac | Carriage Return (CR) | | ||
+ | | UNIX | Line Feed (LF) | | ||
+ | |||
+ | So what does this mean to you? Well, if you have a file that was formatted with Mac-style line endings, and you're trying to read that file on a UNIX system, you may notice that everything appears as a single line at the top of the screen. This is because the Mac uses just Carriage Return to terminate its lines, and UNIX uses just Line Feeds... so the two are drastically incompatible for standard text display reasons. | ||
+ | |||
+ | For example, let's say we have a UNIX file we wish to convert to DOS format. We would need to convert every terminating Line Feed to a Carriage Return & Line Feed combination (and take note that the Carriage Return needs to come first and then the Line Feed). We would do something that looks like this: | ||
+ | |||
+ | <cli> | ||
+ | lab46:~$ tr " | ||
+ | </ | ||
+ | |||
+ | To interpret this: | ||
+ | |||
+ | **\n** is the special escape sequence that we're all familiar with. In C, you can use it to issue an // | ||
+ | |||
+ | **\r** is the special escape sequence that corresponds to a Carriage Return (**CR**). | ||
+ | |||
+ | The first argument is the original sequence. The second is what we would like to replace it with. (in this case, replace every **LF** with a **CRLF** combination). | ||
+ | |||
+ | Then, using UNIX I/O redirection operations, **file.unix** is redirected as input to **tr**(**1**), | ||
+ | |||
+ | In the **filters/ | ||
+ | |||
+ | ^ 4. ^|Let' | ||
+ | | ^ a.|Convert **file.mac** to UNIX format. Show me how you did this, as well as any interesting messages you find inside.| | ||
+ | |:::^ b.|Convert **readme.unix** to DOS format. Same deal as above.| | ||
+ | |:::^ c.|Convert **dos.txt** to Mac format. Show me the command-line used.| | ||
+ | =====Procedure===== | ||
+ | Looking back on our database (**sample.db** in the **filters/ | ||
+ | |||
+ | ^ 5. ^|Develop, explain, and show me the command-lines for the following:| | ||
+ | | ^ a.|How many unique // | ||
+ | |:::^ b.|How many unique //majors// are there in the database?| | ||
+ | |:::^ c.|How many unique " | ||
+ | |||
+ | <WRAP round info box> | ||
+ | </ | ||
+ | |||
+ | ^ 6. ^|Using the **pelopwar.txt** file from the **grep/** subdirectory of the UNIX Public Directory, construct filters to do the following:| | ||
+ | | ^ a.|Show me the first 22 lines of this file. How did you do this?| | ||
+ | |:::^ b.|Show me the last 4 lines of this file. How did you do this?| | ||
+ | |:::^ c.|Show me lines 32-48 of this file. How did you do this? (HINT: the last 16 lines of the first 48)| | ||
+ | |:::^ d.|Of the last 12 lines in this file, show me the first 4. How did you do this?| | ||
+ | |||
+ | Being familiar with the commands and utilities available to you on the system greatly increases your ability to construct effective filters, and ultimately solve problems in a more efficient and creative manner. | ||
+ | |||
+ | =====Conclusions===== | ||
+ | This assignment has activities which you should tend to- document/ | ||
+ | |||
+ | As always, the class mailing list and class IRC channel are available for assistance, but not answers. |