This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
haas:spring2014:unix:labs:labb [2013/11/13 13:59] – external edit 127.0.0.1 | haas:spring2014:unix:labs:labb [2014/04/15 09:18] (current) – [Exercise] wedge | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | <WRAP round box> | + | < |
- | \\ | + | < |
- | < | + | < |
- | \\ | + | <fs 125%>Lab 0xB: Data Manipulation</fs> |
- | < | + | |
- | \\ | + | |
- | \\ | + | |
- | Lab 0xB: Filters | + | |
- | \\ | + | |
- | \\ | + | |
- | </WRAP> | + | |
</ | </ | ||
+ | |||
~~TOC~~ | ~~TOC~~ | ||
=====Objective===== | =====Objective===== | ||
- | To become familar with the concepts | + | To explore some aspects |
=====Reading===== | =====Reading===== | ||
- | Check out the manual pages for the following utilities: | + | Please reference |
- | * **cat**(**1**) - concatenate files | + | * **dd**(**1**) |
- | * **cut**(**1**) - cut text | + | * **md5sum**(**1**) |
- | * **grep**(**1**) | + | * **diff**(**1**) |
- | * **head**(**1**) | + | * **bvi**(**1**) |
- | * **sed**(**1**) | + | * **hexedit**(**1**) |
- | * **sort**(**1**) - sort output | + | * **file**(**1**) |
- | * **tail**(**1**) - print last " | + | |
- | * **tr**(**1**) - translate characters | + | |
- | * **uniq**(**1**) - filter out duplicate lines from sorted | + | |
- | * **wc**(**1**) | + | |
- | In " | ||
- | |||
- | * Chapter 16 (" | ||
- | * Chapter 17 (" | ||
- | * Chapter 18 (" | ||
- | * Chapter 19 (" | ||
=====Background===== | =====Background===== | ||
- | Filtering | + | The **dd**(**1**) utility, short for //data dump//, |
- | <WRAP round info box>A filter, as defined on http:// | + | ====Copying==== |
- | \\ | + | To illustrate the basic nature of **dd**(**1**), we will perform a file copy. Typically, **dd**(**1**) |
- | For example: a Web filter that screens out vulgar sites. | + | |
- | </ | + | |
- | UNIX provides some utilities that allow you to accomplish impressive amounts of filtering. | + | When given just a source and a destination, **dd**(**1**) will happily copy (from start to finish), the source data to the destination location (filling it up from beginning to end). The end result should be identical to the source. |
- | The next step is to apply shell scripts, which allow you to write " | + | For example: |
- | + | ||
- | So, some basics of filtering: | + | |
- | + | ||
- | In order to do any sort of filtering, we need to know what we want to filter. Makes sense. | + | |
- | + | ||
- | Before we employ filtering, we must have some clear idea about what we would like to filter, and how to safely maintain the data we wish to let through. (Filtering is no good if the data you are after gets damaged in the process). | + | |
- | + | ||
- | The UNIX **cat**(**1**) utility is a general all-purpose tool that can be used to display the contents of text files. **cat**(**1**) also provides a number of other features that can be handy for debugging problems you may encounter with text files. (the **-n** and **-e** arguments can be particularly useful). | + | |
- | + | ||
- | Let's play with a sample database. In the **filters/ | + | |
- | + | ||
- | Let's try some stuff out. | + | |
- | + | ||
- | ====No filtering, or a filter that lets everything through==== | + | |
- | + | ||
- | Display the contents: | + | |
<cli> | <cli> | ||
- | lab46: | + | lab46: |
+ | 9+1 records in | ||
+ | 9+1 records out | ||
+ | 4912 bytes (4.9 kB) copied, 0.0496519 s, 98.9 kB/s | ||
+ | lab46: | ||
</ | </ | ||
- | This is the simplest form of filtering possible-- none at all. All the data in the text file is passed to **STDOUT**. | + | Here, **if=** specifies |
- | Even at this stage we can do some useful things with the data. For example, if we wanted to find out how many lines were in the database: | + | Doing some comparisons: |
<cli> | <cli> | ||
- | lab46: | + | lab46: |
+ | -rwxr-xr-x 1 root root 4912 May 4 2010 / | ||
+ | -rw-r--r-- 1 user lab46 4912 Nov 13 14:57 howlong | ||
+ | lab46: | ||
</ | </ | ||
- | The database will display to STDOUT in all its entirety. You will notice the database is setup as follows: | + | ====Investigating==== |
- | < | + | ^ 1. ^|Answer me the following:| |
- | name:sid:major:year:favorite candy | + | | ^ a.|What is different about these two files?| |
- | </ | + | |:::^ b.|What is similar?| |
+ | |:::^ c.|If **dd**(**1**) copies (or duplicates) data, why do you suppose these differences exist?| | ||
+ | |:::^ d.|What is the output of **file**(**1**) when you run it on both of these files?| | ||
+ | |:::^ e.|When you execute each file, is the output the same or different? | ||
+ | |:::^ f.|Any prerequisite steps needed to get either file to run? What were they?| | ||
- | With this information we can make some important observations about the structure | + | Consistency of data has been a desire of computer users long before computers were readily available. To be able to verify |
- | * fields are separated by a colon (:) | + | ====Comparisons==== |
- | * last entry on the line is followed by a star (*) | + | |
- | To be effective in filtering text, we must be aware of the structure of that text. The more you know about how some structure is set up, the better we can design a solution to the particular problem. | + | Although many ways exist, there are two common ways of comparing two files: |
- | ====keyword filtering==== | + | |
- | Ok, so let us filter some of this information: | + | * **diff**(**1**): |
+ | * **md5sum**(**1**): computes an MD5 hash of a file's contents, creating a unique data fingerprint | ||
- | Find all the students who are in //Biology//: | + | ^ 2. ^|Answer me the following: |
+ | | ^ a.|Are **/usr/bin/uptime** and **howlong** text files or binary files? What is your proof?| | ||
+ | |:::^ b.|Using **diff**(**1**), | ||
+ | |:::^ c.|Using **md5sum**(**1**), | ||
+ | |:::^ d.|Using **md5sum**(**1**), | ||
+ | |:::^ e.|How could an MD5 hash be useful with regards to data integrity and security? | ||
+ | |:::^ f.|In what situations could **diff**(**1**) be a useful tool for comparing differences? | ||
+ | =====Exercise===== | ||
- | <cli> | + | ^ 3. ^|Do the following:| |
- | lab46:~$ cat sample.db | grep Biology | + | | ^ a.|Using **dd**(**1**), |
- | </cli> | + | |:::^ b.|How did you do this?| |
+ | |:::^ c.|How could you verify you were successful? | ||
+ | |:::^ d.|If you ran **echo "more information" | ||
+ | |:::^ e.|Can you find this information in **test.file**? | ||
+ | |:::^ f.|If you wanted to retrieve the information you just added using **dd**(**1**), | ||
- | We can do more complicated searches too: | + | <WRAP round info box> |
- | Find all the students who are in Biology AND like Lollipops: | + | In the **data/** subdirectory of the UNIX Public Directory is a file called **data.file** |
- | < | + | Please copy this to your home directory |
- | lab46:~$ cat sample.db | grep Biology | grep Lollipops | + | |
- | </ | + | |
- | + | ||
- | ^ 1. ^|Perform the following searches on the database: | + | |
- | | ^ a.|Find all the students that are a // | + | |
- | |:::^ b.|Same as above but in // | + | |
- | |:::^ c.|Any duplicate entries? Remove any duplicates.| | + | |
- | |:::^ d.|Using the **wc**(**1**) utility, how many matches did you get?| | + | |
- | + | ||
- | Be sure to give me the command-line incantations you came up with, and any observations you made. | + | |
- | ====filter for manipulation==== | + | |
- | + | ||
- | So we've done some simple searches on our database. We've filtered the output to get desired values. But we don't have to stop there. Not only can we filter the text, we can manipulate it to our liking. | + | |
- | + | ||
- | The **cut**(**1**) utility lets us literally cut columns from the output. | + | |
- | + | ||
- | It relies on a thing called a field-separator, | + | |
- | + | ||
- | Using the " | + | |
- | + | ||
- | So, looking at the following text: | + | |
- | + | ||
- | < | + | |
- | hello there:this:is:a:bunch of:text. | + | |
- | </ | + | |
- | + | ||
- | Looking at this example, we can see that ":" | + | |
- | + | ||
- | With ":" | + | |
- | + | ||
- | ^ Field 1 ^ Field 2 ^ Field 3 ^ Field 4 ^ Field 5 ^ Field 6 | | + | |
- | | hello there | this | is | a | bunch of | text. | | + | |
- | + | ||
- | + | ||
- | We can test these properties out by using **cut**(**1**) on the command-line: | + | |
- | + | ||
- | < | + | |
- | lab46:~$ echo "hello there: | + | |
- | </ | + | |
- | + | ||
- | Where # is a specific field or range of fields. (ie **-f2** or **-f2,4** or **-f1-3**) | + | |
- | + | ||
- | ^ 2. ^|Let' | + | |
- | | ^ a.|What would the following command-line display: **echo "hello there: | + | |
- | |:::^ b.|If you wanted | + | |
- | |:::^ c.|Did your general attempt | + | |
- | + | ||
- | If you found that extra information showed up when you tried to do that last part- taking a closer look will show why: | + | |
- | + | ||
- | If you tell **cut**(**1**) to display any fields that aren't immediately next to one another, it will insert the field separator to indicate the separation. | + | |
- | + | ||
- | So how do you keep this functionality while still getting the exact data you seek? Well, nobody said we could only apply one filter to text. | + | |
- | =====The Stream Editor - sed===== | + | |
- | + | ||
- | Remember back when we played with **vi/vim**? Remember that useful search and replace command: | + | |
- | + | ||
- | < | + | |
- | : | + | |
- | </ | + | |
- | + | ||
- | That was quite useful. And luckily, we've got that same ability | + | |
- | + | ||
- | sed provides some of the features we've come to enjoy in vi, and is for all intents and purposes a non-interactive editor. One useful ability, however, is its ability to edit data streams (that is, **STDOUT**, including that generated from our command lines). | + | |
- | + | ||
- | Perhaps the most immediately useful command found in sed will be its search and replace, which is pretty much just like the **vi/vim** variant: | + | |
- | + | ||
- | < | + | |
- | sed -e ' | + | |
- | </ | + | |
- | + | ||
- | However, if you look close, you will see that we did not include any sort of file to operate on. While we can, one of the other common uses of sed is to pop it in a command-line with everything else, stuck together with the all-powerful pipe (**|**). | + | |
- | + | ||
- | For example, so solve the above problem with the field separator: | + | |
- | + | ||
- | < | + | |
- | lab46:~$ echo "hello there: | + | |
- | </ | + | |
- | + | ||
- | We used sed to replace any occurrence of the ":" | + | |
- | + | ||
- | ^ 3. ^|Answer me the following:| | + | |
- | | ^ a.|Does the above command-line fix the problem from #2c?| | + | |
- | |:::^ b.|If you wanted to change all " | + | |
- | |:::^ c.|If you wanted to replace all the period symbols in the text with asterisks, how would you do it?| | + | |
- | |:::^ d.|What does the resulting output look like?| | + | |
- | + | ||
- | =====From head(1) to tail(1)===== | + | |
- | + | ||
- | Two other utilities you may want to become acquainted with are the **head**(**1**) and **tail**(**1**) utilities. | + | |
- | + | ||
- | **head**(**1**) will allow you to print a specified number of lines from //1 to n//. So if you needed to print, say, the first 12 lines of a file, **head**(**1**) will be a good bet. | + | |
- | + | ||
- | For example, to display the first 4 lines of our sample database: | + | |
- | + | ||
- | < | + | |
- | lab46:~$ head -12 sample.db | + | |
- | </ | + | |
- | + | ||
- | And, of course, adding it onto an existing command line using the pipe. In this example, the first two results of all the *ology Majors: | + | |
- | + | ||
- | < | + | |
- | lab46:~$ cat sample.db | grep " | + | |
- | </ | + | |
- | + | ||
- | See where we're going with this? We can use these utilities to put together massively powerful command-line incantations create all sorts of interesting filters. | + | |
- | + | ||
- | **tail**(**1**) works in the opposite end- starting at the end of the file and working backwards towards the beginning. So if you wanted to display the last 8 lines of a file, for example. **tail**(**1**) also has the nifty ability to continually monitor a file and update its output should the source file change. This is useful for monitoring log files that are continually updated. | + | |
- | =====Translating characters with tr===== | + | |
- | + | ||
- | This is another useful tool to be familiar with. With **tr**(**1**), | + | |
- | ====ASCII file line endings==== | + | |
- | + | ||
- | An important thing to be aware of is how the various systems terminate their lines. Check the following table: | + | |
- | + | ||
- | ^ System | + | |
- | | DOS | Carriage Return, Line Feed (CRLF) | + | |
- | | Mac | Carriage Return (CR) | | + | |
- | | UNIX | Line Feed (LF) | | + | |
- | + | ||
- | So what does this mean to you? Well, if you have a file that was formatted with Mac-style line endings, and you're trying to read that file on a UNIX system, you may notice that everything appears as a single line at the top of the screen. This is because the Mac uses just Carriage Return to terminate its lines, and UNIX uses just Line Feeds... so the two are drastically incompatible for standard text display reasons. | + | |
- | + | ||
- | For example, let's say we have a UNIX file we wish to convert to DOS format. We would need to convert every terminating Line Feed to a Carriage Return & Line Feed combination (and take note that the Carriage Return needs to come first and then the Line Feed). We would do something that looks like this: | + | |
- | + | ||
- | < | + | |
- | lab46:~$ tr " | + | |
- | </ | + | |
- | + | ||
- | To interpret this: | + | |
- | + | ||
- | **\n** is the special escape sequence that we're all familiar with. In C, you can use it to issue an // | + | |
- | + | ||
- | **\r** is the special escape sequence that corresponds to a Carriage Return (**CR**). | + | |
- | + | ||
- | The first argument is the original sequence. The second is what we would like to replace it with. (in this case, replace every **LF** with a **CRLF** combination). | + | |
- | + | ||
- | Then, using UNIX I/O redirection operations, **file.unix** is redirected as input to **tr**(**1**), | + | |
- | + | ||
- | In the **filters/ | + | |
- | + | ||
- | ^ 4. ^|Let' | + | |
- | | ^ a.|Convert **file.mac** to UNIX format. Show me how you did this, as well as any interesting messages you find inside.| | + | |
- | |:::^ b.|Convert **readme.unix** to DOS format. Same deal as above.| | + | |
- | |:::^ c.|Convert **dos.txt** to Mac format. Show me the command-line used.| | + | |
- | =====Procedure===== | + | |
- | Looking back on our database (**sample.db** in the **filters/ | + | |
- | + | ||
- | ^ 5. ^|Develop, explain, and show me the command-lines for the following: | + | |
- | | ^ a.|How many unique // | + | |
- | |:::^ b.|How many unique //majors// are there in the database? | + | |
- | |:::^ c.|How many unique " | + | |
- | + | ||
- | <WRAP round info box> | + | |
- | </ | + | |
- | ^ | + | ^ |
- | | ^ a.|Show me the first 22 lines of this file. How did you do this?| | + | | ^ a.|How large (in bytes) is this file?| |
- | |:::^ b.|Show me the last 4 lines of this file. How did you do this?| | + | |:::^ b.|What information predominantly appears to be in the first 3kB of the file?| |
- | |:::^ c.|Show me lines 32-48 of this file. How did you do this? (HINT: | + | |:::^ c.|Does this information remain constant throughout the file? Are there ranges where it differs? What are they?| |
- | |::: | + | |:::^ d.|How would you extract |
+ | |::: | ||
+ | |:::^ f.|Run **file**(**1**) on each file that hosts extracted data. What is each type of file?| | ||
+ | |:::^ g.|Based on the output of **file**(**1**), react accordingly to the data to unlock its functionality/ | ||
- | Being familiar with the commands and utilities available to you on the system greatly increases your ability to construct effective filters, and ultimately solve problems in a more efficient and creative manner. | ||
=====Conclusions===== | =====Conclusions===== |