Differences

This shows you the differences between two versions of the page.

--- haas:spring2014:unix:labs:labb [2013/11/13 13:59] – external edit 127.0.0.1
+++ haas:spring2014:unix:labs:labb [2014/04/15 09:18] (current) – [Exercise] wedge
@@ Line 1: / Line 1: @@
-<WRAP round box>
+<WRAP centeralign round box>
-\\
+<WRAP><color red><fs 200%>Corning Community College</fs></color></WRAP>
-<WRAP centeralign bigger><WRAP bigger fgred>Corning Community College</WRAP>
+<WRAP><fs 150%>CSCS1730 UNIX/Linux Fundamentals</fs></WRAP>
-\\
+<fs 125%>Lab 0xB: Data Manipulation</fs>
-<WRAP muchbigger>UNIX/Linux Fundamentals</WRAP>
-\\
-\\
-Lab 0xB: Filters
-\\
-\\
-</WRAP>
 </WRAP>
 ~~TOC~~
 =====Objective=====
-To become familar with the concepts of text filtering, and some of the UNIX utilities that are useful in this process.
+To explore some aspects of data manipulation and data security on the system.
 =====Reading=====
-Check out the manual pages for the following utilities:
+Please reference the following manual pages:
-  * **cat**(**1**)	- concatenate files
+  * **dd**(**1**)
-  * **cut**(**1**)	- cut text
+  * **md5sum**(**1**)
-  * **grep**(**1**) - globally search for regular expression and print
+  * **diff**(**1**)
-  * **head**(**1**) - print first "n" lines of output
+  * **bvi**(**1**)
-  * **sed**(**1**) - stream editor
+  * **hexedit**(**1**)
-  * **sort**(**1**) - sort output
+  * **file**(**1**)
-  * **tail**(**1**) - print last "n" lines of output
-  * **tr**(**1**) - translate characters
-  * **uniq**(**1**) - filter out duplicate lines from sorted file
-  * **wc**(**1**) - word count
-In "Harley Hahn's Guide to UNIX and Linux", please read:
-  * Chapter 16 ("Filters: Introduction and Basic Operations", pages 373-394).
-  * Chapter 17 ("Filters: Comparing and Extracting", pages 395-420).
-  * Chapter 18 ("Filters: Counting and Formatting", pages 421-446).
-  * Chapter 19 ("Filters: Selecting, Sorting, Combining, and Changing", pages 447-496).
 =====Background=====
-Filtering is a big deal in many areas that deal with information processing. Say you've got a database of produce for a grocery store, and you want to view JUST the information regarding the banana shipments.. instead of sorting through the entire database and picking out the data you want manually- why not put all the data through a filter and simply view the pertinent data?
+The **dd**(**1**) utility, short for //data dump//, is a tool that specializes in taking data from a source file and depositing it in a destination file. In combination with its various options, we have the capability of more fine-grained access to data that would otherwise not be as convenient using the standard data manipulation tools (**cp**(**1**), **split**(**1**), **cat**(**1**)).
-<WRAP round info box>A filter, as defined on http://dictionary.reference.com/ is a program or routine that blocks access to data that meet a particular criterion.\\
+====Copying====
-\\
+To illustrate the basic nature of **dd**(**1**), we will perform a file copy. Typically, **dd**(**1**) is given two arguments: the source of the data, and the destination of the data.
-For example: a Web filter that screens out vulgar sites.
-</WRAP>
-UNIX provides some utilities that allow you to accomplish impressive amounts of filtering. When coupled with Regular Expressions, you can combine the power of pattern matching with your filter, adding considerable flexibility to your arsenal of tricks.
+When given just a source and a destination, **dd**(**1**) will happily copy (from start to finish), the source data to the destination location (filling it up from beginning to end). The end result should be identical to the source.
-The next step is to apply shell scripts, which allow you to write "programs" that can take advantage of all the utilities and features available on a system. We will be looking at shell scripts another week. First we must get the foundations in place so we can better appreciate shell scripts.
+For example:
-So, some basics of filtering:
-In order to do any sort of filtering, we need to know what we want to filter. Makes sense.
-Before we employ filtering, we must have some clear idea about what we would like to filter, and how to safely maintain the data we wish to let through. (Filtering is no good if the data you are after gets damaged in the process).
-The UNIX **cat**(**1**) utility is a general all-purpose tool that can be used to display the contents of text files. **cat**(**1**) also provides a number of other features that can be handy for debugging problems you may encounter with text files. (the **-n** and **-e** arguments can be particularly useful).
-Let's play with a sample database. In the **filters/** subdirectory of the UNIX Public Directory you will find a file called "**sample.db**". Copy it to your home directory.
-Let's try some stuff out.
-====No filtering, or a filter that lets everything through====
-Display the contents:
 <cli>
-lab46:~$ cat sample.db
+lab46:~$ dd if=/usr/bin/uptime of=howlong
++1 records in
++1 records out
+bytes (4.9 kB) copied, 0.0496519 s, 98.9 kB/s
+lab46:~$
 </cli>
-This is the simplest form of filtering possible-- none at all. All the data in the text file is passed to **STDOUT**.
+Here, **if=** specifies the source (input file) of our data, and **of=** specifies the destination (output file) for the data.
-Even at this stage we can do some useful things with the data. For example, if we wanted to find out how many lines were in the database:
+Doing some comparisons:
 <cli>
-lab46:~$ cat sample.db | wc -l
+lab46:~$ ls -l /usr/bin/uptime howlong
+-rwxr-xr-x 1 root  root 4912 May  4  2010 /usr/bin/uptime
+-rw-r--r-- 1 user lab46 4912 Nov 13 14:57 howlong
+lab46:~$
 </cli>
-The database will display to STDOUT in all its entirety. You will notice the database is setup as follows:
+====Investigating====
-<code>
+^  1.  ^|Answer me the following:|
-name:sid:major:year:favorite candy
+| ^  a.|What is different about these two files?|
-</code>
+|:::^  b.|What is similar?|
+|:::^  c.|If **dd**(**1**) copies (or duplicates) data, why do you suppose these differences exist?|
+|:::^  d.|What is the output of **file**(**1**) when you run it on both of these files?|
+|:::^  e.|When you execute each file, is the output the same or different?|
+|:::^  f.|Any prerequisite steps needed to get either file to run? What were they?|
-With this information we can make some important observations about the structure of the database:
+Consistency of data has been a desire of computer users long before computers were readily available. To be able to verify the authenticity of two works of data, minimizing the chances of some hidden alteration or forgery is an important capability to possess.
-  * fields are separated by a colon (:)
+====Comparisons====
-  * last entry on the line is followed by a star (*)
-To be effective in filtering text, we must be aware of the structure of that text. The more you know about how some structure is set up, the better we can design a solution to the particular problem.
+Although many ways exist, there are two common ways of comparing two files:
-====keyword filtering====
-Ok, so let us filter some of this information:
+  * **diff**(**1**): compares two files line by line, indicating differences (useful for text files)
+  * **md5sum**(**1**): computes an MD5 hash of a file's contents, creating a unique data fingerprint
-Find all the students who are in //Biology//:
+^  2.  ^|Answer me the following:|
+| ^  a.|Are **/usr/bin/uptime** and **howlong** text files or binary files? What is your proof?|
+|:::^  b.|Using **diff**(**1**), verify whether or not these files are identical. Show me the results.|
+|:::^  c.|Using **md5sum**(**1**), verify whether or not these files are identical. Show me the results.|
+|:::^  d.|Using **md5sum**(**1**), compare the MD5 hash of one of these files against **/bin/cp**, is there a difference?|
+|:::^  e.|How could an MD5 hash be useful with regards to data integrity and security?|
+|:::^  f.|In what situations could **diff**(**1**) be a useful tool for comparing differences?|
+=====Exercise=====
-<cli>
+^  3.  ^|Do the following:|
-lab46:~$ cat sample.db | grep Biology
+| ^  a.|Using **dd**(**1**), create a 8kB file called "test.file" filled entirely with zeros.|
-</cli>
+|:::^  b.|How did you do this?|
+|:::^  c.|How could you verify you were successful?|
+|:::^  d.|If you ran **echo "more information" >> test.file**, what would happen?|
+|:::^  e.|Can you find this information in **test.file**? Where is it (think in terms of file offsets)|
+|:::^  f.|If you wanted to retrieve the information you just added using **dd**(**1**), how would you do it?|
-We can do more complicated searches too:
+<WRAP round info box>__Hint:__ When on the subject of viewing the contents of non-text files, the typical tools we regularly use likely will not be of much help. Explore **bvi**(**1**) and **hexedit**(**1**).</WRAP>
-Find all the students who are in Biology AND like Lollipops:
+In the **data/** subdirectory of the UNIX Public Directory is a file called **data.file**
-<cli>
+Please copy this to your home directory to work on the following question.
-lab46:~$ cat sample.db | grep Biology | grep Lollipops
-</cli>
-^  1.  ^|Perform the following searches on the database:|
-| ^  a.|Find all the students that are a //Freshman//|
-|:::^  b.|Same as above but in //alphabetical order//|
-|:::^  c.|Any duplicate entries? Remove any duplicates.|
-|:::^  d.|Using the **wc**(**1**) utility, how many matches did you get?|
-Be sure to give me the command-line incantations you came up with, and any observations you made.
-====filter for manipulation====
-So we've done some simple searches on our database. We've filtered the output to get desired values. But we don't have to stop there. Not only can we filter the text, we can manipulate it to our liking.
-The **cut**(**1**) utility lets us literally cut columns from the output.
-It relies on a thing called a field-separator, which will be used as a logical separator of the data.
-Using the "**-d**" argument to cut, we can specify the field separator in our data. The "**-f**" option will parse the text in fields based on the established field separator.
-So, looking at the following text:
-<code>
-hello there:this:is:a:bunch of:text.
-</code>
-Looking at this example, we can see that ":" would make for an excellent field separator.
-With ":" as the field separator, the logical structure of the above text is logically represented as follows:
-^  Field 1  ^  Field 2  ^  Field 3  ^  Field 4  ^  Field 5  ^  Field 6  |
-|  hello there  |  this  |  is  |  a  |  bunch of  |  text.  |
-We can test these properties out by using **cut**(**1**) on the command-line:
-<cli>
-lab46:~$ echo "hello there:this:is:a:bunch of:text." | cut -d":" -f#
-</cli>
-Where # is a specific field or range of fields. (ie **-f2** or **-f2,4** or **-f1-3**)
-^  2.  ^|Let's play with the **cut**(**1**) utility:|
-| ^  a.|What would the following command-line display: **echo "hello there:this:is:a:bunch of:text." <nowiki>|</nowiki> cut -d":" -f3**|
-|:::^  b.|If you wanted to get "hello there text." to display to the screen, what manipulation to the text would you have to do?|
-|:::^  c.|Did your general attempt work? Is there extra information?|
-If you found that extra information showed up when you tried to do that last part- taking a closer look will show why:
-If you tell **cut**(**1**) to display any fields that aren't immediately next to one another, it will insert the field separator to indicate the separation.
-So how do you keep this functionality while still getting the exact data you seek? Well, nobody said we could only apply one filter to text.
-=====The Stream Editor - sed=====
-Remember back when we played with **vi/vim**? Remember that useful search and replace command:
-<code>
-:%s/regex/replacement/g
-</code>
-That was quite useful. And luckily, we've got that same ability on the command line. Introducing "**sed**(**1**)", the stream editor.
-sed provides some of the features we've come to enjoy in vi, and is for all intents and purposes a non-interactive editor. One useful ability, however, is its ability to edit data streams (that is, **STDOUT**, including that generated from our command lines).
-Perhaps the most immediately useful command found in sed will be its search and replace, which is pretty much just like the **vi/vim** variant:
-<code>
-sed -e 's/regex/replacement/g'
-</code>
-However, if you look close, you will see that we did not include any sort of file to operate on. While we can, one of the other common uses of sed is to pop it in a command-line with everything else, stuck together with the all-powerful pipe (**|**).
-For example, so solve the above problem with the field separator:
-<cli>
-lab46:~$ echo "hello there:this:is:a:bunch of:text." | cut -d":" -f1,6 | sed -e 's/:/ /g'
-</cli>
-We used sed to replace any occurrence of the ":" with a single space.
-^  3.  ^|Answer me the following:|
-| ^  a.|Does the above command-line fix the problem from #2c?|
-|:::^  b.|If you wanted to change all "t"'s to uppercase "T"'s in addition to that, what would you do?|
-|:::^  c.|If you wanted to replace all the period symbols in the text with asterisks, how would you do it?|
-|:::^  d.|What does the resulting output look like?|
-=====From head(1) to tail(1)=====
-Two other utilities you may want to become acquainted with are the **head**(**1**) and **tail**(**1**) utilities.
-**head**(**1**) will allow you to print a specified number of lines from //1 to n//. So if you needed to print, say, the first 12 lines of a file, **head**(**1**) will be a good bet.
-For example, to display the first 4 lines of our sample database:
-<cli>
-lab46:~$ head -12 sample.db
-</cli>
-And, of course, adding it onto an existing command line using the pipe. In this example, the first two results of all the *ology Majors:
-<cli>
-lab46:~$ cat sample.db | grep "ology" | head -2
-</cli>
-See where we're going with this? We can use these utilities to put together massively powerful command-line incantations create all sorts of interesting filters.
-**tail**(**1**) works in the opposite end- starting at the end of the file and working backwards towards the beginning. So if you wanted to display the last 8 lines of a file, for example. **tail**(**1**) also has the nifty ability to continually monitor a file and update its output should the source file change. This is useful for monitoring log files that are continually updated.
-=====Translating characters with tr=====
-This is another useful tool to be familiar with. With **tr**(**1**), you can substitute any character or sequence of characters with another. The nice thing is that you can quickly use it to do end-of-line character translations, useful in converting text files from DOS format to UNIX or Mac format (or any combination therein).
-====ASCII file line endings====
-An important thing to be aware of is how the various systems terminate their lines. Check the following table:
-^  System  ^  Line Ending Character(s)  |
-|  DOS  |  Carriage Return, Line Feed (CRLF)  |
-|  Mac  |  Carriage Return (CR)  |
-|  UNIX  |  Line Feed (LF)  |
-So what does this mean to you? Well, if you have a file that was formatted with Mac-style line endings, and you're trying to read that file on a UNIX system, you may notice that everything appears as a single line at the top of the screen. This is because the Mac uses just Carriage Return to terminate its lines, and UNIX uses just Line Feeds... so the two are drastically incompatible for standard text display reasons.
-For example, let's say we have a UNIX file we wish to convert to DOS format. We would need to convert every terminating Line Feed to a Carriage Return & Line Feed combination (and take note that the Carriage Return needs to come first and then the Line Feed). We would do something that looks like this:
-<cli>
-lab46:~$ tr "\n" "\r\n" < file.unix > file.dos
-</cli>
-To interpret this:
-**\n** is the special escape sequence that we're all familiar with. In C, you can use it to issue an //end-of-line// character. So in UNIX, this represents a Line Feed (**LF**).
-**\r** is the special escape sequence that corresponds to a Carriage Return (**CR**).
-The first argument is the original sequence. The second is what we would like to replace it with. (in this case, replace every **LF** with a **CRLF** combination).
-Then, using UNIX I/O redirection operations, **file.unix** is redirected as input to **tr**(**1**), and **file.dos** is created and will contain the output.
-In the **filters/** subdirectory of the UNIX Public Directory you will find some text files in DOS, Mac, and UNIX format.
-^  4.  ^|Let's do some **tr**(**1**) conversions:|
-| ^  a.|Convert **file.mac** to UNIX format. Show me how you did this, as well as any interesting messages you find inside.|
-|:::^  b.|Convert **readme.unix** to DOS format. Same deal as above.|
-|:::^  c.|Convert **dos.txt** to Mac format. Show me the command-line used.|
-=====Procedure=====
-Looking back on our database (**sample.db** in the **filters/** subdirectory of the UNIX Public Directory), let's do some more operations on it:
-^  5.  ^|Develop, explain, and show me the command-lines for the following:|
-| ^  a.|How many unique //students// are there in the database?|
-|:::^  b.|How many unique //majors// are there in the database?|
-|:::^  c.|How many unique "favorite candies" in the database? (remove any trailing asterisks from the output)|
-<WRAP round info box>**__HINT__**: sort them in alphabetical order, and make sure there are no duplicates. Also- make sure you don't count the title banner as a "student". Also be sure to either omit the header, or have the header at the top of any provided output.
-</WRAP>
-^  6.  ^|Using the **pelopwar.txt** file from the **grep/** subdirectory of the UNIX Public Directory, construct filters to do the following:|
+^  4.  ^|Applying your skills to analyze **data.file**, do the following:|
-| ^  a.|Show me the first 22 lines of this file. How did you do this?|
+| ^  a.|How large (in bytes) is this file?|
-|:::^  b.|Show me the last 4 lines of this file. How did you do this?|
+|:::^  b.|What information predominantly appears to be in the first 3kB of the file?|
-|:::^  c.|Show me lines 32-48 of this file. How did you do this? (HINT: the last 16 lines of the first 48)|
+|:::^  c.|Does this information remain constant throughout the file? Are there ranges where it differs? What are they?|
-|:::^  d.|Of the last 12 lines in this file, show me the first 4. How did you do this?|
+|:::^  d.|How would you extract the data at one of these ranges and place it into unique files? Extract the data at each identified range.|
+|:::^  e.|How many such ranges of data are there in this file?|
+|:::^  f.|Run **file**(**1**) on each file that hosts extracted data. What is each type of file?|
+|:::^  g.|Based on the output of **file**(**1**), react accordingly to the data to unlock its functionality/data. Show me what you did.|
-Being familiar with the commands and utilities available to you on the system greatly increases your ability to construct effective filters, and ultimately solve problems in a more efficient and creative manner.
 =====Conclusions=====

Lab46 Wiki

User Tools

Site Tools

Differences

Page Tools