This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
haas:spring2014:unix:labs:laba [2013/11/06 17:45] – external edit 127.0.0.1 | haas:spring2014:unix:labs:laba [2014/03/23 19:10] (current) – wedge | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | <WRAP round box> | + | < |
- | \\ | + | < |
- | < | + | < |
- | \\ | + | <fs 125%>Lab 0xA: Filters</fs> |
- | < | + | |
- | \\ | + | |
- | \\ | + | |
- | Lab 0xA: Data Analysis with Regular Expressions and Scripting | + | |
- | \\ | + | |
- | \\ | + | |
- | </WRAP> | + | |
</ | </ | ||
+ | |||
~~TOC~~ | ~~TOC~~ | ||
=====Objective===== | =====Objective===== | ||
- | To continue to build on our knowledge | + | To become familar with the concepts |
=====Reading===== | =====Reading===== | ||
- | Referencing | + | Check out the manual pages for the following utilities: |
- | * **grep**(**1**) | + | |
- | * **sed**(**1**) | + | * **cut**(**1**) - cut text |
- | * **awk**(**1**) | + | |
+ | * **head**(**1**) - print first " | ||
+ | * **sed**(**1**) | ||
+ | * **sort**(**1**) - sort output | ||
+ | * **tail**(**1**) - print last " | ||
+ | * **tr**(**1**) - translate characters | ||
+ | * **uniq**(**1**) - filter out duplicate lines from sorted file | ||
+ | * **wc**(**1**) | ||
- | =====Note===== | + | In " |
- | This lab is involved. Information obtained in early steps are built upon with increasingly complex functionality. | + | |
- | + | ||
- | If at any point something doesn't make sense, or you aren't getting the output you think you should be getting- **ask**. | + | |
- | + | ||
- | It is your responsibility | + | |
+ | * Chapter 16 (" | ||
+ | * Chapter 17 (" | ||
+ | * Chapter 18 (" | ||
+ | * Chapter 19 (" | ||
=====Background===== | =====Background===== | ||
+ | Filtering is a big deal in many areas that deal with information processing. Say you've got a database of produce for a grocery store, and you want to view JUST the information regarding the banana shipments.. instead of sorting through the entire database and picking out the data you want manually- why not put all the data through a filter and simply view the pertinent data? | ||
- | As we've been exploring Regular Expressions, Shell Scripting, and even the various tools on the system, you've been told that these are important building | + | <WRAP round info box>A filter, as defined |
+ | \\ | ||
+ | For example: a Web filter that screens out vulgar sites. | ||
+ | </ | ||
- | Now, we've amassed a considerable amount | + | UNIX provides some utilities that allow you to accomplish impressive amounts |
- | =====Problem Description===== | + | The next step is to apply shell scripts, which allow you to write " |
- | As students at CCC, a routine activity | + | |
- | On the main [[http:// | + | So, some basics of filtering: |
- | As it turns out, this functionality generates data in HTML format, yet it contains all the useful information | + | In order to do any sort of filtering, we need to know what we want to filter. Makes sense. |
- | This is one of those perfect examples that can be solved with our UNIX skills... the data we find available to us is in a form not immediately readable for our needs.. so what do we do when the universe doesn' | + | Before we employ filtering, |
- | =====Obtain the Data===== | + | |
- | In preparation for this exercise, I have taken the liberty of downloading a class listing for the Fall 2013 semester. This list contains all the courses offered at the primary college locations and Internet courses, and excludes ACE courses, and courses taught at non-primary college locations (high schools, etc.). | + | |
- | This file can be found in the **courselist/** subdirectory | + | The UNIX **cat**(**1**) utility is a general all-purpose tool that can be used to display |
- | We'll want to copy this file to your home directory. | + | Let's play with a sample database. In the **filters/ |
- | ^ 1. ^|Do the following: | + | Let's try some stuff out. |
- | | ^ a.|Copy the indicated file to your home directory. How did you do this?| | + | |
- | |:::^ b.|List the file. How large is it?| | + | |
- | |:::^ c.|What type of file is it? How did you determine this?| | + | |
- | |:::^ d.|We want the HTML data, so unravel this file to obtain that data. How did you do this?| | + | |
- | |:::^ e.|How large is the HTML data?| | + | |
- | |:::^ f.|What is the compression ratio achieved with this data?| | + | |
- | Now that we have a copy of the data, we can move on to studying it. | + | ====No filtering, or a filter that lets everything through==== |
- | =====Analyzing | + | Display |
- | The first step we must take when tackling a problem like this is to get an understanding of the data we are working with. Regular Expressions are cool and all, but they aren't useful unless we know what it is we are describing. | + | |
- | Our first task is to locate any common patterns in our data that we might be able to use to our advantage with Regular Expressions. | + | < |
+ | lab46:~$ cat sample.db | ||
+ | </ | ||
- | ^ 2. ^|Viewing the HTML file in **vi**, answer me the following: | + | This is the simplest form of filtering possible-- none at all. All the data in the text file is passed |
- | | ^ a.|This file contains courses offered next semester. Search for the course entry for "CSIT 2044". How did you do this?| | + | |
- | |:::^ b.|Comparing | + | |
- | As it stands, each course has an information string as follows (I'll use UNIX as an example): | + | Even at this stage we can do some useful things with the data. For example, if we wanted to find out how many lines were in the database: |
- | ^ UNIX/Linux Fundamentals - 92629 - CSCS 1730 - 001 | | + | <cli> |
- | + | lab46:~$ cat sample.db | wc -l | |
- | After the initial HTML data, we get actually course data we are interested in... there' | + | </cli> |
- | + | ||
- | - Course Title | + | |
- | - Course Reference Number (CRN) | + | |
- | - Course Prefix/ | + | |
- | - Course Section | + | |
- | + | ||
- | Check out some other courses and verify that this pattern holds true. The actual data will vary, but the pattern/ | + | |
- | + | ||
- | =====Isolating the Course Information Strings===== | + | |
- | Although there' | + | |
- | + | ||
- | Using the UNIX class again as an example, the actual line in question is as follows: | + | |
- | + | ||
- | <code html> | + | |
- | <th class=" | + | |
- | </ | + | |
- | + | ||
- | If we go and look at another class, say ARTS 1030, we see the following: | + | |
- | + | ||
- | <code html> | + | |
- | <th class=" | + | |
- | </ | + | |
- | + | ||
- | and GOVT 1010, we see the following: | + | |
- | + | ||
- | <code html> | + | |
- | <th class=" | + | |
- | </code> | + | |
- | In context, these lines are surrounded by other lines of information, | + | The database will display to STDOUT |
- | < | + | < |
- | </ | + | name:sid:major:year: |
- | < | + | |
- | < | + | |
- | </ | + | |
- | </ | + | |
- | < | + | |
- | <th class=" | + | |
- | </ | + | |
- | < | + | |
- | <td class=" | + | |
- | <span class=" | + | |
- | < | + | |
- | <span class=" | + | |
</ | </ | ||
- | Each class should be in a similar situation. The line containing the course | + | With this information we can make some important observations about the structure of the database: |
- | ^ 3. ^|Through analyzing the data, answer me the following: | + | * fields are separated by a colon (:) |
- | | ^ | + | |
- | |:::^ b.|Perform | + | |
- | |:::^ c.|Hit **n** to go to the next match. And hit **n** again. And again. Are you consistently hitting the course information line for each course?| | + | |
- | <WRAP round warning box>You absolutely need to have a correctly working pattern | + | To be effective |
+ | ====keyword filtering==== | ||
- | Return to the command prompt. Time to start prototyping our solution. | + | Ok, so let us filter some of this information: |
- | We'll want to come up with a command-line that isolates those course information lines for us. A prototype for that command-line will look something like this (substitute your working RegEx pattern in place of the string " | + | Find all the students who are in //Biology//: |
<cli> | <cli> | ||
- | lab46:~$ cat fall2013-20110417.html | grep ' | + | lab46:~$ cat sample.db | grep Biology |
</ | </ | ||
- | When you put in the same pattern you came up with while searching | + | We can do more complicated searches too: |
+ | |||
+ | Find all the students who are in Biology AND like Lollipops: | ||
<cli> | <cli> | ||
- | <th class=" | + | lab46:~$ cat sample.db | grep Biology | grep Lollipops |
- | <a href=" | + | |
- | <th class=" | + | |
- | <a href=" | + | |
- | <th class=" | + | |
- | <a href=" | + | |
- | <th class=" | + | |
- | <a href=" | + | |
- | <th class=" | + | |
- | <a href=" | + | |
- | <th class=" | + | |
- | <a href=" | + | |
- | : | + | |
</ | </ | ||
- | Because we piped our output to **less**(**1**), | + | ^ 1. ^|Perform the following searches on the database: |
+ | | ^ a.|Find all the students that are a // | ||
+ | |:::^ b.|Same as above but in // | ||
+ | |:::^ c.|Any duplicate entries? Remove any duplicates.| | ||
+ | |:::^ d.|Using the **wc**(**1**) | ||
- | What we're interested in at this point is that the data that is being produced all seems to match those lines in the file that contain the course information string. | + | Be sure to give me the command-line incantations you came up with, and any observations you made. |
- | =====Filtering unnecessary data===== | + | ====filter for manipulation==== |
- | When you're satisfied with the information your pattern and resultant **grep** search has produced, our next step is to refine the information-- to make it more readable. | + | |
- | To do this, we will make use of the **sed**(**1**) utility, which is a //steam editor//; it allows us to take the output and perform edits on it, much like we could in a text editor, | + | So we've done some simple searches on our database. We've filtered |
- | If you recall from our explorations of **vi**, it has a //search and replace// capability that proved to be rather powerful. **sed**(**1**) | + | The **cut**(**1**) |
- | <code bash> | + | It relies on a thing called a field-separator, |
- | cat FILE | grep ' | + | |
- | </ | + | |
- | Of specific interest to us are the **s**, **PATTERN**, **REPLACEMENT** and **g** options to **sed**(**1**). They have the following functionality: | + | Using the "**-d**" argument to cut, we can specify the field separator in our data. The "**-f**" option will parse the text in fields based on the established field separator. |
- | * **s**: Invoke | + | So, looking at the following text: |
- | * **PATTERN**: | + | |
- | * **REPLACEMENT**: This field is what we wish to replace any matched text with. | + | |
- | * **g**: Not necessarily needed in all cases, the **g** indicates we wish to perform this search and replace globally for **all** occurrences on a line. I'd recommend getting in the habit of using it, and then recognizing when you don't want to use it. | + | |
- | So, looking at the data leading up to the course information we're interested in, can we come up with a pattern to describe it? I think so. | + | < |
+ | hello there: | ||
+ | </ | ||
- | ^ 4. ^|Craft a RegEx pattern | + | Looking at this example, we can see that ":" would make for an excellent field separator. |
- | | ^ a.|Starts at the beginning of the line.| | + | |
- | |:::^ b.|Goes until it encounters some unique text just before our desired information.| | + | |
- | |:::^ c.|Specifically describe the pattern of the data just before our desired information.| | + | |
- | |:::^ d.|What is your pattern?| | + | |
- | To test your pattern, you'll want to do the following: | + | With ":" |
- | <cli> | + | ^ Field 1 ^ Field 2 ^ Field 3 ^ Field 4 ^ Field 5 ^ Field 6 | |
- | lab46:~$ cat fall2013-20110417.html | + | | hello there |
- | </ | + | |
- | Where **PATTERN** is a new Regular Expression pattern that successfully matches the beginning of the lines we're interested in (actually all that **grep** is producing at this point), and replacing it with nothing (the two consecutive slashes indicate we're not interested in replacing the matched data with anything). | ||
- | If successful your output should appear as follows | + | We can test these properties out by using **cut**(**1**) on the command-line: |
<cli> | <cli> | ||
- | Accounting Practices - 81559 - ACCT 1000 - 001</a></ | + | lab46:~$ echo "hello there: |
- | Accounting Practices - 82350 - ACCT 1000 - 003</ | + | |
- | Financial Accounting - 82355 - ACCT 1030 - 001</ | + | |
- | Financial Accounting - 81558 - ACCT 1030 - 002</ | + | |
- | Financial Accounting - 81107 - ACCT 1030 - 003</ | + | |
- | Financial Accounting - 81108 - ACCT 1030 - 004</ | + | |
- | Financial Accounting - 81173 - ACCT 1030 - 005</ | + | |
- | Financial Accounting - 82115 - ACCT 1030 - 006</ | + | |
- | Managerial Accounting - 82078 - ACCT 1040 - 003</ | + | |
- | Accounting Procedures - 81123 - ACCT 1050 - 001</ | + | |
- | Federal Income Tax - 81783 - ACCT 1100 - 001</ | + | |
- | Federal Income Tax - 82358 - ACCT 1100 - 002</ | + | |
- | Intermediate Accounting I - 81124 - ACCT 2030 - 001</ | + | |
- | Intermediate Accounting I - 82359 - ACCT 2030 - 002</ | + | |
- | Computerized Accounting - 82361 - ACCT 2100 - 001</ | + | |
- | Cultural Anthropology - 81139 - ANTH 2120 - 001</ | + | |
- | Elem Mod Stand Arabic Con& | + | |
- | Elem Mod Arabic Con& | + | |
- | Introduction Art Appreciation - 81505 - ARTS 1004 - 002</ | + | |
- | Drawing I - 81771 - ARTS 1030 - 001</ | + | |
- | Drawing I - 82176 - ARTS 1030 - 002</ | + | |
- | Drawing I - 82112 - ARTS 1030 - 003</ | + | |
- | Drawing I - 81503 - ARTS 1030 - 004</ | + | |
- | Ceramics I - 81151 - ARTS 1210 - 001</ | + | |
- | Ceramics I - 82110 - ARTS 1210 - 002</ | + | |
- | Ceramics I - 81504 - ARTS 1210 - 003</ | + | |
- | Ceramics I - 82134 - ARTS 1210 - 004</ | + | |
- | Ceramics I - 81176 - ARTS 1210 - 005</ | + | |
- | Basic Black & White Photo - 81873 - ARTS 1220 - 001</ | + | |
- | Basic Black & White Photo - 81874 - ARTS 1220 - 002</ | + | |
- | Basic Black & White Photo - 81875 - ARTS 1220 - 003</ | + | |
- | History/ | + | |
- | : | + | |
</ | </ | ||
- | Our **sed** should have successfully stripped off the leading HTML text that we're uninterested in. Once that happens, suddenly our data becomes that much more readable. | + | Where # is a specific field or range of fields. (ie **-f2** or **-f2,4** or **-f1-3**) |
- | Note that there' | + | ^ 2. ^|Let' |
+ | | ^ a.|What would the following command-line display: | ||
+ | |:::^ b.|If you wanted to get "hello there text." to display to the screen, what manipulation to the text would you have to do?| | ||
+ | |:::^ c.|Did your general attempt work? Is there extra information? | ||
- | < | + | If you found that extra information showed up when you tried to do that last part- taking |
- | lab46:~$ cat fall2013-20110417.html | grep ' | + | |
- | Accounting Practices - 81559 - ACCT 1000 - 001 | + | |
- | Accounting Practices - 82350 - ACCT 1000 - 003 | + | |
- | Financial Accounting - 82355 - ACCT 1030 - 001 | + | |
- | Financial Accounting - 81558 - ACCT 1030 - 002 | + | |
- | Financial Accounting - 81107 - ACCT 1030 - 003 | + | |
- | Financial Accounting - 81108 - ACCT 1030 - 004 | + | |
- | Financial Accounting - 81173 - ACCT 1030 - 005 | + | |
- | Financial Accounting - 82115 - ACCT 1030 - 006 | + | |
- | Managerial Accounting - 82078 - ACCT 1040 - 003 | + | |
- | Accounting Procedures - 81123 - ACCT 1050 - 001 | + | |
- | Federal Income Tax - 81783 - ACCT 1100 - 001 | + | |
- | Federal Income Tax - 82358 - ACCT 1100 - 002 | + | |
- | Intermediate Accounting I - 81124 - ACCT 2030 - 001 | + | |
- | Intermediate Accounting I - 82359 - ACCT 2030 - 002 | + | |
- | Computerized Accounting - 82361 - ACCT 2100 - 001 | + | |
- | Cultural Anthropology - 81139 - ANTH 2120 - 001 | + | |
- | Elem Mod Stand Arabic Con& | + | |
- | Elem Mod Arabic Con& | + | |
- | Introduction Art Appreciation - 81505 - ARTS 1004 - 002 | + | |
- | Drawing I - 81771 - ARTS 1030 - 001 | + | |
- | Drawing I - 82176 - ARTS 1030 - 002 | + | |
- | Drawing I - 82112 - ARTS 1030 - 003 | + | |
- | Drawing I - 81503 - ARTS 1030 - 004 | + | |
- | Ceramics I - 81151 - ARTS 1210 - 001 | + | |
- | Ceramics I - 82110 - ARTS 1210 - 002 | + | |
- | Ceramics I - 81504 - ARTS 1210 - 003 | + | |
- | Ceramics I - 82134 - ARTS 1210 - 004 | + | |
- | Ceramics I - 81176 - ARTS 1210 - 005 | + | |
- | Basic Black & White Photo - 81873 - ARTS 1220 - 001 | + | |
- | Basic Black & White Photo - 81874 - ARTS 1220 - 002 | + | |
- | Basic Black & White Photo - 81875 - ARTS 1220 - 003 | + | |
- | History/ | + | |
- | </ | + | |
- | In the provided expression, the following happens: | + | If you tell **cut**(**1**) to display any fields that aren't immediately next to one another, it will insert the field separator to indicate |
- | * The pattern **< | + | So how do you keep this functionality while still getting |
- | * We replace that matched pattern with NOTHING. | + | =====The Stream Editor - sed===== |
- | <WRAP round info box>Note the presence of the backslash | + | Remember back when we played with **vi/vim**? Remember that useful search |
- | The result should be as appears in the sample above... no HTML data, just real readable course information. | + | < |
+ | : | ||
+ | </ | ||
- | ^ 5. | + | That was quite useful. And luckily, we've got that same ability on the command line. Introducing "**sed**(**1**)", the stream editor. |
- | | ^ a.|Of this list, how many courses is CCC offering next semester? | + | |
- | |:::^ b.|How did you produce this result?| | + | |
- | |:::^ c.|How many **CSCS** classes is CCC offering next semester? How did you find this?| | + | |
- | |:::^ d.|How did you produce this result?| | + | |
- | |:::^ e.|How many upper level (2000 and above) | + | |
- | |:::^ f.|How did you produce this result?| | + | |
- | |:::^ g.|Who is offering more courses next semester, the English or Math department? | + | |
- | |:::^ h.|How did you produce this result?| | + | |
- | Hopefully you're starting | + | sed provides some of the features we've come to enjoy in vi, and is for all intents and purposes a non-interactive editor. One useful ability, however, is its ability |
- | Once in that format, we can then perform some more valuable tasks on that data. | + | Perhaps the most immediately useful command found in sed will be its search and replace, which is pretty much just like the **vi/vim** variant: |
- | =====Data Analysis===== | + | < |
- | In the **courselist/** subdirectory of the UNIX Public Directory are some additional files of value: | + | sed -e 's/regex/ |
+ | </ | ||
- | * fall2010-20100315.html.gz | + | However, if you look close, you will see that we did not include any sort of file to operate on. While we can, one of the other common uses of sed is to pop it in a command-line with everything else, stuck together with the all-powerful pipe (**|**). |
- | * fall2010-20101113.html.gz | + | |
- | * fall2011-20110417.html.gz | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | * spring2011-20101113.html.gz | + | |
- | * winter2011-20101113.html.gz | + | |
- | Each of these files contains a snapshot of semester course information, referenced by semester, and snapshot date. Please make a copy of these additional files, uncompress them, and let's create a script to perform some data analysis. | + | For example, so solve the above problem with the field separator: |
- | ^ 6. ^|Write a script that does the following:| | + | <cli> |
- | | ^ a.|Accepts 1 or more of these files as an argument.| | + | lab46:~$ echo "hello there:this:is:a:bunch of:text." |
- | |:::^ b.|If no files are specified, display an error with usage information and exit.| | + | </ |
- | |:::^ c.|If one file is given, perform the logic we've done manually on the command-line to produce and display the total number of courses offered in the given semester' | + | |
- | |:::^ d.|If two files are given, and are both for the same semester+year, | + | |
- | |::: | + | |
- | |:::^ f.|If more than two semesters are listed, do the same, but **also** display the totals for MATH, CSCS, BIOL, and PFIT.| | + | |
- | |:::^ g.|Provide a copy of your script.| | + | |
- | ^ 7. ^|As you are playing with the different course data files:| | + | We used sed to replace |
- | | ^ a.|Comparing fall2011 | + | |
- | |:::^ b.|Do any of the files seem to break your logic?| | + | |
- | |:::^ c.|Which one(s)?| | + | |
- | |:::^ d.|Comparing a "working" | + | |
- | |:::^ e.|Between which two snapshot dates did this change seem to take place?| | + | |
- | |:::^ f.|What can you surmise as being a cause of this change?| | + | |
- | |:::^ g.|Could you adapt your script to handle the two different formats of data? How would you do this?| | + | |
- | |:::^ h.|Provide a copy of your updated script.| | + | |
- | <WRAP round info box> | + | ^ 3. ^|Answer me the following:| |
+ | | ^ a.|Does the above command-line fix the problem from #2c?| | ||
+ | |:::^ b.|If you wanted | ||
+ | |:::^ c.|If you wanted to replace all the period symbols in the text with asterisks, how would you do it?| | ||
+ | |:::^ d.|What does the resulting output look like?| | ||
- | There are a ton of questions we could ask of this data: | + | =====From head(1) to tail(1)===== |
- | | + | Two other utilities you may want to become acquainted with are the **head**(**1**) and **tail**(**1**) utilities. |
- | | + | |
- | | + | |
- | Then there are some questions that, with our current skill level, may cause us a bit of trouble: | + | **head**(**1**) will allow you to print a specified number of lines from //1 to n//. So if you needed to print, say, the first 12 lines of a file, **head**(**1**) will be a good bet. |
- | * What is the range of CRN numbers for a given semester? (Lowest through highest) | + | For example, to display |
- | * Which course prefix has the MOST offerings a given semester? | + | |
- | * Which course prefix has the LEAST offerings a given semester? | + | |
- | * Which course prefix offered the MOST remedial course offerings? | + | |
- | While we may be able to derive answers to these questions... in some respects the data is not conveniently arranged for our analysis purposes. At the moment we have our data in the following format: | + | < |
+ | lab46:~$ head -12 sample.db | ||
+ | </ | ||
- | ^ UNIX/Linux Fundamentals - 81769 - CSCS 1730 - 001 ^ | + | And, of course, adding it onto an existing command line using the pipe. In this example, the first two results of all the *ology Majors: |
- | And to answer some of these questions, especially when **grep**' | + | < |
+ | lab46:~$ cat sample.db | grep " | ||
+ | </ | ||
- | ^ 78400:CSCS 1730-001: | + | See where we're going with this? We can use these utilities to put together massively powerful command-line incantations create all sorts of interesting filters. |
- | So once again our data may not be exactly | + | **tail**(**1**) works in the opposite end- starting at the end of the file and working backwards towards the beginning. |
+ | =====Translating characters with tr===== | ||
- | =====Rearranging Data with Regular Expressions===== | + | This is another useful tool to be familiar with. With **tr**(**1**), |
- | I consider where we are at now to be amongst some of the most powerful of concepts we learn in this class. What we are going to do now hopefully should take the cake and illustrate the true potential | + | ====ASCII file line endings==== |
- | To do our next trick, we need to study our data once again: | + | An important thing to be aware of is how the various systems terminate their lines. Check the following table: |
- | ^ | + | ^ |
+ | | DOS | Carriage Return, Line Feed (CRLF) | ||
+ | | Mac | Carriage Return (CR) | | ||
+ | | UNIX | Line Feed (LF) | | ||
- | As you can see, the information as it is currently | + | So what does this mean to you? Well, if you have a file that was formatted |
- | ^ Current: | + | For example, let's say we have a UNIX file we wish to convert to DOS format. We would need to convert every terminating Line Feed to a Carriage Return & Line Feed combination (and take note that the Carriage Return needs to come first and then the Line Feed). We would do something that looks like this: |
- | ^ Desired: | + | |
- | + | ||
- | So how could we do this? To accomplish this task, we need to explore another RegEx capability and exercise our options in the **sed** REPLACEMENT field. | + | |
- | + | ||
- | ^ 8. ^|With our data in the current structure: | + | |
- | | ^ | + | |
- | |:::^ b.|Derive a RegEx pattern that will match the CRN up to the second "space dash space" | + | |
- | |:::^ c.|Derive | + | |
- | |:::^ d.|Finally, round out with a fourth RegEx pattern that matches the **Section**, | + | |
- | + | ||
- | For my examples, I'll name your patterns REGEX1, REGEX2, REGEX3, | + | |
- | + | ||
- | In order to rearrange our data, we need to effectively describe | + | |
- | + | ||
- | Check this out: | + | |
<cli> | <cli> | ||
- | lab46: | + | lab46: |
- | lab46: | + | |
</ | </ | ||
- | Notice what we just did here... we took our information in its current form of filtering and output it to a file (called **output**), | + | To interpret this: |
- | That should make sense, we're just using **I/O Redirection** to send the output of that pipelined command-line to a file instead of to **STDOUT**. | + | **\n** is the special escape sequence |
- | Feel free to make use of similar output junctures during | + | **\r** is the special escape sequence that corresponds |
- | Moving on: | + | The first argument is the original sequence. The second is what we would like to replace it with. (in this case, replace every **LF** with a **CRLF** combination). |
- | < | + | Then, using UNIX I/O redirection operations, **file.unix** is redirected as input to **tr**(**1**), and **file.dos** is created and will contain the output. |
- | lab46:~$ cat output | sed 's/^\(REGEX1\) - \(REGEX2\) - \(REGEX3\) - \(REGEX4\)$/ | + | |
- | 81559:ACCT 1000-001: | + | |
- | 82350:ACCT 1000-003: | + | |
- | 82355:ACCT 1030-001: | + | |
- | 81558:ACCT 1030-002: | + | |
- | 81107:ACCT 1030-003: | + | |
- | 81108:ACCT 1030-004: | + | |
- | 81173:ACCT 1030-005: | + | |
- | 82115:ACCT 1030-006: | + | |
- | 82078:ACCT 1040-003: | + | |
- | 81123:ACCT 1050-001: | + | |
- | 81783:ACCT 1100-001: | + | |
- | 82358:ACCT 1100-002: | + | |
- | 81124:ACCT 2030-001: | + | |
- | 82359:ACCT 2030-002: | + | |
- | 82361:ACCT 2100-001: | + | |
- | 81139:ANTH 2120-001: | + | |
- | 82296:ARAB 1010-001: | + | |
- | 82297:ARAB 1010-071: | + | |
- | 81505:ARTS 1004-002: | + | |
- | 81771:ARTS 1030-001: | + | |
- | 82176:ARTS 1030-002: | + | |
- | 82112:ARTS 1030-003: | + | |
- | 81503:ARTS 1030-004: | + | |
- | 81151:ARTS 1210-001: | + | |
- | 82110:ARTS 1210-002: | + | |
- | 81504:ARTS 1210-003: | + | |
- | 82134:ARTS 1210-004: | + | |
- | 81176:ARTS 1210-005: | + | |
- | 81873:ARTS 1220-001: | + | |
- | 81874:ARTS 1220-002: | + | |
- | 81875:ARTS 1220-003: | + | |
- | 81180:ARTS 1310-001: | + | |
- | </ | + | |
- | <WRAP round warning box>**__NOTE:__** If the format of your data does not seem to change, | + | In the **filters/** subdirectory of the UNIX Public Directory |
- | Once you get it--- **WOW!** the data changed, just the way we wanted. Now we can do further analysis and write shell scripts that better assist us in our tasks. **Activities like this is what separates someone who can effectively command technology | + | ^ 4. ^|Let' |
+ | | ^ a.|Convert **file.mac** to UNIX format. Show me how you did this, as well as any interesting messages | ||
+ | |:::^ b.|Convert **readme.unix** to DOS format. Same deal as above.| | ||
+ | |:::^ c.|Convert **dos.txt** | ||
+ | =====Procedure===== | ||
+ | Looking back on our database (**sample.db** in the **filters/ | ||
- | That is the power of Regular Expressions. We can effectively delegate the manual labor to the computer, which is very good at manual (and menial) tasks, and is great at following | + | ^ 5. |
+ | | ^ a.|How many unique // | ||
+ | |:::^ b.|How many unique //majors// are there in the database? | ||
+ | |:::^ c.|How many unique " | ||
- | Plus, the less we are involved at the grunt-work level, the less chance | + | <WRAP round info box> |
- | =====Additional Data Wrangling===== | + | </ |
- | + | ||
- | To cap off our experience, let's do one last foray into rearranging our data. | + | |
- | ^ | + | ^ |
- | | ^ a.|PREFIX NUMBER-SECTION(CRN): | + | | ^ a.|Show me the first 22 lines of this file. How did you do this?| |
- | |:::^ b.|PREFIX NUMBER:CRN (omit the section and title)| | + | |:::^ b.|Show me the last 4 lines of this file. How did you do this?| |
- | |:::^ c.|PREFIXNUMBER-SECTION: | + | |:::^ c.|Show me lines 32-48 of this file. How did you do this? (HINT: the last 16 lines of the first 48)| |
+ | |:::^ d.|Of the last 12 lines in this file, show me the first 4. How did you do this?| | ||
- | <WRAP round info box> | + | Being familiar |
=====Conclusions===== | =====Conclusions===== |