Associated Term: Spring 2011
Registration Dates: Oct 17, 2010 to May 22, 2011
Each class should be in a similar situation. The line containing the course information is surrounded by lines that contain other information (whether useful or useless, there is other data there than what we are presently interested in locating).
^ 3. ^|Through analyzing the data, answer me the following:|
| ^ a.|If we wanted to perform a search that would only hit the course information lines (ie a pattern that would match just that line, and match that line for each course in the file), what does the RegEx pattern look like?|
|:::^ b.|Perform the search in **vi** (using **/**, verify that it hits that line in some course). Does it snap to the appropriate line?|
|:::^ c.|Hit **n** to go to the next match. And hit **n** again. And again. Are you consistently hitting the course information line for each course?|
You absolutely need to have a correctly working pattern in order to proceed. If you have ANY questions, please ask them. This lab will fail to cooperate with you if your pattern is not adequate.
Return to the command prompt. Time to start prototyping our solution.
We'll want to come up with a command-line that isolates those course information lines for us. A prototype for that command-line will look something like this (substitute your working RegEx pattern in place of the string "REGEX" in the example below):
lab46:~$ cat spring2011-20101105.html | grep 'REGEX' | less
When you put in the same pattern you came up with while searching in **vi**, your screen should be filled with data that looks like this (and much much more):
Accounting Practices - 78010 - ACCT 1000 - 001 |
Financial Accounting - 78014 - ACCT 1030 - 002 |
Financial Accounting - 78577 - ACCT 1030 - 005 |
Financial Accounting - 78016 - ACCT 1030 - 006 |
Managerial Accounting - 78573 - ACCT 1040 - 001 |
Managerial Accounting - 78017 - ACCT 1040 - 002 |
Managerial Accounting - 78372 - ACCT 1040 - 003 |
Accounting Procedures - 78019 - ACCT 1050 - 001 |
Accounting Procedures - 79108 - ACCT 1050 - 002 |
Federal Income Tax - 78027 - ACCT 1100 - 001 |
Cost Accounting - 78021 - ACCT 2050 - 001 |
Cost Accounting - 79100 - ACCT 2050 - 002 |
Computerized Accounting - 78026 - ACCT 2100 - 001 |
Cultural Anthropology - 78886 - ANTH 2120 - 002 |
Introduction Art Appreciation - 78493 - ARTS 1004 - 001 |
:
Because we piped our output to **less**(**1**), it stops after the first screenful of information. Pressing the down/up arrow keys or the space bar will navigate us through this data.
What we're interested in at this point is that the data that is being produced all seems to match those lines in the file that contain the course information string.
=====Filtering unnecessary data=====
When you're satisfied with the information your pattern and resultant **grep** search has produced, our next step is to refine the information-- to make it more readable.
To do this, we will make use of the **sed**(**1**) utility, which is a //steam editor//; it allows us to take the output and perform edits on it, much like we could in a text editor, only we specify on the command-line the actual work we wish to perform.
If you recall from our explorations of **vi**, it has a //search and replace// capability that proved to be rather powerful. **sed**(**1**) also possesses this ability, and we can unlock it as follows:
cat FILE | grep 'REGEX' | sed 's/PATTERN/REPLACEMENT/g' | less
Of specific interest to us are the **s**, **PATTERN**, **REPLACEMENT** and **g** options to **sed**(**1**). They have the following functionality:
* **s**: Invoke the **sed**(**1**) //search and replace// command. By default the forward slash **/** is the field separator.
* **PATTERN**: The first field following the search command is the pattern we are looking for. In this case, we want to come up with a new pattern that will match a portion of the text we wish to get rid of.
* **REPLACEMENT**: This field is what we wish to replace any matched text with.
* **g**: Not necessarily needed in all cases, the **g** indicates we wish to perform this search and replace globally for **all** occurrences on a line. I'd recommend getting in the habit of using it, and then recognizing when you don't want to use it.
So, looking at the data leading up to the course information we're interested in, can we come up with a pattern to describe it? I think so.
^ 4. ^|Craft a RegEx pattern that does the following:|
| ^ a.|Starts at the beginning of the line.|
|:::^ b.|Goes until it encounters some unique text just before our desired information.|
|:::^ c.|Specifically describe the pattern of the data just before our desired information.|
|:::^ d.|What is your pattern?|
To test your pattern, you'll want to do the following:
lab46:~$ cat spring2011-20101105.html | grep 'REGEX' | sed 's/PATTERN//g' | less
Where **PATTERN** is a new Regular Expression pattern that successfully matches the beginning of the lines we're interested in (actually all that **grep** is producing at this point), and replacing it with nothing (the two consecutive slashes indicate we're not interested in replacing the matched data with anything).
If successful your output should appear as follows:
Accounting Practices - 78010 - ACCT 1000 - 001
Financial Accounting - 78014 - ACCT 1030 - 002
Financial Accounting - 78577 - ACCT 1030 - 005
Financial Accounting - 78016 - ACCT 1030 - 006
Managerial Accounting - 78573 - ACCT 1040 - 001
Managerial Accounting - 78017 - ACCT 1040 - 002
Managerial Accounting - 78372 - ACCT 1040 - 003
Accounting Procedures - 78019 - ACCT 1050 - 001
Accounting Procedures - 79108 - ACCT 1050 - 002
Federal Income Tax - 78027 - ACCT 1100 - 001
Cost Accounting - 78021 - ACCT 2050 - 001
Cost Accounting - 79100 - ACCT 2050 - 002
Computerized Accounting - 78026 - ACCT 2100 - 001
Cultural Anthropology - 78886 - ANTH 2120 - 002
Introduction Art Appreciation - 78493 - ARTS 1004 - 001
Drawing I - 78524 - ARTS 1030 - 001
Drawing I - 78943 - ARTS 1030 - 002
Drawing I - 78391 - ARTS 1030 - 003
Ceramics I - 78041 - ARTS 1210 - 001
Ceramics I - 78388 - ARTS 1210 - 002
Ceramics I - 78472 - ARTS 1210 - 003
Ceramics I - 78043 - ARTS 1210 - 004
Basic Black & White Photo - 78740 - ARTS 1220 - 001
Basic Black & White Photo - 78741 - ARTS 1220 - 002
History/Appreciation of Art I - 79029 - ARTS 1310 - 001
History/Appreciation of Art II - 79052 - ARTS 1320 - 001
History/Appreciation of Art II - 78659 - ARTS 1320 - 002
History/Appreciation of Art II - 78950 - ARTS 1320 - 003
Introduction to Digital Art - 78420 - ARTS 1400 - 001
Introduction to Digital Art - 78421 - ARTS 1400 - 002
:
Our **sed** should have successfully stripped off the leading HTML text that we're uninterested in. Once that happens, suddenly our data becomes that much more readable.
Note that there's still HTML data trailing our information. That can be addressed in another **sed** call:
lab46:~$ cat spring2011-20101105.html | grep 'REGEX' | sed 's/PATTERN//g' | sed 's/<\/A>.*$//g' | less
Accounting Practices - 78010 - ACCT 1000 - 001
Financial Accounting - 78014 - ACCT 1030 - 002
Financial Accounting - 78577 - ACCT 1030 - 005
Financial Accounting - 78016 - ACCT 1030 - 006
Managerial Accounting - 78573 - ACCT 1040 - 001
Managerial Accounting - 78017 - ACCT 1040 - 002
Managerial Accounting - 78372 - ACCT 1040 - 003
Accounting Procedures - 78019 - ACCT 1050 - 001
Accounting Procedures - 79108 - ACCT 1050 - 002
Federal Income Tax - 78027 - ACCT 1100 - 001
Cost Accounting - 78021 - ACCT 2050 - 001
Cost Accounting - 79100 - ACCT 2050 - 002
Computerized Accounting - 78026 - ACCT 2100 - 001
Cultural Anthropology - 78886 - ANTH 2120 - 002
In the provided expression, the following happens:
* The pattern **<\/A>.*$** explicitly matches the closing **A** tag, and then matches whatever follows until the end of the line.
* We replace that matched pattern with NOTHING.
Note the presence of the backslash **\** before the closing slash of the **A** tag. This is needed because the forward slash **/** is the default field separator in **sed**(**1**), and to avoid the error of prematurely terminating the field, we use the backslash to escape it in order to match a literal forward slash.
The result should be as appears in the sample above... no HTML data, just real readable course information.
^ 5. ^|Perform some data mining for me:|
| ^ a.|Of this list, how many courses is CCC offering next semester?|
|:::^ b.|How did you produce this result?|
|:::^ c.|How many **CSCS** classes is CCC offering next semester? How did you find this?|
|:::^ d.|How did you produce this result?|
|:::^ e.|How many upper level (2000 and above) **ENGL** classes are being offered next semester?|
|:::^ f.|How did you produce this result?|
|:::^ g.|Who is offering more courses next semester, the English or Math department?|
|:::^ h.|How did you produce this result?|
Hopefully you're starting to see the value in what the Regular Expressions have enabled for us-- we were able to take raw data in some arbitrary format, and through analyzing it, adequately whittle away at it until it becomes a format readable to us.
Once in that format, we can then perform some more valuable tasks on that data.
=====Data Analysis=====
In the **courselist/** subdirectory of the UNIX Public Directory are some additional files of value:
* fall2010-20100315.html.gz
* fall2010-20101113.html.gz
* spring2010-20091022.html.gz
* spring2010-20101113.html.gz
* spring2011-20101113.html.gz
* winter2011-20101113.html.gz
Each of these files contains a snapshot of semester course information, referenced by semester, and snapshot date. Please make a copy of these additional files, uncompress them, and let's create a script to perform some data analysis.
^ 6. ^|Write a script that does the following:|
| ^ a.|Accepts 1 or more of these files as an argument.|
|:::^ b.|If no files are specified, display an error with usage information and exit.|
|:::^ c.|If one file is given, perform the logic we've done manually on the command-line to produce and display the total number of courses offered in the given semester's course file.|
|:::^ d.|If two files are given, and are both for the same semester+year, display the totals for each semester, and if the numbers do not match, display how both files differ (in an attempt to show what change took place).|
|:::^ e.|If two files are given, and are **not** the same semester+year, display the totals for each semester, and display how many English courses are being offered in each of the files.|
|:::^ f.|If more than two courses are listed, do the same, but **also** display the totals for MATH, CSCS, BIOL, and PFIT.|
|:::^ g.|Provide a copy of your script on the submission form.|
^ 7. ^|As you are playing with the different course data files:|
| ^ a.|Comparing spring2010 to spring2011, which semester offered more courses?|
|:::^ b.|Do any of the files seem to break your logic?|
|:::^ c.|Which one(s)?|
|:::^ d.|Comparing a "working" file to a "nonworking" one, what seems to be a difference that trips up your patterns?|
|:::^ e.|Between which two snapshot dates did this change seem to take place?|
|:::^ f.|What can you surmise as being a cause of this change?|
|:::^ g.|Could you adapt your script to handle the two different formats of data? How would you do this?|
|:::^ h.|Provide a copy of your updated script on the submission form.|
__Hint:__ to compare differences between textual data sets, explore the **diff**(**1**) tool.
There are a ton of questions we could ask of this data:
* How many remedial (courses below the 1000 level) are offered a given semester?
* Do any quantity of particular course(s) increase/decrease over time?
* Is there a noticeable change in certain course offerings between a fall and a spring?
Then there are some questions that, with our current skill level, may cause us a bit of trouble:
* What is the range of CRN numbers for a given semester? (Lowest through highest)
* Which course prefix has the MOST offerings a given semester?
* Which course prefix has the LEAST offerings a given semester?
* Which course prefix offered the MOST remedial course offerings?
While we may be able to derive answers to these questions... in some respects the data is not conveniently arranged for our analysis purposes. At the moment we have our data in the following format:
^ UNIX/Linux Fundamentals - 78400 - CSCS 1730 - 001 |
And to answer some of these questions, especially when **grep**'s are concerned, we'd ideally want the data arranged more like:
^ 78400:CSCS 1730-001:UNIX/Linux Fundamentals |
So once again our data may not be exactly the way we want it. Do we give up? **HECK NO**, we conform the universe to our demands...
=====Rearranging Data with Regular Expressions=====
I consider where we are at now to be amongst some of the most powerful of concepts we learn in this class. What we are going to do now hopefully should take the cake and illustrate the true potential of the capabilities we are able to wield provided a good working knowledge of Regular Expressions and related tools.
To do our next trick, we need to study our data once again:
^ UNIX/Linux Fundamentals - 78400 - CSCS 1730 - 001 |
As you can see, the information as it is currently formatted takes the following structure, as compared to the desired structure:
^ Current: | Course Title - CRN - Course Prefix/Number - Section |
^ Desired: | CRN:Course Prefix/Number-Section:Course Title |
So how could we do this? To accomplish this task, we need to explore another RegEx capability and exercise our options in the **sed** REPLACEMENT field.
^ 8. ^|With our data in the current structure:|
| ^ a.|Derive a RegEx pattern that will match up to the first "space dash space". What is your pattern?|
|:::^ b.|Derive a RegEx pattern that will match the CRN up to the second "space dash space". What is your pattern?|
|:::^ c.|Derive a RegEx pattern that will match the Course Prefix/Number up to the third "space dash space". What is your pattern?|
|:::^ d.|Finally, round out with a fourth RegEx pattern that matches the **Section**, which is at the end of the line. What is your pattern?|
For my examples, I'll name your patterns REGEX1, REGEX2, REGEX3, and REGEX4.
In order to rearrange our data, we need to effectively describe the data (as you did above) in order to reference it in groups. The RegEx symbols **\(** and **\)** denote Regular Expression groups, which we can use to isolate specific patterns for later reference.
Check this out:
lab46:~$ cat spring2011-20101105.html | grep 'REGEX' | sed 's/PATTERN//g' | sed 's/<\/A>.*$//g' > output
lab46:~$
Notice what we just did here... we took our information in its current form of filtering and output it to a file (called **output**), effectively taking a snapshot of our progress.
That should make sense, we're just using **I/O Redirection** to send the output of that pipelined command-line to a file instead of to **STDOUT**.
Feel free to make use of similar output junctures during the solution of a problem like this- and who knows, you might need to do particular processing with certain arrangements of data. So if you output your data at certain key points, you could be making your work a lot easier.
Moving on:
lab46:~$ cat output | sed 's/^\(REGEX1\) - \(REGEX2\) - \(REGEX3\) - \(REGEX4\)$/\2:\3-\4:\1/g' | less
78010:ACCT 1000-001:Accounting Practices
78014:ACCT 1030-002:Financial Accounting
78577:ACCT 1030-005:Financial Accounting
78016:ACCT 1030-006:Financial Accounting
78573:ACCT 1040-001:Managerial Accounting
78017:ACCT 1040-002:Managerial Accounting
78372:ACCT 1040-003:Managerial Accounting
78019:ACCT 1050-001:Accounting Procedures
79108:ACCT 1050-002:Accounting Procedures
78027:ACCT 1100-001:Federal Income Tax
78021:ACCT 2050-001:Cost Accounting
79100:ACCT 2050-002:Cost Accounting
78026:ACCT 2100-001:Computerized Accounting
78886:ANTH 2120-002:Cultural Anthropology
78493:ARTS 1004-001:Introduction Art Appreciation
78524:ARTS 1030-001:Drawing I
78943:ARTS 1030-002:Drawing I
78391:ARTS 1030-003:Drawing I
78041:ARTS 1210-001:Ceramics I
78388:ARTS 1210-002:Ceramics I
**__NOTE:__** If the format of your data does not seem to change, you've got a typo, or a RegEx that doesn't adequately describe the data. Go over your syntax, look for any possible gotchas. Ask questions, seek clarification, **and don't be afraid to have someone look at your pattern**... you'd be amazed what a second pair of eyes can do.
Once you get it--- **WOW!** the data changed, just the way we wanted. Now we can do further analysis and write shell scripts that better assist us in our tasks. **Activities like this is what separates someone who can effectively command technology as a tool to assist you** to someone who resorts to manual data entry, racking up hours of time manually preparing the data to attempt to answer the same questions we've asked and gotten answers to. And our processing takes a fraction of the time it would take compared to trying to do all this data filtering and rearranging by hand.
That is the power of Regular Expressions. We can effectively delegate the manual labor to the computer, which is very good at manual (and menial) tasks, and is great at following instructions.
Plus, the less we are involved at the grunt-work level, the less chance there are of errors being introduced. The computer, when it follows correct instructions, will process the data effectively, versus the unpredictability of a human manually working on the data, accidentally inserting typos or other glitches that would threaten the validity of the end data.
=====Additional Data Wrangling=====
To cap off our experience, let's do one last foray into rearranging our data.
^ 9. ^|Rearrange the course information as follows (and show your command-lines):|
| ^ a.|PREFIX NUMBER-SECTION(CRN):TITLE|
|:::^ b.|PREFIX NUMBER:CRN (omit the section and title)|
|:::^ c.|PREFIXNUMBER-SECTION:TITLE (CRN) (merge PREFIX and NUMBER together, no space separating them).|
**PLEASE- ASK QUESTIONS, SEEK CLARIFICATION**. You're all just starting out, developing a proficiency with Regular Expressions. Typos happen. Don't let them trainwreck your progress on the lab.
=====Conclusions=====
All questions in this assignment require an action or response. Please organize your responses into an easily readable format and submit the final results to your instructor.
Your assignment is expected to be performed and submitted in a clear and organized fashion- messy or unorganized assignments may have points deducted. Be sure to adhere to the submission policy.
When complete, electronically submit your assignment by filling out the following form:
http://lab46.corning-cc.edu/haas/content/unix/submit.php?labb
As always, the class mailing list is available for assistance, but not answers. |