haas:spring2010:basic

Pattern Matching

Another challenge appears on the horizon of UNIXy possibility, and one that some of you can potentially put to near-immediate use in some upcoming academic activities.

Some of you may have noticed that Summer and Fall 2010 course registrations are now upon us, and take it from me, the sooner you get your courses taken care of, the easier it will be. With likely continued record-high enrollments, the chances of the classes you want filling up are high, and the speeds at which such things may take place have been known to surprise people.

With that said, this week's quest is going to focus on playing with CCC's course offering data, and gleaning useful information from it.

To start, you will need a copy of the CCC course data for Fall 2010. To save time and frustration, I have gone ahead and downloaded it, and that can be found in a gzipped html file called fall2010-20100315.html.gz located in the courselist/ subdirectory of the UNIX Public Directory. Obtain a copy of this file and place it in a good working location within your home directory.

Notice the size of this file. Uncompress it and check the file size… what is the ratio of raw to compressed data?

Take a look at the data inside this file (use your favorite moded text editor, or cat it with a pipe to less)…. study the data and look for any patterns within it.

Specifically, I want you to be able to identify:

blocks of data containing information on a single course
how does a new course start (pattern-wise in this file)?
what is the format of the information related to a course title/prefix/section? Is there an order to it from one to the next?
any big deviations amongst this data (regular vs. internet vs. hybrid vs. lecture/lab vs. ACE vs. independent study)
check out information like class room, meeting times, instructor and see if there's a common pattern amongst the different courses (be on the lookout for exceptions)

Once you think you have a handle on how the file is arranged (be sure to jot down some of your observations… would make great journal content as you explore this problem), I'd like for you to, using your skills on the UNIX command-line, obtain for me the following information:

how many total courses is CCC offering in the fall?
how many ACE courses are being offered?
how many English courses?
how many courses are being offered at the Elmira Center?
how many upper-level (2000 or above) courses are being offered at Airport Corporate Park AND the BDC (total between both)?
how many lower level courses (1000-1999) are being offered in the fall?
how many unique courses are being offered in the fall (count CSNT 1200 and all its sections as “1”, CSCS 1200 and all its sections as “1”, etc.)?
how many unique course prefixes are present in the fall course offering (“CSNT”, “CSCS”, “FIRE”, etc.).

To assist you, I would recommend exploring and becoming more familiar with some of the following commands (in addition to your working toolset):

grep
sort
uniq
sed

Now, with sed and grep we approach another big area of topic coverage— and that is the area of Regular Expressions.

This is an area we will be spending some time and attention on, but it may behoove you to start reading up on them and asking questions and playing with them.

Some of your books have information on Regular Expressions, and some manual pages (the grep manual page has an informative section on “REGULAR EXPRESSIONS”– search for it in all caps like that, no quotes needed).

Basically, a regular expression is a pattern that can be applied to text, in a similar way as wildcards work on files. Both grep and sed understand regular expressions, and through using them one can obtain some amazing capabilities.

Quickly, a basic table of Regular Expressions:

Symbol	Description
CAROT (shift-6)	Match Beginning of Line
$	Match Ending of Line
.	Match Any Single Character
*	Match 0 or More of the Previous
[ ]	Match One of Any of the Enclosed Characters (Character Class)
[CAROT (shift-6) ]	Do NOT Match One of Any of the Enclosed Characters (Inverted Character Class)

NOTE: All mentions of “CAROT (shift-6)” should be substituted with a '^' character… at the time of writing, I can't figure out how to get that character properly escaped in dokuwiki table syntax

Can someone figure out how I can display a '^' symbol in a dokuwiki table so I can fix it?

There are more, but for now let's focus on these.

Try to use Regular Expressions with grep and sed to assist you in finding information you seek. This can help you in the solving of the information I request above, along with opening the door to more exciting and more powerful capabilities we will soon be exploring.

Additionally, try your hand at the following:

using the “search and replace” functionality in sed to erase the HTML tags in some of the data you are viewing, so that you are dealing only with plain ASCII text
can you get a list of all the course prefix / number / CRN / title / section data without HTML tags?
can you put each course's data in its own text file (named by the unique course CRN)? How might you go about trying to do this?
using this file, go ahead and put together your schedule for Fall 2010. Perform searches against this file to get the CRNs (needed for registration), which will significantly speed up your registration process for the fall (also creates the opportunity to ask questions with regard to requirements)

Once again, ASKING QUESTIONS will be greatly beneficial. This isn't a problem that you can likely do in 20 minutes, so you'll want to gradually poke away at it throughout the week.

Good luck.