This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
haas:spring2014:unix:labs:labb [2014/03/23 19:10] – created wedge | haas:spring2014:unix:labs:labb [2014/04/15 09:18] (current) – [Exercise] wedge | ||
---|---|---|---|
Line 2: | Line 2: | ||
< | < | ||
< | < | ||
- | <fs 125%>Lab 0xB: Data Analysis with Regular Expressions and Scripting</fs> | + | <fs 125%>Lab 0xB: Data Manipulation</fs> |
</ | </ | ||
~~TOC~~ | ~~TOC~~ | ||
=====Objective===== | =====Objective===== | ||
- | To continue to build on our knowledge | + | To explore some aspects |
=====Reading===== | =====Reading===== | ||
- | Referencing | + | Please reference |
- | * **grep**(**1**) | + | * **dd**(**1**) |
- | * **sed**(**1**) | + | * **md5sum**(**1**) |
- | * **awk**(**1**) | + | * **diff**(**1**) |
- | + | * **bvi**(**1**) | |
- | =====Note===== | + | * **hexedit**(**1**) |
- | This lab is involved. Information obtained in early steps are built upon with increasingly complex functionality. | + | * **file**(**1**) |
- | + | ||
- | If at any point something doesn' | + | |
- | + | ||
- | It is your responsibility to understand what is going on, please be proactive by asking questions | + | |
=====Background===== | =====Background===== | ||
+ | The **dd**(**1**) utility, short for //data dump//, is a tool that specializes in taking data from a source file and depositing it in a destination file. In combination with its various options, we have the capability of more fine-grained access to data that would otherwise not be as convenient using the standard data manipulation tools (**cp**(**1**), | ||
- | As we've been exploring Regular Expressions, Shell Scripting, and even the various tools on the system, you've been told that these are important building blocks to aid you in more effective problem solving. | + | ====Copying==== |
+ | To illustrate the basic nature of **dd**(**1**), | ||
- | Now, we've amassed | + | When given just a source and a destination, **dd**(**1**) will happily copy (from start to finish), the source data to the destination location (filling it up from beginning to end). The end result should be identical to the source. |
- | =====Problem Description===== | + | For example: |
- | As students at CCC, a routine activity that takes place each semester is the selection of classes for the following semester. To put a schedule together, courses must be looked up and a selection of compatible times are selected, sometimes choosing from a selection of offerings, and ultimately a CRN (Course Reference Number) must be identified in order to communicate to the system. | + | |
- | + | ||
- | On the main [[http:// | + | |
- | + | ||
- | As it turns out, this functionality generates data in HTML format, yet it contains all the useful information we might possibly need for course selections. | + | |
- | + | ||
- | This is one of those perfect examples that can be solved with our UNIX skills... the data we find available to us is in a form not immediately readable for our needs.. so what do we do when the universe doesn' | + | |
- | =====Obtain the Data===== | + | |
- | In preparation for this exercise, I have taken the liberty of downloading a class listing for the Fall 2013 semester (Fall 2014 is not yet available at the time of this lab's release- but that's okay; if you get the logic worked out, all we have to do is substitute the data set!). This list contains all the courses offered at the primary college locations and Internet courses, and excludes ACE courses, and courses taught at non-primary college locations (high schools, etc.). | + | |
- | + | ||
- | This file can be found in the **courselist/ | + | |
- | + | ||
- | We'll want to copy this file to your home directory. | + | |
- | + | ||
- | ^ 1. ^|Do the following: | + | |
- | | ^ a.|Copy the indicated file to your home directory. How did you do this?| | + | |
- | |:::^ b.|List the file. How large is it?| | + | |
- | |:::^ c.|What type of file is it? How did you determine this?| | + | |
- | |:::^ d.|We want the HTML data, so unravel this file to obtain that data. How did you do this?| | + | |
- | |:::^ e.|How large is the HTML data?| | + | |
- | |:::^ f.|What is the compression ratio achieved with this data?| | + | |
- | + | ||
- | Now that we have a copy of the data, we can move on to studying it. | + | |
- | + | ||
- | =====Analyzing the Raw Data===== | + | |
- | The first step we must take when tackling a problem like this is to get an understanding of the data we are working with. Regular Expressions are cool and all, but they aren't useful unless we know what it is we are describing. | + | |
- | + | ||
- | Our first task is to locate any common patterns in our data that we might be able to use to our advantage with Regular Expressions. | + | |
- | + | ||
- | ^ 2. ^|Viewing the HTML file in **vi**, answer me the following: | + | |
- | | ^ a.|This file contains courses offered next semester. Search for the course entry for "CSIT 2044". How did you do this?| | + | |
- | |:::^ b.|Comparing the data in this file, is there any similarity to an "ENGL 1020" course? How about a "MATH 1230" course? Is there any pattern in common among all the courses?| | + | |
- | + | ||
- | As it stands, each course has an information string as follows (I'll use UNIX as an example): | + | |
- | + | ||
- | ^ UNIX/Linux Fundamentals - 92629 - CSCS 1730 - 001 | | + | |
- | + | ||
- | After the initial HTML data, we get actually course data we are interested in... there' | + | |
- | + | ||
- | - Course Title | + | |
- | - Course Reference Number (CRN) | + | |
- | - Course Prefix/ | + | |
- | - Course Section | + | |
- | + | ||
- | Check out some other courses and verify that this pattern holds true. The actual data will vary, but the pattern/ | + | |
- | + | ||
- | =====Isolating the Course Information Strings===== | + | |
- | Although there' | + | |
- | + | ||
- | Using the UNIX class again as an example, the actual line in question is as follows: | + | |
- | + | ||
- | <code html> | + | |
- | <th class=" | + | |
- | </ | + | |
- | + | ||
- | If we go and look at another class, say ARTS 1030, we see the following: | + | |
- | + | ||
- | <code html> | + | |
- | <th class=" | + | |
- | </ | + | |
- | + | ||
- | and GOVT 1010, we see the following: | + | |
- | + | ||
- | <code html> | + | |
- | <th class=" | + | |
- | </ | + | |
- | + | ||
- | In context, these lines are surrounded by other lines of information, | + | |
- | + | ||
- | <code html> | + | |
- | </ | + | |
- | < | + | |
- | < | + | |
- | </ | + | |
- | </ | + | |
- | < | + | |
- | <th class=" | + | |
- | </ | + | |
- | < | + | |
- | <td class=" | + | |
- | <span class=" | + | |
- | < | + | |
- | <span class=" | + | |
- | </ | + | |
- | + | ||
- | Each class should be in a similar situation. The line containing the course information is surrounded by lines that contain other information (whether useful or useless, there is other data there than what we are presently interested in locating). | + | |
- | + | ||
- | ^ 3. ^|Through analyzing the data, answer me the following: | + | |
- | | ^ a.|If we wanted to perform a search that would only hit the course information lines (ie a pattern that would match just that line, and match that line for each course in the file), what does the RegEx pattern look like?| | + | |
- | |:::^ b.|Perform the search in **vi** (using **/**, verify that it hits that line in some course). Does it snap to the appropriate line?| | + | |
- | |:::^ c.|Hit **n** to go to the next match. And hit **n** again. And again. Are you consistently hitting the course information line for each course?| | + | |
- | + | ||
- | <WRAP round warning box>You absolutely need to have a correctly working pattern in order to proceed. If you have ANY questions, please ask them. This lab will fail to cooperate with you if your pattern is not adequate.</ | + | |
- | + | ||
- | Return to the command prompt. Time to start prototyping our solution. | + | |
- | + | ||
- | We'll want to come up with a command-line that isolates those course information lines for us. A prototype for that command-line will look something like this (substitute your working RegEx pattern in place of the string " | + | |
<cli> | <cli> | ||
- | lab46: | + | lab46: |
+ | 9+1 records in | ||
+ | 9+1 records out | ||
+ | 4912 bytes (4.9 kB) copied, 0.0496519 s, 98.9 kB/s | ||
+ | lab46: | ||
</ | </ | ||
- | When you put in the same pattern you came up with while searching in **vi**, your screen should be filled with data that looks like this (and much much more): | + | Here, **if=** specifies the source |
- | < | + | Doing some comparisons: |
- | <th class=" | + | |
- | <a href=" | + | |
- | <th class=" | + | |
- | <a href=" | + | |
- | <th class=" | + | |
- | <a href=" | + | |
- | <th class=" | + | |
- | <a href=" | + | |
- | <th class=" | + | |
- | <a href=" | + | |
- | <th class=" | + | |
- | <a href=" | + | |
- | : | + | |
- | </ | + | |
- | + | ||
- | Because we piped our output to **less**(**1**), | + | |
- | + | ||
- | What we're interested in at this point is that the data that is being produced all seems to match those lines in the file that contain the course information string. | + | |
- | =====Filtering unnecessary data===== | + | |
- | When you're satisfied with the information your pattern and resultant **grep** search has produced, our next step is to refine the information-- to make it more readable. | + | |
- | + | ||
- | To do this, we will make use of the **sed**(**1**) utility, which is a //steam editor//; it allows us to take the output and perform edits on it, much like we could in a text editor, only we specify on the command-line the actual work we wish to perform. | + | |
- | + | ||
- | If you recall from our explorations of **vi**, it has a //search and replace// capability that proved to be rather powerful. **sed**(**1**) also possesses this ability, and we can unlock it as follows: | + | |
- | + | ||
- | <code bash> | + | |
- | cat FILE | grep ' | + | |
- | </ | + | |
- | + | ||
- | Of specific interest to us are the **s**, **PATTERN**, | + | |
- | + | ||
- | * **s**: Invoke the **sed**(**1**) //search and replace// command. By default the forward slash **/** is the field separator. | + | |
- | * **PATTERN**: | + | |
- | * **REPLACEMENT**: | + | |
- | * **g**: Not necessarily needed in all cases, the **g** indicates we wish to perform this search and replace globally for **all** occurrences on a line. I'd recommend getting in the habit of using it, and then recognizing when you don't want to use it. | + | |
- | + | ||
- | So, looking at the data leading up to the course information we're interested in, can we come up with a pattern to describe it? I think so. | + | |
- | + | ||
- | ^ 4. ^|Craft a RegEx pattern that does the following: | + | |
- | | ^ a.|Starts at the beginning of the line.| | + | |
- | |:::^ b.|Goes until it encounters | + | |
- | |:::^ c.|Specifically describe the pattern of the data just before our desired information.| | + | |
- | |:::^ d.|What is your pattern?| | + | |
- | + | ||
- | To test your pattern, you'll want to do the following: | + | |
<cli> | <cli> | ||
- | lab46: | + | lab46: |
+ | -rwxr-xr-x 1 root root 4912 May 4 2010 / | ||
+ | -rw-r--r-- 1 user lab46 4912 Nov 13 14:57 howlong | ||
+ | lab46: | ||
</ | </ | ||
- | Where **PATTERN** is a new Regular Expression pattern that successfully matches the beginning of the lines we're interested in (actually all that **grep** is producing at this point), and replacing it with nothing (the two consecutive slashes indicate we're not interested in replacing the matched data with anything). | + | ====Investigating==== |
- | If successful your output should appear as follows | + | ^ 1. ^|Answer me the following: |
+ | | ^ a.|What is different about these two files?| | ||
+ | |:::^ b.|What is similar?| | ||
+ | |:::^ c.|If **dd**(**1**) copies (or duplicates) | ||
+ | |:::^ d.|What is the output | ||
+ | |:::^ e.|When you execute each file, is the output the same or different? | ||
+ | |:::^ f.|Any prerequisite steps needed to get either file to run? What were they?| | ||
- | < | + | Consistency of data has been a desire |
- | Accounting Practices - 81559 - ACCT 1000 - 001</a></ | + | |
- | Accounting Practices - 82350 - ACCT 1000 - 003</ | + | |
- | Financial Accounting - 82355 - ACCT 1030 - 001</ | + | |
- | Financial Accounting - 81558 - ACCT 1030 - 002</ | + | |
- | Financial Accounting - 81107 - ACCT 1030 - 003</ | + | |
- | Financial Accounting - 81108 - ACCT 1030 - 004</ | + | |
- | Financial Accounting - 81173 - ACCT 1030 - 005</ | + | |
- | Financial Accounting - 82115 - ACCT 1030 - 006</ | + | |
- | Managerial Accounting - 82078 - ACCT 1040 - 003</ | + | |
- | Accounting Procedures - 81123 - ACCT 1050 - 001</ | + | |
- | Federal Income Tax - 81783 - ACCT 1100 - 001</ | + | |
- | Federal Income Tax - 82358 - ACCT 1100 - 002</ | + | |
- | Intermediate Accounting I - 81124 - ACCT 2030 - 001</ | + | |
- | Intermediate Accounting I - 82359 - ACCT 2030 - 002</ | + | |
- | Computerized Accounting - 82361 - ACCT 2100 - 001</ | + | |
- | Cultural Anthropology - 81139 - ANTH 2120 - 001</ | + | |
- | Elem Mod Stand Arabic Con& | + | |
- | Elem Mod Arabic Con& | + | |
- | Introduction Art Appreciation - 81505 - ARTS 1004 - 002</ | + | |
- | Drawing I - 81771 - ARTS 1030 - 001</ | + | |
- | Drawing I - 82176 - ARTS 1030 - 002</ | + | |
- | Drawing I - 82112 - ARTS 1030 - 003</ | + | |
- | Drawing I - 81503 - ARTS 1030 - 004</ | + | |
- | Ceramics I - 81151 - ARTS 1210 - 001</ | + | |
- | Ceramics I - 82110 - ARTS 1210 - 002</ | + | |
- | Ceramics I - 81504 - ARTS 1210 - 003</ | + | |
- | Ceramics I - 82134 - ARTS 1210 - 004</ | + | |
- | Ceramics I - 81176 - ARTS 1210 - 005</ | + | |
- | Basic Black & White Photo - 81873 - ARTS 1220 - 001</ | + | |
- | Basic Black & White Photo - 81874 - ARTS 1220 - 002</ | + | |
- | Basic Black & White Photo - 81875 - ARTS 1220 - 003</ | + | |
- | History/ | + | |
- | : | + | |
- | </ | + | |
- | Our **sed** should have successfully stripped off the leading HTML text that we're uninterested in. Once that happens, suddenly our data becomes that much more readable. | + | ====Comparisons==== |
- | Note that there's still HTML data trailing our information. That can be addressed in another **sed** call: | + | Although many ways exist, |
- | < | + | |
- | lab46:~$ cat fall2013-20110417.html | grep ' | + | * **md5sum**(**1**): |
- | Accounting Practices - 81559 - ACCT 1000 - 001 | + | |
- | Accounting Practices - 82350 - ACCT 1000 - 003 | + | |
- | Financial Accounting - 82355 - ACCT 1030 - 001 | + | |
- | Financial Accounting - 81558 - ACCT 1030 - 002 | + | |
- | Financial Accounting - 81107 - ACCT 1030 - 003 | + | |
- | Financial Accounting - 81108 - ACCT 1030 - 004 | + | |
- | Financial Accounting - 81173 - ACCT 1030 - 005 | + | |
- | Financial Accounting - 82115 - ACCT 1030 - 006 | + | |
- | Managerial Accounting - 82078 - ACCT 1040 - 003 | + | |
- | Accounting Procedures - 81123 - ACCT 1050 - 001 | + | |
- | Federal Income Tax - 81783 - ACCT 1100 - 001 | + | |
- | Federal Income Tax - 82358 - ACCT 1100 - 002 | + | |
- | Intermediate Accounting I - 81124 - ACCT 2030 - 001 | + | |
- | Intermediate Accounting I - 82359 - ACCT 2030 - 002 | + | |
- | Computerized Accounting - 82361 - ACCT 2100 - 001 | + | |
- | Cultural Anthropology - 81139 - ANTH 2120 - 001 | + | |
- | Elem Mod Stand Arabic Con& | + | |
- | Elem Mod Arabic Con& | + | |
- | Introduction Art Appreciation - 81505 - ARTS 1004 - 002 | + | |
- | Drawing I - 81771 - ARTS 1030 - 001 | + | |
- | Drawing I - 82176 - ARTS 1030 - 002 | + | |
- | Drawing I - 82112 - ARTS 1030 - 003 | + | |
- | Drawing I - 81503 - ARTS 1030 - 004 | + | |
- | Ceramics I - 81151 - ARTS 1210 - 001 | + | |
- | Ceramics I - 82110 - ARTS 1210 - 002 | + | |
- | Ceramics I - 81504 - ARTS 1210 - 003 | + | |
- | Ceramics I - 82134 - ARTS 1210 - 004 | + | |
- | Ceramics I - 81176 - ARTS 1210 - 005 | + | |
- | Basic Black & White Photo - 81873 - ARTS 1220 - 001 | + | |
- | Basic Black & White Photo - 81874 - ARTS 1220 - 002 | + | |
- | Basic Black & White Photo - 81875 - ARTS 1220 - 003 | + | |
- | History/ | + | |
- | </ | + | |
- | + | ||
- | In the provided expression, the following happens: | + | |
- | + | ||
- | | + | |
- | * We replace that matched pattern with NOTHING. | + | |
- | + | ||
- | <WRAP round info box>Note the presence of the backslash **< | + | |
- | + | ||
- | The result should be as appears in the sample above... no HTML data, just real readable course information. | + | |
- | + | ||
- | ^ 5. ^|Perform some data mining for me:| | + | |
- | | ^ a.|Of this list, how many courses is CCC offering next semester? | + | |
- | |:::^ b.|How did you produce this result?| | + | |
- | |:::^ c.|How many **CSCS** classes is CCC offering next semester? How did you find this?| | + | |
- | |:::^ d.|How did you produce this result?| | + | |
- | |:::^ e.|How many upper level (2000 and above) **ENGL** classes are being offered next semester? | + | |
- | |:::^ f.|How did you produce this result?| | + | |
- | |:::^ g.|Who is offering more courses next semester, the English or Math department? | + | |
- | |:::^ h.|How did you produce this result?| | + | |
- | + | ||
- | Hopefully you're starting to see the value in what the Regular Expressions have enabled for us-- we were able to take raw data in some arbitrary format, and through analyzing it, adequately whittle away at it until it becomes a format readable to us. | + | |
- | + | ||
- | Once in that format, we can then perform some more valuable tasks on that data. | + | |
- | + | ||
- | =====Data Analysis===== | + | |
- | In the **courselist/ | + | |
- | + | ||
- | * fall2010-20100315.html.gz | + | |
- | * fall2010-20101113.html.gz | + | |
- | * fall2011-20110417.html.gz | + | |
- | * fall2013-20130417.html.gz | + | |
- | * spring2010-20091022.html.gz | + | |
- | * spring2010-20101113.html.gz | + | |
- | * spring2011-20101105.html.hz | + | |
- | * spring2011-20101113.html.gz | + | |
- | * winter2011-20101113.html.gz | + | |
- | + | ||
- | Each of these files contains a snapshot of semester course information, | + | |
- | + | ||
- | ^ 6. ^|Write a script that does the following: | + | |
- | | ^ a.|Accepts 1 or more of these files as an argument.| | + | |
- | |:::^ b.|If no files are specified, display an error with usage information and exit.| | + | |
- | |:::^ c.|If one file is given, perform the logic we've done manually on the command-line to produce and display the total number of courses offered in the given semester' | + | |
- | |:::^ d.|If two files are given, and are both for the same semester+year, | + | |
- | |:::^ e.|If two files are given, and are **not** the same semester+year, | + | |
- | |:::^ f.|If more than two semesters are listed, do the same, but **also** display the totals for MATH, CSCS, BIOL, and PFIT.| | + | |
- | |:::^ g.|Provide a copy of your script.| | + | |
- | + | ||
- | ^ 7. ^|As you are playing with the different course data files:| | + | |
- | | ^ a.|Comparing fall2011 to fall2013, which semester offered more courses?| | + | |
- | |:::^ b.|Do any of the files seem to break your logic?| | + | |
- | |:::^ c.|Which one(s)?| | + | |
- | |:::^ d.|Comparing a " | + | |
- | |:::^ e.|Between which two snapshot dates did this change seem to take place?| | + | |
- | |:::^ f.|What can you surmise as being a cause of this change?| | + | |
- | |:::^ g.|Could you adapt your script to handle the two different formats of data? How would you do this?| | + | |
- | |:::^ h.|Provide a copy of your updated script.| | + | |
- | + | ||
- | <WRAP round info box> | + | |
- | + | ||
- | There are a ton of questions we could ask of this data: | + | |
- | + | ||
- | * How many remedial (courses below the 1000 level) are offered a given semester? | + | |
- | * Do any quantity | + | |
- | * Is there a noticeable change in certain course offerings between a fall and a spring? | + | |
- | + | ||
- | Then there are some questions that, with our current skill level, may cause us a bit of trouble: | + | |
- | + | ||
- | * What is the range of CRN numbers for a given semester? (Lowest through highest) | + | |
- | * Which course prefix has the MOST offerings a given semester? | + | |
- | * Which course prefix has the LEAST offerings a given semester? | + | |
- | * Which course prefix offered the MOST remedial course offerings? | + | |
- | + | ||
- | While we may be able to derive answers to these questions... in some respects the data is not conveniently arranged for our analysis purposes. At the moment we have our data in the following format: | + | |
- | + | ||
- | ^ UNIX/Linux Fundamentals - 81769 - CSCS 1730 - 001 ^ | + | |
- | + | ||
- | And to answer some of these questions, especially when **grep**' | + | |
- | + | ||
- | ^ 78400:CSCS 1730-001: | + | |
- | + | ||
- | So once again our data may not be exactly the way we want it. Do we give up? **HECK NO**, we conform the universe to our demands... | + | |
- | + | ||
- | =====Rearranging Data with Regular Expressions===== | + | |
- | I consider where we are at now to be amongst some of the most powerful of concepts we learn in this class. What we are going to do now hopefully should take the cake and illustrate the true potential of the capabilities we are able to wield provided | + | |
- | + | ||
- | To do our next trick, we need to study our data once again: | + | |
- | + | ||
- | ^ UNIX/Linux Fundamentals - 81769 - CSCS 1730 - 001 ^ | + | |
- | + | ||
- | As you can see, the information as it is currently formatted takes the following structure, as compared to the desired structure: | + | |
- | + | ||
- | ^ Current: | + | |
- | ^ Desired: | + | |
- | + | ||
- | So how could we do this? To accomplish this task, we need to explore another RegEx capability and exercise our options in the **sed** REPLACEMENT field. | + | |
- | + | ||
- | ^ 8. ^|With our data in the current structure: | + | |
- | | ^ a.|Derive a RegEx pattern that will match up to the first "space dash space" | + | |
- | |:::^ b.|Derive a RegEx pattern that will match the CRN up to the second "space dash space" | + | |
- | |:::^ c.|Derive a RegEx pattern that will match the Course Prefix/ | + | |
- | |:::^ d.|Finally, round out with a fourth RegEx pattern that matches the **Section**, | + | |
- | + | ||
- | For my examples, I'll name your patterns REGEX1, REGEX2, REGEX3, and REGEX4. | + | |
- | + | ||
- | In order to rearrange our data, we need to effectively describe the data (as you did above) in order to reference it in groups. The RegEx symbols **\(** and **\)** denote Regular Expression groups, which we can use to isolate specific patterns for later reference. | + | |
- | + | ||
- | Check this out: | + | |
- | + | ||
- | < | + | |
- | lab46:~$ cat fall2013-20110417.html | grep ' | + | |
- | lab46:~$ | + | |
- | </ | + | |
- | + | ||
- | Notice what we just did here... we took our information in its current form of filtering and output it to a file (called **output**), | + | |
- | + | ||
- | That should make sense, we're just using **I/O Redirection** to send the output of that pipelined command-line to a file instead of to **STDOUT**. | + | |
- | + | ||
- | Feel free to make use of similar output junctures during the solution of a problem like this- and who knows, you might need to do particular processing with certain arrangements of data. So if you output your data at certain key points, you could be making your work a lot easier. | + | |
- | + | ||
- | Moving on: | + | |
- | + | ||
- | < | + | |
- | lab46:~$ cat output | sed ' | + | |
- | 81559:ACCT 1000-001: | + | |
- | 82350:ACCT 1000-003: | + | |
- | 82355:ACCT 1030-001: | + | |
- | 81558:ACCT 1030-002: | + | |
- | 81107:ACCT 1030-003: | + | |
- | 81108:ACCT 1030-004: | + | |
- | 81173:ACCT 1030-005: | + | |
- | 82115:ACCT 1030-006: | + | |
- | 82078:ACCT 1040-003: | + | |
- | 81123:ACCT 1050-001: | + | |
- | 81783:ACCT 1100-001: | + | |
- | 82358:ACCT 1100-002: | + | |
- | 81124:ACCT 2030-001: | + | |
- | 82359:ACCT 2030-002: | + | |
- | 82361:ACCT 2100-001: | + | |
- | 81139:ANTH 2120-001: | + | |
- | 82296:ARAB 1010-001: | + | |
- | 82297:ARAB 1010-071: | + | |
- | 81505:ARTS 1004-002: | + | |
- | 81771:ARTS 1030-001: | + | |
- | 82176:ARTS 1030-002: | + | |
- | 82112:ARTS 1030-003: | + | |
- | 81503:ARTS 1030-004: | + | |
- | 81151:ARTS 1210-001: | + | |
- | 82110:ARTS 1210-002: | + | |
- | 81504:ARTS 1210-003: | + | |
- | 82134:ARTS 1210-004: | + | |
- | 81176:ARTS 1210-005: | + | |
- | 81873:ARTS 1220-001: | + | |
- | 81874:ARTS 1220-002: | + | |
- | 81875:ARTS 1220-003: | + | |
- | 81180:ARTS 1310-001: | + | |
- | </ | + | |
- | <WRAP round warning box>**__NOTE:__** If the format of your data does not seem to change, you've got a typo, or a RegEx that doesn' | + | ^ 2. ^|Answer me the following: |
+ | | ^ a.|Are | ||
+ | |:::^ b.|Using **diff**(**1**), verify whether | ||
+ | |:::^ c.|Using | ||
+ | |::: | ||
+ | |:::^ e.|How could an MD5 hash be useful with regards to data integrity and security? | ||
+ | |:::^ f.|In what situations could **diff**(**1**) be a useful tool for comparing differences? | ||
+ | =====Exercise===== | ||
- | Once you get it--- **WOW!** the data changed, just the way we wanted. Now we can do further analysis and write shell scripts that better assist us in our tasks. **Activities like this is what separates someone who can effectively command technology as a tool to assist | + | ^ 3. ^|Do the following: |
+ | | ^ a.|Using | ||
+ | |:::^ b.|How did you do this?| | ||
+ | |:::^ c.|How could you verify you were successful? | ||
+ | |:::^ d.|If you ran **echo "more information" | ||
+ | |:::^ e.|Can | ||
+ | |:::^ f.|If you wanted | ||
- | That is the power of Regular Expressions. We can effectively delegate | + | <WRAP round info box> |
- | Plus, the less we are involved at the grunt-work level, the less chance there are of errors being introduced. The computer, when it follows correct instructions, | + | In the **data/** subdirectory |
- | =====Additional Data Wrangling===== | + | |
- | To cap off our experience, let's do one last foray into rearranging our data. | + | Please copy this to your home directory to work on the following question. |
- | ^ | + | ^ |
- | | ^ a.|PREFIX NUMBER-SECTION(CRN):TITLE| | + | | ^ a.|How large (in bytes) is this file?| |
- | |:::^ b.|PREFIX NUMBER:CRN (omit the section | + | |:::^ b.|What information predominantly appears to be in the first 3kB of the file?| |
- | |::: | + | |:::^ c.|Does this information remain constant throughout |
+ | |:::^ d.|How would you extract the data at one of these ranges | ||
+ | |::: | ||
+ | |:::^ f.|Run **file**(**1**) on each file that hosts extracted data. What is each type of file?| | ||
+ | |:::^ g.|Based on the output of **file**(**1**), react accordingly to the data to unlock its functionality/ | ||
- | <WRAP round info box> | ||
=====Conclusions===== | =====Conclusions===== |