Corning Community College
CSCS1730 UNIX/Linux Fundamentals
~~TOC~~
======Project: DATA PROCESSING======
=====Objective=====
To apply your growing and versatile skills on the command-line by massaging data through the deployment of innovative command-line incantations and slick scripts.
=====Background=====
Often times, we will find ourselves encountering data in a slightly one-off format- not quite meeting some requirement we need for further processing.
Luckily, the UNIX environment provides many facilities for filtering and manipulating data so that we can "reformat" it to meet expectations.
This activity has you dabbling in one such scenario: a program that generates "raw" data (simulated from a scientific/industrial instrument). This "raw" data needs to be sanitized and reformatted (to perhaps be further analyzed by other tools upstream).
=====Task 0: Post/respond to a question=====
* Because the class mailing list has been rather quiet of late, and we've got a break coming up, I would like each person to post at least 1 focused question regarding this project to the class mailing list.
* Please do not give away any answers to the actions requested by this project in doing so.
* Be sure to identify which "task" or aspect of the project you are asking about
* Respond to at least 1 question, not by giving an explicit answer, but by asking further questions, or giving a pointer to a resource that may contain additional information (i.e. see **cut(1)** manual page)
* To get credit, your response can**not** be to one of your own questions.
* Put a URL to the mailing list post of your question asked in a file called: **task0.question**
* See http://lab46.corning-cc.edu/mailman/listinfo/unix to access the archives
* Put a URL to the mailing list post of your response in a file called: **task0.response**
* See http://lab46.corning-cc.edu/mailman/listinfo/unix to access the archives
* A question may receive multiple answers.
=====Task 1: Obtain source code=====
On Lab46, in the **/var/public/unix/projects/dataproc/** directory, is a file called **info.c**
* Copy this into your home directory. How did you do it?
* Write down the command-line used in a file called **task1.txt**
=====Task 2: Study the file contents=====
Determine:
* How to properly compile the file (so that it will run without displaying an error)?
* How to properly execute the resulting program (to generate 8 lines of output)?
* When you figure out the answers to both of these, put your responses in a file called **task2.txt**
A copy of the code follows:
/*
* info.c - program to generate information stream for processing.
*
* In order to run, this program must be named according
* to the value stored in the name[] array. Do not change
* the code or values in this source code, but match the
* executable name as appropriate.
*
* By default, no data is generated. In order to alter
* this behavior, provide a whole number as the first
* argument on the command-line, and that many lines of
* output will be generated (to STDOUT by default).
*
* To compile: gcc -o PROGRAM_NAME info.c
*/
#include
#include
#include
int main(int argc, char **argv)
{
int index, max, x, y, i;
char name[] = { 0x64, 0160, (114-63), (064+03), 0x00 };
char file[(strlen(name)+1)];
x = strlen(*(argv+0));
y = strlen(name);
for (i = 0; i <= y; i++)
{
file[i] = *(*(argv+0)+(x-y)+i);
}
if (strcasecmp(file, name) != 0)
{
fprintf(stderr, "ERROR: filename is incorrect!\n");
fprintf(stderr, " must match name[] string\n");
exit(1);
}
if (argc >= 2)
{
max = atoi(*(argv+1));
}
else
{
max = 0;
}
if (argc >= 3)
{
srand(atoi(*(argv+2)));
}
else
{
srand(1730);
}
for (index = 1; index <= max; index++)
{
x = rand() % 849 + 50;
y = rand() % 1899 + 100;
if (((x % 3) == 0) && ((y % 4) > 2))
fprintf(stdout, "%d\tblank\n", index);
else if (((x % 7) < 4) && ((y % 5) > 3))
fprintf(stdout, "%d\terror %d\n", index, ((x % 20) + 1));
else
fprintf(stdout, "%d\t%.3d-%.3d\n", index, x, y);
}
return(0);
}
NOTE: Copying/pasting this code into a file to do the project will not earn you credit for task 1. You MUST copy the file from the specified location.
=====Task 3: Execute your program=====
Once you have things working:
* Run the program and have it generate 1024 lines of output
* Write down the command-line used in a file called **task3.txt**
=====Task 4: Store your output=====
* Save your program's output (the 1024 lines) to a file called **task4.txt**
=====Task 5: Find and count the duplicates=====
* Ignoring the index values in the left-most column, determine which numerical codes occur more than once by concocting a command-line incantation or script that appropriately filters and processes the output.
* Also display with a count of the total number of lines in the output, along with the total number of lines with valid numeric values (ignore "blank" lines and lines with error codes). Finally, display the total count of lines that have duplicates.
* Put your resulting command-line(s) or script in a file called **task5.sh**
* Put the output (result) of your command-line(s) or script in a file called **task5.out**
For example, let's say we had the following output:
1 671-477
2 error 4
3 742-703
4 671-477
5 blank
6 516-336
7 671-477
8 742-703
9 546-031
10 089-322
11 442-1220
12 blank
As a result of running your solution, the following output should be produced:
671-477 occurs 3 times
742-703 occurs 2 times
Out of 12 lines (9 with numeric values), there were a total of 5 lines with duplicate values
=====Task 6: Find and display the max duplicates=====
From your filtered output in the previous task, write some logic that:
* Removes the "blank" lines and error codes from your original output
* Collapses any duplicates (have just 1 value for each duplicate set)
* Sorts the resulting numeric data according to the value to the left of the dash.
* Re-indexes the data to create a new, more refined, data file. Have a single tab separate the index value from the data value on each line.
* Put your logic in a file called **task6.sh**
* Put your output in a file called **task6.out**
=====Submission=====
To successfully complete this project, the following criteria must be met:
* All criteria indicated above
* To signal completion, submit an archive containing all the files generated in each task above.
* Task 0: **task0.question** and **task0.response**
* Task 1: **task1.txt**
* Task 2: **task2.txt**
* Task 3: **task3.txt**
* Task 4: **task4.txt**
* Task 5: **task5.sh** and **task5.out**
* Task 6: **task6.sh** and **task6.out**
* Put all these files in a **tar** archive called **dataproc.tar**
* Compress it with max compression using **gzip**
* The resulting archive should be named: **dataproc.tar.gz**
To submit this project to me using the **submit** tool, run the following command at your lab46 prompt:
$ submit unix dataproc dataproc.tar.gz
Submitting unix project "dataproc":
-> dataproc.tar.gz(OK)
SUCCESSFULLY SUBMITTED
You should get some sort of confirmation indicating successful submission if all went according to plan. If not, check for typos and or locational mismatches.