<WRAP centeralign round box>
<WRAP><color red><fs 200%>Corning Community College</fs></color></WRAP>
<WRAP><fs 150%>CSCS1730 UNIX/Linux Fundamentals</fs></WRAP>
</WRAP>

~~TOC~~

======Project: DATA PROCESSING======

=====Objective=====
To apply your growing and versatile skills on the command-line by massaging data through the deployment of innovative command-line incantations.
=====Background=====
Often times, we will find ourselves encountering data in a slightly one-off format- not quite meeting some requirement we need for further processing.

Luckily, the UNIX environment provides many facilities for filtering and manipulating data so that we can "reformat" it to meet expectations.

This activity has you dabbling in one such scenario: a program that generates "raw" data (simulated from a scientific/industrial instrument). This "raw" data needs to be sanitized and reformatted (to perhaps be further analyzed by other tools upstream).

=====Task 0: Post/respond to a question=====
  * To ensure adequate out-of-class communications, I'd like for you to make use of the class mailing list.
    * I would like each person to post at least 1 focused question regarding this project to the class mailing list.
      * This also helps to make sure everyone has subscribed to the list (as you should have the first week)
      * Please do not give away any answers to the actions requested by this project in doing so.
      * Be sure to identify which "task" or aspect of the project you are asking about
    * Respond to at least 1 question, not by giving an explicit answer, but by asking further questions, or giving a pointer to a resource that may contain additional information (i.e. see the **cut(1)** manual page)
    * To get credit, your response can**not** be to one of your own questions.
  * Put a URL to the mailing list post of your question asked in a file called: **task0.question**
    * See http://lab46.corning-cc.edu/mailman/listinfo/unix to access the archives
  * Put a URL to the mailing list post of your response in a file called: **task0.response**
    * See http://lab46.corning-cc.edu/mailman/listinfo/unix to access the archives
    * A question may receive multiple answers.

=====Task 1: Obtain source code=====
On Lab46, in the **/var/public/unix/projects/dataproc/** directory, is a file called **info.c**

  * Copy this into your home directory. How did you do it?
  * Write down the command-line used in a file called **task1.txt**

=====Task 2: Study the file contents=====

Determine:
  * How to properly compile the file (so that it will run without displaying an error)?
  * How to properly execute the resulting program (to generate 8 lines of output)?
  * When you figure out the answers to both of these, put your responses in a file called **task2.txt**

A copy of the code follows:

<code c 1>
/*
 * info.c - program to generate information stream for processing.
 *
 *          In order to run, this program must be named according
 *          to the value stored in the name[] array. Do not change
 *          the code or values in this source code, but match the
 *          executable name as appropriate.
 *
 *          By default, no data is generated. In order to alter
 *          this behavior, provide a whole number as the first
 *          argument on the command-line, and that many lines of
 *          output will be generated (to STDOUT by default).
 *
 * To compile: gcc -o PROGRAM_NAME info.c
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char **argv)
{
	int index, max, x, y, i;
	char name[] = { 0x64, 0160, (114-63), (064+03), 0x00 };
	char file[(strlen(name)+1)]; 

	x = strlen(*(argv+0));
	y = strlen(name);

	for (i = 0; i <= y; i++)
	{
		file[i] = *(*(argv+0)+(x-y)+i);
	}

	if (strcasecmp(file, name) != 0)
	{
		fprintf(stderr, "ERROR: filename is incorrect!\n");
		fprintf(stderr, "       must match name[] string\n");
		exit(1);
	}

	if (argc >= 2)
	{
		max = atoi(*(argv+1));
	}
	else
	{
		max = 0;
	}

	if (argc >= 3)
	{
		srand(atoi(*(argv+2)));
	}
	else
	{
		srand(1730);
	}

	for (index = 1; index <= max; index++)
	{
		x = rand() % 849 + 50;
		y = rand() % 1899 + 100;

		if (((x % 3) == 0) && ((y % 4) > 2))
			fprintf(stdout, "%d\tblank\n", index);
		else if (((x % 7) < 4) && ((y % 5) > 3))
			fprintf(stdout, "%d\terror %d\n", index, ((x % 20) + 1));
		else
			fprintf(stdout, "%d\t%.3d-%.3d\n", index, x, y);
	}

	return(0);
}
</code>

NOTE: Copying/pasting this code into a file to do the project will not earn you credit for task 1. You MUST copy the file from the specified location.
=====Task 3: Execute your program=====

Once you have things working:

  * Run the program and have it generate 1024 lines of output
  * Write down the command-line used in a file called **task3.txt**

=====Task 4: Store your output=====

  * Save your program's output (the 1024 lines) to a file called **task4.txt**

=====Task 5: Find and count the duplicates=====
  * Ignoring the index values in the left-most column, determine which numerical codes occur more than once by concocting a command-line incantation that appropriately filters and processes the output.
  * Also display with a count of the total number of lines in the output, along with the total number of lines with valid numeric values (ignore "blank" lines and lines with error codes). Finally, display the total count of lines that have duplicates.
    * Omit all the lines that occurred only once (ie has no duplicates); it will make your data set immediately more reasonable.
  * Put your resulting command-line(s) in a file called **task5.sh**
  * Put the output (result) of your command-line(s) in a file called **task5.out**

For example, let's say we had the following output:

<cli>
1	671-477
2	error 4
3	742-703
4 	671-477
5	blank
6	516-336
7 	671-477
8 	742-703
9 	546-031
10 	089-322
11 	442-1220
12  	blank
</cli>

As a result of running your solution, the following output should be produced:

<cli>
671-477 occurs 3 times
742-703 occurs 2 times
Out of 12 lines (9 with numeric values), there were a total of 5 lines with duplicate values
</cli>

=====Task 6: Find and display the max duplicates=====

From your filtered output in the previous task, write some logic that:

  * Removes the "blank" lines and error codes from your original output
  * Collapses any duplicates (have just 1 value for each duplicate set)
  * Sorts the resulting numeric data according to the value to the left of the dash.
  * Re-indexes the data to create a new, more refined, data file. Have a single tab separate the index value from the data value on each line.
  * Put your logic in a file called **task6.sh**
  * Put your output in a file called **task6.out**

=====Submission=====
To successfully complete this project, the following criteria must be met:

  * All criteria indicated above
  * To signal completion, submit an archive containing all the files generated in each task above.
    * Task 0: **task0.question** and **task0.response**
    * Task 1: **task1.txt**
    * Task 2: **task2.txt**
    * Task 3: **task3.txt**
    * Task 4: **task4.txt**
    * Task 5: **task5.sh** and **task5.out**
    * Task 6: **task6.sh** and **task6.out**
    * Put all these files in a **tar** archive called **dataproc.tar**
    * Compress it with max compression using **gzip**
    * The resulting archive should be named: **dataproc.tar.gz**

To submit this project to me using the **submit** tool, run the following command at your lab46 prompt:

<cli>
$ submit unix dataproc dataproc.tar.gz
Submitting unix project "dataproc":
    -> dataproc.tar.gz(OK)

SUCCESSFULLY SUBMITTED
</cli>

You should get some sort of confirmation indicating successful submission if all went according to plan. If not, check for typos and or locational mismatches.