User Tools

Site Tools


haas:fall2020:unix:projects:dataproc

Corning Community College

CSCS1730 UNIX/Linux Fundamentals

~~TOC~~

Project: DATA PROCESSING

Objective

To apply your growing and versatile skills on the command-line by massaging data through the deployment of innovative command-line incantations.

Background

Often times, we will find ourselves encountering data in a slightly one-off format- not quite meeting some requirement we need for further processing.

Luckily, the UNIX environment provides many facilities for filtering and manipulating data so that we can “reformat” it to meet expectations.

This activity has you dabbling in one such scenario: a program that generates “raw” data (simulated from a scientific/industrial instrument). This “raw” data needs to be sanitized and reformatted (to perhaps be further analyzed by other tools upstream).

Task 0: Post/respond to a question

  • To ensure adequate out-of-class communications, I'd like for you to make use of the class mailing list.
    • I would like each person to post at least 1 focused question regarding this project to the class mailing list.
      • This also helps to make sure everyone has subscribed to the list (as you should have the first week)
      • Please do not give away any answers to the actions requested by this project in doing so.
      • Be sure to identify which “task” or aspect of the project you are asking about
    • Respond to at least 1 question, not by giving an explicit answer, but by asking further questions, or giving a pointer to a resource that may contain additional information (i.e. see the cut(1) manual page)
    • To get credit, your response cannot be to one of your own questions.
  • Put a URL to the mailing list post of your question asked in a file called: task0.question
  • Put a URL to the mailing list post of your response in a file called: task0.response

Task 1: Obtain source code

On Lab46, in the /var/public/unix/projects/dataproc/ directory, is a file called info.c

  • Copy this into your home directory. How did you do it?
  • Write down the command-line used in a file called task1.txt

Task 2: Study the file contents

Determine:

  • How to properly compile the file (so that it will run without displaying an error)?
  • How to properly execute the resulting program (to generate 8 lines of output)?
  • When you figure out the answers to both of these, put your responses in a file called task2.txt

A copy of the code follows:

1
/*
 * info.c - program to generate information stream for processing.
 *
 *          In order to run, this program must be named according
 *          to the value stored in the name[] array. Do not change
 *          the code or values in this source code, but match the
 *          executable name as appropriate.
 *
 *          By default, no data is generated. In order to alter
 *          this behavior, provide a whole number as the first
 *          argument on the command-line, and that many lines of
 *          output will be generated (to STDOUT by default).
 *
 * To compile: gcc -o PROGRAM_NAME info.c
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
int main(int argc, char **argv)
{
	int index, max, x, y, i;
	char name[] = { 0x64, 0160, (114-63), (064+03), 0x00 };
	char file[(strlen(name)+1)]; 
 
	x = strlen(*(argv+0));
	y = strlen(name);
 
	for (i = 0; i <= y; i++)
	{
		file[i] = *(*(argv+0)+(x-y)+i);
	}
 
	if (strcasecmp(file, name) != 0)
	{
		fprintf(stderr, "ERROR: filename is incorrect!\n");
		fprintf(stderr, "       must match name[] string\n");
		exit(1);
	}
 
	if (argc >= 2)
	{
		max = atoi(*(argv+1));
	}
	else
	{
		max = 0;
	}
 
	if (argc >= 3)
	{
		srand(atoi(*(argv+2)));
	}
	else
	{
		srand(1730);
	}
 
	for (index = 1; index <= max; index++)
	{
		x = rand() % 849 + 50;
		y = rand() % 1899 + 100;
 
		if (((x % 3) == 0) && ((y % 4) > 2))
			fprintf(stdout, "%d\tblank\n", index);
		else if (((x % 7) < 4) && ((y % 5) > 3))
			fprintf(stdout, "%d\terror %d\n", index, ((x % 20) + 1));
		else
			fprintf(stdout, "%d\t%.3d-%.3d\n", index, x, y);
	}
 
	return(0);
}

NOTE: Copying/pasting this code into a file to do the project will not earn you credit for task 1. You MUST copy the file from the specified location.

Task 3: Execute your program

Once you have things working:

  • Run the program and have it generate 1024 lines of output
  • Write down the command-line used in a file called task3.txt

Task 4: Store your output

  • Save your program's output (the 1024 lines) to a file called task4.txt

Task 5: Find and count the duplicates

  • Ignoring the index values in the left-most column, determine which numerical codes occur more than once by concocting a command-line incantation that appropriately filters and processes the output.
  • Also display with a count of the total number of lines in the output, along with the total number of lines with valid numeric values (ignore “blank” lines and lines with error codes). Finally, display the total count of lines that have duplicates.
    • Omit all the lines that occurred only once (ie has no duplicates); it will make your data set immediately more reasonable.
  • Put your resulting command-line(s) in a file called task5.sh
  • Put the output (result) of your command-line(s) in a file called task5.out

For example, let's say we had the following output:

1	671-477
2	error 4
3	742-703
4 	671-477
5	blank
6	516-336
7 	671-477
8 	742-703
9 	546-031
10 	089-322
11 	442-1220
12  	blank

As a result of running your solution, the following output should be produced:

671-477 occurs 3 times
742-703 occurs 2 times
Out of 12 lines (9 with numeric values), there were a total of 5 lines with duplicate values

Task 6: Find and display the max duplicates

From your filtered output in the previous task, write some logic that:

  • Removes the “blank” lines and error codes from your original output
  • Collapses any duplicates (have just 1 value for each duplicate set)
  • Sorts the resulting numeric data according to the value to the left of the dash.
  • Re-indexes the data to create a new, more refined, data file. Have a single tab separate the index value from the data value on each line.
  • Put your logic in a file called task6.sh
  • Put your output in a file called task6.out

Submission

To successfully complete this project, the following criteria must be met:

  • All criteria indicated above
  • To signal completion, submit an archive containing all the files generated in each task above.
    • Task 0: task0.question and task0.response
    • Task 1: task1.txt
    • Task 2: task2.txt
    • Task 3: task3.txt
    • Task 4: task4.txt
    • Task 5: task5.sh and task5.out
    • Task 6: task6.sh and task6.out
    • Put all these files in a tar archive called dataproc.tar
    • Compress it with max compression using gzip
    • The resulting archive should be named: dataproc.tar.gz

To submit this project to me using the submit tool, run the following command at your lab46 prompt:

$ submit unix dataproc dataproc.tar.gz
Submitting unix project "dataproc":
    -> dataproc.tar.gz(OK)

SUCCESSFULLY SUBMITTED

You should get some sort of confirmation indicating successful submission if all went according to plan. If not, check for typos and or locational mismatches.

haas/fall2020/unix/projects/dataproc.txt · Last modified: 2014/09/29 18:28 by 127.0.0.1