Project: Data Archiving and Compression (dac0)

Corning Community College

CSCS1730 UNIX/Linux Fundamentals

Project: Data Archiving and Compression (dac0)

Objective

Reference technical documentation to locate and operate particular tools to aid you in accomplishing a task. Collaboratively construct an informative document to detail how one can prepare to start upon this process.

Prerequisites

To successfully accomplish/perform this project, the listed resources/experiences need to be consulted/achieved:

ability to read the manual pages and use the information therein
ability to copy, move, and list files
ability to navigate around the filesystem

grabit

As part of this activity is to test your ability to navigate around the filesystem and manipulate files on your own, there is no grabit configured for this project.

EDIT

You will want to go here to edit and fill in the various sections of the document:

https://lab46.g7n.org/notes/unix/fall2023/projects/dac0

DAC0 project documentation

Toolbox

It would be especially useful to review the manual pages or any documentation on the following resources:

ls(1) - lists files.
ls -l(1) - allows you to see additional information on all of your current directory's contents.
zip(1) - compresses files into a smaller form.
rev(1) - allows you to reverse the text in a txt.file.
tac(1) - allows you to flip the text around vertically, so that the bottom text will become the top text and the top text will become the bottom text.
uuencode(1) - allows you to archive files and directories.
file(1) - allows you to see what kind of file a file is and if anything special needs to be done to it.
stat -c %s(1) - allows you to see the byte size of a file.
tar -xf archive.tar[.gz/bz2](1) allows you to extract tar, Tar Gz, and Tar Bz2 Archives.
rm -rf directoryname(1) allows you to remove a directory as well as the files in the directory and subdirectories.

Background

What is an archive

An archive is where you take multiple files and place them all in the same file. The new archive file would be the same size as if you added all of the sizes from the different files together.If you were to put 7, 10 mb into an archive, the archive would be the size of 50 mb. Tar is a command which does both archiving and a bit of compression.

What actions can be performed on an archive?

On an archive you can view the contents, append the contents, extract some of the contents, or 'unpack' all of the contents.

What is compression

Compression is a technique used to reduce file size.

How does compression differ from archiving?

Short answer: Archiving stores multiple files in one file without changing the size the files take up, compression changes the bit pattern to reduce the size of the file or files.

Types of compression (lossy vs lossless)

The two types of compression are referred to as lossy and lossless. Lossy causes data to be lost during the compression process, while lossless techniques of compression allow original data to be reconstructed perfectly from the compressed data when the file is extracted.Hence the names lossy and lossless.

Procedure

In the UNIX class Public Directory on lab46 you will find a dac0/ subdirectory. You can reach the directory by cd /var/public/fall2023/unix/. Once there use ls to check see if you can find it. Once there enter the dac0 directory and into your the directory with your username. From there you can cp the files, specifying the absolute path to your dac0 (assuming you already made a dac0 directory in where you store your projects) directory in your home directory like ~/src/fall2023/unix/dac0/. Then head to that directory via the same absolute path that you specified and verify all of the files are there with ls.

Next unzip each of the files using tar “-cf file.tar” and “unzip file.zip”

Check each of the text files, some are in English and others you will have to reverse, flip, or decode using the commands in the toolbox above.

Once this is done use the stat “-c %s filename” command to see how big each file is, then rename them in order from smallest to largest with the letters of the alphabet, a being smallest, h being the largest.

Repository Operations

Checking current repository status

You can check the current repository status by using “hg status”

Adding untracked files to repository

You can add untracked files to your repository by using the “hg add -I -a” command. -I for include pattern -a for “All untracked”

Committing changes

To commit a file, you must use hg commit [file1, file2,…] command with the files that you already added. It will take you to a screen similar to nano or pico editors. Once there you can type a name for your commit and then press enter to accept the name. It will ask if you are sure that you want to save it to a confusing looking name. Press enter and it will return you to the command line. Use hg status to see if you can see any files that you just tried to commit. If you don't see them than your commit will have been successful.

Pushing commits upstream to server

To get your commit actually committed you should use “hg push” to push the file along to be pulled at a later time.

Pulling changes from server

You can pull changes from the server by using the “Hg pull” command.

Updating current repository

hg update to update your system after everything has been pulled

On your development system

On your development system, I want you to do the following:

Figure out the format of the files, and read up on the available tools for manipulating them
Install any needed tools to accomplish the task of accessing the information contained within
Extract the contents of the archive and study it (contents will extract to the current working directory, so you WILL want to be in a custom project directory)
Analyze the files extracted from the archive. Each file will ultimately be contextually readable plain text (in English), but some may be encoded or compressed or otherwise manipulated and will need further processing to get to the final readable state.
- Once in their readable states, name the files a, b, c, d, e, f, g, h, in order of their file sizes (in bytes), from least to greatest.
Place these single-lettered files in a new tar archive called result.tar (files should be added to the archive in the current directory, do not embed any directory information in the archive).
Compress it (using maximum compression) with gzip(1); it should now be called result.tar.gz
- you are going to submit this archive
In addition to the created archive, you will also submit a text file named dac0.steps which will contain step-by-step command-lines used to copy, extract, manipulate, rename, create a new archive and compress result.tar.gz (document from the point of having the copied files in place on your development system).
- you do NOT need to include and repository, verify, or submit commands, JUST those steps for accomplishing the core task of the files in the project to stated specifications.
- The file should JUST contain the exact commands you used, in order from start to finish. If you'd like to add any additional commentary, prefix it with a # sign.
  - Commands should be left justified, one command-line per line (lines can wrap).

* Do NOT number your steps. Just place the command-line incantations utilized, one after the other.

Verification

One of the tests I will perform for output compliance of your code will involve comparing your program's output against a range of input values, to see if they all output in conformance with project specifications.

I will make use of a checksum to verify exactness.

You will need to run this from your dac0 project directory, where your individual a-h files are located.

You can check your project by typing in the following at the prompt (on lab46):

lab46:~/src/SEMESTER/unix/dac0$ filechk unix dac0

If all aligns, you will see this:

==========unix/dac0 whole file comparison=========================================
 For the file: a
     you want: cca000c9cb8a5c134bed61154a7907ba
     you have: cca000c9cb8a5c134bed61154a7907ba MATCH

 For the file: b
     you want: c8136ca761229bad59497021a8f425af
     you have: c8136ca761229bad59497021a8f425af MATCH

 For the file: c
     you want: d6db0da4b084fff4b255ae7a4e95ed62
     you have: d6db0da4b084fff4b255ae7a4e95ed62 MATCH

 For the file: d
     you want: dadd5272203fa77b80f26cf355e6e833
     you have: dadd5272203fa77b80f26cf355e6e833 MATCH

 For the file: e
     you want: af095aeaaf55a8a3b351a921baebc9e7
     you have: af095aeaaf55a8a3b351a921baebc9e7 MATCH

 For the file: f
     you want: 84d0fd81532fac6c743c8054f76f0270
     you have: 84d0fd81532fac6c743c8054f76f0270 MATCH

 For the file: g
     you want: c36a56a9ab8190e4d007bd16e377639a
     you have: c36a56a9ab8190e4d007bd16e377639a MATCH

 For the file: h
     you want: 226c53b09f112cf7323cd5263302ea95
     you have: 226c53b09f112cf7323cd5263302ea95 MATCH

If something is off, your checksum will not match the dac0 checksum, and verification will instead say “MISMATCH”, like follows (note that a mismatched checksum can be anything, and likely not what is seen in this example):

==========unix/dac0 whole file comparison=========================================
 For the file: a
     you want: cca000c9cb8a5c134bed61154a7907ba
     you have: cca000c9cb8a5c134bed61154a7907ba MATCH

 For the file: b
     you want: d8136ca761229bad59497021a8f425af
     you have: c8136ca761229bad59497021a8f425af MISMATCH

 For the file: c
     you want: d6db0da4b084fff4b255ae7a4e95ed62
     you have: d6db0da4b084fff4b255ae7a4e95ed62 MATCH

 For the file: d
     you want: dadd5272203fa77b80f26cf355e6e833
     you have: dadd5272203fa77b80f26cf355e6e833 MATCH

 For the file: e
     you want: af095aeaaf55a8a3b351a921baebc9e7
     you have: af095aeaaf55a8a3b351a921baebc9e7 MATCH

 For the file: f
     you want: 84d0fd81532fac6c743c8054f76f0270
     you have: 84d0fd81532fac6c743c8054f76f0270 MATCH

 For the file: g
     you want: d36a56a9ab8190e4d007bd16e377639a
     you have: c36a56a9ab8190e4d007bd16e377639a MISMATCH

 For the file: h
     you want: 226c53b09f112cf7323cd5263302ea95
     you have: 226c53b09f112cf7323cd5263302ea95 MATCH

SUBMISSION

To be successful in this project, the following criteria (or their equivalent) must be met:

Project must be submit on time, by the deadline.
- Late submissions will lose 33% credit per day, with the submission window closing on the 3rd day following the deadline.
Track/version your projects files in your lab46 semester repository
Submit a copy of your final archive to me using the submit tool.

Submit Tool Usage

Let's say you have completed work on the project, and are ready to submit, you would do the following:

lab46:~/src/SEMESTER/DESIG/PROJECT$ submit DESIG PROJECT file1 file2 file3 ... fileN

A less abstract instantiation of the above (to help you transition):

lab46:~/src/SEMESTER/unix/dac0$ submit unix dac0 result.tar.gz dac0.steps
Submitting unix project "dac0":
    -> result.tar.gz(OK)
    -> dac0.steps(OK)

SUCCESSFULLY SUBMITTED

You should get some sort of confirmation indicating successful submission if all went according to plan. If not, check for typos and or locational mismatches.

RUBRIC

I'll be evaluating the project based on the following criteria:

26:dac0:final tally of results (26/26)
*:dac0:archive submitted [2/2]
*:dac0:archive has correct name of result.tar.gz [2/2]
*:dac0:archive is max compressed with gzip [2/2]
*:dac0:archive is a tar archive [2/2]
*:dac0:archive extracts into current directory [2/2]
*:dac0:archive contains 8 english readable files [2/2]
*:dac0:archived files are named a-h [2/2]
*:dac0:archived files named in order of size [2/2]
*:dac0:instructions submitted in text file [2/2]
*:dac0:instructions in file named dac0.steps [2/2]
*:dac0:dac0.steps contains list of instructions for accomplishing task [2/2]
*:dac0:dac0.steps instructions are accurate and correct [2/2]
*:dac0:dac0.steps any extra information after hash mark [2/2]

Pertaining to the collaborative authoring of project documentation

each class member is to participate in the contribution of relevant information and formatting of the documentation
- minimal member contributions consist of:
  - near the class average edits (a value of at least four productive edits)
  - near the average class content change average (a value of at least 256 bytes (absolute value of data content change))
  - within 20% of the class content contribution average
  - no zero-sum commits (adding in one commit then later removing in its entirety for the sake of satisfying edit requirements)
- adding and formatting data in an organized fashion, aiming to create an informative and readable document that anyone in the class can reference
- content contributions will be factored into a documentation coefficient, a value multiplied against your actual project submission to influence the end result:
  - no contributions, co-efficient is 0.50
  - less than minimum contributions is 0.75
  - met minimum contribution threshold is 1.00

Additionally

Solutions not abiding by spirit of project will be subject to a 50% overall deduction
Solutions not utilizing descriptive why and how comments will be subject to a 25% overall deduction
Solutions not utilizing indentation to promote scope and clarity or otherwise maintaining consistency in code style and presentation will be subject to a 25% overall deduction
Solutions not organized and easy to read (assume a terminal at least 90 characters wide, 40 characters tall) are subject to a 25% overall deduction

Table of Contents