CSCS1730 UNIX/Linux Fundamentals
PROJECT: DATA ARCHIVING AND COMPRESSION (dac0)
OBJECTIVE
Reference technical documentation to locate and operate particular tools to aid you in accomplishing a task.
PREREQUISITES
To successfully accomplish/perform this project, the listed resources/experiences need to be consulted/achieved:
- ability to read the manual pages and use the information therein
- ability to copy, move, and list files
- ability to navigate around the filesystem
NO GRABIT
As part of this activity is to test your ability to navigate around the filesystem and manipulate files on your own, there is no grabit configured for this project.
Navigate to the UNIX PUBLIC DIRECTORY yourself and manually copy your project files back into your repository.
TOOLBOX
It would be especially useful to review the manual pages or any documentation on the following resources:
-
cp(1) -
mv(1) -
ls(1) -
mkdir(1) -
tar(1) -
xz(1) -
gzip(1) -
bzip2(1) -
zip(1) -
tac(1) -
rev(1) -
cat(1) -
file(1) -
uudecode(1) -
md5sum(1)
BACKGROUND
When we talk about archives, there are commonly two separate actions taking place. Sometimes they are intertwined, others they represent discrete steps.
They are:
- archiving / extracting
- compression / decompression
Archives are merely a manifestation of a common computing concept: a container.
Containers encapsulate things; in this case- files. And the fact that UNIX tries to make everything a file really enhances the viability of this ability.
Compression, on the other hand, is an action performed on a single file. Utilizing various algorithms, we accomplish a sort of "more in less"… we can take the data present and cram it into a smaller box (file)… where the aim is to take up less storage on the filesystem (also makes copying easier).
There are many compression algorithms in existence. There are commonly two categories of compression algorithm:
- lossless compression - no data is lost as a part of the compression process
- lossy compression - unnecessary data is discarded as part of the compression process
Wikipedia has categories identifying various algorithms implemented for both lossless and lossy compression algorithms.
Where confusion may set in is when a tool combines the actions of archival AND compression. But if you think about it, even in such cases, we always end up with one file, and that file is compressed (unless we have a concatenation of separately compressed files into a single file.
Archives are useful in that they let us pack items together. If something needs 100 files, making a copy of that, or copying it/install it onto another system would be made more complex if we had to deal with each of those files individually. Archives simplify the problem in that they can provide us all those files, all contained within a single file (lessening opportunities for error). So, archives make our lives easier.
ON YOUR DEVELOPMENT SYSTEM
Once you obtain the project files on the LAB46 SHELL SYSTEM, and transfer them over to your development system (you can add them to your private repository), you will want to perform the project off of lab46, as you may need to install additional packages and tools to process the data.
On your development system, I want you to do the following:
- Figure out the format of the files, and read up on the available tools for manipulating them
- Install any needed tools to accomplish the task of accessing the information contained within
- Extract the contents of the archive and study it (contents will extract to the current working directory, so you WILL want to be in a custom project directory)
- Analyze the files extracted from the archive. Each file will ultimately be contextually readable plain text (in English), but some may be encoded or compressed or otherwise manipulated and will need further processing to get to the final readable state.
- Once in their readable states, name the files a, b, c, d, e, f, g, h, in order of their file sizes (in bytes), from least to greatest.
- Place these single-lettered files in a new tar archive called result.tar (files should be added to the archive in the current directory, do not embed any directory information in the archive).
- Compress it (using maximum compression) with gzip(1); it should now be called result.tar.gz
- you are going to submit this archive
- In addition to the created archive, you will also submit a text file named dac0.steps which will contain step-by-step command-lines used to copy, extract, manipulate, rename, create a new archive and compress result.tar.gz (document from the point of having the copied files in place on your development system).
- you do NOT need to include any repository, verify, or submit commands, JUST those steps for accomplishing the core task of the files in the project to stated specifications.
- The file should JUST contain the exact commands you used, in order from start to finish. If you'd like to add any additional commentary, prefix it with a # sign.
- Commands should be left justified, one command-line per line (lines can wrap).
- Do NOT number your steps. Just place the command-line incantations utilized, one after the other.
PROCESS
On the system hosting the needed resources, egress from your home directory and navigate to the UNIX PUBLIC DIRECTORY, locate the subdirectory for this project and navigate there.
Assess the layout of files. What type of files are here? Are they named in a manner so as to indicate a specific course of action?
Once you locate your files, proceed to copy them into your home directory, into a custom project subdirectory you've made, ideally in your repository.
Ingress to that destination, ensure said files have been included into your repository. Do note: your repository by default may be configured to ignore many archive files. To override this, add them specifically by name.
Transition to your development system, navigate into your repository. Obtain the files, and verify they are present.
Focusing on one archive at a time:
- what type of archive is it?
- what tools might be needed to extract and/or decompress it?
- reference the manual page(s) for any tools in question, determine any options that need to be applied.
- attempt to extract files from the archive. Did it work? How can you tell what new additions there are? (Do note: many of these tools have options to enable verbose operation, which might prove particularly helpful in this endeavour).
- one at a time, investigate any new files:
- is it viewable?
- is it readable?
- if not, is there something that looks out of sorts that you could manipulate to correct it?
- is it contextually readable, english text?
- take note of the file size
- as you accumulate processed files, proceed to order them by file size
VERIFICATION
One of the tests I will perform for output compliance of your code will involve comparing your program's output against a range of input values, to see if they all output in conformance with project specifications.
I will make use of a checksum to verify exactness.
You will need to run this from your dac0 project directory, where
your individual a-h files are located.
You can check your project by typing in the following at the prompt (on lab46):
lab46:~/src/SEMESTER/unix/dac0$ filechk unix dac0
If all aligns, you will see this:
==========unix/dac0 whole file comparison=========================================
For the file: a
you want: cca000c9cb8a5c134bed61154a7907ba
you have: cca000c9cb8a5c134bed61154a7907ba MATCH
For the file: b
you want: c8136ca761229bad59497021a8f425af
you have: c8136ca761229bad59497021a8f425af MATCH
For the file: c
you want: d6db0da4b084fff4b255ae7a4e95ed62
you have: d6db0da4b084fff4b255ae7a4e95ed62 MATCH
For the file: d
you want: dadd5272203fa77b80f26cf355e6e833
you have: dadd5272203fa77b80f26cf355e6e833 MATCH
For the file: e
you want: af095aeaaf55a8a3b351a921baebc9e7
you have: af095aeaaf55a8a3b351a921baebc9e7 MATCH
For the file: f
you want: 84d0fd81532fac6c743c8054f76f0270
you have: 84d0fd81532fac6c743c8054f76f0270 MATCH
For the file: g
you want: c36a56a9ab8190e4d007bd16e377639a
you have: c36a56a9ab8190e4d007bd16e377639a MATCH
For the file: h
you want: 226c53b09f112cf7323cd5263302ea95
you have: 226c53b09f112cf7323cd5263302ea95 MATCH
If something is off, your checksum will not match the dac0 checksum, and verification will instead say "MISMATCH", like follows (note that a mismatched checksum can be anything, and likely not what is seen in this example):
==========unix/dac0 whole file comparison=========================================
For the file: a
you want: cca000c9cb8a5c134bed61154a7907ba
you have: cca000c9cb8a5c134bed61154a7907ba MATCH
For the file: b
you want: d8136ca761229bad59497021a8f425af
you have: c8136ca761229bad59497021a8f425af MISMATCH
For the file: c
you want: d6db0da4b084fff4b255ae7a4e95ed62
you have: d6db0da4b084fff4b255ae7a4e95ed62 MATCH
For the file: d
you want: dadd5272203fa77b80f26cf355e6e833
you have: dadd5272203fa77b80f26cf355e6e833 MATCH
For the file: e
you want: af095aeaaf55a8a3b351a921baebc9e7
you have: af095aeaaf55a8a3b351a921baebc9e7 MATCH
For the file: f
you want: 84d0fd81532fac6c743c8054f76f0270
you have: 84d0fd81532fac6c743c8054f76f0270 MATCH
For the file: g
you want: d36a56a9ab8190e4d007bd16e377639a
you have: c36a56a9ab8190e4d007bd16e377639a MISMATCH
For the file: h
you want: 226c53b09f112cf7323cd5263302ea95
you have: 226c53b09f112cf7323cd5263302ea95 MATCH
SUBMISSION
To be successful in this project, the following criteria (or their equivalent) must be met:
- Project must be submit on time, by the deadline.
- Late submissions will lose 33% credit per day, with the submission window closing on the 3rd day following the deadline.
- Track/version your projects files in your private semester repository
- Submit a copy of your final archive to me using the submit tool.
SUBMIT TOOL USAGE
Let's say you have completed work on the project, and are ready to submit, you would do the following:
lab46:~/src/SEMESTER/DESIG/PROJECT$ submit DESIG PROJECT file1 file2 file3 ... fileN
A less abstract instantiation of the above (to help you transition):
lab46:~/src/SEMESTER/unix/dac0$ submit unix dac0 result.tar.gz dac0.steps
Submitting unix project "dac0":
-> result.tar.gz(OK)
-> dac0.steps(OK)
SUCCESSFULLY SUBMITTED
You should get some sort of confirmation indicating successful submission if all went according to plan. If not, check for typos and or locational mismatches.
RUBRIC
I'll be evaluating the project based on the following criteria:
78:dac0:final tally of results (78/78)
*:dac0:archive submitted [6/6]
*:dac0:archive has correct name of result.tar.gz [6/6]
*:dac0:archive is max compressed with gzip [6/6]
*:dac0:archive is a tar archive [6/6]
*:dac0:archive extracts into current directory [6/6]
*:dac0:archive contains 8 english readable files [6/6]
*:dac0:archived files are named a-h [6/6]
*:dac0:archived files named in order of size [6/6]
*:dac0:instructions submitted in text file [6/6]
*:dac0:instructions in file named dac0.steps [6/6]
*:dac0:dac0.steps contains list of instructions for accomplishing task [6/6]
*:dac0:dac0.steps instructions are accurate and correct [6/6]
*:dac0:dac0.steps any extra information after hash mark [6/6]
ADDITIONALLY
- Solutions not abiding by spirit of project will be subject to a 50% overall deduction
- Solutions not utilizing descriptive why and how comments will be subject to a 25% overall deduction
- Solutions not utilizing indentation to promote scope and clarity or otherwise maintaining consistency in code style and presentation will be subject to a 25% overall deduction
- Solutions not organized and easy to read (assume a terminal at least 90 characters wide, 40 characters tall) are subject to a 25% overall deduction