Corning Community College
CSCS2330 Discrete Structures
~~TOC~~
======Project: RUN-LENGTH ENCODING - DATA COMPRESSION FUN (dcf0)======
=====Errata=====
Any changes that have been made.
* Revision 0.1: Enhanced included 'check' script (20170901)
* Revision 0.2: Filled out the "Verify Results" section (20170903)
* Revision 0.3: Added a "Check Results" section (20170904)
* Revision 0.4: Further enhanced 'check' script, and updated project page verify section to reflect improved functionality (20170904)
* Revision 0.5: After some absolutely incredible irc shenanigans in what I've come to call the "discrete irc labor day power hours", some 1337 h4xxing was done, and it was discovered that the included in/sample2.bmp.rle file was actually incorrectly encoded. Finally a seemingly working copy was created, only to later discover that, while not incorrect, was also not optimal. So, as of 8pm Monday Sept 4th, a more optimal and correct in/sample2.bmp.rle file was placed in the project directory. Please run 'make getdata' to share in the goodness. And thank you to everyone who showed up- THAT is why I do what I do. (20170904)
=====Objective=====
To apply your skills in implementing an encoding scheme that, in ideal circumstances, will lead to a smaller storage footprint.
=====Encoding=====
A Google search defines encoding as:
* general: convert into a coded form.
* COMPUTING-centric: convert (information or an instruction) into a particular form.
When we program, we encode ideas into syntax.
When you compile source code, the compiler encodes your coding syntax into machine code.
When we lay out a pattern, we encode particular aspects of a process.
Encoding can be as simple as a 1:1 translation (such as shifting letters around to encipher a message), but it can also take other forms.
=====Compression=====
Compression is a certain form of encoding, one where we focus on taking something (typically data, bits, bandwidth, instructions) and encode it in a way where it ends up requiring less storage space than when we began.
According to Wikipedia, data compression is: the process of encoding digital information using fewer bits.
Another aspect to consider is whether the act of encoding data results in a reversible pathway or not. Sometimes the process of encoding can eliminate seemingly unnecessary data, aiding in the task of taking up less space. When data is trimmed in this manner, we say the process is **lossy**; when all data is preserved, it is **lossless**.
For more information on lossy and lossless data compression:
* [[http://en.wikipedia.org/wiki/Lossless_data_compression|lossless]] - no data is lost as a part of the compression process
* [[http://en.wikipedia.org/wiki/Lossy_data_compression|lossy]] - unnecessary data is discarded as part of the compression process
Wikipedia has categories identifying various algorithms implemented for both [[http://en.wikipedia.org/wiki/Category:Lossless_compression_algorithms|lossless]] and [[http://en.wikipedia.org/wiki/Category:Lossy_compression_algorithms|lossy]] compression algorithms.
====This week's algorithm: RLE====
Our algorithm of implementation this week is a relatively simple and straightforward one known as Run-Length Encoding (RLE for short).
Sometimes associated with certain image formats, RLE is an algorithm that, when matched with appropriately patterned data, can yield some space savings.
However, like other special-purpose algorithms, it its worst cases it can actually balloon the storage footprint of your data.
To demonstrate what RLE does, let us look at the following data:
aaaabcccdddddefghhhhhhhhhhhh (28 bytes total)
Notice how there are sequences of repeating letters? Turns out in some patterns of data, that can be quite the frequent occurrence (for example, images with large swaths of the same color).
RLE adds a sort of narrative to the data, sort of like what we might do if we were describing that string:
- there are four a's
- one b
- three c's
- five d's
- one e
- one f
- one g
- twelve h's
Or, to concisely represent it:
4a1b3c5d1e1f1g12h (17 bytes total)
By applying that "narrative" to the data, we were able to both preserve the integrity of the data, while shrinking its storage footprint.
And in that case, we got some pretty decent gains:
* 28 bytes in
* 17 bytes out
* 17 is ~60% of 28
* so we were able to compress that data by about 40%. Not bad.
However, the data must be appropriately repetitive... if not, it can grow in size. Take this example:
abcdefgh (8 bytes)
Applying our same narrative:
- one a
- one b
- one c
- ...
- one h
Or, concisely:
1a1b1c1d1e1f1g1h (16 bytes)
In that case, we not only didn't shrink, but we grew to twice our original size.
When RLE works, it can work pretty well, but when the data isn't conducive, it really doesn't help out that much at all.
=====dcfX RLE v1 specification=====
You'll be writing an **encode** and a **decode** program implementing RLE, in accordance with these published specifications (this way, any one can take an RL-encoded file from someone else and decode it with their's, and vice versa).
====Header====
Every RL-encoded file will start with the following 12-byte header:
* byte 0: 0x64
* byte 1: 0x63
* byte 2: 0x66
* byte 3: 0x58
* byte 4: 0x20
* byte 5: 0x52
* byte 6: 0x4c
* byte 7: 0x45
* byte 8: 0x00 (reserved)
* byte 9: 0x01 (version of our RLE specification)
* byte 10: 1 byte for stride value (will be a value of 1 for time being)
* byte 11: 1 byte for source file name's length (doesn't include NULL terminator)
* bytes 12 through length-1 indicated in byte 11: ASCII string of original filename to write. This will also be the name of the file **decode** creates, leaving the RL-encoded file intact.
Following this we will have a repeating sequence of **count** and **value** fields, continuing until the end of the file.
For example, encoding our 28-byte example from above we'd get:
04 61 01 62 03 63 05 64 01 65 01 66 01 67 0c 68
Or, if looking at the ENTIRE encoded file, with header, where the source file was named '**sample.txt**' and has a newline character at the end (I've wrapped it at 16 bytes per line so it better fits on the page without a horizontal scroll):
64 63 66 58 20 52 4c 45 00 01 01 0a 73 61 6d 70
6c 65 2e 74 78 74 04 61 01 62 03 63 05 64 01 65
01 66 01 67 0c 68 01 0a
Now, in this case, with full headers, our original 28-byte (29, counting the newline) example would actually end up at 40 bytes, but that sample is rather small. A repetitive file originally 40-ish bytes, and of course larger, would start to yield space savings.
=====Program=====
It is your task to write an encoder and decoder for this specification of the dcfX RLE format:
- **encode.c**: read in source data, encode according to specifications
- **decode.c**: read in RL-encoded data, decode to produce original data
Your program should:
* obtain 2 parameters from the command-line (see **command-line arguments** section below):
* argv[1]: name of the input file
* argv[2]: name of the output file
* this should be a file that exists, but you should do appropriate error checking and bail out if the file cannot be accessed
* for **encode**, the output file will be in RLE format (ideally with an "**.rle**" suffixed to the end). This adds a certain universal aspect to how we'll go about naming things (**tar** and **gzip** do this too).
* for **decode**, the input file should be an RL-encoded file.
* be sure to perform appropriate error checking and bail out as needed.
* implement the specified algorithm in both encoding and decoding forms.
* please be sure to test it against varying types of data, to make sure it works no matter what you throw at it.
* I mean it: don't just test it on some small ASCII example, be sure to test it against the full set of sample files in the **in/** directory.
* calculate and display some statistics gleaned during the performance of the process.
* for example, **encode** should display information on:
* how many bytes read in
* how many bytes written out
* compression rate
* **decode** should also display:
* RLE header information
* filename information
* how many bytes read in
* how many bytes written out
* decompression rate
* see the sample program outputs below
* display errors to STDERR
* display run-time information to STDOUT
* your RL-encoded data **MUST** be conformant to the project specifications described above.
* you should be able to encode/decode a set of data with 100% retrieval rate. No data should be lost.
* remember, you are encoding/decoding **binary** data, NOT ASCII. Don't fall prey to your misconceptions.
* **decode** should validate the header information (is it encoded in version 1? if not, complain to STDERR of "version mismatch!" and exit).
* if the first 8 bytes of the header do not check out, error out with an "invalid data format detected! aborting process..." message to STDERR.
* if the file only contains a header (and no encoded data), report to STDERR "empty data segment" and exit.
=====Grabit Integration=====
For those familiar with the **grabit** tool on lab46, I have made some skeleton files and a custom **Makefile** available for this project.
To "grab" it:
lab46:~/src/discrete$ grabit discrete dcf0
make: Entering directory '/var/public/fall2017/discrete/dcf0'
‘/var/public/fall2017/discrete/dcf0/Makefile’ -> ‘/home/USERNAME/src/discrete/dcf0/Makefile’
‘/var/public/fall2017/discrete/dcf0/encode.c’ -> ‘/home/USERNAME/src/discrete/dcf0/encode.c’
‘/var/public/fall2017/discrete/dcf0/decode.c’ -> ‘/home/USERNAME/src/discrete/dcf0/decode.c’
‘/var/public/fall2017/discrete/dcf0/in/sample0.txt’ -> ‘/home/USERNAME/src/discrete/dcf0/in/sample0.txt’
‘/var/public/fall2017/discrete/dcf0/in/sample1.txt’ -> ‘/home/USERNAME/src/discrete/dcf0/in/sample1.txt’
‘/var/public/fall2017/discrete/dcf0/in/sample2.bmp’ -> ‘/home/USERNAME/src/discrete/dcf0/in/sample2.bmp’
‘/var/public/fall2017/discrete/dcf0/in/sample3.wav’ -> ‘/home/USERNAME/src/discrete/dcf0/in/sample3.wav’
‘/var/public/fall2017/discrete/dcf0/in/sample4.bmp.rle’ -> ‘/home/USERNAME/src/discrete/dcf0/in/sample4.bmp.rle’
‘/var/public/fall2017/discrete/dcf0/in/sample5.txt.rle’ -> ‘/home/USERNAME/src/discrete/dcf0/in/sample5.txt.rle’
‘/var/public/fall2017/discrete/dcf0/in/sample6.mp3.rle’ -> ‘/home/USERNAME/src/discrete/dcf0/in/sample6.mp3.rle’
‘/var/public/fall2017/discrete/dcf0/in/sample7.txt.rle’ -> ‘/home/USERNAME/src/discrete/dcf0/in/sample7.txt.rle’
make: Leaving directory '/var/public/fall2017/discrete/dcf0'
lab46:~/src/discrete$ cd dcf0
lab46:~/src/discrete/dcf0 $ ls
Makefile in out decode.c encode.c
lab46:~/src/discrete/dcf0$
Just another "nice thing" we deserve.
NOTE: You do NOT want to do this on a populated dcf0 project directory-- it will overwrite files. Only do this on an empty directory.
=====Makefile fun=====
With the Makefile, we have your basic compile and clean-up operations:
* **make**: compile everything
* **make debug**: compile everything with debug support
* **make clean**: remove all binaries
* **make getdata**: re-obtain a fresh copy of project data files
* **make save**: make a backup of your project
* **make submit**: submit project (uses submit tool)
=====Command-Line Arguments=====
====setting up main()====
To accept (or rather, to gain access) to arguments given to your program at runtime, we need to specify two parameters to the main() function. While the names don't matter, the types do.. I like the traditional **argc** and **argv** names, although it is also common to see them abbreviated as **ac** and **av**.
Please declare your main() function as follows:
int main(int argc, char **argv)
The arguments are accessible via argv, in the order they were specified:
* argv[0]: program invocation (path + program name)
* argv[1]: input file
* argv[2]: output file
====Simple argument checks====
Although I'm not going to require extensive argument parsing or checking for this project, we should check to see if the minimal number of arguments has been provided:
if (argc < 3) // if less than 3 arguments have been provided
{
fprintf(stderr, "Not enough arguments!\n");
exit(1);
}
=====Execution=====
Your program output should be as follows (given the specific input):
====Encode====
lab46:~/src/discrete/dcf0$ ./encode in/sample0.txt out/sample0.txt.rle
input name length: 14 bytes
input filename: in/sample0.txt
output name length: 19 bytes
output filename: out/sample0.txt.rle
stride value: 1 byte
read in: 82 bytes
wrote out: 62 bytes
compression rate: 24.39%
lab46:~/src/discrete/dcf0$
====Decode====
lab46:~/src/discrete/dcf0$ mkdir tmp
lab46:~/src/discrete/dcf0$ ./decode out/sample0.txt.rle tmp/sample0.txt
input name length: 19 bytes
input filename: out/sample0.txt.rle
output name length: 15 bytes
output filename: tmp/sample0.txt
header text: dcfX RLE v1
stride value: 1 byte
read in: 62 bytes
wrote out: 82 bytes
inflation rate: 32.26%
lab46:~/src/discrete/dcf0$
=====Check Results=====
A good way to test that both encode and decode are working is to encode data then immediately turn around and decode that same data. If the decoded file is in the same state as the original, pre-encoded file, you know things are working.
====diff compare===
A quick way to check if two files are identical is to run the **diff(1)** command on them, so assuming the original file in **in/sample0.txt**, and the decoded version (which should be the same thing) in **tmp/sample0.txt**:
lab46:~/src/discrete/dcf0$ diff in/sample0.txt tmp/sample0.txt
lab46:~/src/discrete/dcf0$
Just getting your prompt back indicates no major differences were found.
====MD5sum compare====
If you'd like to be REALLY sure, generate MD5sum hashes and compare:
lab46:~/src/discrete/dcf0$ md5sum in/sample0.txt tmp/sample0.txt
10f9bc85023dcf37be2b04638cb45ee2 in/sample0.txt
10f9bc85023dcf37be2b04638cb45ee2 tmp/sample0.txt
lab46:~/src/discrete/dcf0$
As you can see, both hashes match (the MD5sum hashes are analyzing the file contents, NOT the name/location).
====Hex Dump/Visualization====
You may want to check and see what exactly your program is generating.
This can be done by performing a hex data dump (or visualization) of the raw data in the output file.
The tool I'd recommend for quick viewing is **xxd(1)**; please see the following example:
lab46:~/src/discrete/dcf0$ xxd out/sample0.txt.rle
0000000: 6463 6658 2052 4c45 0001 010e 696e 2f73 dcfX RLE....in/s
0000010: 616d 706c 6530 2e74 7874 0161 0262 0363 ample0.txt.a.b.c
0000020: 0464 0565 0666 0767 0868 0969 086a 076b .d.e.f.g.h.i.j.k
0000030: 066c 056d 046e 036f 0270 0171 010a .l.m.n.o.p.q..
lab46:~/src/discrete/dcf0$
With this output, we can confirm, byte-by-byte, what has been placed in our encoded file. What you'll see are three fields:
* leftmost: byte offset (from start of file)
* middle: hex data (in pairs- big endian by default, so as you expect to read it)
* rightmost: the ASCII-ized representation of the middle data
=====Verify Results=====
If you'd like to verify your implementations, there is a **check** script included when you use the **grabit** tool to obtain the skeleton files and data.
**NOTE:** As there have been updates to this script since the project was first released, you may want to manually obtain a copy, to ensure you have the latest and greatest:
lab46:~/src/discrete/dcf0$ cp /var/public/fall2017/discrete/dcf0/check .
To run it, you need a functioning **encode** and **decode** program (although it does its best otherwise).
It runs through four separate tests, storing the results in a corresponding **o#/** directory (sometimes, if applicable, intermediate results in a corresponding **m#/** directory):
* test 0: take the raw data files in **in/** and encodes them (**o0/**)
* test 1: take pre-encoded data files in **in/** and decodes them (**o1/**)
* test 2: take the raw data files in **in/**, encodes them (**m2/**), then decodes them (**o2/**)
* test 3: take pre-encoded data files in **in/**, decodes them (**m3/**), then encodes them (**o3/**)
How it works:
- depending on the test, encodes or decodes a file in the **in/** directory.
* if single step, result is in **o#/** directory
* if multi-step, result is in **m#/** directory, then second operation puts its result into **o#/**
- A checksum is taken of the original file in **in/**
- Another checksum is taken of the new file in **o#/**
- The checksums are compared. If they match, "OK" is displayed; if they do not match, a corresponding "FAIL" message appears.
====Successful operation====
If all goes according to plan, you'll see "OK" status messages displayed.
lab46:~/src/discrete/dcf0$ ./check
=================================================
= PHASE 0: Raw -> Encode data verification test =
=================================================
in/sample0.txt -> o0/sample0.txt.rle: OK
in/sample1.txt -> o0/sample1.txt.rle: OK
in/sample2.bmp -> o0/sample2.bmp.rle: OK
in/sample3.wav -> o0/sample3.wav.rle: OK
=================================================
= PHASE 1: Decode -> Raw data verification test =
=================================================
in/sample0.txt.rle -> o1/sample0.txt: OK
in/sample1.txt.rle -> o1/sample1.txt: OK
in/sample2.bmp.rle -> o1/sample2.bmp: OK
in/sample3.wav.rle -> o1/sample3.wav: OK
================================================
= PHASE 2: Raw -> Encode -> Decode -> Raw test =
================================================
in/sample0.txt -> m2/sample0.txt.rle -> o2/sample0.txt: OK
in/sample1.txt -> m2/sample1.txt.rle -> o2/sample1.txt: OK
in/sample2.bmp -> m2/sample2.bmp.rle -> o2/sample2.bmp: OK
in/sample3.wav -> m2/sample3.wav.rle -> o2/sample3.wav: OK
=============================================
= PHASE 3: Decode -> Raw -> Encode Raw test =
=============================================
in/sample0.txt.rle -> m3/sample0.txt -> o3/sample0.txt.rle: OK
in/sample1.txt.rle -> m3/sample1.txt -> o3/sample1.txt.rle: OK
in/sample2.bmp.rle -> m3/sample2.bmp -> o3/sample2.bmp.rle: OK
in/sample3.wav.rle -> m3/sample3.wav -> o3/sample3.wav.rle: OK
====Unsuccessful operation====
Should something not work correctly, you'll see a "FAIL" message:
lab46:~/src/discrete/dcf0$ ./check
=================================================
= PHASE 0: Raw -> Encode data verification test =
=================================================
in/sample0.txt -> o0/sample0.txt.rle: OK
in/sample1.txt -> o0/sample1.txt.rle: OK
in/sample2.bmp -> o0/sample2.bmp.rle: FAIL: checksums do not match
in/sample3.wav -> o0/sample3.wav.rle: OK
=================================================
= PHASE 1: Decode -> Raw data verification test =
=================================================
in/sample0.txt.rle -> o1/sample0.txt: OK
in/sample1.txt.rle -> o1/sample1.txt: OK
in/sample2.bmp.rle -> o1/sample2.bmp: FAIL: checksums do not match
in/sample3.wav.rle -> o1/sample3.wav: OK
================================================
= PHASE 2: Raw -> Encode -> Decode -> Raw test =
================================================
in/sample0.txt -> m2/sample0.txt.rle -> o2/sample0.txt: OK
in/sample1.txt -> m2/sample1.txt.rle -> o2/sample1.txt: OK
in/sample2.bmp -> m2/sample2.bmp.rle -> o2/sample2.bmp: FAIL: checksums do not match
in/sample3.wav -> m2/sample3.wav.rle -> o2/sample3.wav: OK
=============================================
= PHASE 3: Decode -> Raw -> Encode Raw test =
=============================================
in/sample0.txt.rle -> m3/sample0.txt -> o3/sample0.txt.rle: OK
in/sample1.txt.rle -> m3/sample1.txt -> o3/sample1.txt.rle: OK
in/sample2.bmp.rle -> m3/sample2.bmp -> o3/sample2.bmp.rle: FAIL: checksums do not match
in/sample3.wav.rle -> m3/sample3.wav -> o3/sample3.wav.rle: OK
====Incomplete operation====
Should something not work at all (like a missing or uncompiling decode binary), you'll see a "MISSING" message:
lab46:~/src/discrete/dcf0$ ./check
...
=================================================
= PHASE 1: Decode -> Raw data verification test =
=================================================
in/sample0.txt.rle -> o1/sample0.txt: MISSING: decode
in/sample1.txt.rle -> o1/sample1.txt: MISSING: decode
in/sample2.bmp.rle -> o1/sample2.bmp: MISSING: decode
in/sample3.wav.rle -> o1/sample3.wav: MISSING: decode
...
=====Submission=====
To successfully complete this project, the following criteria must be met:
* Code must compile cleanly (no warnings or errors)
* Output must be correct, and match the form given in the sample output above.
* Code must be nicely and consistently indented (you may use the **indent** tool)
* Code must implement the algorithm(s) presented above.
* **encode.c**
* **decode.c**
* indicated error conditions are identified and reported, along with expected program behavior
* Code must be commented
* comments must be meaningful and descriptive of the process (tell me why you're doing what you're doing)
* have a properly filled-out comment banner at the top
* be sure to include any compiling instructions
* Track/version the source code in a repository
* Submit a copy of your source code to me using the **submit** tool.
To submit this program to me using the **submit** tool, run the following command at your lab46 prompt:
lab46:~/src/discrete/dcf0$ make submit
removed 'decode'
removed 'encode'
removed 'errors'
Project backup process commencing
Taking snapshot of current project (dcf0) ... OK
Compressing snapshot of dcf0 project archive ... OK
Setting secure permissions on dcf0 archive ... OK
Project backup process complete
Submitting discrete project "dcf0":
-> ../dcf0-DATESTRING-HOUR.tar.gz(OK)
SUCCESSFULLY SUBMITTED
You should get some sort of confirmation indicating successful submission if all went according to plan. If not, check for typos and or locational mismatches.