Corning Community College CSCS2330 Discrete Structures ======Project: RUN-LENGTH ENCODING - DATA COMPRESSION FUN (dcf1)====== =====Errata===== Any changes that have been made. * Revision 0.1: Updating dcfX v2 spec and added some additional implementation constraints (20170907) * Revision 0.2: Finalized project data files, adapted included 'check' script for dcf1 (20170909) * Revision 0.3: Updated check script so it no longer gives out false negatives. **make getdata** to grab the updated copy (20170921) =====Objective===== To apply your skills in implementing an encoding scheme that, in ideal circumstances, will lead to a smaller storage footprint. ====This week's algorithm: RLE+stride==== Last week's project dealt with the first version of this algorithm; this week we add another variable into the mix which can fundamentally change the effectiveness of compressing data. Our algorithm of implementation this week is a minor tweak of our RLE algorithm from last week. The change? A configurable **stride** value. Google defines **stride** as: a long, decisive step. In dcf0, our stride value was fixed to 1 byte. We could only count up sequences of single byte runs, which in some cases yielded compression; in others, not so much. With a configurable stride, we can then start counting up new sorts of data runs (such as when groups of two bytes may see strings of repetition, or 5 bytes, or 11 bytes). To demonstrate what RLE+stride does, let us look at the following data: aaaaaabcdbcdbcdddddddefghhhhhhhhhhhh (36 bytes total) Encoding with RLE+1 (1 byte stride), we would get the following (this is what your dcf0 programs should be producing if fed this data): * 06 61 01 62 01 63 01 64 01 62 01 63 01 64 01 62 01 63 07 64 01 65 01 66 01 67 0C 68 (28 bytes) Encoding with RLE+2 (2 byte stride), we would get the following: * 03 61 61 01 62 63 01 64 62 01 63 64 01 62 63 03 64 64 01 64 65 01 66 67 06 68 (26 bytes) Notice here with a stride of 2 bytes, we still have the singular count byte, but that is then followed by **TWO BYTES** of what the count is keeping track of. In this example we actually gained some ground over the 1 byte stride. Not much, but let's see if we can improve it somewhat. Encoding with RLE+3 (3 byte stride): * 02 61 61 61 03 62 63 64 02 64 64 64 01 65 66 67 04 68 68 68 (20 bytes) We see here a 1 byte count byte, followed by 3 bytes of value. ... a bigger savings. This was possible because the source data had a lot of 3 byte sequences that allowed a 3 byte stride to work particularly well. And what is neat is various types and formats of data will have patterns that better fit various strides. =====dcfX RLE v2 specification===== You'll be writing an **encode** and a **decode** program implementing RLE+stride, in accordance with these published specifications (this way, any one can take an RL-encoded file from someone else and decode it with their's, and vice versa). ====Header==== It is actually **identical** to the specifications of last week, save for four changes: - we're no longer hard-coding the **stride** value to 1 (byte 10), but instead obtaining it from the command-line (argv[3]), any valid value between 1 and 255 (inclusive). - we're placing a 2 in the version byte (byte 9) - the embedded source file name will now be stripped of any path (ie "in/sample0.txt" should now just be stored as "sample0.txt") - the destination argument (argv[2]) is now merely a path, NOT a path+filename (ie "out/sample0.txt.rle" should now just be "out") * the destination file is a combination of the destination path + source filename + ".rle" extension (for encode). And specifically for **decode**, the source filename will be retrieved out of the post-header information at the start of the encoded file. Every RL-encoded file will start with the following 12-byte header: * byte 0: 0x64 * byte 1: 0x63 * byte 2: 0x66 * byte 3: 0x58 * byte 4: 0x20 * byte 5: 0x52 * byte 6: 0x4c * byte 7: 0x45 * byte 8: 0x00 (reserved) * byte 9: version of our RLE specification * byte 10: 1 byte for stride value * byte 11: 1 byte for source file name's length (doesn't include NULL terminator) * bytes 12 through length-1 indicated in byte 11: ASCII string of original filename to write. This will also be the name of the file **decode** creates, leaving the RL-encoded file intact. Following this we will have a repeating sequence of **count** and **value** fields (where **count** is still of length 1, but **value** is of length **stride**), continuing until the end of the file. =====Useful functions===== The following standard library functions may prove helpful: * **strlen(3)** * **malloc(3)** * **calloc(3)** * **fopen(3)** * **fclose(3)** * **fgetc(3)** * **basename(3)** * **feof(3)** * **sprintf(3)** * **atoi(3)** * **fprintf(3)** =====Program===== It is your task to write an encoder and decoder for this specification of the dcfX RLE v2 format: - **encode.c**: read in source data, encode according to specifications - **decode.c**: read in RL-encoded data, decode to produce original data Your program should: * for encode, obtain 3 parameters from the command-line (see **command-line arguments** section below): * argv[1]: name of path + source file * this should be a file that exists, but you should do appropriate error checking and bail out if the file cannot be accessed * argv[2]: path of destination location (NOT file NAME, just PATH). The destination file name will be generated, based on the source file and whether we are encoding or decoding: * if encoding, destination is argv[2] + argv[1] (minus source path) + ".rle" * if decoding, destination is argv[2] + embedded file name (which shouldn't have an ".rle" per project specifications) * argv[3]: stride value * this is a value between 1 and 255 (inclusive). * be sure to do error checking to make sure it is in this range. * for **decode**, only accept 2 arguments; this should be an RL-encoded file and the destination location. * the output file name will be obtained from parsing the file's header data and combining with argv[2]. * be sure to perform appropriate error checking and bail out as needed. * if the version byte in the header is 1, assume a 1 byte stride and decode; be sure to report RLE version in output * this way, v2 decode will be (somewhat) backwards compatible with v1 encode * if the version byte in the header is 2, read the stride value from the header stride byte, and decode in accordance with that * implement the specified algorithm in both encoding and decoding forms. * please be sure to test it against varying types of data, to make sure it works no matter what you throw at it. * calculate and display some statistics gleaned during the performance of the process. * for example, **encode** should display information on: * how many bytes read in * how many bytes written out * compression rate * **decode** should also display: * RLE header information * filename information * how many bytes read in * how many bytes written out * decompression rate * see the sample program outputs below * display errors to STDERR * display run-time information to STDOUT * your RL-encoded data **MUST** be conformant to the project specifications described above. * you should be able to encode/decode a set of data with 100% retrieval rate. No data should be lost. * **decode** should validate the header information (is it encoded in version 1 or 2? if not, complain to STDERR of "version mismatch!" and exit). * if the first 8 bytes of the header do not check out, error out with an "invalid data format detected! aborting process..." message to STDERR. * if the file only contains a header (and no encoded data), report to STDERR "empty data segment" and exit. ====Other specification details==== * if you end up with a less-than-stride quantity of bytes remaining at the very end of the file, merely put down 01 and those remaining bytes * your program should be mindful of data reads/writes that may hit the end of file marker * your program also should avoid writing the EOF byte as a data byte (or reading it and processing it as anything other than a marker to stop reading). * Since we are using single bytes to store our counts, one needs to be mindful not to allow the byte value to "roll over"; limit your counts to 255. * the **feof(3)** function can be of great use. * NO infinite loops; a number of people in dcf0 made their processing loop intentionally infinite (think "while(1)"), and then broke out of it when various conditions were met. And while a functional solution can be obtained this way, it demonstrates an ignorance of structured programming- there should be at least one over-arching condition you are looking for, and the central loop should terminate based on that. I'm not saying you can't have other loop terminating conditions, but I do NOT want to see unglorified infinite loops put there "just to make it go". UNDERSTAND your conditions better and you won't have to do things like that. * NO hardcoding fixed offsets beyond the header. While we know how big the header is, we do not have the same universal guarantee about the filename length. As such, prior to reading the header information, we do not know precisely where the encoded section actually starts. Some people discovered and were overly reliant/trigger happy in using **fseek(3)** and related file positioning functions. Like the infinite loop issue noted above, this demonstrates an unfamiliarity with how the computer works, and are trying to exert excessive control (and therefore complexity) onto the solution. There's a natural flow- recognize it and write your code to work with it. You'll find your solutions are a lot more durable, not to mention less complex. * in fact, I'd recommend avoiding things like **fseek(3)** and **ftell(3)** altogether (or anything that artificially adjusts the current file position), your solutions will benefit from the resulting simplicity (but again, that simplicity will only come with an understanding of the process). * NO solutions centrally based on knowledge of the file's length. Do not compute file length and then encode/decode (with encode/decode process being somehow based on the length). You can (hint: and should) gradually accumulate the file lengths via the process, but the process itself should in no way be based on some fixed, known length. Again, this is further preparing us for letting the computer take control of what it is good at, and freeing us to focus on the conceptual crafting of solutions. * when computing compression/inflation rates, only calculate the DATA, omit the header and filename information. But you'll also want to keep track of total bytes read/written for purposes of displaying such statistics in the info table (see examples). * I will be looking for the output prompts, but not necessarily the precise values (for example, if your compression/inflation equation yields different results than mine, that won't count against you- provided you have at least tried to compute something reasonable). =====Grabit Integration===== For those familiar with the **grabit** tool on lab46, I have made some skeleton files and a custom **Makefile** available for this project. To "grab" it: lab46:~/src/discrete$ grabit discrete dcf1 make: Entering directory '/var/public/SEMESTER/discrete/dcf1' ‘/var/public/SEMESTER/discrete/dcf1/Makefile’ -> ‘/home/USERNAME/src/discrete/dcf1/Makefile’ ‘/var/public/SEMESTER/discrete/dcf1/encode.c’ -> ‘/home/USERNAME/src/discrete/dcf1/encode.c’ ‘/var/public/SEMESTER/discrete/dcf1/decode.c’ -> ‘/home/USERNAME/src/discrete/dcf1/decode.c’ ... make: Leaving directory '/var/public/SEMESTER/discrete/dcf1' lab46:~/src/discrete$ cd dcf1 lab46:~/src/discrete/dcf1$ ls Makefile in decode.c encode.c lab46:~/src/discrete/dcf1$ Just another "nice thing" we deserve. NOTE: You do NOT want to do this on a populated dcf1 project directory-- it will overwrite files. Only do this on an empty directory. =====Makefile fun===== With the Makefile, we have your basic compile and clean-up operations: * **make**: compile everything * **make debug**: compile everything with debug support * **make clean**: remove all binaries * **make getdata**: re-obtain a fresh copy of project data files * **make save**: make a backup of your project * **make submit**: submit project (uses submit tool) =====Command-Line Arguments===== ====setting up main()==== To accept (or rather, to gain access) to arguments given to your program at runtime, we need to specify two parameters to the main() function. While the names don't matter, the types do.. I like the traditional **argc** and **argv** names, although it is also common to see them abbreviated as **ac** and **av**. Please declare your main() function as follows: int main(int argc, char **argv) The arguments are accessible via the argv array, in the order they were specified: * argv[0]: program invocation (path + program name) * argv[1]: our input file (path + file name) * argv[2]: our output destination (just path info, no file name) * argv[3]: our stride value (1-255) ====Simple argument checks==== Although I'm not going to require extensive argument parsing or checking for this project, we should check to see if the minimal number of arguments has been provided: if (argc < 3) // if less than 3 arguments have been provided { fprintf(stderr, "Not enough arguments!\n"); exit(1); } =====Execution===== Your program output should be as follows (given the specific input): ====Encode==== lab46:~/src/discrete/dcf1$ ./encode in/sample3.txt out 3 input name length: 14 bytes input filename: in/sample3.txt embedded name length: 11 bytes embedded file name: sample3.txt output name length: 19 bytes output filename: out/sample3.txt.rle stride value: 3 bytes read in: 82 bytes data written out: 78 bytes total written out: 101 bytes compression rate: 4.88% lab46:~/src/discrete/dcf1$ Similarly, if we were to encode the **sample2.bmp** data file from dcf0 with the right stride, we can actually achieve a notable amount of compression (unlike our results from dcf0 with a stride fixed at 1 byte): lab46:~/src/discrete/dcf1$ ./encode ../dcf0/in/sample2.bmp out 37 input name length: 22 bytes input filename: ../dcf0/in/sample2.bmp embedded name length: 11 bytes embedded file name: sample2.bmp output name length: 19 bytes output filename: out/sample2.bmp.rle stride value: 37 bytes read in: 250934 bytes data written out: 183730 bytes total written out: 183753 bytes compression rate: 26.78% lab46:~/src/discrete/dcf1$ With various formats, you'll likely want to play with the stride in order to find better compression results. ====Decode==== lab46:~/src/discrete/dcf1$ ./decode in/sample0.txt.rle out input filename: in/sample0.txt.rle output name length: 11 bytes output filename: sample5.txt header text: dcfX RLE v2 stride value: 4 bytes read in: 3093 bytes wrote out: 3600 bytes inflation rate: 14.08% lab46:~/src/discrete/dcf1$ =====Check Results===== A good way to test that both encode and decode are working is to encode data then immediately turn around and decode that same data. If the decoded file is in the same state as the original, pre-encoded file, you know things are working. ====diff compare=== A quick way to check if two files are identical is to run the **diff(1)** command on them, so assuming the original file in **in/sample1.txt**, and the decoded version (which should be the same thing) in **tmp/sample1.txt**: lab46:~/src/discrete/dcf1$ diff in/sample1.txt tmp/sample1.txt lab46:~/src/discrete/dcf1$ Just getting your prompt back indicates no major differences were found. ====MD5sum compare==== If you'd like to be REALLY sure, generate MD5sum hashes and compare: lab46:~/src/discrete/dcf1$ md5sum in/sample1.txt tmp/sample1.txt 10f9bc85023dcf37be2b04638cb45ee2 in/sample1.txt 10f9bc85023dcf37be2b04638cb45ee2 tmp/sample1.txt lab46:~/src/discrete/dcf1$ As you can see, both hashes match (the MD5sum hashes are analyzing the file contents, NOT the name/location). ====Hex Dump/Visualization==== You may want to check and see what exactly your program is generating. This can be done by performing a hex data dump (or visualization) of the raw data in the output file. The tool I'd recommend for quick viewing is **xxd(1)**; please see the following example: lab46:~/src/discrete/dcf1$ xxd out/sample3.txt.rle 0000000: 6463 6658 2052 4c45 0002 030b 7361 6d70 dcfX RLE....samp 0000010: 6c65 332e 7478 7401 6162 6201 6363 6301 le3.txt.abb.ccc. 0000020: 6464 6401 6465 6501 6565 6502 6666 6602 ddd.dee.eee.fff. 0000030: 6767 6701 6768 6802 6868 6803 6969 6902 ggg.ghh.hhh.iii. 0000040: 6a6a 6a01 6a6a 6b02 6b6b 6b02 6c6c 6c01 jjj.jjk.kkk.lll. 0000050: 6d6d 6d01 6d6d 6e01 6e6e 6e01 6f6f 6f01 mmm.mmn.nnn.ooo. 0000060: 7070 7101 0a ppq.. lab46:~/src/discrete/dcf1$ With this output, we can confirm, byte-by-byte, what has been placed in our encoded file. What you'll see are three fields: * leftmost: byte offset (from start of file) * middle: hex data (in pairs- big endian by default, so as you expect to read it) * rightmost: the ASCII-ized representation of the middle data =====Verify Results===== If you'd like to verify your implementations, there is a **check** script included when you use the **grabit** tool to obtain the skeleton files and data. To run it, you need a functioning **encode** and **decode** program (although it does its best otherwise). It runs through four separate tests, storing the results in a corresponding **o#/** directory (sometimes, if applicable, intermediate results in a corresponding **m#/** directory): * test 0: take the raw data files in **in/** and encodes them (**o0/**) * test 1: take pre-encoded data files in **in/** and decodes them (**o1/**) * test 2: take the raw data files in **in/**, encodes them (**m2/**), then decodes them (**o2/**) * test 3: take pre-encoded data files in **in/**, decodes them (**m3/**), then encodes them (**o3/**) How it works: - depending on the test, encodes or decodes a file in the **in/** directory. * if single step, result is in **o#/** directory * if multi-step, result is in **m#/** directory, then second operation puts its result into **o#/** - A checksum is taken of the original file in **in/** - Another checksum is taken of the new file in **o#/** - The checksums are compared. If they match, "OK" is displayed; if they do not match, a corresponding "FAIL" message appears. ====Successful operation==== If all goes according to plan, you'll see "OK" status messages displayed. lab46:~/src/discrete/dcf1$ ./check ================================================= = PHASE 0: Raw -> Encode data verification test = ================================================= in/ascii1.art -> o0/ascii1.art.rle: OK in/ascii3.art -> o0/ascii3.art.rle: OK in/ascii7.art -> o0/ascii7.art.rle: OK in/ascii8.art -> o0/ascii8.art.rle: OK in/blunders2.mp3 -> o0/blunders2.mp3.rle: OK in/blunders4.mp3 -> o0/blunders4.mp3.rle: OK in/blunders7.mp3 -> o0/blunders7.mp3.rle: OK in/blunders93.mp3 -> o0/blunders93.mp3.rle: OK in/sample1.txt -> o0/sample1.txt.rle: OK in/sample2.txt -> o0/sample2.txt.rle: OK in/sample3.txt -> o0/sample3.txt.rle: OK in/sample4.txt -> o0/sample4.txt.rle: OK in/sprite13.png -> o0/sprite13.png.rle: OK in/sprite1.png -> o0/sprite1.png.rle: OK in/sprite2.png -> o0/sprite2.png.rle: OK in/sprite7.png -> o0/sprite7.png.rle: OK ... ====Unsuccessful operation==== Should something not work correctly, you'll see a "FAIL" message: lab46:~/src/discrete/dcf1$ ./check ================================================= = PHASE 0: Raw -> Encode data verification test = ================================================= in/ascii1.art -> o0/ascii1.art.rle: OK in/ascii3.art -> o0/ascii3.art.rle: OK in/ascii7.art -> o0/ascii7.art.rle: OK in/ascii8.art -> o0/ascii8.art.rle: OK in/blunders2.mp3 -> o0/blunders2.mp3.rle: OK in/blunders4.mp3 -> o0/blunders4.mp3.rle: OK in/blunders7.mp3 -> o0/blunders7.mp3.rle: OK in/blunders93.mp3 -> o0/blunders93.mp3.rle: FAIL: checksums do not match in/sample1.txt -> o0/sample1.txt.rle: OK in/sample2.txt -> o0/sample2.txt.rle: OK in/sample3.txt -> o0/sample3.txt.rle: OK in/sample4.txt -> o0/sample4.txt.rle: OK in/sprite13.png -> o0/sprite13.png.rle: OK in/sprite1.png -> o0/sprite1.png.rle: OK in/sprite2.png -> o0/sprite2.png.rle: OK in/sprite7.png -> o0/sprite7.png.rle: OK ... ====Incomplete operation==== Should something not work at all (like a missing or uncompiling decode binary), you'll see a "MISSING" message: lab46:~/src/discrete/dcf1$ ./check ... ================================================= = PHASE 1: Decode -> Raw data verification test = ================================================= Missing 'decode', skipping test. ... =====Submission===== To successfully complete this project, the following criteria must be met: * Code must compile cleanly (no warnings or errors) * Output must be correct, and match the form given in the sample output above. * Implementations must be compliant to dcfX v2 spec, and pass all tests in the check tool. * Code must be nicely and consistently indented (you may use the **indent** tool) * Code must implement the algorithm(s) presented above. * **encode.c** * **decode.c** * indicated error conditions are identified and reported, along with expected program behavior * Code must be commented * comments must be meaningful and descriptive of the process (tell me how/why you're doing what you're doing) * have a properly filled-out comment banner at the top * be sure to include any compiling instructions, if they differ from just typing 'make' * Track/version the source code in a repository * Submit a copy of your source code to me using the **submit** tool. To submit this program to me using the **submit** tool, run the following command at your lab46 prompt: lab46:~/src/discrete/dcf1$ make submit removed 'decode' removed 'encode' removed 'errors' Project backup process commencing Taking snapshot of current project (dcf1) ... OK Compressing snapshot of dcf1 project archive ... OK Setting secure permissions on dcf1 archive ... OK Project backup process complete Submitting discrete project "dcf1": -> ../dcf1-DATESTRING-HOUR.tar.gz(OK) SUCCESSFULLY SUBMITTED You should get some sort of confirmation indicating successful submission if all went according to plan. If not, check for typos and or locational mismatches.