Corning Community College
CSCS2330 Discrete Structures
Any changes that have been made.
To apply your skills in implementing an encoding scheme that, in ideal circumstances, will lead to a smaller storage footprint.
Last week's project dealt with the first version of this algorithm; this week we add another variable into the mix which can fundamentally change the effectiveness of compressing data.
Our algorithm of implementation this week is a minor tweak of our RLE algorithm from last week.
The change? A configurable stride value.
Google defines stride as: a long, decisive step.
In dcf0, our stride value was fixed to 1 byte. We could only count up sequences of single byte runs, which in some cases yielded compression; in others, not so much.
With a configurable stride, we can then start counting up new sorts of data runs (such as when groups of two bytes may see strings of repetition, or 5 bytes, or 11 bytes).
To demonstrate what RLE+stride does, let us look at the following data:
aaaaaabcdbcdbcdddddddefghhhhhhhhhhhh (36 bytes total)
Encoding with RLE+1 (1 byte stride), we would get the following (this is what your dcf0 programs should be producing if fed this data):
Encoding with RLE+2 (2 byte stride), we would get the following:
Notice here with a stride of 2 bytes, we still have the singular count byte, but that is then followed by TWO BYTES of what the count is keeping track of.
In this example we actually gained some ground over the 1 byte stride. Not much, but let's see if we can improve it somewhat.
Encoding with RLE+3 (3 byte stride):
We see here a 1 byte count byte, followed by 3 bytes of value.
… a bigger savings. This was possible because the source data had a lot of 3 byte sequences that allowed a 3 byte stride to work particularly well. And what is neat is various types and formats of data will have patterns that better fit various strides.
You'll be writing an encode and a decode program implementing RLE+stride, in accordance with these published specifications (this way, any one can take an RL-encoded file from someone else and decode it with their's, and vice versa).
It is actually identical to the specifications of last week, save for four changes:
And specifically for decode, the source filename will be retrieved out of the post-header information at the start of the encoded file.
Every RL-encoded file will start with the following 12-byte header:
Following this we will have a repeating sequence of count and value fields (where count is still of length 1, but value is of length stride), continuing until the end of the file.
The following standard library functions may prove helpful:
It is your task to write an encoder and decoder for this specification of the dcfX RLE v2 format:
Your program should:
For those familiar with the grabit tool on lab46, I have made some skeleton files and a custom Makefile available for this project.
To “grab” it:
lab46:~/src/discrete$ grabit discrete dcf1 make: Entering directory '/var/public/SEMESTER/discrete/dcf1' ‘/var/public/SEMESTER/discrete/dcf1/Makefile’ -> ‘/home/USERNAME/src/discrete/dcf1/Makefile’ ‘/var/public/SEMESTER/discrete/dcf1/encode.c’ -> ‘/home/USERNAME/src/discrete/dcf1/encode.c’ ‘/var/public/SEMESTER/discrete/dcf1/decode.c’ -> ‘/home/USERNAME/src/discrete/dcf1/decode.c’ ... make: Leaving directory '/var/public/SEMESTER/discrete/dcf1' lab46:~/src/discrete$ cd dcf1 lab46:~/src/discrete/dcf1$ ls Makefile in decode.c encode.c lab46:~/src/discrete/dcf1$
Just another “nice thing” we deserve.
NOTE: You do NOT want to do this on a populated dcf1 project directory– it will overwrite files. Only do this on an empty directory.
With the Makefile, we have your basic compile and clean-up operations:
To accept (or rather, to gain access) to arguments given to your program at runtime, we need to specify two parameters to the main() function. While the names don't matter, the types do.. I like the traditional argc and argv names, although it is also common to see them abbreviated as ac and av.
Please declare your main() function as follows:
int main(int argc, char **argv)
The arguments are accessible via the argv array, in the order they were specified:
Although I'm not going to require extensive argument parsing or checking for this project, we should check to see if the minimal number of arguments has been provided:
if (argc < 3) // if less than 3 arguments have been provided { fprintf(stderr, "Not enough arguments!\n"); exit(1); }
Your program output should be as follows (given the specific input):
lab46:~/src/discrete/dcf1$ ./encode in/sample3.txt out 3 input name length: 14 bytes input filename: in/sample3.txt embedded name length: 11 bytes embedded file name: sample3.txt output name length: 19 bytes output filename: out/sample3.txt.rle stride value: 3 bytes read in: 82 bytes data written out: 78 bytes total written out: 101 bytes compression rate: 4.88% lab46:~/src/discrete/dcf1$
Similarly, if we were to encode the sample2.bmp data file from dcf0 with the right stride, we can actually achieve a notable amount of compression (unlike our results from dcf0 with a stride fixed at 1 byte):
lab46:~/src/discrete/dcf1$ ./encode ../dcf0/in/sample2.bmp out 37 input name length: 22 bytes input filename: ../dcf0/in/sample2.bmp embedded name length: 11 bytes embedded file name: sample2.bmp output name length: 19 bytes output filename: out/sample2.bmp.rle stride value: 37 bytes read in: 250934 bytes data written out: 183730 bytes total written out: 183753 bytes compression rate: 26.78% lab46:~/src/discrete/dcf1$
With various formats, you'll likely want to play with the stride in order to find better compression results.
lab46:~/src/discrete/dcf1$ ./decode in/sample0.txt.rle out input filename: in/sample0.txt.rle output name length: 11 bytes output filename: sample5.txt header text: dcfX RLE v2 stride value: 4 bytes read in: 3093 bytes wrote out: 3600 bytes inflation rate: 14.08% lab46:~/src/discrete/dcf1$
A good way to test that both encode and decode are working is to encode data then immediately turn around and decode that same data. If the decoded file is in the same state as the original, pre-encoded file, you know things are working.
A quick way to check if two files are identical is to run the diff(1) command on them, so assuming the original file in in/sample1.txt, and the decoded version (which should be the same thing) in tmp/sample1.txt:
lab46:~/src/discrete/dcf1$ diff in/sample1.txt tmp/sample1.txt lab46:~/src/discrete/dcf1$
Just getting your prompt back indicates no major differences were found.
If you'd like to be REALLY sure, generate MD5sum hashes and compare:
lab46:~/src/discrete/dcf1$ md5sum in/sample1.txt tmp/sample1.txt 10f9bc85023dcf37be2b04638cb45ee2 in/sample1.txt 10f9bc85023dcf37be2b04638cb45ee2 tmp/sample1.txt lab46:~/src/discrete/dcf1$
As you can see, both hashes match (the MD5sum hashes are analyzing the file contents, NOT the name/location).
You may want to check and see what exactly your program is generating.
This can be done by performing a hex data dump (or visualization) of the raw data in the output file.
The tool I'd recommend for quick viewing is xxd(1); please see the following example:
lab46:~/src/discrete/dcf1$ xxd out/sample3.txt.rle 0000000: 6463 6658 2052 4c45 0002 030b 7361 6d70 dcfX RLE....samp 0000010: 6c65 332e 7478 7401 6162 6201 6363 6301 le3.txt.abb.ccc. 0000020: 6464 6401 6465 6501 6565 6502 6666 6602 ddd.dee.eee.fff. 0000030: 6767 6701 6768 6802 6868 6803 6969 6902 ggg.ghh.hhh.iii. 0000040: 6a6a 6a01 6a6a 6b02 6b6b 6b02 6c6c 6c01 jjj.jjk.kkk.lll. 0000050: 6d6d 6d01 6d6d 6e01 6e6e 6e01 6f6f 6f01 mmm.mmn.nnn.ooo. 0000060: 7070 7101 0a ppq.. lab46:~/src/discrete/dcf1$
With this output, we can confirm, byte-by-byte, what has been placed in our encoded file. What you'll see are three fields:
If you'd like to verify your implementations, there is a check script included when you use the grabit tool to obtain the skeleton files and data.
To run it, you need a functioning encode and decode program (although it does its best otherwise).
It runs through four separate tests, storing the results in a corresponding o#/ directory (sometimes, if applicable, intermediate results in a corresponding m#/ directory):
How it works:
If all goes according to plan, you'll see “OK” status messages displayed.
lab46:~/src/discrete/dcf1$ ./check ================================================= = PHASE 0: Raw -> Encode data verification test = ================================================= in/ascii1.art -> o0/ascii1.art.rle: OK in/ascii3.art -> o0/ascii3.art.rle: OK in/ascii7.art -> o0/ascii7.art.rle: OK in/ascii8.art -> o0/ascii8.art.rle: OK in/blunders2.mp3 -> o0/blunders2.mp3.rle: OK in/blunders4.mp3 -> o0/blunders4.mp3.rle: OK in/blunders7.mp3 -> o0/blunders7.mp3.rle: OK in/blunders93.mp3 -> o0/blunders93.mp3.rle: OK in/sample1.txt -> o0/sample1.txt.rle: OK in/sample2.txt -> o0/sample2.txt.rle: OK in/sample3.txt -> o0/sample3.txt.rle: OK in/sample4.txt -> o0/sample4.txt.rle: OK in/sprite13.png -> o0/sprite13.png.rle: OK in/sprite1.png -> o0/sprite1.png.rle: OK in/sprite2.png -> o0/sprite2.png.rle: OK in/sprite7.png -> o0/sprite7.png.rle: OK ...
Should something not work correctly, you'll see a “FAIL” message:
lab46:~/src/discrete/dcf1$ ./check ================================================= = PHASE 0: Raw -> Encode data verification test = ================================================= in/ascii1.art -> o0/ascii1.art.rle: OK in/ascii3.art -> o0/ascii3.art.rle: OK in/ascii7.art -> o0/ascii7.art.rle: OK in/ascii8.art -> o0/ascii8.art.rle: OK in/blunders2.mp3 -> o0/blunders2.mp3.rle: OK in/blunders4.mp3 -> o0/blunders4.mp3.rle: OK in/blunders7.mp3 -> o0/blunders7.mp3.rle: OK in/blunders93.mp3 -> o0/blunders93.mp3.rle: FAIL: checksums do not match in/sample1.txt -> o0/sample1.txt.rle: OK in/sample2.txt -> o0/sample2.txt.rle: OK in/sample3.txt -> o0/sample3.txt.rle: OK in/sample4.txt -> o0/sample4.txt.rle: OK in/sprite13.png -> o0/sprite13.png.rle: OK in/sprite1.png -> o0/sprite1.png.rle: OK in/sprite2.png -> o0/sprite2.png.rle: OK in/sprite7.png -> o0/sprite7.png.rle: OK ...
Should something not work at all (like a missing or uncompiling decode binary), you'll see a “MISSING” message:
lab46:~/src/discrete/dcf1$ ./check ... ================================================= = PHASE 1: Decode -> Raw data verification test = ================================================= Missing 'decode', skipping test. ...
To successfully complete this project, the following criteria must be met:
To submit this program to me using the submit tool, run the following command at your lab46 prompt:
lab46:~/src/discrete/dcf1$ make submit removed 'decode' removed 'encode' removed 'errors' Project backup process commencing Taking snapshot of current project (dcf1) ... OK Compressing snapshot of dcf1 project archive ... OK Setting secure permissions on dcf1 archive ... OK Project backup process complete Submitting discrete project "dcf1": -> ../dcf1-DATESTRING-HOUR.tar.gz(OK) SUCCESSFULLY SUBMITTED
You should get some sort of confirmation indicating successful submission if all went according to plan. If not, check for typos and or locational mismatches.