Corning Community College
CSCS2330 Discrete Structures
~~TOC~~
To apply your skills in implementing an encoding scheme that, in ideal circumstances, will lead to a smaller storage footprint.
Last week's project dealt with the second version of this algorithm, implementing a configurable yet global stride value; this week we add another feature into the mix which should further improve overall efficiency, and perhaps even reduce wasted space.
The addition? The use of a control sequence byte.
What will happen here, is that instead of assuming the fixed count, value sequences throughout the data section of the file, we will instead be on the lookout for a special control sequence byte, and once we encounter that, determine our count and stride, process that section, then revert to normal data.
In dcf0, our stride value was fixed to 1 byte. We could only count up sequences of single byte runs, which in some cases yielded compression; in others, not so much.
In dcf1, we added a configurable stride, which can then start counting up new sorts of data runs (such as when groups of two bytes may see strings of repetition, or 5 bytes, or 11 bytes).
In this project (dcf2), we still have a specified stride, but we eliminate unnecessary encodings when there's nothing of value to encode. As we know some data encodes into RLE very effectively, others do not. And our control sequence+stride will aid us in that endeavor.
The control sequence consists of a special byte, designated at run-time (for encode), or read out of an encoded file's header (for decode). And is followed by a 1 byte count and 1 byte stride. In a similar manner to how we read the file name (read the file name length byte in the header, then proceed to count out the bytes that follow), we do the same here with our control sequence.
To demonstrate what RLE+control_sequence does, let us look at the following data:
aaaaaabcdbcdbcdddddddefghijklmnnnnnnowxyz (41 bytes total)
Encoding with our new algorithm, we would get the following (our control sequence byte will be a 2A, then followed by the count byte, then the stride byte, then the encoded data):
The advantage here is that if we were to have a long sequence of non-patterned data, we don't have to bother with encoding… we just let it be. Once we find something interesting, then we break out the control byte, specify our count and our stride, then pack in our selected data.
You'll be writing an encode and a decode program implementing RLE+control_sequences, in accordance with these published specifications (this way, any one can take an RL-encoded file from someone else and decode it with their's, and vice versa).
It is actually identical to the specifications of last week, save for three changes:
Every RL-encoded file will start with the following 12-byte header:
Following this we will have a repeating sequence of count and value fields (where count is still of length 1, but value is of length stride), continuing until the end of the file.
It is your task to write an encoder and decoder for this specification of the dcfX RLE v3 format:
Your program should:
For those familiar with the grabit tool on lab46, I have made some skeleton files and a custom Makefile available for this project.
To “grab” it:
lab46:~/src/discrete$ grabit discrete dcf2 make: Entering directory '/var/public/fall2016/discrete/dcf2' ‘/var/public/fall2016/discrete/dcf2/Makefile’ -> ‘/home/USERNAME/src/discrete/dcf2/Makefile’ ‘/var/public/fall2016/discrete/dcf2/encode.c’ -> ‘/home/USERNAME/src/discrete/dcf2/encode.c’ ‘/var/public/fall2016/discrete/dcf2/decode.c’ -> ‘/home/USERNAME/src/discrete/dcf2/decode.c’ ‘/var/public/fall2016/discrete/dcf2/data/sample0.txt’ -> ‘/home/USERNAME/src/discrete/dcf2/data/sample0.txt’ ‘/var/public/fall2016/discrete/dcf2/data/sample1.txt’ -> ‘/home/USERNAME/src/discrete/dcf2/data/sample1.txt’ ‘/var/public/fall2016/discrete/dcf2/data/sample2.bmp’ -> ‘/home/USERNAME/src/discrete/dcf2/data/sample2.bmp’ ‘/var/public/fall2016/discrete/dcf2/data/sample3.wav’ -> ‘/home/USERNAME/src/discrete/dcf2/data/sample3.wav’ ‘/var/public/fall2016/discrete/dcf2/data/sample4.bmp.rle’ -> ‘/home/USERNAME/src/discrete/dcf2/data/sample4.bmp.rle’ ‘/var/public/fall2016/discrete/dcf2/data/sample5.txt.rle’ -> ‘/home/USERNAME/src/discrete/dcf2/data/sample5.txt.rle’ ‘/var/public/fall2016/discrete/dcf2/data/sample6.mp3.rle’ -> ‘/home/USERNAME/src/discrete/dcf2/data/sample6.mp3.rle’ ‘/var/public/fall2016/discrete/dcf2/data/sample7.txt.rle’ -> ‘/home/USERNAME/src/discrete/dcf2/data/sample7.txt.rle’ make: Leaving directory '/var/public/fall2016/discrete/dcf2' lab46:~/src/discrete$ cd dcf2 lab46:~/src/discrete/dcf2$ ls Makefile data decode.c encode.c lab46:~/src/discrete/dcf2$
Just another “nice thing” we deserve.
NOTE: You do NOT want to do this on a populated dcf2 project directory– it will overwrite files. Only do this on an empty directory.
With the Makefile, we have your basic compile and clean-up operations:
To accept (or rather, to gain access) to arguments given to your program at runtime, we need to specify two parameters to the main() function. While the names don't matter, the types do.. I like the traditional argc and argv names, although it is also common to see them abbreviated as ac and av.
Please declare your main() function as follows:
int main(int argc, char **argv)
The arguments are accessible via the argv array, in the order they were specified:
Although I'm not going to require extensive argument parsing or checking for this project, we should check to see if the minimal number of arguments has been provided:
if (argc < 3) // if less than 3 arguments have been provided { fprintf(stderr, "Not enough arguments!\n"); exit(1); }
Your program output should be as follows (given the specific input):
lab46:~/src/discrete/dcf2$ ./encode sample2.bmp 3 37 dcfX v3 encode details ================================== input name length: 11 bytes input filename: sample2.bmp output filename: sample2.bmp.rle control byte: 0x25 stride value: 3 bytes read in: 250934 bytes wrote out: 112390 bytes compression rate: 55.21% lab46:~/src/discrete/dcf2$
With various formats, you'll likely want to play with the stride in order to find better compression scenarios.
lab46:~/src/discrete/dcf2$ ./decode sample5.txt.rle input filename: sample5.txt.rle output name length: 11 bytes output filename: sample5.txt header text: dcfX RLE v3 control byte: 0x29 stride value: 4 bytes read in: 2734 bytes wrote out: 3600 bytes inflation rate: 24.06% lab46:~/src/discrete/dcf2$
A good way to test that both encode and decode are working is to encode data then immediately turn around and decode that same data. If the decoded file is in the same state as the original, pre-encoded file, you know things are working.
If you'd like to verify your implementations beyond simply encoding (and moving the original file out of the way), and then decoding, one can use the md5sum tool to verify an exact match.
Run it on the original unencoded file, then run it on the decoded file… the md5sum hashes should match.
The diff(1) tool will also likely work well enough for our endeavors here.
To submit this program to me using the submit tool, run the following command at your lab46 prompt:
lab46:~/src/discrete/dcf2$ make submit removed 'decode' removed 'encode' removed 'errors' Project backup process commencing Taking snapshot of current project (dcf2) ... OK Compressing snapshot of dcf2 project archive ... OK Setting secure permissions on dcf2 archive ... OK Project backup process complete Submitting discrete project "dcf2": -> ../dcf2-DATESTRING-HOUR.tar.gz(OK) SUCCESSFULLY SUBMITTED
You should get some sort of confirmation indicating successful submission if all went according to plan. If not, check for typos and or locational mismatches.
To be successful in this project, the following criteria must be met: