Corning Community College
CSCS2330 Discrete Structures
~~TOC~~
To apply your skills in implementing an encoding scheme that, in ideal circumstances, will lead to a smaller storage footprint.
A Google search defines encoding as:
When we program, we encode ideas into syntax.
When you compile source code, the compiler encodes your coding syntax into machine code.
When we lay out a pattern, we encode particular aspects of a process.
Encoding can be as simple as a 1:1 translation (such as shifting letters around to encipher a message), but it can also take other forms.
Compression is a certain form of encoding, one where we focus on taking something (typically data, bits, bandwidth, instructions) and encode it in a way where it ends up requiring less storage space than when we began.
According to Wikipedia, data compression is: the process of encoding digital information using fewer bits.
Another aspect to consider is whether the act of encoding data results in a reversible pathway or not. Sometimes the process of encoding can eliminate seemingly unnecessary data, aiding in the task of taking up less space. When data is trimmed in this manner, we say the process is lossy; when all data is preserved, it is lossless.
For more information on lossy and lossless data compression:
Wikipedia has categories identifying various algorithms implemented for both lossless and lossy compression algorithms.
Our algorithm of implementation this week is a relatively simple and straightforward one known as Run-Length Encoding (RLE for short).
Sometimes associated with certain image formats, RLE is an algorithm that, when matched with appropriately patterned data, can yield some space savings.
However, like other special-purpose algorithms, it its worst cases it can actually balloon the storage footprint of your data.
To demonstrate what RLE does, let us look at the following data:
aaaabcccdddddefghhhhhhhhhhhh (28 bytes total)
Notice how there are sequences of repeating letters? Turns out in some patterns of data, that can be quite the frequent occurrence (for example, images with large swaths of the same color).
RLE adds a sort of narrative to the data, sort of like what we might do if we were describing that string:
Or, to concisely represent it:
4a1b3c5d1e1f1g12h (17 bytes total)
By applying that “narrative” to the data, we were able to both preserve the integrity of the data, while shrinking its storage footprint.
And in that case, we got some pretty decent gains:
However, the data must be appropriately repetitive… if not, it can grow in size. Take this example:
abcdefgh (8 bytes)
Applying our same narrative:
Or, concisely:
1a1b1c1d1e1f1g1h (16 bytes)
In that case, we not only didn't shrink, but we grew to twice our original size.
When RLE works, it can work pretty well, but when the data isn't conducive, it really doesn't help out that much at all.
You'll be writing an encode and a decode program implementing RLE, in accordance with these published specifications (this way, any one can take an RL-encoded file from someone else and decode it with their's, and vice versa).
Every RL-encoded file will start with the following 12-byte header:
Following this we will have a repeating sequence of count and value fields, continuing until the end of the file.
For example, encoding our 28-byte example from above we'd get:
04 61 01 62 03 63 05 64 01 65 01 66 01 67 0c 68
Or, if looking at the ENTIRE encoded file, with header, where the source file was named 'sample.txt' and has a newline character at the end (I've wrapped it at 16 bytes per line so it better fits on the page without a horizontal scroll):
64 63 66 58 20 52 4c 45 00 01 01 0a 73 61 6d 70 6c 65 2e 74 78 74 04 61 01 62 03 63 05 64 01 65 01 66 01 67 0c 68 01 0a
Now, in this case, with full headers, our original 28-byte (29, counting the newline) example would actually end up at 40 bytes, but that sample is rather small. A repetitive file originally 40-ish bytes, and of course larger, would start to yield space savings.
It is your task to write an encoder and decoder for this specification of the dcfX RLE format:
Your program should:
For those familiar with the grabit tool on lab46, I have made some skeleton files and a custom Makefile available for this project.
To “grab” it:
lab46:~/src/discrete$ grabit discrete dcf0 make: Entering directory '/var/public/fall2016/discrete/dcf0' ‘/var/public/fall2016/discrete/dcf0/Makefile’ -> ‘/home/USERNAME/src/discrete/dcf0/Makefile’ ‘/var/public/fall2016/discrete/dcf0/encode.c’ -> ‘/home/USERNAME/src/discrete/dcf0/encode.c’ ‘/var/public/fall2016/discrete/dcf0/decode.c’ -> ‘/home/USERNAME/src/discrete/dcf0/decode.c’ ‘/var/public/fall2016/discrete/dcf0/data/sample0.txt’ -> ‘/home/USERNAME/src/discrete/dcf0/data/sample0.txt’ ‘/var/public/fall2016/discrete/dcf0/data/sample1.txt’ -> ‘/home/USERNAME/src/discrete/dcf0/data/sample1.txt’ ‘/var/public/fall2016/discrete/dcf0/data/sample2.bmp’ -> ‘/home/USERNAME/src/discrete/dcf0/data/sample2.bmp’ ‘/var/public/fall2016/discrete/dcf0/data/sample3.wav’ -> ‘/home/USERNAME/src/discrete/dcf0/data/sample3.wav’ ‘/var/public/fall2016/discrete/dcf0/data/sample4.bmp.rle’ -> ‘/home/USERNAME/src/discrete/dcf0/data/sample4.bmp.rle’ ‘/var/public/fall2016/discrete/dcf0/data/sample5.txt.rle’ -> ‘/home/USERNAME/src/discrete/dcf0/data/sample5.txt.rle’ ‘/var/public/fall2016/discrete/dcf0/data/sample6.mp3.rle’ -> ‘/home/USERNAME/src/discrete/dcf0/data/sample6.mp3.rle’ ‘/var/public/fall2016/discrete/dcf0/data/sample7.txt.rle’ -> ‘/home/USERNAME/src/discrete/dcf0/data/sample7.txt.rle’ make: Leaving directory '/var/public/fall2016/discrete/dcf0' lab46:~/src/discrete$ cd dcf0 lab46:~/src/discrete/dcf0 $ ls Makefile data decode.c encode.c lab46:~/src/discrete/dcf0$
Just another “nice thing” we deserve.
NOTE: You do NOT want to do this on a populated dcf0 project directory– it will overwrite files. Only do this on an empty directory.
With the Makefile, we have your basic compile and clean-up operations:
To accept (or rather, to gain access) to arguments given to your program at runtime, we need to specify two parameters to the main() function. While the names don't matter, the types do.. I like the traditional argc and argv names, although it is also common to see them abbreviated as ac and av.
Please declare your main() function as follows:
int main(int argc, char **argv)
The arguments are accessible via the argv array, in the order they were specified:
Although I'm not going to require extensive argument parsing or checking for this project, we should check to see if the minimal number of arguments has been provided:
if (argc < 2) // if less than 2 arguments have been provided { fprintf(stderr, "Not enough arguments!\n"); exit(1); }
Your program output should be as follows (given the specific input):
lab46:~/src/discrete/dcf0$ ./encode data/sample0.txt input name length: 16 bytes input filename: data/sample0.txt output filename: data/sample0.txt.rle stride value: 1 byte read in: 82 bytes wrote out: 64 bytes compression rate: 21.95% lab46:~/src/discrete/dcf0$
lab46:~/src/discrete/dcf0$ ./decode data/sample0.txt.rle input filename: data/sample0.txt.rle output name length: 16 bytes output filename: data/sample0.txt header text: dcfX RLE v1 stride value: 1 byte read in: 64 bytes wrote out: 82 bytes inflation rate: 21.95% lab46:~/src/discrete/dcf0$
A good way to test that both encode and decode are working is to encode data then immediately turn around and decode that same data. If the decoded file is in the same state as the original, pre-encoded file, you know things are working.
If you'd like to verify your implementations,
To successfully complete this project, the following criteria must be met:
To submit this program to me using the submit tool, run the following command at your lab46 prompt:
lab46:~/src/discrete/dcf0$ make submit removed 'decode' removed 'encode' removed 'errors' Project backup process commencing Taking snapshot of current project (dcf0) ... OK Compressing snapshot of dcf0 project archive ... OK Setting secure permissions on dcf0 archive ... OK Project backup process complete Submitting discrete project "dcf0": -> ../dcf0-DATESTRING-HOUR.tar.gz(OK) SUCCESSFULLY SUBMITTED
You should get some sort of confirmation indicating successful submission if all went according to plan. If not, check for typos and or locational mismatches.