Corning Community College
CSCS2330 Discrete Structures
~~TOC~~
======Project: RUN-LENGTH ENCODING - DATA COMPRESSION FUN (dcf1)======
=====Objective=====
To apply your skills in implementing an encoding scheme that, in ideal circumstances, will lead to a smaller storage footprint.
====This week's algorithm: RLE+stride====
Last week's project dealt with the first version of this algorithm; this week we add another variable into the mix which can fundamentally change the effectiveness of compressing data.
Our algorithm of implementation this week is a minor tweak of our RLE algorithm from last week.
The change? A configurable **stride** value.
Google defines **stride** as: a long, decisive step.
In dcf0, our stride value was fixed to 1 byte. We could only count up sequences of single byte runs, which in some cases yielded compression; in others, not so much.
With a configurable stride, we can then start counting up new sorts of data runs (such as when groups of two bytes may see strings of repetition, or 5 bytes, or 11 bytes).
To demonstrate what RLE+stride does, let us look at the following data:
aaaaaabcdbcdbcdddddddefghhhhhhhhhhhh (36 bytes total)
Encoding with RLE+1 (1 byte stride), we would get the following (this is what your dcf0 programs should be producing if fed this data):
* 06 61 01 62 01 63 01 64 01 62 01 63 01 64 01 62 01 63 07 64 01 65 01 66 01 67 0C 68 (28 bytes)
Encoding with RLE+2 (2 byte stride), we would get the following:
* 03 61 61 01 62 63 01 64 62 01 63 64 01 62 63 03 64 64 01 64 65 01 66 67 06 68 (26 bytes)
Notice here with a stride of 2 bytes, we still have the singular count byte, but that is then followed by **TWO BYTES** of what the count is keeping track of.
In this example we actually gained some ground over the 1 byte stride. Not much, but let's see if we can improve it somewhat.
Encoding with RLE+3 (3 byte stride):
* 02 61 61 61 03 62 63 64 02 64 64 64 01 65 66 67 04 68 68 68 (20 bytes)
We see here a 1 byte count byte, followed by 3 bytes of value.
... a bigger savings. This was possible because the source data had a lot of 3 byte sequences that allowed a 3 byte stride to work particularly well. And what is neat is various types and formats of data will have patterns that better fit various strides.
=====dcfX RLE v2 specification=====
You'll be writing an **encode** and a **decode** program implementing RLE+stride, in accordance with these published specifications (this way, any one can take an RL-encoded file from someone else and decode it with their's, and vice versa).
====Header====
It is actually **identical** to the specifications of last week, save for two changes:
- we're no longer hard-coding the **stride** value to 1 (byte 10)
- we're placing a 2 in the version byte (byte 9)
Every RL-encoded file will start with the following 12-byte header:
* byte 0: 0x64
* byte 1: 0x63
* byte 2: 0x66
* byte 3: 0x58
* byte 4: 0x20
* byte 5: 0x52
* byte 6: 0x4c
* byte 7: 0x45
* byte 8: 0x00 (reserved)
* byte 9: 0x01 (version of our RLE specification)
* byte 10: 1 byte for stride value
* byte 11: 1 byte for source file name's length (doesn't include NULL terminator)
* bytes 12 through length-1 indicated in byte 11: ASCII string of original filename to write. This will also be the name of the file **decode** creates, leaving the RL-encoded file intact.
Following this we will have a repeating sequence of **count** and **value** fields (where **count** is still of length 1, but **value** is of length **stride**), continuing until the end of the file.
=====Program=====
It is your task to write an encoder and decoder for this specification of the dcfX RLE v2 format:
- **encode.c**: read in source data, encode according to specifications
- **decode.c**: read in RL-encoded data, decode to produce original data
Your program should:
* for encode, obtain 2 parameters from the command-line (see **command-line arguments** section below):
* argv[1]: name of source file
* this should be a file that exists, but you should do appropriate error checking and bail out if the file cannot be accessed
* argv[2]: stride value
* this is a value between 1 and 255 (inclusive).
* for **encode**, the output file will be the filename specified in argv[1] will an "**.rle**" suffixed to the end. This adds a certain universal aspect to how we'll go about naming things (**tar** and **gzip** do this too).
* if stride is set to 1, set the RLE version to 1 in the header (that way our **decode** from last week can still work with it)
* for **decode**, only accept 1 argument; this should be an RL-encoded file.
* the output file name will be obtained from parsing the file's header data.
* be sure to perform appropriate error checking and bail out as needed.
* if the version byte in the header is 1, assume a 1 byte stride and decode; be sure to report RLE version in output
* this way, v2 decode will be backwards compatible with v1 encode
* if the version byte in the header is 2, read the stride value from the header stride byte, and decode in accordance with that
* implement the specified algorithm in both encoding and decoding forms.
* please be sure to test it against varying types of data, to make sure it works no matter what you throw at it.
* calculate and display some statistics gleaned during the performance of the process.
* for example, **encode** should display information on:
* how many bytes read in
* how many bytes written out
* compression rate
* **decode** should also display:
* RLE header information
* filename information
* how many bytes read in
* how many bytes written out
* decompression rate
* see the sample program outputs below
* display errors to STDERR
* display run-time information to STDOUT
* your RL-encoded data **MUST** be conformant to the project specifications described above.
* you should be able to encode/decode a set of data with 100% retrieval rate. No data should be lost.
* **decode** should validate the header information (is it encoded in version 1 or 2? if not, complain to STDERR of "version mismatch!" and exit).
* if the first 8 bytes of the header do not check out, error out with an "invalid data format detected! aborting process..." message to STDERR.
* if the file only contains a header (and no encoded data), report to STDERR "empty data segment" and exit.
====Other specification details====
* if you end up with a less-than-stride quantity of bytes remaining at the very end of the file, merely put down 01 and those remaining bytes
* your program should be mindful of data reads/writes that may hit the end of file marker
* your program also should avoid writing the EOF byte as a data byte (or reading it and processing it as anything other than a marker to stop reading).
* Since we are using single bytes to store our counts, one needs to be mindful not to allow the byte value to "roll over"; limit your counts to 255.
* the **feof(3)** function can be of great use.
=====Grabit Integration=====
For those familiar with the **grabit** tool on lab46, I have made some skeleton files and a custom **Makefile** available for this project.
To "grab" it:
lab46:~/src/discrete$ grabit discrete dcf1
make: Entering directory '/var/public/summer2017/discrete/dcf1'
‘/var/public/summer2017/discrete/dcf1/Makefile’ -> ‘/home/USERNAME/src/discrete/dcf1/Makefile’
‘/var/public/summer2017/discrete/dcf1/encode.c’ -> ‘/home/USERNAME/src/discrete/dcf1/encode.c’
‘/var/public/summer2017/discrete/dcf1/decode.c’ -> ‘/home/USERNAME/src/discrete/dcf1/decode.c’
‘/var/public/summer2017/discrete/dcf1/data/sample0.txt’ -> ‘/home/USERNAME/src/discrete/dcf1/data/sample0.txt’
‘/var/public/summer2017/discrete/dcf1/data/sample1.txt’ -> ‘/home/USERNAME/src/discrete/dcf1/data/sample1.txt’
‘/var/public/summer2017/discrete/dcf1/data/sample2.bmp’ -> ‘/home/USERNAME/src/discrete/dcf1/data/sample2.bmp’
‘/var/public/summer2017/discrete/dcf1/data/sample3.wav’ -> ‘/home/USERNAME/src/discrete/dcf1/data/sample3.wav’
‘/var/public/summer2017/discrete/dcf1/data/sample4.bmp.rle’ -> ‘/home/USERNAME/src/discrete/dcf1/data/sample4.bmp.rle’
‘/var/public/summer2017/discrete/dcf1/data/sample5.txt.rle’ -> ‘/home/USERNAME/src/discrete/dcf1/data/sample5.txt.rle’
‘/var/public/summer2017/discrete/dcf1/data/sample6.mp3.rle’ -> ‘/home/USERNAME/src/discrete/dcf1/data/sample6.mp3.rle’
‘/var/public/summer2017/discrete/dcf1/data/sample7.txt.rle’ -> ‘/home/USERNAME/src/discrete/dcf1/data/sample7.txt.rle’
make: Leaving directory '/var/public/summer2017/discrete/dcf1'
lab46:~/src/discrete$ cd dcf1
lab46:~/src/discrete/dcf1$ ls
Makefile data decode.c encode.c
lab46:~/src/discrete/dcf1$
Just another "nice thing" we deserve.
NOTE: You do NOT want to do this on a populated dcf1 project directory-- it will overwrite files. Only do this on an empty directory.
=====Makefile fun=====
With the Makefile, we have your basic compile and clean-up operations:
* **make**: compile everything
* **make debug**: compile everything with debug support
* **make clean**: remove all binaries
* **make getdata**: re-obtain a fresh copy of project data files
* **make save**: make a backup of your project
* **make submit**: submit project (uses submit tool)
=====Command-Line Arguments=====
====setting up main()====
To accept (or rather, to gain access) to arguments given to your program at runtime, we need to specify two parameters to the main() function. While the names don't matter, the types do.. I like the traditional **argc** and **argv** names, although it is also common to see them abbreviated as **ac** and **av**.
Please declare your main() function as follows:
int main(int argc, char **argv)
The arguments are accessible via the argv array, in the order they were specified:
* argv[0]: program invocation (path + program name)
* argv[1]: our input file
* argv[2]: our stride value (1-255)
====Simple argument checks====
Although I'm not going to require extensive argument parsing or checking for this project, we should check to see if the minimal number of arguments has been provided:
if (argc < 3) // if less than 3 arguments have been provided
{
fprintf(stderr, "Not enough arguments!\n");
exit(1);
}
=====Execution=====
Your program output should be as follows (given the specific input):
====Encode====
lab46:~/src/discrete/dcf1$ ./encode data/sample2.bmp 37
input name length: 16 bytes
input filename: data/sample2.bmp
output filename: data/sample2.bmp.rle
stride value: 37 bytes
read in: 250934 bytes
wrote out: 183758 bytes
compression rate: 26.77%
lab46:~/src/discrete/dcf1$
With various formats, you'll likely want to play with the stride in order to find better compression scenarios.
====Decode====
lab46:~/src/discrete/dcf1$ ./decode data/sample5.txt.rle
input filename: data/sample5.txt.rle
output name length: 11 bytes
output filename: sample5.txt
header text: dcfX RLE v2
stride value: 4 bytes
read in: 3093 bytes
wrote out: 3600 bytes
inflation rate: 14.08%
lab46:~/src/discrete/dcf1$
=====Check Results=====
A good way to test that both encode and decode are working is to encode data then immediately turn around and decode that same data. If the decoded file is in the same state as the original, pre-encoded file, you know things are working.
If you'd like to verify your implementations beyond simply encoding (and moving the original file out of the way), and then decoding, one can use the **md5sum** tool to verify an exact match.
Run it on the original unencoded file, then run it on the decoded file... the md5sum hashes should match.
The **diff(1)** tool will also likely work well enough for our endeavors here.
=====Submission=====
====Project Submission====
To submit this program to me using the **submit** tool, run the following command at your lab46 prompt:
lab46:~/src/discrete/dcf1$ make submit
removed 'decode'
removed 'encode'
removed 'errors'
Project backup process commencing
Taking snapshot of current project (dcf1) ... OK
Compressing snapshot of dcf1 project archive ... OK
Setting secure permissions on dcf1 archive ... OK
Project backup process complete
Submitting discrete project "dcf1":
-> ../dcf1-DATESTRING-HOUR.tar.gz(OK)
SUCCESSFULLY SUBMITTED
You should get some sort of confirmation indicating successful submission if all went according to plan. If not, check for typos and or locational mismatches.
====Submission Criteria====
To be successful in this project, the following criteria must be met:
* Project must be submit on time, by the posted deadline.
* Early submissions will earn 1 bonus point per full day in advance of the deadline.
* Bonus eligibility requires an honest attempt at performing the project (no blank efforts accepted)
* Late submissions will lose 25% credit per day, with the submission window closing on the 4th day following the deadline.
* To clarify: if a project is due on Wednesday (before its end), it would then be 25% off on Thursday, 50% off on Friday, 75% off on Saturday, and worth 0% once it becomes Sunday.
* Certain projects may not have a late grace period, and the due date is the absolute end of things.
* all requested functionality must conform to stated requirements (either on this project page or in comment banner in source code files themselves).
* code resulting in two binaries must be submitted:
* source code that when compiled produces the **encode** program
* if you're only using one file for the encode, that source file should be called **encode.c**
* source code that when compiled produces the **decode** program
* if you're only using one file for the decode, that source file should be called **decode.c**
* Output generated must conform to any provided requirements and specifications (be it in syntax or sample output)
* output obviously must also be correct based on input.
* Processing must be correct based on input given and output requested
* Specification details are NOT to be altered. This project will be evaluated according to the specifications laid out in this document.
* Code must compile cleanly.
* Each source file must compile cleanly (worth 3 total points):
* 3/3: no compiler warnings, notes or errors.
* 2/3: one of warning or note present during compile
* 1/3: two of warning or note present during compile
* 0/3: compiler errors present (code doesn't compile)
* Code must be nicely and consistently indented (you may use the **indent** tool)
* You are free to use your own coding style, but you must be **consistent**
* Avoid unnecessary blank lines (some are good for readability, but do not go overboard- double-spacing your code will get points deducted).
* Indentation will be rated on the following scale (worth 3 total points):
* 3/3: Aesthetically pleasing, pristine indentation, easy to read, organized
* 2/3: Mostly consistent indentation, but some distractions (superfluous or lacking blank lines, or some sort of "busy" ness to the code)
* 1/3: Some indentation issues, difficult to read
* 0/3: Lack of consistent indentation (didn't appear to try)
* Code must be commented
* Commenting will be rated on the following scale (worth 4 total points):
* 4/4: Not only aesthetically pleasing, but also adequately explains the WHY behind what you are doing
* 3/4: Aesthetically pleasing (comments aligned or generally not distracting), easy to read, organized
* 2/4: Mostly consistent, some distractions or gaps in comments (not explaining important things)
* 1/4: Light commenting effort, not much time or energy appears to have been put in.
* 0/4: No original comments
* should I deserve nice things, my terminal is usually 90 characters wide. So if you'd like to format your code not to exceed 90 character wide terminals (and avoid line wrapping comments), at least as reasonably as possible, those are two sure-fire ways of making a good impression on me with respect to code presentation and comments.
* Sufficient comments explaining the point of provided logic **MUST** be present
* Code must be appropriately modified
* Appropriate modifications will be rated on the following scale (worth 3 total points):
* 3/3: Complete attention to detail, original-looking implementation
* 2/3: Lacking some details (like variable initializations), but otherwise complete (still conforms, or conforms mostly to specifications)
* 1/3: Incomplete implementation (typically lacking some obvious details/does not conform to specifications)
* 0/3: Incomplete implementation to the point of non-functionality (or was not started at all)
* Implementation must be accurate with respect to the spirit/purpose of the project (if the focus is on exploring a certain algorithm to produce results, but you avoid the algorithm yet still produce the same results-- that's what I'm talking about here).. worth 3 total points:
* 3/3: Implementation is in line with spirit of project
* 2/3: Some avoidance/shortcuts taken (note this does not mean optimization-- you can optimize all you want, so long as it doesn't violate the spirit of the project).
* 1/3: Generally avoiding the spirit of the project (new, different things, resorting to old and familiar, despite it being against the directions)
* 0/3: entirely avoiding.
* Error checking must be adequately and appropriately performed, according to the following scale (worth 3 total points):
* 3/3: Full and proper error checking performed for all reasonable cases, including queries for external resources and data.
* 2/3: Enough error checking performed to pass basic project requirements and work for most operational cases.
* 1/3: Minimal error checking, code is fragile (code may not work in full accordance with project requirements)
* 0/3: No error checking (code likely does not work in accordance with project requirements)
* Track/version the source code in a repository
* Submit a copy of your source code to me using the **submit** tool (**make submit** will do this) by the deadline.