Project: RUN-LENGTH ENCODING - DATA COMPRESSION FUN (dcf2)

Corning Community College

CSCS2330 Discrete Structures

Project: RUN-LENGTH ENCODING - DATA COMPRESSION FUN (dcf2)

Errata

Any changes that have been made.

<DESCRIPTION> (DATESTAMP)

Objective

To apply your skills in implementing an encoding scheme that, in ideal circumstances, will lead to a smaller storage footprint.

This week's algorithm: RLE+control_sequences

Last week's project dealt with the second version of this algorithm, implementing a configurable yet global stride value; this week we add another feature into the mix which should further improve overall efficiency, and perhaps even reduce wasted space.

The addition? The use of a control sequence byte.

What will happen here, is that instead of assuming the fixed count, value sequences throughout the data section of the file, we will instead be on the lookout for a special control sequence byte, and once we encounter that, determine our count and stride, process that section, then revert to normal data.

In dcf0, our stride value was fixed to 1 byte. We could only count up sequences of single byte runs, which in some cases yielded compression; in others, not so much.

In dcf1, we added a configurable stride, which can then start counting up new sorts of data runs (such as when groups of two bytes may see strings of repetition, or 5 bytes, or 11 bytes).

In this project (dcf2), we still have a specified stride, but we eliminate unnecessary encodings when there's nothing of value to encode. As we know some data encodes into RLE very effectively, others do not. And our control sequence+stride will aid us in that endeavor.

The control sequence consists of a special byte, designated at run-time (for encode), or read out of an encoded file's header (for decode). And is followed by a 1 byte count and 1 byte stride. In a similar manner to how we read the file name (read the file name length byte in the header, then proceed to count out the bytes that follow), we do the same here with our control sequence.

To demonstrate what RLE+control_sequence does, let us look at the following data:

aaaaaabcdbcdbcdddddddefghijklmnnnnnnowxyz (41 bytes total)

Encoding with our new algorithm, we would get the following (our control sequence byte will be a 2A, then followed by the count byte, then the stride byte, then the encoded data):

2A 06 01 61 62 63 64 62 63 64 62 63 2A 07 01 64 65 66 67 68 69 6A 6B 6C 6D 2A 06 01 6E 6F 77 78 79 7A
- 34 bytes

The advantage here is that if we were to have a long sequence of non-patterned data, we don't have to bother with encoding… we just let it be. Once we find something interesting, then we break out the control byte, specify our count and our stride, then pack in our selected data.

dcfX RLE v3 specification

You'll be writing an encode and a decode program implementing RLE+control_sequences, in accordance with these published specifications (this way, any one can take an RL-encoded file from someone else and decode it with their's, and vice versa).

It is actually identical to the specifications of dcf1, save for three changes:

we're placing a 3 in the version byte (byte 9)
we're placing our control byte in byte 8 (previously our reserved byte)

Every RL-encoded file will start with the following 12-byte header:

byte 0: 0x64
byte 1: 0x63
byte 2: 0x66
byte 3: 0x58
byte 4: 0x20
byte 5: 0x52
byte 6: 0x4c
byte 7: 0x45
byte 8: (previously reserved, now control byte)
byte 9: 0x03 (version of our RLE specification)
byte 10: 1 byte for stride value
byte 11: 1 byte for source file name's length (doesn't include NULL terminator)
bytes 12 through length-1 indicated in byte 11: ASCII string of original filename to write. This will also be the name of the file decode creates, leaving the RL-encoded file intact.

Following this we will have a repeating sequence of count and value fields (where count is still of length 1, but value is of length stride), continuing until the end of the file.

Program

It is your task to write an encoder and decoder for this specification of the dcfX RLE v3 format:

encode.c: read in source data, encode according to specifications
decode.c: read in RL-encoded data, decode to produce original data

Your program should:

for encode, obtain 4 parameters from the command-line (see command-line arguments section below):
- argv[1]: name of path + source file
  - this should be a file that exists, but you should do appropriate error checking and bail out if the file cannot be accessed
- argv[2]: path of destination location (NOT file NAME, just PATH). The destination file name will be generated, based on the source file and whether we are encoding or decoding:
  - if encoding, destination is argv[2] + argv[1] (minus source path) + “.rle”
  - if decoding, destination is argv[2] + embedded file name (which shouldn't have an “.rle” per project specifications)
- argv[3]: stride value
  - this is a value between 1 and 255 (inclusive).
    - be sure to do error checking to make sure it is in this range.
- argv[4]: control byte
  - this is a value between 0 and 255 (inclusive).
for encode, the output file will be the filename specified in argv[1] will an “.rle” suffixed to the end. This adds a certain universal aspect to how we'll go about naming things (tar and gzip do this too).
for decode, only accept 2 arguments; this should be an RL-encoded file.
- the output file name will be obtained from parsing the file's header data.
- be sure to perform appropriate error checking and bail out as needed.
- if the version byte in the header is 1, assume a 1 byte stride and decode; be sure to report RLE version 1 in output
  - this way, v3 decode will be backwards compatible with v1 encode
- if the version byte in the header is 2, check the stride byte and decode according to RLE v2 (report as version 2 in output)
  - once again, baking in backwards compatibility
implement the specified algorithm in both encoding and decoding forms.
- please be sure to test it against varying types of data, to make sure it works no matter what you throw at it.
calculate and display some statistics gleaned during the performance of the process.
- for example, encode should display information on:
  - how many bytes read in
  - how many bytes written out
  - control byte used (display the hex value)
  - stride
  - compression rate
- decode should also display:
  - RLE header information
  - filename information
  - control byte used (display the hex value)
  - stride
  - how many bytes read in
  - how many bytes written out
  - decompression rate
- see the sample program outputs below
display errors to STDERR
display run-time information to STDOUT
your RL-encoded data MUST be conformant to the project specifications described above.
- you should be able to encode/decode a set of data with 100% retrieval rate. No data should be lost.
decode should validate the header information (is it encoded in version 1, 2, or 3? if not, complain to STDERR of “version mismatch!” and exit).
- if the first 8 bytes of the header do not check out, error out with an “invalid data format detected! aborting process…” message to STDERR.
- if the file only contains a header (and no encoded data), report to STDERR “empty data segment” and exit.

Other specification details

end of file bytes should no longer be tricky in v3, simply treat them as unencoded data
your program should be mindful of data reads/writes that may hit the end of file marker
ALL variables must be well and pertinently named, and be no fewer than 4 symbols in length
your program also should avoid writing the EOF byte as a data byte (or reading it and processing it as anything other than a marker to stop reading).
Since we are using single bytes to store our counts, one needs to be mindful not to allow the byte value to “roll over”; limit your counts to 255.
the feof(3) function can be of great use.
what if the data contains as legitimate data the same byte we're using as our control byte? Simple, escape it with the control byte, then a count, stride, and the data. Your code should not be checking for control bytes in the middle of an encoded data packet, merely in raw data. It should be the only instance of a 01 count, 01 stride control sequence…

Grabit Integration

For those familiar with the grabit tool on lab46, I have made some skeleton files and a custom Makefile available for this project.

To “grab” it:

lab46:~/src/discrete$ grabit discrete dcf2

Just another “nice thing” we deserve.

NOTE: You do NOT want to do this on a populated dcf2 project directory– it will overwrite files. Only do this on an empty directory.

Makefile fun

With the Makefile, we have your basic compile and clean-up operations:

make: compile everything
make debug: compile everything with debug support
make clean: remove all binaries
make getdata: re-obtain a fresh copy of project data files
make save: make a backup of your project
make submit: submit project (uses submit tool)

Command-Line Arguments

setting up main()

To accept (or rather, to gain access) to arguments given to your program at runtime, we need to specify two parameters to the main() function. While the names don't matter, the types do.. I like the traditional argc and argv names, although it is also common to see them abbreviated as ac and av.

Please declare your main() function as follows:

int main(int argc, char **argv)

The arguments are accessible via the argv array, in the order they were specified:

argv[0]: program invocation (path + program name)
argv[1]: our input file
argv[2]: our output path
argv[3]: our stride value (1-255)
argv[4]: our control sequence byte (0-255)

Simple argument checks

Although I'm not going to require extensive argument parsing or checking for this project, we should check to see if the minimal number of arguments has been provided:

    if (argc < 3)  // if less than 3 arguments have been provided
    {
        fprintf(stderr, "Not enough arguments!\n");
        exit(1);
    }

Execution

Your program output should be as follows (given the specific input):

Encode

lab46:~/src/discrete/dcf2$ ./encode sample2.bmp . 3 37
dcfX v3 encode details
==================================
input name length: 11 bytes
   input filename: sample2.bmp
  output filename: ./sample2.bmp.rle
     control byte: 0x25
     stride value: 3 bytes
          read in: 250934 bytes
        wrote out: 112390 bytes
 compression rate: 55.21%
lab46:~/src/discrete/dcf2$

With various formats, you'll likely want to play with the stride in order to find better compression scenarios.

Decode

lab46:~/src/discrete/dcf2$ ./decode sample5.txt.rle .
    input filename: sample5.txt.rle
output name length: 11 bytes
   output filename: ./sample5.txt
       header text: dcfX RLE v3
      control byte: 0x29
      stride value: 4 bytes
           read in: 2734 bytes
         wrote out: 3600 bytes
    inflation rate: 24.06%
lab46:~/src/discrete/dcf2$

Check Results

A good way to test that both encode and decode are working is to encode data then immediately turn around and decode that same data. If the decoded file is in the same state as the original, pre-encoded file, you know things are working.

If you'd like to verify your implementations beyond simply encoding (and moving the original file out of the way), and then decoding, one can use the md5sum tool to verify an exact match.

Run it on the original unencoded file, then run it on the decoded file… the md5sum hashes should match.

The diff(1) tool will also likely work well enough for our endeavors here.

Submission

Project Submission

To submit this program to me using the submit tool, run the following command at your lab46 prompt:

lab46:~/src/discrete/dcf2$ make submit
removed 'decode'
removed 'encode'
removed 'errors'

Project backup process commencing

Taking snapshot of current project (dcf2)      ... OK
Compressing snapshot of dcf2 project archive   ... OK
Setting secure permissions on dcf2 archive     ... OK

Project backup process complete

Submitting discrete project "dcf2":
    -> ../dcf2-DATESTRING-HOUR.tar.gz(OK) 

SUCCESSFULLY SUBMITTED

You should get some sort of confirmation indicating successful submission if all went according to plan. If not, check for typos and or locational mismatches.

What I will be looking for:

234:dcf2:final tally of results (234/234)
*:dcf2:encode.c compiles cleanly, no compiler messages [13/13]
*:dcf2:encode.c consistent indentation throughout code [13/13]
*:dcf2:encode.c relevant how and why comments in code [13/13]
*:dcf2:encode.c implementation flexible, not hardcoded [13/13]
*:dcf2:encode.c implementation free from restrictions [13/13]
*:dcf2:encode.c conforms to project specifications [13/13]
*:dcf2:encode runtime output conforms to specifications [13/13]
*:dcf2:encode verification tests succeed [13/13]
*:dcf2:decode.c compiles cleanly, no compiler messages [13/13]
*:dcf2:decode.c consistent indentation throughout code [13/13]
*:dcf2:decode.c relevant how and why comments in code [13/13]
*:dcf2:decode.c implementation flexible, not hardcoded [13/13]
*:dcf2:decode.c implementation free from restrictions [13/13]
*:dcf2:decode.c conforms to project specifications [13/13]
*:dcf2:decode runtime output conforms to specifications [13/13]
*:dcf2:decode verification tests succeed [13/13]
*:dcf2:project committed and pushed to lab46 repository [26/26]

Additionally:

Solutions not abiding by spirit of project will be subject to a 25% overall deduction
Solutions not utilizing descriptive why and how comments will be subject to a 25% overall deduction
Solutions not utilizing indentation to promote scope and clarity will be subject to a 25% overall deduction
Solutions not organized and easy to read are subject to a 25% overall deduction

Table of Contents