Table of Contents

Corning Community College

CSCS2330 Discrete Structures

~~TOC~~

Project: RUN-LENGTH ENCODING - DATA COMPRESSION FUN (dcf1)

Objective

To apply your skills in implementing an encoding scheme that, in ideal circumstances, will lead to a smaller storage footprint.

This week's algorithm: RLE+stride

Last week's project dealt with the first version of this algorithm; this week we add another variable into the mix which can fundamentally change the effectiveness of compressing data.

Our algorithm of implementation this week is a minor tweak of our RLE algorithm from last week.

The change? A configurable stride value.

Google defines stride as: a long, decisive step.

In dcf0, our stride value was fixed to 1 byte. We could only count up sequences of single byte runs, which in some cases yielded compression; in others, not so much.

With a configurable stride, we can then start counting up new sorts of data runs (such as when groups of two bytes may see strings of repetition, or 5 bytes, or 11 bytes).

To demonstrate what RLE+stride does, let us look at the following data:

aaaaaabcdbcdbcdddddddefghhhhhhhhhhhh (36 bytes total)

Encoding with RLE+1 (1 byte stride), we would get the following (this is what your dcf0 programs should be producing if fed this data):

Encoding with RLE+2 (2 byte stride), we would get the following:

Notice here with a stride of 2 bytes, we still have the singular count byte, but that is then followed by TWO BYTES of what the count is keeping track of.

In this example we actually gained some ground over the 1 byte stride. Not much, but let's see if we can improve it somewhat.

Encoding with RLE+3 (3 byte stride):

We see here a 1 byte count byte, followed by 3 bytes of value.

… a bigger savings. This was possible because the source data had a lot of 3 byte sequences that allowed a 3 byte stride to work particularly well. And what is neat is various types and formats of data will have patterns that better fit various strides.

dcfX RLE v2 specification

You'll be writing an encode and a decode program implementing RLE+stride, in accordance with these published specifications (this way, any one can take an RL-encoded file from someone else and decode it with their's, and vice versa).

It is actually identical to the specifications of last week, save for two changes:

  1. we're no longer hard-coding the stride value to 1 (byte 10)
  2. we're placing a 2 in the version byte (byte 9)

Every RL-encoded file will start with the following 12-byte header:

Following this we will have a repeating sequence of count and value fields (where count is still of length 1, but value is of length stride), continuing until the end of the file.

Program

It is your task to write an encoder and decoder for this specification of the dcfX RLE v2 format:

  1. encode.c: read in source data, encode according to specifications
  2. decode.c: read in RL-encoded data, decode to produce original data

Your program should:

Other specification details

Grabit Integration

For those familiar with the grabit tool on lab46, I have made some skeleton files and a custom Makefile available for this project.

To “grab” it:

lab46:~/src/discrete$ grabit discrete dcf1
make: Entering directory '/var/public/summer2017/discrete/dcf1'

‘/var/public/summer2017/discrete/dcf1/Makefile’ -> ‘/home/USERNAME/src/discrete/dcf1/Makefile’
‘/var/public/summer2017/discrete/dcf1/encode.c’ -> ‘/home/USERNAME/src/discrete/dcf1/encode.c’
‘/var/public/summer2017/discrete/dcf1/decode.c’ -> ‘/home/USERNAME/src/discrete/dcf1/decode.c’
‘/var/public/summer2017/discrete/dcf1/data/sample0.txt’ -> ‘/home/USERNAME/src/discrete/dcf1/data/sample0.txt’
‘/var/public/summer2017/discrete/dcf1/data/sample1.txt’ -> ‘/home/USERNAME/src/discrete/dcf1/data/sample1.txt’
‘/var/public/summer2017/discrete/dcf1/data/sample2.bmp’ -> ‘/home/USERNAME/src/discrete/dcf1/data/sample2.bmp’
‘/var/public/summer2017/discrete/dcf1/data/sample3.wav’ -> ‘/home/USERNAME/src/discrete/dcf1/data/sample3.wav’
‘/var/public/summer2017/discrete/dcf1/data/sample4.bmp.rle’ -> ‘/home/USERNAME/src/discrete/dcf1/data/sample4.bmp.rle’
‘/var/public/summer2017/discrete/dcf1/data/sample5.txt.rle’ -> ‘/home/USERNAME/src/discrete/dcf1/data/sample5.txt.rle’
‘/var/public/summer2017/discrete/dcf1/data/sample6.mp3.rle’ -> ‘/home/USERNAME/src/discrete/dcf1/data/sample6.mp3.rle’
‘/var/public/summer2017/discrete/dcf1/data/sample7.txt.rle’ -> ‘/home/USERNAME/src/discrete/dcf1/data/sample7.txt.rle’

make: Leaving directory '/var/public/summer2017/discrete/dcf1'
lab46:~/src/discrete$ cd dcf1
lab46:~/src/discrete/dcf1$ ls
Makefile          data            decode.c            encode.c
lab46:~/src/discrete/dcf1$ 

Just another “nice thing” we deserve.

NOTE: You do NOT want to do this on a populated dcf1 project directory– it will overwrite files. Only do this on an empty directory.

Makefile fun

With the Makefile, we have your basic compile and clean-up operations:

Command-Line Arguments

setting up main()

To accept (or rather, to gain access) to arguments given to your program at runtime, we need to specify two parameters to the main() function. While the names don't matter, the types do.. I like the traditional argc and argv names, although it is also common to see them abbreviated as ac and av.

Please declare your main() function as follows:

int main(int argc, char **argv)

The arguments are accessible via the argv array, in the order they were specified:

Simple argument checks

Although I'm not going to require extensive argument parsing or checking for this project, we should check to see if the minimal number of arguments has been provided:

    if (argc < 3)  // if less than 3 arguments have been provided
    {
        fprintf(stderr, "Not enough arguments!\n");
        exit(1);
    }

Execution

Your program output should be as follows (given the specific input):

Encode

lab46:~/src/discrete/dcf1$ ./encode data/sample2.bmp 37
input name length: 16 bytes
   input filename: data/sample2.bmp
  output filename: data/sample2.bmp.rle
     stride value: 37 bytes
          read in: 250934 bytes
        wrote out: 183758 bytes
 compression rate: 26.77%
lab46:~/src/discrete/dcf1$ 

With various formats, you'll likely want to play with the stride in order to find better compression scenarios.

Decode

lab46:~/src/discrete/dcf1$ ./decode data/sample5.txt.rle
    input filename: data/sample5.txt.rle
output name length: 11 bytes
   output filename: sample5.txt
       header text: dcfX RLE v2
      stride value: 4 bytes
           read in: 3093 bytes
         wrote out: 3600 bytes
    inflation rate: 14.08%
lab46:~/src/discrete/dcf1$ 

Check Results

A good way to test that both encode and decode are working is to encode data then immediately turn around and decode that same data. If the decoded file is in the same state as the original, pre-encoded file, you know things are working.

If you'd like to verify your implementations beyond simply encoding (and moving the original file out of the way), and then decoding, one can use the md5sum tool to verify an exact match.

Run it on the original unencoded file, then run it on the decoded file… the md5sum hashes should match.

The diff(1) tool will also likely work well enough for our endeavors here.

Submission

Project Submission

To submit this program to me using the submit tool, run the following command at your lab46 prompt:

lab46:~/src/discrete/dcf1$ make submit
removed 'decode'
removed 'encode'
removed 'errors'

Project backup process commencing

Taking snapshot of current project (dcf1)      ... OK
Compressing snapshot of dcf1 project archive   ... OK
Setting secure permissions on dcf1 archive     ... OK

Project backup process complete

Submitting discrete project "dcf1":
    -> ../dcf1-DATESTRING-HOUR.tar.gz(OK) 

SUCCESSFULLY SUBMITTED

You should get some sort of confirmation indicating successful submission if all went according to plan. If not, check for typos and or locational mismatches.

Submission Criteria

To be successful in this project, the following criteria must be met: