Corning Community College
CSCS1320 C/C++ Programming
======Project: C Binary Fun (cbf0)======
=====Objective=====
To practice manipulating binary data in a C program (for fun and glory).
=====Background=====
We've had a newfound exposure to data this semester, and how programming languages like C interpret different forms of data; we look at things like data type and corresponding storage allocated, and we use these as ingredients in our programmatic solutions.
Yet- we've also inserted a layer of abstraction between us and the computer: integers and floating point values and ASCII characters... each with its own unique ways of accessing and manipulating.
The thing is: to the computer all data is largely the same: sequences of 1s and 0s, accessed in units of bytes.
This project will expose us to some of the underlying aspects of the realm of data that lies closer to the computer, that of "binary data", where we often come in contact with hexadecimal values to aid us in the interaction.
So, as we are here to learn more about the computer, it only makes sense to steer some of our activities towards the manipulation of binary data as well- one cannot effectively solve a whole domain of problems if they have no idea how to work with it.
This project aims to ameliorate that.
Binary data merely refers to data as the computer stores it. The computer is a binary device, so its raw data (as it exists on various forms of storage and media) is often referred to as binary data, to reflect the 1s and 0s being represented.
The data we have become familiar with is textual data. We read from and write to files (even with those files commonly being the keyboard and screen) with the express purpose of retrieving or storing text with them. And with the use of various text processing tools, we can easily manipulate these text files.
But: did you know that all text data is also binary data?
The trick to remember is that its opposite is not always true: not all binary data is text. In fact most of it isn't. Text represents is a very narrow range of possible data values, and then only within a certain context. You may "see" random letters when viewing binary data, but there is no continuity. The data values that we utilize when interacting with text are also valid combinations of binary values. Which can mean almost anything.
So, text is really ONE (of many) possible representations of binary data. We need to gain a wider perspective and get more familiar with this more expansive and general notion of binary data.
The computer works in units of **bytes**, which these days means groups of 8 bits. C has the ability to arbitrarily read and write individual bytes of data, and we will want to make use of that to aid us in our current task.
=====Opening and reading from files=====
The nice thing about C is that it tends to embody the "everything is a file" mantra from UNIX.
What this means, basically, is that interacting with data in a file is really no different that interacting with data from the keyboard or data to the screen. We merely need a FILE pointer and appropriate resources allocated.
To interact with a file, we must first declare a pointer to type FILE that will be our point of transaction.
Common names for our file pointer variable are **fp**, **fPtr**, **input**, **inp**, but in reality can be anything you want.
The intention is that, of course, you name variables so they are meaningful in the context of the overall implementation.
FILE *input = NULL;
====Opening a file with fopen()====
To attach a file stream to a FILE pointer, we utilize a file opening function such as **fopen()**.
It takes two arguments:
- the path and name of file we wish to open (provided as a string)
- the mode we wish to open the file as (provided as a string)
There are 3 common file opening modes (and combinations thereof, among others, sometimes dependent on the particular operating system being run). For now, I highly recommend just sticking to ONE mode of operation per FILE pointer. This can avoid messy things like data corruption and indirect logic/runtime errors.
The 3 file modes:
* r - open file for reading (start at beginning)
* w - open file for writing (start at beginning)
* a - open file for appending (add to end)
If we wanted to open the file "**sample0.txt**" in the **in/** sub-directory off our current working directory for **reading**, using the file pointer **input**, we would do the following:
input = fopen ("in/sample0.txt", "r");
Note the double quotes around each argument. They both need to be strings (ie array of char terminated with NULL terminator characters), and the double quotes enables this.
====Reading from the file====
If the file is filled with a set format of data you'd like to retrieve, such as one short integer per line (basically, a text file filled with numbers), we can just use our trusty and familiar **fscanf()** function. We merely have have to indicate the correct file pointer:
short int value = 0;
...
fscanf (input, "%hd", &value);
If there is no simple universal "format" to the file, or if the raw information in the file is the information we are interested in, we need to instead look at it as a consecutive collection of bytes, and we can grab a char's worth of data (I would recommend starting out by looking at a file like this as a byte-by-byte or char-by-char endeavor... ignore trying to transact with groups of them until you get the process down with individual chars).
The **fscanf()** function is still viable here, but if all we're after is a char value, there's a special purpose input function we can use instead: **fgetc()**
To read a byte of data from a file and store it in our variable (called byte), we would do the following:
char byte = 0;
...
byte = fgetc (input);
The **fgetc()** function takes the intended FILE pointer it is to read from as its argument, so **input** should be a FILE pointer AND should have previously been **fopen()**'ed (and for **reading**!) prior to calling **fgetc()**.
To make things easier, placing logic to read from a file in a loop can be a very powerful combination.
=====Task=====
Your task is to write a hex viewer, along the lines of the **xxd(1)** tool found on the system.
=====Experiencing xxd=====
If we don't know what it is we are implementing, we won't be all that successful. So, here's a quick overview of the **xxd(1)** tool we will be simulating aspects of; first up, a plain text look at a data file we will be processing:
lab46:~/src/SEMESTER/DESIG/cbf0$ cat in/sample0.txt
>ABCDEFGHIJKLMNOPQRSTUVWXYZ<
[abcdefghijklmnopqrstuvwxyz]
01: BINARY
01234567: OCTAL
0123456789: DECIMAL
0123456789ABCDEF:HEXADECIMAL
)!@#$%^&*(
.
lab46:~/src/SEMESTER/DESIG/cbf0$
Note how it is filled with ASCII text- many of our recognizable symbols we use when using a text editor.
But, to illustrate how text is just a form of binary, witness what we are shown when we peel away a layer, and view the binary data (represented in hex for convenience) of that same file:
lab46:~/src/SEMESTER/DESIG/cbf0$ xxd in/sample0.txt
00000000: 3e41 4243 4445 4647 4849 4a4b 4c4d 4e4f >ABCDEFGHIJKLMNO
00000010: 5051 5253 5455 5657 5859 5a3c 0a5b 6162 PQRSTUVWXYZ<.[ab
00000020: 6364 6566 6768 696a 6b6c 6d6e 6f70 7172 cdefghijklmnopqr
00000030: 7374 7576 7778 797a 5d0a 3031 3a20 2020 stuvwxyz].01:
00000040: 2020 2020 2020 2020 2020 2042 494e 4152 BINAR
00000050: 590a 3031 3233 3435 3637 3a20 2020 2020 Y.01234567:
00000060: 2020 204f 4354 414c 0a30 3132 3334 3536 OCTAL.0123456
00000070: 3738 393a 2020 2020 2020 4445 4349 4d41 789: DECIMA
00000080: 4c0a 3031 3233 3435 3637 3839 4142 4344 L.0123456789ABCD
00000090: 4546 3a48 4558 4144 4543 494d 414c 0a29 EF:HEXADECIMAL.)
000000a0: 2140 2324 255e 262a 280a 2e0a !@#$%^&*(...
lab46:~/src/SEMESTER/DESIG/cbf0$
The EXACT same file, with the EXACT same arrangement of data, only represented more as the computer looks at it (sequentially, one byte immediately following the next).
The output of **xxd(1)** has 3 distinct sections:
- the address or offset (from the start of file). This is a hexadecimal address, starting at 0 (beginning of the file), and increments according to the number of bytes displayed. You'll notice that there are (at maximum) the same number of bytes on each line, so the offset increments by that amount with each new line it displays.
- the actual data (represented in hex); here we see 8 columns of hex values, grouped together in pairs of two bytes (other hex viewers may separate into 16 columns, isolating each byte for better viewing).
- the ASCII rendering (far right field); if we are viewing an ASCII file, we will easily see the ASCII contents of this file. If we are viewing a non-ASCII file, we may still see random ASCII values, but that is just that the value stored in the particular byte maps to that ASCII value, and should NOT be considered actual ASCII data.
This is one of those conceptual roadblocks many develop- they think that binary is somehow more complicated than it is, and create all sorts of obstacles to effective access. Here we will try to break down some of those walls, because this is really important stuff to know.
Your task is to write a C program that takes a file name as a command-line argument, opens that file, reads its contents, and displays that data to the screen in the manner that the **xxd(1)** tool does in the above example (note that while the **xxd(1)** tool has other features, we are not looking to implement them; only this simple rendering view).
Your program must:
* Require the user supply a file via the command-line
* if the file specified does not exist/cannot be opened, display an error message and exit.
* error message should be of the form: **Error: Could not open 'filename' for reading!**
* Where **filename** is the name of the file specified on the command-line (make sure the quotes surround it in the output).
* no further processing should be done if the file is not able to be accessed.
* Detect the current size of the terminal (see "Detecting Terminal Size" section below), and record the lines and columns into variables for use in your program.
* If the terminal your program is being run in is **less than** 80 columns, display an error message and exit.
* error message should be of the form: **Error: Terminal width is less than 80 columns!**
* Your program will only be displaying to an area up to 80 characters wide, so a wider terminal will not influence program output.
* Similarly, if the number of lines in the terminal is **less than** 20, display a similar error message and exit.
* error message should be of the form: **Error: Terminal height is less than 20 lines!**
* Unlike the width, the height can impact program output (taller terminals, if not otherwise throttled by a second command-line argument, can auto-expand if there is more room and data to display).
* The second command-line argument is a sizing throttle (controlling the number of lines your program will display). If no argument, or a **0** is given, assume autosize (use the detected height to be your maximum in your calculations).
* Each row will display:
* an 8-digit hex offset (referring to the first data byte on a given line)
* followed by a colon and a single space
* differently from **xxd(1)**: sixteen space separated groups of bytes
* however you arrive at it: two total spaces following the hex bytes (again, see output example)
* a 16-character ASCII representation field (no separating spaces between the values)
* all printable characters should be displayed.
* all non-printable (and various whitespace) characters should be substituted with a '**.**'
* A newline will be the last character on each line.
* The hex values and rendered ASCII displayed will be sourced from the file specified on the command-line. While the target files for this project are less than 512 bytes, your program should be able to handle larger and smaller files, and update its display accordingly.
* If a line throttle is given, your program is to stop output of data and ASCII rendering at that line, once it completes.
* Once the data in the file has been exhausted, you need to wrap up as appropriate; finish the current line (even if you have to pad spaces), and display the corresponding ascii field (padding spaces as appropriate).
* Don't forget to **fclose()** any open file pointers! And **free()** any **malloc()**'ed or **calloc()**'ed memory.
Sample output of your program should be as follows (compared to the **xxd(1)** output above):
system:~/src/SEMESTER/DESIG/cbf0$ ./cbf0.$(uname -m) in/sample0.txt
00000000: 3e 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f >ABCDEFGHIJKLMNO
00000010: 50 51 52 53 54 55 56 57 58 59 5a 3c 0a 5b 61 62 PQRSTUVWXYZ<.[ab
00000020: 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 cdefghijklmnopqr
00000030: 73 74 75 76 77 78 79 7a 5d 0a 30 31 3a 20 20 20 stuvwxyz].01:
00000040: 20 20 20 20 20 20 20 20 20 20 20 42 49 4e 41 52 BINAR
00000050: 59 0a 30 31 32 33 34 35 36 37 3a 20 20 20 20 20 Y.01234567:
00000060: 20 20 20 4f 43 54 41 4c 0a 30 31 32 33 34 35 36 OCTAL.0123456
00000070: 37 38 39 3a 20 20 20 20 20 20 44 45 43 49 4d 41 789: DECIMA
00000080: 4c 0a 30 31 32 33 34 35 36 37 38 39 41 42 43 44 L.0123456789ABCD
00000090: 45 46 3a 48 45 58 41 44 45 43 49 4d 41 4c 0a 29 EF:HEXADECIMAL.)
000000a0: 21 40 23 24 25 5e 26 2a 28 0a 2e 0a !@#$%^&*(...
Or, if using a line throttle:
system:~/src/SEMESTER/DESIG/cbf0$ ./cbf0.$(uname -m) in/sample0.txt 4
00000000: 3e 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f >ABCDEFGHIJKLMNO
00000010: 50 51 52 53 54 55 56 57 58 59 5a 3c 0a 5b 61 62 PQRSTUVWXYZ<.[ab
00000020: 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 cdefghijklmnopqr
00000030: 73 74 75 76 77 78 79 7a 5d 0a 30 31 3a 20 20 20 stuvwxyz].01:
=====Detecting Terminal Size=====
To detect the current size of your terminal, you may make use of the following code, provided in the form of a complete program for you to test, and then adapt into your code as appropriate.
It makes use of a **structure**, which we have not extensively covered yet, but the example shows you how you can make use of an existing struct, which is all you have to do in the program (we're just using it to retrieve information to help us on our program).
#include
#include
#include
int main ()
{
struct winsize terminal;
ioctl (0, TIOCGWINSZ, &terminal);
printf ("lines: %d\n", terminal.ws_row);
printf ("columns: %d\n", terminal.ws_col);
return (0);
}
An **ioctl(2)** is a method (and system/library call) for manipulating underlying device parameters of special files (for the UNIX people: everything is a file, including your keyboard, and **terminal screen**). We are basically querying the screen (or accessing lower level information made possible by communicating with the driver of the device) to obtain some useful information.
Here we are accessing the information on our terminal file, retrieving the width and height so that we can make use of them productively in our programs.
Compile and run the above code to see how it works. Try it in different size terminals. Then incorporate the logic into your hex viewer for this project.
=====Command-Line Arguments=====
====setting up main()====
To accept (or rather, to gain access) to arguments given to your program at runtime, we need to specify two parameters to the main() function. While the names don't matter, the types do.. I like the traditional **argc** and **argv** names, although it is also common to see them abbreviated as **ac** and **av**.
Please declare your main() function as follows:
int main(int argc, char **argv)
There are two very important variables involved here (the types are actually what are important, the names given to the variables are actually quite, variable; you may see other references refer to them as things like "ac" and "av"):
* int argc: the count (an integer) of tokens given on the command line (program name + arguments)
* char **argv: an array of strings (technically an array of an array of char) that contains "strings" of the various tokens provided on the command-line.
The arguments are accessible via the argv array, in the order they were specified:
* argv[0]: program invocation (path + program name)
* argv[1]: our first argument
* argv[2]: second argument
* argv[3]: third argument
* ...
* argv[N]: Nth argument
Additionally, let's not forget the **argc** variable, an integer, which contains a count of arguments (argc == argument count). If we provided argv[0] through argv[4], argc would contain a 5.
===example===
For example, if we were to execute a program as follows:
$ ./program word 73 another word
We'd have:
* argv[0]: "./program"
* argv[1]: "word"
* argv[2]: "73" (note not the integer number 73, but the string "73")
* argv[3]: "another"
* argv[4]: "word"
and let's not forget:
* argc: 5 (there are 5 things, argv indexes 0, 1, 2, 3, and 4)
=====Loops=====
A loop is basically instructing the computer to repeat a section, or block, or code a given amount of times (it can be based on a fixed value-- repeat this 4 times, or be based on a conditional value-- keep repeating as long as (or while) this value is not 4).
Loops enable us to simplify our code-- allowing us to write a one-size-fits all algorithm (provided the algorithm itself can appropriately scale!), where the computer merely repeats the instructions we gave. We only have to write them once, but the computer can do that task any number of times.
Loops can be initially difficult to comprehend because unlike other programmatic actions, they are not single-state in nature-- loops are multi-state. What this means is that in order to correctly "see" or visualize a loop, you must analyze what is going on with EACH iteration or cycle, watching the values/algorithm/process slowly march from its initial state to its resultant state. Think of it as climbing a set of stairs... yes, we can describe that action succinctly as "climbing a set of stairs", but there are multiple "steps" (heh, heh) involved: we place our foot, adjust our balance-- left foot, right foot, from one step, to the next, to the next, allowing us to progress from the bottom step to the top step... that process of scaling a stairway is the same as iterating through a loop-- but what is important as we implement is what needs to happen each step along the way.
With that said, it is important to be able to focus on the process of the individual steps being taken. What is involved in taking a step? What constitutes a basic unit of stairway traversal? If that unit can be easily repeated for the next and the next (and in fact, the rest of the) steps, we've described the core process of the loop, or what will be iterated a given number of times.
In C and C-syntax influenced languages (C++, Java, PHP, among others), we typically have 3 types of loops:
* **for** loop (automatic counter loop, stepping loop; top-driven) - when we know exactly how many times we wish something to run; we know where we want to start, where we want to end, and exactly how to progress from start to end (step value)
* **while** loop (top-driven conditional loop) - when we want to repeat a process, but the exact number of iterations is either not known, not important, not known, or variable in nature. While loops can run 0 or more times.
* **do-while** loop (bottom-driven conditional loop) - similar to the while loop, only we do the check for loop termination at the bottom of the loop, meaning it runs 1 or more times (a do-while loop is guaranteed to run at least once).
====for() loops====
A **for()** loop is the most syntactically unique of the loops, so care must be taken to use the proper syntax.
With any loop, we need (at least one) looping variable, which the loop will use to analyze whether or not we've met our looping destination, or to perform another iteration.
A for loop typically also has a defined starting point, a "keep-looping-while" condition, and a stepping equation.
Here's a sample for() loop, in C, which will display the squares of each number, starting at 0, and stepping one at a time, for 8 total iterations:
int i = 0;
for (i = 0; i < 8; i++)
{
fprintf(stdout, "loop #%d ... %d\n", (i+1), (i*i));
}
The output of this code, with the help of our loop should be:
loop #1 ... 0
loop #2 ... 1
loop #3 ... 4
loop #4 ... 9
loop #5 ... 16
loop #6 ... 25
loop #7 ... 36
loop #8 ... 49
Note how we can use our looping variable (**i**) within mathematical expressions to drive a process along... loops can be of enormous help in this way.
And again, we shouldn't look at this as one step-- we need to see there are 8 discrete, distinct steps happening here (when i is 0, when i is 1, when i is 2, ... up until (and including) when i is 7).
The loop exits once **i** reaches a value of 8, because our loop determinant condition states as long as **i** is **less than** **8**, continue to loop. Once **i** becomes **8**, our looping condition has been satisfied, and the loop will no longer iterate.
The stepping (that third) field is a mathematical expression indicating how we wish for **i** to progress from its starting state (of being equal to 0) to satisfying the loop's iterating condition (no longer being less than 8).
**i++** is a shortcut we can use in C; the longhand (and likely more familiar) equivalent is: **i = i + 1**
====while() loops====
A **while()** loop isn't as specific about starting and stepping values, really only caring about what condition needs to be met in order to exit the loop (keep looping while this condition is true).
In actuality, anything we use a for loop for can be expressed as a while loop-- we merely have to ensure we provide the necessary loop variables and progressions within the loop.
That same loop above, expressed as a while loop, could look like:
int i = 0;
while (i < 8)
{
fprintf(stdout, "loop #%d ... %d\n", (i+1), (i*i));
i = i + 1; // I could have used "i++;" here
}
The output of this code should be identical, even though we used a different loop to accomplish the task (try them both out and confirm!)
**while()** loops, like **for()** loops, will run 0 or more times; if the conditions enabling the loop to occur are not initially met, they will not run... if met, they will continue to iterate until their looping conditions are met.
It is possible to introduce a certain kind of **logical error** into your programs using loops-- what is known as an "infinite loop"; this is basically where you erroneously provide incorrect conditions to the particular loop used, allowing it to start running, but never arriving at its conclusion, thereby iterating forever.
Another common **logical error** that loops will allow us to encounter will be the "off by one" error-- where the conditions we pose to the loop are incorrect, and the loop runs one magnitude more or less than we had intended. Again, proper debugging of our code will resolve this situation.
====do-while loops====
The third commonly recognized looping structure in C, the do-while loop is identical to the while() (and therefore also the for()) loop, only it differs in where it checks the looping condition: where **for()** and **while()** are "top-driven" loops (ie the test for loop continuance occurs at the top of the loop, **before** running the code in the loop body), the **do-while** is a "bottom-driven" loop (ie the test for loop continuance occurs at the bottom of the loop).
The placement of this test determines the minimal number of times a loop can run.
In the case of the for()/while() loops, because the test is at the top- if the looping conditions are not met, the loop may not run at all. It is for this reason why these loops can run "0 or more times"
For the do-while loop, because the test occurs at the bottom, the body of the loop (one full iteration) is run before the test is encountered. So even if the conditions for looping are not met, a do-while will run "1 or more times".
That may seem like a minor, and possibly annoying, difference, but in nuanced algorithm design, such distinctions can drastically change the layout of your code, potentially being the difference between beautifully elegant-looking solutions and those which appear slightly more hackish. They can BOTH be used to solve the same problems, it is merely the nature of how we choose express the solution that should make one more preferable over the other in any given moment.
I encourage you to intentionally try your hand at taking your completed programs and implementing other versions that utilize the other types of loops you haven't utilized. This way, you can get more familiar with how to structure your solutions and express them. You will find you tend to think in a certain way (from experience, we seem to get in the habit of thinking "top-driven", and as we're unsure, we tend to exert far more of a need to control the situation, so we tend to want to use **for** loops for everything-- but practicing the others will free your mind to craft more elegant and efficient solutions; but only if you take the time to play and explore these possibilities).
So, expressing that same program in the form of a do-while loop (note the changes from the while):
int i = 0;
do
{
fprintf(stdout, "loop #%d ... %d\n", (i+1), (i*i));
i = i + 1; // again, we could just as easily use "i++;" here
} while(i < 8);
In this case, the 0 or more vs. 1 or more minimal iterations wasn't important; the difference is purely syntactical.
With the do-while loop, we start the loop with a **do** statement.
Also, the do-while is the only one of our loops which NEEDS a terminating semi-colon (**;**).. please take note of this.
=====Process=====
In general, you will be looking to do something like the following:
address <- zero
byte <- readfromfile
ascii[zero] <- byte
count <- one
loop as long as there is still data in the file
display the address in hex
if count is equal to one
display the byte in hex
endif
loop as long as count is less than sixteen
byte <- readfromfile
if there is still data in the file
ascii[count] <- byte
display the byte in hex
let count increment by one
else
loop as long as count is less than sixteen
display a space
endloop
break from loop
endif
endloop
loop index from zero to count
if ascii[index] is a printable character
display ascii[index] as an ASCII character
else
display the period symbol
endif
endloop
display a newline
let count be equal to zero
let address be incremented by sixteen
endloop
=====Submission=====
To successfully complete this project, the following criteria must be met:
* Code must compile cleanly (no warnings or errors)
* Use the **-Wall** and **--std=gnu99** flags when compiling.
* Code must be nicely and consistently indented (you may use the **indent** tool)
* Code must utilize the algorithm/approach presented above
* Output **must** match the specifications presented above (when given the same inputs)
* Code must be commented
* be sure your comments reflect the **how** and **why** of what you are doing, not merely the **what**.
* Track/version the source code in a repository
* Submit a copy of your source code to me using the **submit** tool.
To submit this program to me using the **submit** target in the Makefile, run the following command at your lab46 prompt:
lab46:~/src/SEMESTER/DESIG/cbf0$ make submit
...
You should get some sort of confirmation indicating successful submission if all went according to plan. If not, check for typos and or locational mismatches.
=====Evaluation Criteria=====
What I will be looking for:
65:cbf0:final tally of results (65/65)
*:cbf0:obtained via grabit by Sunday before deadline [5/5]
*:cbf0:program compiles successfully, no errors [5/5]
*:cbf0:program compiles with no warnings [5/5]
*:cbf0:program performs stated task/algorithm [5/5]
*:cbf0:program output conforms to formatting expectations [5/5]
*:cbf0:proper error checking and status reporting performed [5/5]
*:cbf0:code implements solution using relevant concepts [5/5]
*:cbf0:code updates committed/pushed to lab46 semester repo [5/5]
*:cbf0:code uses correct variable types and name lengths [5/5]
*:cbf0:project is submitted with relevant and complete source [5/5]
*:cbf0:project is submitted on lab46 using 'make submit' [5/5]
*:cbf0:project is submitted with pi and lab46 binaries [5/5]
*:cbf0:runtime tests of submitted program succeed [5/5]
Additionally:
* Solutions not abiding by **SPIRIT** of project will be subject to a 25% overall deduction
* Solutions not utilizing descriptive why and how **COMMENTS** will be subject to a 25% overall deduction
* Solutions not utilizing **INDENTATION** to promote scope and clarity will be subject to a 25% overall deduction
* Solutions lacking **ORGANIZATION** and are not easy to read (within 90 char width) are subject to a 25% overall deduction