User Tools

Site Tools


haas:fall2020:common:helpances:binaryfileandeof

Properly checking for the end of file when dealing with a binary file

Here is a copy of a reply I had to some questions I suspect many of you may be encountering when dealing with certain file aspects of discrete/bdt0:

I was having trouble finding EOF when reading from my input files last night. Neither the char or the unsigned char data types could seem to store EOF when I was reading it in with fgetc().

Yeah, you may notice that fgetc(3) will seem to read in a 0xFF as it encounters the EOF. The trick is, how are you to know THIS 0xFF is the EOF vs. any of the other 0xFF's potentially read in elsewhere in the file?

My preferred approach to this is to use fgetc(3) in combination with feof(3) – it operates more on the notion of some underlying flag being set when the EOF is encountered. This way, I don't have to bother interpreting whether this 0xFF is the critical “droids I'm looking for”, I let the stdio subsystem do it for me.

Because of this my file reading loop check would never fail and continued to read values after the end of the file was reached. But once I changed the variable to an integer it recognized EOF and the check utility lit up nicely. Why is this?

I suspect you are playing with the symbol EOF? As in:

  if (inputvalue == EOF) // ?

According to a quick google search:

  • EOF is a macro which expands to an integer constant expression with type int and an implementation dependent negative value but is very commonly -1.

This would explain why I am seeing 0xFF when I hit EOF (when storing the result in a char), and also why it would be exceedingly difficult (well, impossible without using other factors to make my determination; I could not rely on EOF alone) to detect EOF when using fgetc(3) with char values.

But, to show why, we need to make sure we understand negative numbers, especially as they are represented on the computer. This is that whole “two's complement” thing. Let's take a look at unsigned vs. signed for a 3-bit storage quantity:

  value unsigned signed
  ===== ======== ======
  000   0         0
  001   1        +1
  010   2        +2
  011   3        +3
  100   4        -4
  101   5        -3
  110   6        -2
  111   7        -1

As you can see, the -1 value in the signed column corresponds with all the bits being set in the value:

  • 11111111 would be -1 for a signed char,
  • 1111111111111111 would be -1 for a signed short int, and
  • 11111111111111111111111111111111 would be -1 for a signed int.

Taking our char's -1 and putting it in an int's point of view, we'd have:

  • 00000000000000000000000011111111, which a signed int would likely see as +255 (assuming least significant byte comes last in actual memory representation, regardless of endianness).

If so, (char)-1 is NOT the same binary value as (int)-1, and if EOF is (int)-1, the best we can do is, when encountering a (char)-1, is to say “this COULD be an EOF, but we cannot conclusively prove such”.

And it works with an int as you describe because fgetc(3) is actually returning an int, if you look at its function prototype in the manual page. So any encountered EOF would get returned appropriately and you could then do manual EOF detection.

Yet, as I said, I try not to do EOF detection personally. I let the I/O subsystem of the standard library do it, via feof(3). Less mistakes made by me, and I can assume the people who wrote the library functions know a lot more about the details of everything than I do.

Certainly not wrong to do it the way you've done it (especially if it works!), just another way of going about it. I personally did manual EOF detection for years, occasionally running into these problems you mentioned, before discovering feof(3), and then generally switching to always using that due to perceived reliability.

And since I only JUST NOW, while writing this e-mail, discovered what EOF truly, actually is (an (int)-1), a lot of things suddenly make more sense; while I could now more confidently use EOF now, I personally will still be using feof(3) because I am simultaneously even more confident of its operation and ability to be consistently correct no matter how I am personally interacting with my data.

haas/fall2020/common/helpances/binaryfileandeof.txt · Last modified: 2020/08/14 17:43 by wedge