Corning Community College

CSCS1730 UNIX/Linux Fundamentals

~~TOC~~

Project: UNIX DATA RECOVERY (udr2)

Errata

Typos and bug fixes:

<description>

Objective

Continuing our “1337 haxxing” series of projects, we've found considerable conceptual self-imposed roadblocks blocking our employment of otherwise simple computing properties (that not only is everything a file, but that files are fundamentally a series of bytes). The sooner we accept this truth, the sooner many challenges begin to vanish.

We resume our exploration with another practical example, this time based on real data generated by an EEG device. The intersection of hardware, software, and logic play vital roles in problem solving activities (even if it is just enabling analysts to make more educated guesses), and seems to be a skill increasingly taken for granted and alien.

Background

An electroencephalogram (EEG) is a test that detects electrical activity in your brain. Brain cells communicate via electrical impulses and are active all the time, even when you're asleep. This activity can be visualized as wavy lines on an EEG recording, but ultimately is sourced from raw bytes sampled from the device performing the data acquisition.

Sleep is a common area of study where this is particularly applicable, and is even somewhat of a modern day fad- smartphone apps to special wristbands can be used to monitor aspects of our sleeping quality, and more products are coming to market all the time.

We will be analyzing data generated by a consumer-grade EEG headset– basically a device one wears when going to sleep, and via conductive pads in contact with the skin on the forehead, monitors the brainwaves and can determine their level of activity (especially in regard to whether they are asleep, and what level of sleep they are at).

The data was obtained from a live session (me, sleeping) during my initial polyphasic sleep adaptation a few years ago– so there'll be opportunities to see “normal, boring” sleep patterns, transitions, and even more optimized and sleep sessions (including rather restful 20-minute power naps).

The device used generated bytes of raw data, which I captured into individual data files. We will be learning how that data is structured so that we may parse it, and ultimately derive information such as sleep duration, type of sleep, etc.

Like udr0 and udr1… we're just manipulating (reading/writing) bytes of data, and applying specific rules and methods to how we interpret various bytes, or sequences of bytes.

Once again there is a conceptual as well as practical angle… some people will struggle more with one over the other, and as always: questions are not just encouraged, they are expected for success!

EEG data packet format

EEG data is represented in the form of data packets– collections of bytes that can be decoded to convey a particular meaning (state of sleep, timestamp, signal strength, etc.).

It is important to note that the data, in cases of multibyte values, is little endian in orientation.

The format of the data packet is as follows:

Data Packet

Field	Length (bytes)	Description
'A' (0x41)	1	character starting the packet
'4' (0x34)	1	the protocol “version”, of which only '4' is currently supported
checksum	1	a one byte checksum formed by summing the identifier byte and all the data bytes
msglen	2	a two byte message length (little endian). This length includes the size of the data payload plus the identifier
inv_msglen	2	is the inverse of msglen sent for redundancy. If msglen does not match ~inv_msglen, we can start looking for the next packet immediately, instead of reading some arbitrary number of bytes, based on a bad length
time_sec	1	the lower 8 bits of the current unix time (when session was recorded)
sub_sec	2	the 16-bit sub-second (runs through 0xFFFF in 1 second), LSB first
seqnum	1	the 8-bit sequence number
datatype	1	the datatype (see data type subfield table below)
datablock	variable	the array of binary data

Data Type Subfield of Data Packet

These are the data types generated by the EEG device and could manifest within the data file. Note that this data will be contained in the datatype field of the data packet, and any follow-up data will be present in the datablock array field.

Type ID (hex)	Type Name	Description
0x00	event	an event has occured (see event table below)
0x02	slice_end	marks the end of a slice of data (a slice can span multiple packets)
0x03	version	version of the raw data output
0x80	waveform	raw time domain brainwave
0x83	frequency_bins	frequency bins derived from waveform
0x84	signal	signal quality range of waveform (0-30)
0x8A	timestamp	full timestamp from EEG device's RTC
0x97	impedance	impedance across the headband
0x9C	badsignal	signal contains artifacts
0x9D	sleepstage	current sleep stage (produced in 30 second samples, see sleepstage table below)

Event table

These are the possible events generated by the EEG device and could manifest within the data file. Note that this data will be contained in the datablock field (the array) of the data packet when the datatype has been identified as an event.

Event ID	Event Name	Description
0x05	session_start	data acquisition session has commenced
0x07	sleep_start	user is asleep
0x0E	headset_disengaged	EEG headset has been set on dock
0x0F	headset_engaged	EEG headset taken off dock
0x10	alarm_off	user turned off alarm functionality
0x11	alarm_snooze	user hit enabled snooze delay on alarm functionality
0x13	alarm_play	set alarm is now going off
0x15	session_end	data acquisition session has ceased
0x24	headset_introduce	a new headband ID has been read

Sleep Stage table

These are the possible sleep stages recognized by the EEG device (this data will be located in the datablock field (the array) of the data packet when the data type has been identified as a sleepstage.

SleepStage ID	SleepStage Name	Description
0x00	undefined	insufficient data to determine sleep stage
0x01	conscious	user is in an awakened state
0x02	rem	user is experiencing REM (Random Eye Movement) sleep
0x03	light	user is experiencing light sleep
0x04	deep	user is experiencing deep sleep (SWS)

Frequency Bins table

Frequency Bins are a measurement of the current waveform frequency being experienced, which is analyzed by the EEG device and factors into the Sleep Stage determination. This would be considered a more raw form of data, should additional analysis be desired.

ID	Named Range (Hz)	Description
0x00	2-4	Delta
0x01	4-8	Theta
0x02	8-13	Alpha
0x03	13-18	Beta
0x04	18-21	Beta
0x05	11-14	Beta (sleep spindles)
0x06	30-50	Gamma

Example Analysis

With the use of a hex editor, we can manually identify and decode the EEG data packets, using the information provided above.

In the file session-201211020309.raw (November 2nd, 2012, core sleep session starting at 3:09am), the following data can be seen (snippeted from a bvi session):

00002580  3F 05 00 FA FF D7 0E 00 34 02 51 DA 12 00 41 34 7F 05 00 FA ?.......4.Q...A4....
00002594  FF D8 06 00 35 8A D8 3A 93 50 41 34 06 05 00 FA FF D8 08 00 ....5..:.PA4........
000025A8  36 03 03 00 00 00 41 34 40 05 00 FA FF D8 0E 00 37 02 52 DA 6.....A4@.......7.R.
000025BC  12 00 41 34 80 05 00 FA FF D9 04 00 38 8A D9 3A 93 50 41 34 ..A4........8..:.PA4

If you look over in the ASCII field on the far right of the line started by offset 00002594, you will see a “:.PA4”… according to the data packet field breakdown above, the start of the packet will be an 'A', followed by a '4'… so seeing a fairly isolated “A4” is an excellent indication we are looking at a new data packet.

bvi informs us that the lone “A4” 2-byte sequence ('A' byte followed by '4' byte) is at offset 0000259E.

The byte prior to the next “A4” (the next line– 000025A8) occurs at offset 000025AD.

It would seem (especially upon converting 259E and 25AD to decimal), there is a 15 byte difference (so a 16-byte duration) to this particular packet. Let's dig deeper…

First, to reduce analysis paralysis, let us extract specifically this byte.

We need the decimal equivalents of 259E and 25AD:

$ echo "ibase=16; 259E" | bc
9630
$ echo "ibase=16; 25AD" | bc
9645
$

And then, calculate their difference (how long is this packet):

$ echo "9645-9630" | bc
15
$

Okay, so we have a 15 bytes of data following offset 9630 (decimal). We need to remember to include the byte at offset 9630, so 15+1=16 total bytes in this packet. Let us extract just that packet into a file for further analysis:

$ dd if=session-201211020309.raw of=packet bs=1 skip=9630 count=16
16+0 records in
16+0 records out
16 bytes (16 B) copied, 0.141976 s, 0.1 kB/s
$

Finally, let's get a hexdump and further decode this arbitrary packet:

$ od -A x -t x1z -v packet 
000000 41 34 06 05 00 fa ff d8 08 00 36 03 03 00 00 00  >A4........6.....<
000010
$

Note that with our extraction from the data file, the original offset is no longer valid (we now have a file with JUST our packet in it, and our file begins at offset 0).

Okay… let's break this down (reference the info tables above):

byte 0: packet start (0x41 – 'A')
byte 1: protocol version (0x34 – '4')
byte 2: checksum– see below for calculation (0x06)
byte 3: lower-order byte of message length (0x05)
byte 4: upper-order byte of message length (0x00)

According to this, our message length is 0x0005 (or 5 in decimal) bytes long.

byte 5: lower-order inverted byte of message length (was 0x05 above, should be 0xFA)
byte 6: upper-order inverted byte of message length (was 0x00 above, should be 0xFF)

If you have questions about bit inversions, it is merely flipping 0 to 1, and 1 to 0. In our 0x05 example, we have this:

normal: 00000101 (05) or 0000 (0) 0101 (5)
inverted: 11111010 (FA) or 1111 (F) 1010 (A)

The EEG device is inverting the message length data and placing them in our data packet so we can use it as a form of data validation, to make sure we're looking at a real packet (strategies like this are not uncommon– it is part of interfacing real world devices to the digital environments of computers).

And we see that the inverted message length checks out with the regular message length… we've passed one of the tests ensuring this is a valid packet.

byte 7: lower-order byte of 32-bit UNIX time (0xd8) – this will make more sense in the context of the actual time (once known)
byte 8: lower-order byte of subsecond (0x08)
byte 9: upper-order byte of subsecond (0x00)
byte 10: sequence number (0x36)
byte 11: data type (0x03) – according to the table, 0x03 is a 'version'
byte 12: datablock (msglen-1)

It would seem the “message length” consists of the data type byte plus the length of the datablock. We see from the 2-byte message length sequence above that the msglen is 5 bytes… 1 of those bytes is the data type byte, which leaves 4 bytes remaining for the datablock array.

As it is multibyte, it needs to be treated as little endian (lower-order byte first, followed by upper-order bytes)… we see from our hex display there are 4 bytes remaining in our packet:

03 00 00 00

So, doing a straight reversal, that would give us: 00 00 00 03, a 32-bit (4-byte) value, containing the number 3, the apparent version of things (different from the packet format version above).

Let's address the checksum calculation skipped above… now that we know our data type + datablock bytes (all 5 of them), the checksum is calculated by adding together all 5 of those bytes (but only storing the result in a 1 byte storage space, which will likely mean wraparounds like it is nobody's business with more exotic values). Let's trace it out:

0x03 (data type) + 0x03 (first byte of datablock) + 0x00 (second byte of datablock) + 0x00 (third byte of datablock) + 0x00 (fourth byte of data block) = 0x06.

What was the value stored in the checksum field of the our extracted data packet (byte #2): 0x06. Aha! The sum of the data checks out (this is our other test to ensure packet data validity).

There we have it… one decoded packet, of potentially many.

Pretty awesome, right?

Obtain the files

There are two resources you need to obtain for udr2:

binhaxx suite

Located at: http://lab46.corning-cc.edu/~wedge/projects/binhaxx/

Will be a collection of compressed archives for the binhaxx suite of pedagogical data manipulation tools. These are helper programs (or converters) optimized for various binary operations you may find yourself requiring the functionality of.

Please download the latest release, extract it, read the documentation, build it, and install it into your own custom ~/bin directory (and add that custom bin to your PATH).

Explore these tools and get a feel for how they work. You may find use for some of them while performing this and other projects.

sleepdata

The data for this week's project is located in the udr2/ directory of the UNIX Public Directory, in an archive called: sleepfun.tar.bz2

Make a copy of this into your home directory somewhere and set to work.

NOTE: Hopefully it has been standard practice to locate project files in their own unique subdirectory, such as under src/unix/, where you can then add/commit/push the results to your repository (you ARE regularly putting stuff in your repository, aren't you?)

NOTE: You probably do not want to add/commit/push this sleepfun.tar.bz2 archive, nor its extracted .raw files, as they do consume a bit of space.

Data Files

Upon extraction of the files in sleepfun.tar.bz2, you should have the following files:

session-201211020309.raw (5866460 bytes, or 5.6MB) – core sleep session
session-201301041418.raw (360135 bytes) – nap
session-201301311908.raw (4955855 bytes) – core sleep session
session-201302010218.raw (2719296 bytes) – core sleep session
session-201302200614.raw (524705 bytes) – nap
session-201303051015.raw (511190 bytes) – nap

Session files are named with the date and time of the start of the particular sleep session, encoded as follows (YYYYMMDDhhmm):

YYYY - 4-digit year (2012)
MM - 2-digit month (11)
DD - 2-digit day (02)
hh - 2-digit hour (24-hour time, so 03 means 3am)
mm - 2-digit minute (09)

So, 201211020309 means 2012/11/02 at 3:09am was the recorded time of the start of this particular sleep session (I was exploring with a dual core sleep schedule around this time, so this would have been my 2nd core).

Task

With the provided data files, I'd like for you to do the following (be sure to provide commands for each as well as the answer you got):

determine the number of data packets in each file
determine the total time elapsed in the session file
determine the total time in a sleep state (not undefined, not conscious)
find a data packet during a time of rem or deep sleep that stores the complete timestamp, and:
- extract that packet from the pertinent data file (provide command)
- what is the timestamp (as a 32-bit value)
- what is the calendar date and time of that timestamp, when appropriately translated?
which file had the most deep sleep?
- how much took place?
- how did you figure this out?
- what was the approximate time?

Useful tools

You may want to become familiar with the manual pages of the following tools (in addition to tools you've already encountered):

dd(1)
bc(1)
od(1) - as I've said to others, od is like cat, but for binary data
bvi(1)
hexedit(1)
grep(1) - can be contorted to cooperate
date(1) - might be useful for time/date manipulations
binhaxx tools

… along with other tools previously encountered.

binhaxx search

To assist you with this project, a special “binary search” has been developed, provided via the binhaxx tools, called search. search searches for patterns among binary data, as part of STDIN.

It supports space-separated bytes of data, and even allows the use of '.' to denote any hex value (remember, it takes 2 hex values to occupy a byte).

Example Usage

Let's say you wanted to search for the consecutive bytes 0x12 and 0x34 within a binary file:

$ cat session-201302200614.raw | search '12 34' 
533b:12 34 
29af3:12 34 
29dff:12 34 
29f85:12 34 
2a8a9:12 34 
2aa2f:12 34 
2abb5:12 34 
2aec1:12 34 
2b353:12 34
$

What you see are the addresses (in hex) that denote the start of this requested pattern (0x12 immediately followed by 0x34).

If you wanted 0x12 followed by anything, followed by 0x34, we'd do:

$ cat session-201302200614.raw | search '12 .. 45' 
3326:12 e0 45
$

In this case, there is only one such match in the entire file.

The '.' pattern can also be applied to only part of a byte… 0x12 0xe# (we don't care what the lower order 4-bits are, but the upper 4-bits of the second byte MUST be an 0xe):

$ cat session-201302200614.raw | search '12 e.' 
1cf4:12 ee 
206d:12 e0 
3325:12 e0 
3907:12 e0 
4077:12 e0 
4795:12 e0 
50a1:12 e0 
552b:12 e0 
5edb:12 e0 
73e7:12 e0 
81b9:12 e0 
8df9:12 e0 
8fcf:12 e0 
aae3:12 e0 
aae7:12 e0 
b859:12 e0 
3415c:12 e9 
4e11f:12 e0 
6bd5b:12 ed 
796f7:12 e0 
7b877:12 e0 
7d3df:12 e0 
7e7e1:12 e0 
7e7f5:12 e0 
7ecf7:12 e0
$

We can see variations in the lower 4-bits as it matches our desired pattern.

Finally, upper 4-bits can be anything, lower 4 must be 0xc, followed by 0x23:

$ cat session-201302200614.raw | search '.c 34' 
91c1:3c 34 
29029:8c 34 
297e5:0c 34 
322d3:ec 34 
6152b:dc 34 
6a683:0c 34 
6ef95:6c 34
$

This will hopefully prove to be a useful tool in your binary analysis endeavors.

Submission

Successful completion will result in the following criteria being met:

When all is said and done, you will submit 2 files:
- udr2.text, containing:
  - an organized presentation of the answers/responses to all the above questions
- udr2.sh
  - a shell script containing all the necessary commands to accomplish the project (and will automate a run of the project when executed).
  - be sure to adequately comment the script so I can see your thought process (particular commands used, options utilized, logic used).
  - the script should output important information related to the particular step being taken (“Determining amount of Deep Sleep”, “Determining total time spent asleep”, etc.), along with the determined result of that particular data point.

Submit

Please submit as follows:

lab46:~/src/unix/udr2$ submit unix udr2 udr2.text udr2.sh
Submitting unix project "udr2":
    -> udr2.text(OK) 
    -> udr2.sh(OK)

SUCCESSFULLY SUBMITTED
lab46:~/src/unix/udr2$

Lab46 Wiki

Sidebar

Table of Contents

Project: UNIX DATA RECOVERY (udr2)

Errata

Objective

Background

EEG data packet format

Data Packet

Data Type Subfield of Data Packet

Event table

Sleep Stage table

Frequency Bins table

Example Analysis

Obtain the files

binhaxx suite

sleepdata

Data Files

Task

Useful tools

binhaxx search

Example Usage

Submission

Submit

Lab46 Wiki

User Tools

Site Tools

Sidebar

Table of Contents

Project: UNIX DATA RECOVERY (udr2)

Errata

Objective

Background

EEG data packet format

Data Packet

Data Type Subfield of Data Packet

Event table

Sleep Stage table

Frequency Bins table

Example Analysis

Obtain the files

binhaxx suite

sleepdata

Data Files

Task

Useful tools

binhaxx search

Example Usage

Submission

Submit

Page Tools