Corning Community College CSCS1730 UNIX/Linux Fundamentals ~~TOC~~ ======Project: UNIX DATA RECOVERY (udr2)====== =====Errata===== Typos and bug fixes: * bgrep was giving the address of the last byte matched of a pattern, vs. the address of the start of the matched pattern (the intended action). This has been corrected, and **bgrep** has been updated. (20150322) * This should not change anything, save for saving you an additional calculation to determine the start of the packet. * My aforementioned fix did not work, reverted **bgrep** to original version (20150324) * Implemented new fix: **bgrep** should now be correctly reporting the starting address of the matched pattern -- no change on your part, just start using it (and be aware that the address represents the start of the pattern, and not the end) (20150330) =====Objective===== Continuing our "1337 haxxing" series of projects, we've found considerable conceptual self-imposed roadblocks blocking our employment of otherwise simple computing properties (that data is a series of bytes, and ultimately, that **everything is a file**). We resume our exploration with another practical example, this time based on real data generated by an EEG device. The intersection of hardware, software, and logic play vital roles in problem solving activities (even if it is just enabling analysts to make more educated guesses), and seems to be a skill increasingly taken for granted and alien. =====Background===== An electroencephalogram (EEG) is a test that detects electrical activity in your brain. Brain cells communicate via electrical impulses and are active all the time, even when you're asleep. This activity can be visualized as wavy lines on an EEG recording, but ultimately is sourced from raw bytes sampled from the device performing the data acquisition. Sleep is a common area of study where this is particularly applicable, and is even somewhat of a modern day fad- smartphone apps to special wristbands can be used to monitor aspects of our sleeping quality, and more products are coming to market all the time. We will be analyzing data generated by a consumer-grade EEG headset-- basically a device one wears when going to sleep, and via conductive pads in contact with the skin on the forehead, monitors the brainwaves and can determine their level of activity (especially in regard to whether they are asleep, and what level of sleep they are at). The data was obtained from a live session (me, sleeping) during my initial polyphasic sleep adaptation a few years ago-- so there'll be opportunities to see "normal, boring" sleep patterns, transitions, and even more optimized and sleep sessions (including rather restful 20-minute power naps). The device used generated bytes of raw data, which I captured into individual data files. We will be learning how that data is structured so that we may parse it, and ultimately derive information such as sleep duration, type of sleep, etc. Like udr0 and udr1... we're just manipulating (reading/writing) bytes of data, and applying specific rules and methods to how we interpret various bytes, or sequences of bytes. Once again there is a conceptual as well as practical angle... some people will struggle more with one over the other, and as always: questions are not just encouraged, they are expected for success! =====EEG data packet format===== EEG data is represented in the form of data packets-- collections of bytes that can be decoded to convey a particular meaning (state of sleep, timestamp, signal strength, etc.). It is important to note that the data, in cases of multibyte values, is little endian in orientation. The format of the data packet is as follows: ====Data Packet==== ^ Field ^ Length (bytes) ^ Description |AncllLLTttsid | 'A' (0x41) | 1 | character starting the packet | | '4' (0x34) | 1 | the protocol “version”, of which only '4' is currently supported | | checksum | 1 | a one byte checksum formed by summing the identifier byte and all the data bytes | | msglen | 2 | a two byte message length (little endian). This length includes the size of the data payload plus the identifier | | inv_msglen | 2 | is the inverse of msglen sent for redundancy. If msglen does not match ~inv_msglen, we can start looking for the next packet immediately, instead of reading some arbitrary number of bytes, based on a bad length | | time_sec | 1 | the lower 8 bits of the current unix time (when session was recorded) | | sub_sec | 2 | the 16-bit sub-second (runs through 0xFFFF in 1 second), LSB first | | seqnum | 1 | the 8-bit sequence number | | datatype | 1 | the datatype (see data type subfield table below) | | datablock | variable | the array of binary data | ====Data Type Subfield of Data Packet==== These are the data types generated by the EEG device and could manifest within the data file. Note that this data will be contained in the **datatype** field of the data packet, and any follow-up data will be present in the **datablock** array field. ^ Type ID (hex) ^ Type Name ^ Description | | 0x00 | event | an event has occured (see event table below) | | 0x02 | slice_end | marks the end of a slice of data (a slice can span multiple packets) | | 0x03 | version | version of the raw data output | | 0x80 | waveform | raw time domain brainwave | | 0x83 | frequency_bins | frequency bins derived from waveform | | 0x84 | signal | signal quality range of waveform (0-30) | | 0x8A | timestamp | full timestamp from EEG device's RTC | | 0x97 | impedance | impedance across the headband | | 0x9C | badsignal | signal contains artifacts | | 0x9D | sleepstage | current sleep stage (produced in 30 second samples, see sleepstage table below) | ====Event table==== These are the possible events generated by the EEG device and could manifest within the data file. Note that this data will be contained in the **datablock** field (the array) of the data packet when the datatype has been identified as an **event**. ^ Event ID ^ Event Name ^ Description | | 0x05 | session_start | data acquisition session has commenced | | 0x07 | sleep_start | user is asleep | | 0x0E | headset_disengaged | EEG headset has been set on dock | | 0x0F | headset_engaged | EEG headset taken off dock | | 0x10 | alarm_off | user turned off alarm functionality | | 0x11 | alarm_snooze | user hit enabled snooze delay on alarm functionality | | 0x13 | alarm_play | set alarm is now going off | | 0x15 | session_end | data acquisition session has ceased | | 0x24 | headset_introduce | a new headband ID has been read | ====Sleep Stage table==== These are the possible sleep stages recognized by the EEG device (this data will be located in the **datablock** field (the array) of the data packet when the data type has been identified as a **sleepstage**. ^ SleepStage ID ^ SleepStage Name ^ Description | | 0x00 | undefined | insufficient data to determine sleep stage | | 0x01 | conscious | user is in an awakened state | | 0x02 | rem | user is experiencing REM (Random Eye Movement) sleep | | 0x03 | light | user is experiencing light sleep | | 0x04 | deep | user is experiencing deep sleep (SWS) | ====Frequency Bins table==== Frequency Bins are a measurement of the current waveform frequency being experienced, which is analyzed by the EEG device and factors into the Sleep Stage determination. This would be considered a more raw form of data, should additional analysis be desired. ^ ID ^ Named Range (Hz) ^ Description | | 0x00 | 2-4 | Delta | | 0x01 | 4-8 | Theta | | 0x02 | 8-13 | Alpha | | 0x03 | 13-18 | Beta | | 0x04 | 18-21 | Beta | | 0x05 | 11-14 | Beta (sleep spindles) | | 0x06 | 30-50 | Gamma | =====Example Analysis===== With the use of a hex editor, we can manually identify and decode the EEG data packets, using the information provided above. In the file **session-201211020309.raw** (November 2nd, 2012, core sleep session starting at 3:09am), the following data can be seen (snippeted from a **bvi** session): 00002580 3F 05 00 FA FF D7 0E 00 34 02 51 DA 12 00 41 34 7F 05 00 FA ?.......4.Q...A4.... 00002594 FF D8 06 00 35 8A D8 3A 93 50 41 34 06 05 00 FA FF D8 08 00 ....5..:.PA4........ 000025A8 36 03 03 00 00 00 41 34 40 05 00 FA FF D8 0E 00 37 02 52 DA 6.....A4@.......7.R. 000025BC 12 00 41 34 80 05 00 FA FF D9 04 00 38 8A D9 3A 93 50 41 34 ..A4........8..:.PA4 If you look over in the ASCII field on the far right of the line started by offset **00002594**, you will see a ":.PA4"... according to the data packet field breakdown above, the start of the packet will be an 'A', followed by a '4'... so seeing a fairly isolated "A4" is an excellent indication we are looking at a new data packet. **bvi** informs us that the lone "A4" 2-byte sequence ('A' byte followed by '4' byte) is at offset **0000259E**. The byte prior to the next "A4" (the next line-- **000025A8**) occurs at offset **000025AD**. It would seem (especially upon converting 259E and 25AD to decimal), there is a 15 byte difference (so a 16-byte duration) to this particular packet. Let's dig deeper... First, to reduce analysis paralysis, let us extract specifically this byte. We need the decimal equivalents of 259E and 25AD: $ echo "ibase=16; 259E" | bc 9630 $ echo "ibase=16; 25AD" | bc 9645 $ And then, calculate their difference (how long is this packet): $ echo "9645-9630" | bc 15 $ Okay, so we have a 15 bytes of data following offset 9630 (decimal). We need to remember to include the byte at offset 9630, so 15+1=16 total bytes in this packet. Let us extract just that packet into a file for further analysis: $ dd if=session-201211020309.raw of=packet bs=1 skip=9630 count=16 16+0 records in 16+0 records out 16 bytes (16 B) copied, 0.141976 s, 0.1 kB/s $ Finally, let's get a hexdump and further decode this arbitrary packet: $ od -A x -t x1z -v packet 000000 41 34 06 05 00 fa ff d8 08 00 36 03 03 00 00 00 >A4........6.....< 000010 $ Note that with our extraction from the data file, the original offset is no longer valid (we now have a file with JUST our packet in it, and our file begins at offset 0). Okay... let's break this down (reference the info tables above): * byte 0: packet start (0x41 -- 'A') * byte 1: protocol version (0x34 -- '4') * byte 2: checksum-- see below for calculation (0x06) * byte 3: lower-order byte of message length (0x05) * byte 4: upper-order byte of message length (0x00) According to this, our message length is 0x0005 (or 5 in decimal) bytes long. * byte 5: lower-order inverted byte of message length (was 0x05 above, should be 0xFA) * byte 6: upper-order inverted byte of message length (was 0x00 above, should be 0xFF) If you have questions about bit inversions, it is merely flipping 0 to 1, and 1 to 0. In our 0x05 example, we have this: * normal: 00000101 (05) or 0000 (0) 0101 (5) * inverted: 11111010 (FA) or 1111 (F) 1010 (A) The EEG device is inverting the message length data and placing them in our data packet so we can use it as a form of data validation, to make sure we're looking at a real packet (strategies like this are not uncommon-- it is part of interfacing real world devices to the digital environments of computers). And we see that the inverted message length checks out with the regular message length... we've passed one of the tests ensuring this is a valid packet. * byte 7: lower-order byte of 32-bit UNIX time (0xd8) -- this will make more sense in the context of the actual time (once known) * byte 8: lower-order byte of subsecond (0x08) * byte 9: upper-order byte of subsecond (0x00) * byte 10: sequence number (0x36) * byte 11: data type (0x03) -- according to the table, 0x03 is a 'version' * byte 12: datablock (msglen-1) It would seem the "message length" consists of the data type byte plus the length of the datablock. We see from the 2-byte message length sequence above that the msglen is 5 bytes... 1 of those bytes is the **data type** byte, which leaves 4 bytes remaining for the **datablock** array. As it is multibyte, it needs to be treated as little endian (lower-order byte first, followed by upper-order bytes)... we see from our hex display there are 4 bytes remaining in our packet: 03 00 00 00 So, doing a straight reversal, that would give us: **00 00 00 03**, a 32-bit (4-byte) value, containing the number **3**, the apparent version of things (different from the packet format version above). Let's address the checksum calculation skipped above... now that we know our data type + datablock bytes (all 5 of them), the checksum is calculated by adding together all 5 of those bytes (but only storing the result in a 1 byte storage space, which will likely mean wraparounds like it is nobody's business with more exotic values). Let's trace it out: 0x03 (data type) + 0x03 (first byte of datablock) + 0x00 (second byte of datablock) + 0x00 (third byte of datablock) + 0x00 (fourth byte of data block) = 0x06. What was the value stored in the checksum field of the our extracted data packet (byte #2): 0x06. Aha! The sum of the data checks out (this is our other test to ensure packet data validity). There we have it... one decoded packet, of potentially many. Pretty awesome, right? =====Obtain the files===== This week's project is located in the **spring2015/udr2/** directory of the UNIX Public Directory, in an archive called: **sleepfun.tar.bz2** Make a copy of this into your home directory somewhere and set to work. **NOTE:** Hopefully it has been standard practice to locate project files in their own unique subdirectory, such as under **src/unix/**, where you can then add/commit/push the results to your repository (you ARE regularly putting stuff in your repository, aren't you?) =====Data Files===== Upon extraction of the files in **sleepfun.tar.bz2**, you should have the following files: * session-201211020309.raw (5866460 bytes, or 5.6MB) -- core sleep session * session-201301041418.raw (360135 bytes) -- nap * session-201301311908.raw (4955855 bytes) -- core sleep session * session-201302010218.raw (2719296 bytes) -- core sleep session * session-201302200614.raw (524705 bytes) -- nap * session-201303051015.raw (511190 bytes) -- nap Session files are named with the date and time of the start of the particular sleep session, encoded as follows (YYYYMMDDhhmm): * YYYY - 4-digit year (2012) * MM - 2-digit month (11) * DD - 2-digit day (02) * hh - 2-digit hour (24-hour time, so 03 means 3am) * mm - 2-digit minute (09) So, 201211020309 means 2012/11/02 at 3:09am was the recorded time of the start of this particular sleep session (I was exploring with a dual core sleep schedule around this time, so this would have been my 2nd core). =====Task===== With the provided data files, I'd like for you to do the following (be sure to provide commands for each as well as the answer you got): * determine the number of data packets in each file * determine the total time elapsed in the session file * determine the total time in a sleep state (not undefined, not conscious) * find a data packet during a time of rem or deep sleep that stores the complete timestamp, and: * extract that packet from the pertinent data file (provide command) * what is the timestamp (as a 32-bit value) * what is the calendar date and time of that timestamp, when appropriately translated? * which file had the most deep sleep? * how much took place? * how did you figure this out? * what was the approximate time? =====Useful tools===== You may want to become familiar with the manual pages of the following tools (in addition to tools you've already encountered): * **dd**(1) * **bc**(1) * **od**(1) - as I've said to others, **od** is like **cat**, but for binary data * **bvi**(1) * **hexedit**(1) * **grep**(1) - can be contorted to cooperate * **date**(1) - might be useful for time/date manipulations * **bgrep** (see below for usage) ... along with other tools previously encountered. ====bgrep==== To assist you with this project, a special "binary grep" has been deployed on the system, called **bgrep**. bgrep searches for patterns among binary data, as part of STDIN. It supports space-separated (or not) bytes of data, and even allows the use of '.' to denote any hex value (remember, it takes 2 hex values to occupy a byte). ===Example Usage=== Let's say you wanted to search for the consecutive bytes 0x12 and 0x34 within a binary file: $ cat session-201302200614.raw | bgrep '12 34' 533b:12 34 29af3:12 34 29dff:12 34 29f85:12 34 2a8a9:12 34 2aa2f:12 34 2abb5:12 34 2aec1:12 34 2b353:12 34 $ What you see are the addresses (in hex) that denote the start of this requested pattern (0x12 immediately followed by 0x34). If you wanted 0x12 followed by anything, followed by 0x34, we'd do: $ cat session-201302200614.raw | bgrep '12 .. 45' 3326:12 e0 45 $ In this case, there is only one such match in the entire file. The '.' pattern can also be applied to only part of a byte... 0x12 0xe# (we don't care what the lower order 4-bits are, but the upper 4-bits of the second byte MUST be an 0xe): $ cat session-201302200614.raw | bgrep '12 e.' 1cf4:12 ee 206d:12 e0 3325:12 e0 3907:12 e0 4077:12 e0 4795:12 e0 50a1:12 e0 552b:12 e0 5edb:12 e0 73e7:12 e0 81b9:12 e0 8df9:12 e0 8fcf:12 e0 aae3:12 e0 aae7:12 e0 b859:12 e0 3415c:12 e9 4e11f:12 e0 6bd5b:12 ed 796f7:12 e0 7b877:12 e0 7d3df:12 e0 7e7e1:12 e0 7e7f5:12 e0 7ecf7:12 e0 $ We can see variations in the lower 4-bits as it matches our desired pattern. Finally, upper 4-bits can be anything, lower 4 must be 0xc, followed by 0x23: $ cat session-201302200614.raw | bgrep '.c34' 91c1:3c 34 29029:8c 34 297e5:0c 34 322d3:ec 34 6152b:dc 34 6a683:0c 34 6ef95:6c 34 $ Notice in this last pattern, we opted not to space separate the pattern... it works either way (output will be space-separated regardless). This will hopefully prove to be a useful tool in your binary analysis endeavors. =====Submission===== Successful completion will result in the following criteria being met: * When all is said and done, you will submit: * **udr2.text**, containing the answers/responses to all the above questions (including commands used to pull off the project) ====Submit==== Please submit as follows: lab46:~/src/unix/udr2$ submit unix udr2 udr2.text Submitting unix project "udr2": -> udr2.text(OK) SUCCESSFULLY SUBMITTED lab46:~/src/unix/udr2$