Project: PARSING DATA SETS (pds0)

Errata

This section will document any updates applied to the project since original release:

revision #: <description> (DATESTAMP)

Objective

A simultaneous review of prior programming along with the start of a new direction in our programming pursuits.

Overview

As we see in the name of the course, Data Communications, there are two important words in the title, and their meanings (both separate and together) are valid topics of exploration for this course:

data: the quantities, characters, or symbols on which operations are performed by a computer, being stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media
communications: the imparting or exchanging of information
data communications: the electronic transmission of encoded information to, from, or between points

Looking at those 3 definitions, some important things come to mind:

we've been dealing with “data” and “communications” for the entirety of our programming experience
- obtaining input from the keyboard is a form of communication involving data
while “networking” has a very popular and contemporary association with “data communications”, and while ALL networking IS data communications, NOT all data communications is networking (in the sense of utilizing specific hardware and software).
even if we utilize networking, there is an underlying resource we need to be quite keen at using and manipulating properly: data

In this course, we will be exploring different ways of manipulating data (some beyond existing experiences, others in entirely new scenarios), which in many ways will be a form of communication, if only within the same program (input to output), and then also scenarios of “data communication” where there is more than one entity involved in the transaction (be it program, computer, etc.).

So, we are starting simple, with data, somewhat reviewing, although somewhat breaking new ground. I've envisioned a sequence of projects that will provide a common theme, hopefully facilitating our explorations a bit.

Program

In the DATACOMM public directory will be a subdirectory called pds0/; in this directory will be 5 files, named dataset0.txt through dataset5.txt.

Please copy (or reference, via absolute path), these files in your program implementation.

These files are datasets containing intraday stock data for various securities, over varying blocks of time.

The format of the files is as follows:

EXCHANGE%3DOTCMKTS
MARKET_OPEN_MINUTE=570
MARKET_CLOSE_MINUTE=960
INTERVAL=60
COLUMNS=DATE,CLOSE,HIGH,LOW,OPEN,VOLUME
DATA=
TIMEZONE_OFFSET=-240
a1500903000,0.097,0.097,0.0965,0.0965,53758
1,0.096,0.097,0.096,0.097,102502
2,0.0974,0.099,0.095,0.09525,159489
3,0.099,0.099,0.097,0.0975,238832
...

What we have is some lead-in information, sometimes known as a header, which provides some initial values we can use to calibrate our program logic to better fit the data.

In the example above, the header would be the first 7 lines:

EXCHANGE%3DOTCMKTS
MARKET_OPEN_MINUTE=570
MARKET_CLOSE_MINUTE=960
INTERVAL=60
COLUMNS=DATE,CLOSE,HIGH,LOW,OPEN,VOLUME
DATA=
TIMEZONE_OFFSET=-240

What this is basically telling us is which stock exchange this data pertains to (somewhat unimportant for our current project), the absolute minute from the start of the day when the markets opened and closed (potentially important for what we are doing), the interval of data being reported (in units of seconds), the overall format of the data (date, close, high, etc.), a seemingly unused (maybe reserved?) DATA option, and finally a timezone offset (what timezone is this data being reported in?)

Following the header we have a stanza pertaining to a day, which will kick off with a line like this:

a1500903000,0.097,0.097,0.0965,0.0965,53758

This is effectively kicking off item 0 in the reported interval.

That first field (note a comma-separated list), is actually an encoded UNIX time value, which we'll want to decode to report more recognizable date information (YYYY-MM-DD HH:MM).

The successive fields correspond, in order with the values laid out in the COLUMNS option in the header (after DATE comes the prior CLOSE, then the HIGH, the LOW, the OPEN, and finally the VOLUME).

With the exception of DATE and VOLUME, everything else is represented as a decimal cost (you may assume dollars).

Subsequent lines in the stanza are merely offset intervals from the first, for instance:

1,0.096,0.097,0.096,0.097,102502
2,0.0974,0.099,0.095,0.09525,159489
3,0.099,0.099,0.097,0.0975,238832
4,0.0975,0.099,0.097,0.097,21000

No UNIX time value to decode, merely an offset to add to that initial UNIX time value.

Your job is to write a program that, when provided one of these dataset files as a command-line argument, will open and read its contents into memory (I'm leaving the structure of how you store it somewhat flexible for now, but let's just say it may make a whole lot of sense to use a struct to aid in storing this data, perhaps even an array of structs…), and then be able to interactively (perhaps via a menu?) report :

in a specified time interval (minute, 10 minute, 30 minute, hour, day, 2 days, 5 days).
the CLOSE, HIGH, LOW, OPEN, or VOLUME at the specified time interval

Results for now should just be displayed to STDOUT.

Clearly, there's a lot of different directions we can go from here, but for now we're aiming to establish a baseline (can we interact with and parse known data in expected ways). Once we have that down, we can get into more sophisticated variations.

Submission is via the lab46 submit tool, by the posted deadline, for the source code (able to compile and run without issue on lab46).

Table of Contents

Project: PARSING DATA SETS (pds0)

Errata

Objective

Overview

Program