Corning Community College CSCS2700 Data Communications ~~TOC~~ ======Project: PARSING DATA SETS (pds0)====== =====Errata===== This section will document any updates applied to the project since original release: * __revision #__: (DATESTAMP) =====Objective===== A simultaneous review of prior programming along with the start of a new direction in our programming pursuits. =====Overview===== As we see in the name of the course, **Data Communications**, there are two important words in the title, and their meanings (both separate and together) are valid topics of exploration for this course: * **__data__**: the quantities, characters, or symbols on which operations are performed by a computer, being stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media * **__communications__**: the imparting or exchanging of information * **__data communications__**: the electronic transmission of encoded information to, from, or between points Looking at those 3 definitions, some important things come to mind: * we've been dealing with "data" and "communications" for the entirety of our programming experience * obtaining input from the keyboard is a form of communication involving data * while "networking" has a very popular and contemporary association with "data communications", and while ALL networking IS data communications, NOT all data communications is networking (in the sense of utilizing specific hardware and software). * even if we utilize networking, there is an underlying resource we need to be quite keen at using and manipulating properly: **data** In this course, we will be exploring different ways of manipulating data (some beyond existing experiences, others in entirely new scenarios), which in many ways will be a form of communication, if only within the same program (input to output), and then also scenarios of "data communication" where there is more than one entity involved in the transaction (be it program, computer, etc.). So, we are starting simple, with data, somewhat reviewing, although somewhat breaking new ground. I've envisioned a sequence of projects that will provide a common theme, hopefully facilitating our explorations a bit. =====Program===== In the **DATACOMM** public directory will be a subdirectory called **pds0/**; in this directory will be 5 files, named **dataset0.txt** through **dataset5.txt**. Please copy (or reference, via absolute path), these files in your program implementation. These files are datasets containing intraday stock data for various securities, over varying blocks of time. The format of the files is as follows: EXCHANGE%3DOTCMKTS MARKET_OPEN_MINUTE=570 MARKET_CLOSE_MINUTE=960 INTERVAL=60 COLUMNS=DATE,CLOSE,HIGH,LOW,OPEN,VOLUME DATA= TIMEZONE_OFFSET=-240 a1500903000,0.097,0.097,0.0965,0.0965,53758 1,0.096,0.097,0.096,0.097,102502 2,0.0974,0.099,0.095,0.09525,159489 3,0.099,0.099,0.097,0.0975,238832 ... What we have is some lead-in information, sometimes known as a **header**, which provides some initial values we can use to calibrate our program logic to better fit the data. In the example above, the header would be the first 7 lines: EXCHANGE%3DOTCMKTS MARKET_OPEN_MINUTE=570 MARKET_CLOSE_MINUTE=960 INTERVAL=60 COLUMNS=DATE,CLOSE,HIGH,LOW,OPEN,VOLUME DATA= TIMEZONE_OFFSET=-240 What this is basically telling us is which stock exchange this data pertains to (somewhat unimportant for our current project), the absolute minute from the start of the day when the markets opened and closed (potentially important for what we are doing), the interval of data being reported (in units of seconds), the overall format of the data (date, close, high, etc.), a seemingly unused (maybe reserved?) DATA option, and finally a timezone offset (what timezone is this data being reported in?) Following the header we have a stanza pertaining to a day, which will kick off with a line like this: a1500903000,0.097,0.097,0.0965,0.0965,53758 This is effectively kicking off item 0 in the reported interval. That first field (note a comma-separated list), is actually an encoded UNIX time value, which we'll want to decode to report more recognizable date information (YYYY-MM-DD HH:MM). The successive fields correspond, in order with the values laid out in the **COLUMNS** option in the header (after DATE comes the prior CLOSE, then the HIGH, the LOW, the OPEN, and finally the VOLUME). With the exception of DATE and VOLUME, everything else is represented as a decimal cost (you may assume dollars). Subsequent lines in the stanza are merely offset intervals from the first, for instance: 1,0.096,0.097,0.096,0.097,102502 2,0.0974,0.099,0.095,0.09525,159489 3,0.099,0.099,0.097,0.0975,238832 4,0.0975,0.099,0.097,0.097,21000 No UNIX time value to decode, merely an offset to add to that initial UNIX time value. Your job is to write a program that, when provided one of these dataset files as a command-line argument, will open and read its contents into memory (I'm leaving the //structure// of how you store it somewhat flexible for now, but let's just say it may make a whole lot of sense to use a **struct** to aid in storing this data, perhaps even an **array** of structs...), and then be able to interactively (perhaps via a menu?) report : * in a specified time interval (minute, 10 minute, 30 minute, hour, day, 2 days, 5 days). * the CLOSE, HIGH, LOW, OPEN, or VOLUME at the specified time interval Results for now should just be displayed to STDOUT. Clearly, there's a lot of different directions we can go from here, but for now we're aiming to establish a baseline (can we interact with and parse known data in expected ways). Once we have that down, we can get into more sophisticated variations. Submission is via the lab46 submit tool, by the posted deadline, for the source code (able to compile and run without issue on lab46).