User Tools

Site Tools


user:ccaccia:portfolio:project1

Project: DATA MINING

A project for UNIX/Linux Fundamentals by Christopher M. Caccia during the FALL Semester 2011.

This project was begun on November 17th 2011 and has taken 3 hours to complete.

Objectives

The purpose of this project was to take a text file, containing business department names, addresses, employee names, phone numbers, and e-mail addresses and extract them for output in a specific format. All data and lines within the text file needed to be cut out and only business name, employee name, and e-mail be output within a specific format.

Prerequisites

In order to successfully accomplish/perform this project, the listed resources/experiences need to be consulted/achieved:

  • ls command
  • grep command
  • sed command
  • cat command
  • Regular Expressions

Background

This project was attempted because a real world business was consolidating many different text files in many different formats, to one file and one specific format. This was a real world application for data manipulation and was also a very large exercise in implementation of regular expressions. I extracted specific data and output this data to the /tmp directory, where it was used to be merged with the other files.

This process is called data mining and is literally considered magic to most computer users. The ability to cut and re-order specific information from a text file is extremely useful in the business world. To complete this task manually would require many redundant man-hours, or a few hours with regular expressions and utilities like grep, sed, and cut.

Scope

Give a general overview of your anticipated implementation of the project. Address any areas where you are making upfront assumptions or curtailing potential detail. State the focus you will be taking in implementation.

I will be taking a text document and extracting only specific data from the file and then output this extracted data to a new file. This new file will be used to build a new database of business names, employee names, and respective e-mail addresses for a anonymous company in real world application. There are many different documents with many different formats that all need to have specific data extracted and then combined in one master file.

Using text processing commands and regular expressions I will organize the data in the original file, so that specific parts of the data can be extracted and implemented.

Attributes

State and justify the attributes you'd like to receive upon successful approval and completion of this project.

  • Files and directories: I manipulated text files, redirected extracted data to new files, moved this new file to a directory.
  • Commands: I used the “ls” and “cat” commands as well as utilities like grep, and sed.
  • Text processing: I used utilities like grep and sed to process and manipulate text
  • The UNIX development environment: Assigning the variable “^” as a field separator
  • Regular Expressions: I manipulated data using regular expressions with grep, and sed
  • Groups: I output the final text document to a directory accessible to group users.
  • Multitasking: Many files were manipulated at the same time
  • Filters: Using grep and cut to filter data to output

Procedure

The actual steps taken to accomplish the project. Include images, code snippets, command-line excerpts; whatever is useful for intuitively communicating important information for accomplishing the project.

I started by viewing the original text file using the command “cat”, and then searching for patterns within the text that could be extracted and organized.

I noticed that line breaks seemed to separate the clusters of data, and each block of information was also on it's own line.

Using regular expressions, grep, and sed I was able to cut out the “end line's” and place all data on one line. Once this was completed I then created a field separator “^” and structured the data only separated by the ^ key.

Using regular expressions I was able to cut out only the fields I needed and then re-order these fields so that it would display business name, employee name, and e-mail.

After the extracted data was organized the way I needed, I sent a copy of the final outcome to the /tmp directory where the data could be merged with all other files.

Execution

Upon completion of the project, if there is an applicable collection of created code, place a copy of your finished code within <code> </code> blocks here.

lab46:~$ cat "Institute Addresses NLP.txt" | sed 's/^$/^/g' | tr '\n' '$' | tr '^' '\n' | sed 's/^\$//' \
> | sed 's/\$/^1^/' | sed 's/\$/^2^/' | sed 's/\$/^3^/' | sed 's/\$/^4^/' | sed 's/\$/^5^/' \
> | sed 's/\$/^6^/' | sed 's/^\(.*\)\^1\^\(.*\)\^2\^email: \(.*\)\^3\^/\1:\2:\3/g'| sed 's/\^1\^/:/' \
> | sed 's/\^2\^//g' | sed 's/^\(.*\):\(.*\):\(.*\)$/"\1","\2","\3"/g' | grep -v 'none' | \
> sed 's/"[Ee]mail",//g' | grep -v '^$' | sed 's/^\(".*"\),\(".*@.*"\)/\1,"",\2/g' > file.txt
lab46:~$ cp "Institute Addresses NLP.txt /tmp
lab46:~$

Reflection

This was an interesting project for me personally. Regular expressions can be very overwhelming to grasp and learn when manipulating data. This project in particular was very complicated for a new user, however there is really no way to learn unless you dive right in. I have a much better understanding for how useful data mining abilities are and how often they could be implemented. This is not a skill that very many people have mastered or even realize as possible.

References

In performing this project, the following resources were referenced:

grep cut sed regex

Generally, state where you got informative and useful information to help you accomplish this project when you originally worked on it (from Google, other wiki documents on the Lab46 wiki, etc.)

user/ccaccia/portfolio/project1.txt · Last modified: 2011/12/15 15:37 by ccaccia