User Tools

Site Tools


user:cforman:portfolio:project5

Project 5: Helping a Friend

A project for Unix/Linux by Corey Forman during the Fall 2011.

This project was begun on 11/17/11 and was finished on that day also.

Objectives

The objective is to “data-mine” six folders of information and retrieve names, emails, and companies out of the data. Since this is a large project it will be divided into smaller portions but have the same pattern of data for the end result.

Prerequisites

In order to successfully accomplish/perform this project, the listed resources/experiences need to be consulted/achieved:

  • competent Command Line skills
  • basic text editing
  • competent RegEx use
  • pattern recognition doesn't hurt either

Background

The purpose of this project was to assist a friend in data-mining a large amount of information. I attempted to format the data into the version needed to be turned into his boss.

Scope

I am going to use RegEx commands to grab and format the data necessary. I will also Manually edit some of the data. The focus on the project is the RegEx command used to get the data into the right format.

Attributes

State and justify the attributes you'd like to receive upon successful approval and completion of this project.

  • filter because we are grepping out stuff that is not needed.
  • regular expressions: we filter by using regular expressions
  • text processing: we are manipulating text with RegExs
  • files and directories we are working with files to get data out of them.
  • security : we had to cp it to our directory because it was under some elses ownership meaning we could not edit the data.
  • command line: we use RegExs on the command line to manipulate data.

Procedure

I copied the file that needed to data mine. next i formatted the data into a position of which i could edit it with RegExs easily i then tried out various RegEx commands until i received the data i wanted. i then saved that data onto a file so it could be transferred back to the tmp file.

Execution

lab46:lab46:~$ ls
1275799069694.jpg    archivecompilationfile  data                                         mystery                   testdir
250px-P2_glados.jpg  archives                emvideo-youtube-nd2rBWbvDbA_3.jpg            nom-nom-nom-babies.jpg    testdir.tar
Downloads            archives.tar.bz2        error.log                                    public_html               testdir2
InstNLP2.txt         archives.zip            fiddlesticks.jpg                             puzzlebox                 testfile
InstNLP2Edited.txt   bin                     funny-pictures-taco-cat-is-a-palindrome.jpg  shaco.jpg                 tmp
Maildir              cake                    goonies-musical.jpg                          shellscripting            trollin
RageFaceBlackSS.png  closet                  irc                                          spring2012-20111103.html  trolling-400x345.jpg
archive              corningcourses          linktestfile                                 src                       veigar.jpg
archive1.tar.gz      corningcoursesorg       minecraft-creeper-comic-600x694.png          src.orig                  wicked-witch.jpg
archive2.zip         courses                 motd                                         tempfile                  words
lab46:~$
~/src/cprog$ ./hello
Hello, World!
lab46:~/src/cprog$ 

the file after some text editing that i was working with. the file name is InstNLP2.txt

Arcturus België
Eric Schneider
email: info@arcturus.be

Heart Systems n.v. - International Training Institute for Communication and NLP
Paul Liekens
email: Paul.Liekens@hookon.be

InMind
Peter Wrycza and Jan Ardui
email: pwrycza@indosat.net.id

Institut Ressources
Alain Moenaert
email: alain.moenaert@infoboard.be

BrainNet
Dr. Helosio Rodrigues, MD
email: brainet@unisys.com.br

Centro de Aprendizado Linguistico
Wilma Steagall de Tomasso
email: silveira@dialdata.com.br

Conexao Evolving Center of NLP
Getulio Barnasque
email: conexao@pro.via-rs.com.br

the RegEx used to manipulate this data.
 cat InstNLP2.txt | sed 's/^$/^/g' | tr '\n' '$' | tr '^' '\n'|sed 's/-----------/unknown/g'|sed 's/^\$\(.*\)\$\(.*\)\$\(.*\)\$$/"\3","\2","\1"/g'|sed
's/email: //g'>InstNLP2Edited.txt

the results were as follows. 
"info@arcturus.be ","Eric Schneider ","Arcturus België "
"Paul.Liekens@hookon.be ","Paul Liekens ","Heart Systems n.v. - International Training Institute for Communication and NLP "
"pwrycza@indosat.net.id ","Peter Wrycza and Jan Ardui ","InMind "
"alain.moenaert@infoboard.be ","Alain Moenaert ","Institut Ressources "

this data can then be imputed and recognized as data in excel and turned into a spreadsheet. 

Reflection

Comments/thoughts generated through performing the project, observations made, analysis rendered, conclusions wrought. What did you learn from doing this project?

data mining can be a useful skill when applying for a job because most industries function around data today. Being able to data mine can separate you from the rest of the techies.

References

In performing this project, the following resources were referenced:

  • none in class information only
user/cforman/portfolio/project5.txt · Last modified: 2011/12/15 21:09 by cforman