User Tools

Site Tools


user:cforman:portfolio:project5

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
user:cforman:portfolio:project5 [2011/12/16 01:47] – [Scope] cformanuser:cforman:portfolio:project5 [2011/12/16 02:09] (current) – [References] cforman
Line 1: Line 1:
 +======Project 5: Helping a Friend======
  
 +A project for Unix/Linux by Corey Forman during the Fall 2011.
 +
 +This project was begun on 11/17/11 and was finished on that day also. 
 +
 +=====Objectives=====
 +The objective is to "data-mine" six folders of information and retrieve names, emails, and companies out of the data. Since this is a large project it will be divided into smaller portions but have the same pattern of data for the end result.
 +
 +=====Prerequisites=====
 +In order to successfully accomplish/perform this project, the listed resources/experiences need to be consulted/achieved:
 +
 +  * competent Command Line skills
 +  * basic text editing
 +  * competent RegEx use
 +  * pattern recognition doesn't hurt either
 + 
 +
 +=====Background=====
 +
 +The purpose of this project was to assist a friend in data-mining a large amount of information. I attempted to format the data into the version needed to be turned into his boss.
 +=====Scope=====
 +I am going to use RegEx commands to grab and format the data necessary. I will also Manually edit some of the data. The focus on the project is the RegEx command used to get the data into the right format.
 +=====Attributes=====
 +State and justify the attributes you'd like to receive upon successful approval and completion of this project.
 +
 +  * filter because we are grepping out stuff that is not needed. 
 +  * regular expressions: we filter by using regular expressions
 +  * text processing: we are manipulating text with RegExs
 +  * files and directories we are working with files to get data out of them.
 +  * security : we had to cp it to our directory because it was under some elses ownership meaning we could not edit the data. 
 +  * command line: we use RegExs on the command line to manipulate data. 
 +
 +=====Procedure=====
 +I copied the file that needed to data mine. 
 +next i formatted the data into a position of which i could edit it with RegExs easily
 +i then tried out various RegEx commands until i received the data i wanted. 
 +i then saved that data onto a file so it could be transferred back to the tmp file.
 +
 +=====Execution=====
 +<cli>
 +lab46:lab46:~$ ls
 +1275799069694.jpg    archivecompilationfile  data                                         mystery                   testdir
 +250px-P2_glados.jpg  archives                emvideo-youtube-nd2rBWbvDbA_3.jpg            nom-nom-nom-babies.jpg    testdir.tar
 +Downloads            archives.tar.bz2        error.log                                    public_html               testdir2
 +InstNLP2.txt         archives.zip            fiddlesticks.jpg                             puzzlebox                 testfile
 +InstNLP2Edited.txt   bin                     funny-pictures-taco-cat-is-a-palindrome.jpg  shaco.jpg                 tmp
 +Maildir              cake                    goonies-musical.jpg                          shellscripting            trollin
 +RageFaceBlackSS.png  closet                  irc                                          spring2012-20111103.html  trolling-400x345.jpg
 +archive              corningcourses          linktestfile                                 src                       veigar.jpg
 +archive1.tar.gz      corningcoursesorg       minecraft-creeper-comic-600x694.png          src.orig                  wicked-witch.jpg
 +archive2.zip         courses                 motd                                         tempfile                  words
 +lab46:~$
 +~/src/cprog$ ./hello
 +Hello, World!
 +lab46:~/src/cprog$ 
 +
 +the file after some text editing that i was working with. the file name is InstNLP2.txt
 +
 +Arcturus België
 +Eric Schneider
 +email: info@arcturus.be
 +
 +Heart Systems n.v. - International Training Institute for Communication and NLP
 +Paul Liekens
 +email: Paul.Liekens@hookon.be
 +
 +InMind
 +Peter Wrycza and Jan Ardui
 +email: pwrycza@indosat.net.id
 +
 +Institut Ressources
 +Alain Moenaert
 +email: alain.moenaert@infoboard.be
 +
 +BrainNet
 +Dr. Helosio Rodrigues, MD
 +email: brainet@unisys.com.br
 +
 +Centro de Aprendizado Linguistico
 +Wilma Steagall de Tomasso
 +email: silveira@dialdata.com.br
 +
 +Conexao Evolving Center of NLP
 +Getulio Barnasque
 +email: conexao@pro.via-rs.com.br
 +
 +the RegEx used to manipulate this data.
 + cat InstNLP2.txt | sed 's/^$/^/g' | tr '\n' '$' | tr '^' '\n'|sed 's/-----------/unknown/g'|sed 's/^\$\(.*\)\$\(.*\)\$\(.*\)\$$/"\3","\2","\1"/g'|sed
 +'s/email: //g'>InstNLP2Edited.txt
 +
 +the results were as follows. 
 +"info@arcturus.be ","Eric Schneider ","Arcturus België "
 +"Paul.Liekens@hookon.be ","Paul Liekens ","Heart Systems n.v. - International Training Institute for Communication and NLP "
 +"pwrycza@indosat.net.id ","Peter Wrycza and Jan Ardui ","InMind "
 +"alain.moenaert@infoboard.be ","Alain Moenaert ","Institut Ressources "
 +
 +this data can then be imputed and recognized as data in excel and turned into a spreadsheet. 
 +</cli>
 +
 +=====Reflection=====
 +Comments/thoughts generated through performing the project, observations made, analysis rendered, conclusions wrought. What did you learn from doing this project?
 +
 +data mining can be a useful skill when applying for a job because most industries function around data today. Being able to data mine can separate you from the rest of the techies. 
 +=====References=====
 +In performing this project, the following resources were referenced:
 +
 +  * none in class information only