A project for Unix/Linux by Corey Forman during the Fall 2011.
This project was begun on 11/17/11 and was finished on that day also.
The objective is to “data-mine” six folders of information and retrieve names, emails, and companies out of the data. Since this is a large project it will be divided into smaller portions but have the same pattern of data for the end result.
In order to successfully accomplish/perform this project, the listed resources/experiences need to be consulted/achieved:
The purpose of this project was to assist a friend in data-mining a large amount of information. I attempted to format the data into the version needed to be turned into his boss.
I am going to use RegEx commands to grab and format the data necessary. I will also Manually edit some of the data. The focus on the project is the RegEx command used to get the data into the right format.
State and justify the attributes you'd like to receive upon successful approval and completion of this project.
I copied the file that needed to data mine. next i formatted the data into a position of which i could edit it with RegExs easily i then tried out various RegEx commands until i received the data i wanted. i then saved that data onto a file so it could be transferred back to the tmp file.
lab46:lab46:~$ ls 1275799069694.jpg archivecompilationfile data mystery testdir 250px-P2_glados.jpg archives emvideo-youtube-nd2rBWbvDbA_3.jpg nom-nom-nom-babies.jpg testdir.tar Downloads archives.tar.bz2 error.log public_html testdir2 InstNLP2.txt archives.zip fiddlesticks.jpg puzzlebox testfile InstNLP2Edited.txt bin funny-pictures-taco-cat-is-a-palindrome.jpg shaco.jpg tmp Maildir cake goonies-musical.jpg shellscripting trollin RageFaceBlackSS.png closet irc spring2012-20111103.html trolling-400x345.jpg archive corningcourses linktestfile src veigar.jpg archive1.tar.gz corningcoursesorg minecraft-creeper-comic-600x694.png src.orig wicked-witch.jpg archive2.zip courses motd tempfile words lab46:~$ ~/src/cprog$ ./hello Hello, World! lab46:~/src/cprog$ the file after some text editing that i was working with. the file name is InstNLP2.txt Arcturus België Eric Schneider email: info@arcturus.be Heart Systems n.v. - International Training Institute for Communication and NLP Paul Liekens email: Paul.Liekens@hookon.be InMind Peter Wrycza and Jan Ardui email: pwrycza@indosat.net.id Institut Ressources Alain Moenaert email: alain.moenaert@infoboard.be BrainNet Dr. Helosio Rodrigues, MD email: brainet@unisys.com.br Centro de Aprendizado Linguistico Wilma Steagall de Tomasso email: silveira@dialdata.com.br Conexao Evolving Center of NLP Getulio Barnasque email: conexao@pro.via-rs.com.br the RegEx used to manipulate this data. cat InstNLP2.txt | sed 's/^$/^/g' | tr '\n' '$' | tr '^' '\n'|sed 's/-----------/unknown/g'|sed 's/^\$\(.*\)\$\(.*\)\$\(.*\)\$$/"\3","\2","\1"/g'|sed 's/email: //g'>InstNLP2Edited.txt the results were as follows. "info@arcturus.be ","Eric Schneider ","Arcturus België " "Paul.Liekens@hookon.be ","Paul Liekens ","Heart Systems n.v. - International Training Institute for Communication and NLP " "pwrycza@indosat.net.id ","Peter Wrycza and Jan Ardui ","InMind " "alain.moenaert@infoboard.be ","Alain Moenaert ","Institut Ressources " this data can then be imputed and recognized as data in excel and turned into a spreadsheet.
Comments/thoughts generated through performing the project, observations made, analysis rendered, conclusions wrought. What did you learn from doing this project?
data mining can be a useful skill when applying for a job because most industries function around data today. Being able to data mine can separate you from the rest of the techies.
In performing this project, the following resources were referenced: