projects
- intro (due 20140903)
- resume (due 20140903)
- notes (on-going)
- archives (due 20140917)
- puzzlebox (due 20140924)
- puzzlebox2 (due 20141001)
- dataproc (due 20141022)
- statuscalc (due 20141119)
- timeonline (due 20141218-172959)
projects
Corning Community College
UNIX/Linux Fundamentals
Case Study 0x1: Archive Handling
~~TOC~~
To become familiar with archives, their purpose, and how to create and extract them.
UNIX has been around since the beginning of time (or the start of the UNIX Epoch at any rate- 11:59:59pm December 31, 1969 or 12:00:00am January 1, 1970), and tapes at that decade were and came into widespread use. So in common UNIX tradition, we have a utility on the system that allows us to create archives. The UNIX tar utility is short for tape archive, and allows us to combine a set of files together as one long string of data for easy storage or transportation.
As time went by, it was realized that to better utilize our resources, we could come up with methods of compressing the data, so we could in essence fit more in the same amount (or less) of space. There are many forms of compression in existence. For this course, we will rely on one of the most popular (but not necessarily best compressing): GNU zip, available to us in the gzip and gunzip utilities.
Archives have been around forever. They provide an easy way to keep a bunch of files in one place to send to a backup device or to send to another computer.
The advantages to backups are tremendous. In the early days, magnetic tapes were (in fact, for the most part they still are the de-facto large volume backup medium) used to backup critical information. Tapes are a linear storage medium- that is, there is a beginning, and an end. The tape head (which can read/write information) can move (or “seek”) between the starting and ending point of the tape a fixed speed. A representation is as follows:
On this “tape”, there are a fixed “n” amount of cells that can each store a block of data. In our example we can see that of the files we can see (F1, F2, F3, and Fn), that F1 takes up 3 blocks on the tape (cells 0, 1, and 2). F2 takes up a single block, as does F3.
For a tape where the head is positioned at cell 0, if we wanted to extract file F3, we would have to seek past F1 (all 3 blocks), and F2, before we get to the beginning of file F3. And what about Fn, the last file on the tape? We would have to seek through the entire tape until we get to the end. How would this affect access time for the files? How is this compared to a hard disk or RAM, which is more of a random access medium? (or at least less restricted by its linearness)
The other property of the tape archive is that we now have all our files combined into one long (linear) stream of data. Archives in general have this property- the archive starts with some sort of file address table that identifies the offsets of each file from the start of the tape, and are all lumped together.
In addition to their obvious benefit of backing up data, archives also are useful in organizing your files into a single location. For example, developers of large software projects (such as the Linux kernel or the Apache project) do not program everything into a single source file. Not only would this be extremely impractical, but it would undoubtedly be tedious to read through. Instead, developers create lots of small files, all of which make up the whole project.
Now, let's say you want to download a version of one of these software projects. The Linux kernel probably is composed of tens of thousands of source files. The last things you would want to do is download 20,000 files just to be able to compile your own version.
To get around this, archives are used to collect everything up into a single file. Now all you have to do is download that single archive, then extract it to obtain all the individual files. Quite efficient.
As it turns out, many archive formats have appeared over the years. Varying in the way the data is encoded to even integrating some sort of compression algorithm, it can leave for many different ways for many different computers to archive.
Although not a definite guide, perhaps some popular archive formats per platform are:
Operating System | Popular Archive Format |
---|---|
UNIX+ | tar |
DOS/Windows | zip |
MacOS classic++ | binhex |
MacOS X | dmg |
AtariDOS | arc |
Debian GNU/Linux+++ | deb |
RedHat/SuSE/Fedora | rpm |
+ UNIX archives are typically also compressed with gzip, bzip2, or perhaps even compress. However many UNIX vendors also provide some sort of package management system to handle system specific archives.
++ MacOS classic archives are typically also compressed with StuffIt.
+++ While Debian GNU/Linux (and any other Linux distribution for that matter) is for all intents and purposes a UNIX clone, its common archives, or packages, are in a custom format for use with its particular package management system.
It has been said “the great thing about standards is that there are so many of them”. This is true even in archive formats. While there may be several different formats of archives in existance, there is often justifiable reason for having them.
Often times, advancements in technique or a new and improved compression algorithm is discovered. New systems often try to adopt newer technologies, not only to distinguish them from predecessors, but to offer genuine improvements to users who will be using that particular system.
Time to put your skills to the test.
1. | From the archives/ subdirectory of the UNIX public directory (/var/public/unix/archives): | |
---|---|---|
a. | Copy the archive1.tar.gz and archive2.zip files to your home directory. | |
b. | How did you do this? |
2. | Using your book or the man pages: | |
---|---|---|
a. | Determine how to extract both archives. They will both extract into the same directory: archives/ | |
b. | Record the commands and incantations used to extract both archives. Where did you find this information? | |
c. | Go into the archives/ directory and rename abc.txt to: def.text | |
d. | Descend back to the parent of the archives/ directory (most likely the base of your home directory). |
3. | Using the available resources: | |
---|---|---|
a. | Create a tar archive of the archives/ directory and contents. | |
b. | Name the archive: arc.tar | |
c. | How did you do this? |
4. | Using gzip: | |
---|---|---|
a. | Compress the tar archive. | |
b. | Be sure to use maximum compression. | |
c. | The resulting file should be: arc.tar.gz | |
d. | Place the resulting file into your ~/src/unix/ directory and add/commit it to your subversion repository. |
Being familiar with archiving can help in organizing your own data, as well as packaging data to share with others.
DOS and Windows systems use the ZIP archive format. However, to create a ZIP archive involves only one step, as opposed to tar'ing and then gzip'ing files on a UNIX system.
5. | Thinking on this: | |
---|---|---|
a. | Does ZIP actually work fundamentally different than tar + gzip? Explain. | |
b. | Do compressed archives make more sense now that you can see the process behind them? |
Better understanding the concepts behind the tools we use has many advantages. Not only can we better use the tools with the tasks at hand, but it can also help us to creatively solve other problems.
This assignment has activities which you should tend to. As a result you should be proficient logging onto Lab46 and utilizing the concepts in this lab. Also- you should document/summarize knowledge learned on your Opus.
As always, the class mailing list and class IRC channel are available for assistance, but not answers.