Table of Contents


Corning Community College


UNIX/Linux Fundamentals



Case Study 0x2: Archive Handling

~~TOC~~

Objective

To become familiar with archives, their purpose, and how to create and extract them.

History

UNIX has been around since the beginning of time (or the start of the UNIX Epoch at any rate- 11:59:59pm December 31, 1969 or 12:00:00am January 1, 1970), and tapes at that decade were and came into widespread use. So in common UNIX tradition, we have a utility on the system that allows us to create archives. The UNIX tar utility is short for tape archive, and allows us to combine a set of files together as one long string of data for easy storage or transportation.

As time went by, it was realized that to better utilize our resources, we could come up with methods of compressing the data, so we could in essence fit more in the same amount (or less) of space. There are many forms of compression in existence. For this course, we will rely on one of the most popular (but not necessarily best compressing): GNU zip, available to us in the gzip and gunzip utilities.

Background

Archives have been around forever. They provide an easy way to keep a bunch of files in one place to send to a backup device or to send to another computer.

The advantages to backups are tremendous. In the early days, magnetic tapes were (in fact, for the most part they still are the de-facto large volume backup medium) used to backup critical information. Tapes are a linear storage medium- that is, there is a beginning, and an end. The tape head (which can read/write information) can move (or “seek”) between the starting and ending point of the tape a fixed speed. A representation is as follows:

Illustration of a linear data tape

On this “tape”, there are a fixed “n” amount of cells that can each store a block of data. In our example we can see that of the files we can see (F1, F2, F3, and Fn), that F1 takes up 3 blocks on the tape (cells 0, 1, and 2). F2 takes up a single block, as does F3.

For a tape where the head is positioned at cell 0, if we wanted to extract file F3, we would have to seek past F1 (all 3 blocks), and F2, before we get to the beginning of file F3. And what about Fn, the last file on the tape? We would have to seek through the entire tape until we get to the end. How would this affect access time for the files? How is this compared to a hard disk or RAM, which is more of a random access medium? (or at least less restricted by its linearness)

The other property of the tape archive is that we now have all our files combined into one long (linear) stream of data. Archives in general have this property- the archive starts with some sort of file address table that identifies the offsets of each file from the start of the tape, and are all lumped together.

In addition to their obvious benefit of backing up data, archives also are useful in organizing your files into a single location. For example, developers of large software projects (such as the Linux kernel or the Apache project) do not program everything into a single source file. Not only would this be extremely impractical, but it would undoubtedly be tedious to read through. Instead, developers create lots of small files, all of which make up the whole project.

Now, let's say you want to download a version of one of these software projects. The Linux kernel probably is composed of tens of thousands of source files. The last things you would want to do is download 20,000 files just to be able to compile your own version.

To get around this, archives are used to collect everything up into a single file. Now all you have to do is download that single archive, then extract it to obtain all the individual files. Quite efficient.

Archives

As it turns out, many archive formats have appeared over the years. Varying in the way the data is encoded to even integrating some sort of compression algorithm, it can leave for many different ways for many different computers to archive.

Although not a definite guide, perhaps some popular archive formats per platform are:

Operating System Popular Archive Format
UNIX+ tar
DOS/Windows zip
MacOS classic++ binhex
MacOS X dmg
AtariDOS arc
Debian GNU/Linux+++ deb
RedHat/SuSE/Fedora rpm

+ UNIX archives are typically also compressed with gzip, bzip2, or perhaps even compress. However many UNIX vendors also provide some sort of package management system to handle system specific archives.

++ MacOS classic archives are typically also compressed with StuffIt.

+++ While Debian GNU/Linux (and any other Linux distribution for that matter) is for all intents and purposes a UNIX clone, its common archives, or packages, are in a custom format for use with its particular package management system.

It has been said “the great thing about standards is that there are so many of them”. This is true even in archive formats. While there may be several different formats of archives in existance, there is often justifiable reason for having them.

Often times, advancements in technique or a new and improved compression algorithm is discovered. New systems often try to adopt newer technologies, not only to distinguish them from predecessors, but to offer genuine improvements to users who will be using that particular system.

Exercise

Time to put your skills to the test.

1. From the cs2/ subdirectory of the UNIX public directory (/var/public/unix/cs2):
a.Copy the archive1.tar.gz and archive2.zip files to your home directory.
b.How did you do this?
2. Using your book or the man pages:
a.Determine how to extract both archives. They will both extract into the same directory: archives/
b.Record the commands and incantations used to extract both archives. Where did you find this information?
c.Go into the archives/ directory and rename abc.txt to: def.text
d.Descend back to the parent of the archives/ directory (most likely the base of your home directory).
3. Using the available resources:
a.Create a tar archive of the archives/ directory and contents.
b.Name the archive: arc.tar
c.How did you do this?
4. Using gzip:
a.Compress the tar archive.
b.Be sure to use maximum compression.
c.The resulting file should be: arc.tar.gz
d.Place the resulting file into your ~/src/unix/ directory and add/commit it to your subversion repository.

Being familiar with archiving can help in organizing your own data, as well as packaging data to share with others.

Concepts

DOS and Windows systems use the ZIP archive format. However, to create a ZIP archive involves only one step, as opposed to tar'ing and then gzip'ing files on a UNIX system.

5. Thinking on this:
a.Does ZIP actually work fundamentally different than tar + gzip? Explain.
b.Do compressed archives make more sense now that you can see the process behind them?

Better understanding the concepts behind the tools we use has many advantages. Not only can we better use the tools with the tasks at hand, but it can also help us to creatively solve other problems.

Submission

In addition to the responses to the various questions, be sure to submit the archive file you have created in this assignment.

6. Additionally, please do the following:
a.Run md5sum on your compressed archive.
b.What is (copy and paste) the output of this?

Conclusions

All questions in this assignment require an action or response. Please organize your responses into an easily readable format and submit the final results to your instructor.

Your assignment is expected to be performed and submitted in a clear and organized fashion- messy or unorganized assignments may have points deducted. Be sure to adhere to the submission policy.

The successful results of the following actions will be considered for evaluation:

<html><center><a href=“http://lab46.corning-cc.edu/haas/content/unix/submit.php?cs2”>http://lab46.corning-cc.edu/haas/content/unix/submit.php?cs2</a></center></html>