Chapter 4. Files and Directories in Unix

Now that you've set up your workstation, let's spend some time talking about how to get around in a Unix system. In this chapter, we introduce basic Unix concepts, including the structure of the filesystem, file ownership, and commands for moving around the filesystem and creating files.^[*] Another important focus of this chapter, however, is the approach you should take to organizing your research data so that it can be accessed efficiently by you and by others.

Filesystem Basics

All computer filesystems, whether on Unix systems or desktop PCs, are basically the same. Files are named locations on the computer's storage device. Each filename is a pointer to a discrete object with a beginning and end, whether it's a program that can be executed or simply a set of data that can be read by a program. Directories or folders are containers in which files can be grouped. Computer filesystems are organized hierarchically, with a root directory that branches into subdirectories and subdirectories of subdirectories.

This hierarchical system can help organize and share information, if used properly. Like the taxonomy of species developed by the early biologists, your file hierarchy should organize information from the general level to the specific. Each time the filesystem splits into subdirectories, it should be because there are meaningful divisions to be created within a larger class of files.

Why should you organize your computer files in a systematic, orderly way? It seems like an obvious question with an obvious answer. And yet, a common problem faced by researchers and research groups is failure to share information effectively. Problems with information management often become apparent when a research group member leaves, and others are required to take over his project.

Imagine you work with a colleague who keeps all his books and papers piled in random stacks all over his office. Now imagine that your colleague gets a new job and needs to depart in a hurry—leaving behind just about everything in his office. Your boss tells you that you can't throw away any of your colleague's papers without looking at them, because there might be something valuable in there. Your colleague has not organized or categorized any of his papers, so you have to pick up every item, look at it, determine if it's useful, and then decide where you want to file it. This might be a week's work, if you're lucky, and it's guaranteed to be a tough job.

This kind of problem is magnified when computer files are involved. First of all, many highly useful files, especially binaries of programs, aren't readable as text files by users. Therefore, it's difficult to determine what these files do if they're not documented. Other kinds of files, such as files of numerical data, may not contain useful header information. Even though they can be read as text, it may be next to impossible to figure out their purpose.

Second, space constraints on computer system usage are much more nebulous than the walls of an office. As disk space has become cheaper, it's become easier for users of a shared system simply never to clean up after themselves. Many programs produce multiple output files and, if there's no space constraint that forces you to clean up while running them, can produce a huge mess in a short time.

How can you avoid becoming this kind of problem for your colleagues? Awareness of the potential problems you can cause is the first step. You need to know what kinds of programs and files you should share with others and which you should keep in your own directories. You should establish conventions for naming datafiles and programs and stick to these conventions as you work. You should structure your filesystem in a sensible hierarchy. You should keep track of how much space you are using on your computer system and create usable archives of your data when you no longer need to access it frequently. You should create informative documentation for your work within the filesystem and within programs and datafiles.

The nature of the filesystem hierarchy means that you already have a powerful indexing system for your work at your fingertips. It's possible to do computer-based research and be just as disorganized as that coworker who piles all his books and papers in random stacks all over his office. But why would you want to do that? Without much more effort, you can use your computer's filesystem to keep your work organized.

Moving Around the Directory Hierarchy

Like all modern operating systems, the file hierarchy on a Unix system is structured as a tree. You may be used to this from PC operating systems. Open one folder, and there can be files and more folders inside it, layered as deep as you want to go. There is a root directory, designated as /. The root directory branches into a finite number of files and subdirectories. On a well-organized system, each of these subdirectories contains files and other subdirectories pertaining to a particular topic or system function.

Of course, there's nothing inside your computer that really looks like a tree. Files are stored on various media—most commonly the hard disk, which is a recordable device that lives in your computer. As its name implies, the hard disk is really a disk. And the tree structure that you perceive in Unix is simply a way of indexing what is on that disk or on other devices such as CDs, floppy disks, and Zip disks, or even on the disks of every machine in a group of networked computers. Unix has extensive networking capabilities that allow devices on networked computers to be mounted on other computers over the network. Using these capabilities, the filesystems of several networked computers can be indexed as if they were one larger, seamless filesystem.

Paths to Files and Directories

Each file on the filesystem can be uniquely identified by a combination of a filename and a path. You can reference any file on the system by giving its full name, which begins with a / indicating the root directory, continues through a list of subdirectories (the components of the path) and ends with the filename. The full name, or absolute path, of a file in someone's home directory might look like this:

/home/jambeck/mustelidae/weasels.txt

The absolute path describes the relationship of the file to the root directory, /. Each name in the path represents a subdirectory of the prior directory, and / characters separate the directory names.

Every file or directory on the system can be named by its absolute path, but it can also be named by a relative path that describes its relationship to the current working directory. Files in the directory you are in can be uniquely identified just by giving the filename they have in the current working directory. Files in subdirectories of your current directory can be named in relation to the subdirectory they are part of. From jambeck 's home directory, he can uniquely identify the file weasels.txt as mustelidae/weasels.txt. The absence of a preceding / means that the path is defined relative to the current directory rather than relative to the root directory.

If you want to name a directory that is on the same level or above the current working directory, there is a shorthand for doing so. Each directory on the system contains two links, ./ and ../, which refer to the current directory and its parent directory (the directory it's a subdirectory of ), respectively. If user jambeck is working in the directory /home/jambeck /mustelidae/weasels, he can refer to the directory /home/jambeck /mustelidae/otters as ../otters. A subdirectory of a directory on the same level of the hierarchy as /home/jambeck /mustelidae would be referred to as ../../didelphiidae/opossums.

Another shorthand naming convention, which is implemented in the popular csh and tcsh shell environments, is that the path of the home directory can be abbreviated as ~. The directory home/jambeck /mustelidae can then be referred to as ~/mustelidae.

Using a Process-Based File Hierarchy

Filesystems can be deep and narrow or broad and shallow. It's best to follow an intuitive scheme for organizing your files. Each level of hierarchy should be related to a step in the process you've used to carry out the project. A filesystem is probably too shallow if the output from numerous processing steps in one large project is all shoved together in one directory. However, a project directory that involves several analyses of just one data object might not need to be broken down into subdirectories. The filesystem is too deep if versions of output of a process are nested beneath each other or if analyses that require the same level of processing are nested in subdirectories. It's much easier to for you to remember and for others to understand the paths to your data if they clearly symbolize steps in the process you used to do the work.

As you'll see in the upcoming example, your home directory will probably contain a number of directories, each containing data and documentation for a particular project. Each of these project directories should be organized in a way that reflects the outline of the project. Each directory should contain documentation that relates to the data within it.

Establishing File-Naming Conventions for Your Work

Unix allows an almost unlimited variability in file naming. Filenames can contain any character other than the / or the null character (the character whose binary representation is all zeros). However, it's important to remember that some characters, such as a space, a backslash, or an ampersand, have special meaning on the command line and may cause problems when naming files. Filenames can be up to 255 characters in length on most systems. However, it's wise to aim for uniformity rather than uniqueness in file naming. Most humans are much better at remembering frequently used patterns than they are at remembering unique 255-character strings, after all.

A common convention in file naming is to name the file with a unique name followed by a dot (.) and then an extension that uniquely indicates the file type.

As you begin working with computers in your research and structuring your data environment, you need to develop your own file-naming conventions, or preferably, find out what naming conventions already exist and use them consistently throughout your project. There's nothing so frustrating as looking through old data sets and finding that the same type of file has been named in several different ways. Have you found all the data or results that belong together? Can the file you are looking for be named something else entirely? In the absence of conventions, there's no way to know this except to open every unidentifiable file and check its format by eye. The next section provides a detailed example of how to set up a filesystem that won't have you tearing out your hair looking for a file you know you put there.

Here are some good rules of thumb to follow for file-naming conventions:

Files of the same type should have the same extension.
Files derived from the same source data should have a common element in their unique names.
The unique name should contain as much information as possible about the experiment.
Filenames should be as short as is possible without compromising uniqueness.

You'll probably encounter preestablished conventions for file naming in your work. For instance, if you begin working with protein sequence and structure datafiles, you will find that families of files with the same format have common extensions. You may find that others in your group have established local conventions for certain kinds of datafiles and results. You should attempt to follow any known conventions.

Structuring a Project: An Example

Let's take a look at an example of setting up a filesystem. These are real directory layouts we have used in our work; only the names have been changed to protect the innocent. In this case, we are using a single directory to hold the whole project.

It's useful to think of the filesystem as a family tree, clustering related aspects of a project into branches. The top level of your project directory should contain two text files that explain the contents of the directories and subdirectories. The first file should contain an outline of the project, with the date, the names of the people involved, the question being investigated, and references to publications related to this project. Tradition suggests that such informational files should be given a name along the lines of README or 00README. For example, in the shards project, a minimal README file might contain the following:

98-05-22 
Project: Shards 
Personnel: Per Jambeck, Cynthia Gibas 
Question: Are there recurrent structural words in the three-dimensional structure 
of proteins? 
Outline: Automatic construction of a dictionary of elements of local structure in 
proteins using entropy maximization-based learning.

The second file should be an index file (named something readily recognizable like INDEX ) that explains the overall layout of the subdirectories. If you haven't really collected much data yet, a simple sketch of the directories with explanations should do. For example, the following file hierarchy:

98-03-22 PJ 
Layout of the Shards directory
(see README in subdirectories for further details) 
/shards 
/shards/data 
/shards/data/sequences
/shards/data/structures
/shards/data/results
/shards/data/results/enolases
/shards/data/results/globins 
/shards/data/test_cases 
/shards/graphics 
/shards/text
/shards/text/notebook
/shards/text/reports
/shards/programs
/shards/programs/source
/shards/programs/scripts
/shards/programs/bin

may also be represented in graphical form, as shown in Figure 4-1.

Figure 4-1. Tree diagram of a hierarchy

In this directory, we've made the first distinction between programs and data (programs contains the software we write, and data contains the information we get from databases, or files the programs generate). Within each subdirectory, we further distinguish between types of data (in this case, protein structures and protein sequences), and results (run on two sets of proteins, the enolase family and the globin superfamily) gleaned from running our programs on the data, and some test cases. Programs are also subdivided according to types, namely whether they are the human-readable program listings (source code), scripts that aid in running the programs, or the binaries of the programs.

As we mentioned earlier, when you store data in files, you should try to use a terse and consistent system for naming files. Excessively long filenames that describe the exact contents of a file but change for different file types (like all-GPCR-loops-in-SWISSPROT-on-99-7-14.text) will cause problems once you start using the facilities Unix provides for automatically searching for and updating files. In the shards project, we began with protein structures taken from the Protein Data Bank (PDB). We then used a homegrown Perl program called unique.pl to generate a nonredundant database, in which no protein's sequence had greater than 25% similarity to any other protein in the set. Thus, we can represent this information economically using the filename PDB-unique-25 for files related to this data set. For example, the list of the names of proteins in the set, and the file containing the proteins' sequences in FASTA format (a common text-file format for storing macromolecular sequence data), are stored, respectively, in:

PDB-unique-25.list 
PDB-unique-25.fasta

Files containing derived data can be named consistently as well. For example, the file containing all seven-residue pieces of protein structure derived from the nonredundant set is called PDB-unique-25-7.shard. This way, if you need to do something with all files pertaining to this nonredundant database, you can use the wildcard PDB-unique-25*, ignoring databases generated by different programs or those generated with unique.pl at different similarity thresholds.

File naming conventions can take you only so far in organizing a project; the simple naming schemes we've laid out here will become more and more confusing as a project grows. For larger projects, you should consider using a database management system (DBMS) to manage your data. We introduce database concepts in Chapter 13.

Commands for Working with Directories and Files

Now that you have the basics of filesystems, let's dig into the specifics of working with files and directories in Unix. In the following sections, we cover the Unix commands for moving around the filesystem, finding files and directories, and manipulating files and directories.

As we introduce commands, we'll show you the format of the command line for each command (for example, "Usage: man name"), and describe the effects of some options we find most useful.

Moving Around the Filesystem

When you open a window on a Linux system, you see a command prompt:

Command prompts can look different depending on the configuration of your system and your shell. For example, the following user is using the tcsh shell environment and has configured the command prompt to show the username and current working directory:

 [cgibas@gibas ~]$

Whatever the style of the command prompt, it means that your computer is waiting for you to tell it to do something. If you type an instruction at the prompt and press the Enter key, you have given your computer a command. Unix provides a set of simple navigation commands and commands for searching your filesystem for particular files and programs. We'll discuss the format of commands more thoroughly in Chapter 5. In this chapter, we'll introduce you to basic commands for getting around in Unix.

You are here: pwd

pwd stands for "print working directory," and that's exactly what it does. pwd sends the full pathname of the directory you are currently in, the current working directory, to standard output—it prints to the screen. You can think of being "in" a directory in this way: if the directory tree is a map of the filesystem, the current working directory is the "you are here" pointer on the map.

When you log in to the system, your "you are here" pointer is automatically placed in your home directory . Your home directory is a unique place. It contains the files you use almost every time you log into your system, as well as the directories that you create to store other files. What if you want to find out where your home directory is in relation to the rest of the system? Typing pwd at the command prompt in your home directory should give output something like:

/home/jambeck

This means that jambeck 's home directory is a subdirectory of the home directory, which in turn is a subdirectory of the root ( / ) directory.

Changing directories with cd

Usage: cd pathname

The cd command^[†] changes the current working directory. The only argument commonly used with this command is the pathname of a directory. If cd is used without an argument, it changes the current working directory to the user's home directory.

In order for these "you are here" tools to be helpful, you need to have organized your filesystem in a sensible way in the first place, so that the name and location of the directory that you're in gives you information about what kind of material can be found there. Most of the filesystem of your machine will have been set up by default when you installed Linux, but the organization of your own directories, where you store programs and data that you use, is your responsibility.

Finding Files and Directories

Unix provides many ways to find files, from simply listing out the contents of a directory to search programs that look for specified filenames and the locations of executable programs.

Listing files with ls

Usage: ls [ - options ] pathname

Now that you know where you are, how do you find out what's around you? Simply typing the Unix list command, ls , at the prompt gives you a listing of all the files and subdirectories in the current working directory. You can also give a directory name as an argument to ls. It then prints the names of all files in the named directory.

If you have a directory that contains a lot of files, you can use ls combined with the wildcard character * (asterisk) to produce a partial listing of files. There are several ways to use the *. If you have files in a series (such as ch1 to ch14 ), or files with common characters (like those ending in .txt), you can use * to specify all of them at once. When given as the argument in a command, * takes the place of any number of characters in a filename. For example, let's say you're looking for files called seq11, seq25, and seq34 in a directory of 400 files. Instead of scrolling through the list of files by eye, you could find them by typing:

% ls seq*

What if in that same directory you wanted to find all the text files? You know that text files usually end with .txt, so you can search for them by typing:

% ls *.txt

There are also a variety of command-line options to use with ls. The most useful of these are:

-a: Lists all the files in a directory, even those preceded by a dot. Filenames beginning with a dot (.) aren't listed by ls by default and consequently are referred to as hidden files. Hidden files often contain configuration instructions for programs, and it's sometimes necessary to examine or modify them.
-R: Lists subdirectories recursively. The content of the current directory is listed, and whenever a subdirectory is reached, its contents are also explicitly included in the listing. This command can create a catalog of files in your filesystem.
-1: Lists exactly one filename per line, a useful option. A single-column listing of all your source datafiles can quickly be turned into a shell script that executes an identical operation on each file, using just a few regular-expression tricks.
-F: Includes a code indicating the file type. A / following the filename indicates that the file is a directory, * indicates that the file is executable, and @ following the filename indicates that the file is a symbolic link.
-s: Lists the size of the file in blocks along with the filename.
-t: Lists files in chronological order of when they were last modified.
-l: Lists files in the long format.
- - color: Uses color to distinguish different file types.

Interpreting ls output

ls gives its output in two formats, the short and the long format. The short format is the default. It includes only the name of each file along with information requested using the -F or -s options:

#corr.pl#       commands.txt       hi.c           psimg.c 
#eva.pl#        corr.pl            nsmail         res.sty 
#pitch.txt#     corr.pl~           paircount.pl   res.sty~ 
#wish-list.txt# correlation.pl     paircount.pl~  resume.tex
Xrootenv.0      correlation.pl~    pj-resume.dvi  seq-scratch.txt
a.out           detailed-prac.txt  pj-resume.log  sources.txt

The long format of the ls command output contains a variety of useful information about file ownership and permissions, file sizes, and the dates and times that files were last modified:

drwxrwxr-x    4    jambeck    weasel    2048   Mar5      18:23 ./
drwxr-xr-x    5    root       root      1024   Jan 20    12:13 ../ 
-rw-r--r--    1    jambeck    weasel    293    Jan 28    17:39 commands.txt 
-rw-r--r--    1    jambeck    weasel    1749   Feb 21    12:43 corr.pl 
-rw-r--r--    1    jambeck    weasel    559    Feb 23    14:52 correlation.pl 
-rwxr-xr-x    1    jambeck    weasel    3042   Jan 21    17:05 eva.pl* 
drwx------    2    jambeck    weasel    1024   Feb 16    14:44 nsmail/

This listing was generated with the command ls -alF. The first 10 characters in the line give information about file permissions. The first character describes the file type. You will commonly encounter three types of files: the ordinary file (represented by -), the directory (d ), and the symbolic link (l ).

The next nine characters are actually three sets of three bits containing file permission information. The first three characters following the file type are the file permissions for the user. The next set are for the user's group, and the final set are for users outside the group. The character string rwxrwxrwx indicates a file is readable (r ), writable (w), and executable (x ) by any user. We talk about how to change file permissions and file ownership in Section 4.3.3.2.

The next column in the long format file listing tells you how many links a file has; that is, how many directory listings for that file exist on the filesystem. The same file can be named in multiple directories. In the section Section 4.2.3, we talk about how to create links (directory listings) for new and existing files.

The next two columns show the ownership of the file. The owner of the files in the preceding example is jambeck , a member of the group weasel.

The next three columns show the size of the file in characters, and the date and time that the file was last modified. The final column shows the name of the file.

Finding files with find

Usage: find pathname list -[test] criterion

The find command is one of the most powerful, flexible, and complicated commands in the standard set of Unix programs. find searches a path or paths for files based on various tests. There are over 20 different tests that can be used with find; here are a few of the most useful:

-print: This test is always true and sends the pathname of the current file to standard output. -print should be the last command specified in a line, because, as it's always true, it causes every file in the pathname being searched to be sent to the list if it comes before other tests in a sequence.
-name: This is the test most commonly applied with find and the one that is the most immediately useful. find -name weasel.txt -print lists to standard output the full pathnames of all files on the filesystem named weasel.txt. The wildcard operator * can be used within the filename criterion to find files that match a given substring. find -name weas* -print finds not only weasel.txt, but weasel.c and weasel.
-user uname: This test finds all files owned by the specified user.
-group gname: This test finds all files owned by the specified group.
-ctime n: This test is true if the current file has been changed n days ago. Changing a file refers to any change, including a change in permissions, whereas modification refers only to changes to the internal text of the file. -atime and -mtime tests, which check the access and modification times of the files, are also available.

Performing two find tests one after another amounts to applying a logical "and" between the tests. A -o between tests indicates a logical "or." A slash ( / ) negates a command, which means it finds only those files that fail the test.

find can be combined with other commands to selectively archive or remove particular files from a filesystem. Let's say you want a list of every file you have modified in your home directory and all subdirectories in the last week:

% find ~ -type f -mtime -7 -print

Changing the type to d shows only new directories; changing the -7 to +7 shows all files modified more than a week ago. Now let's go back to the original problem and find executable files. One way to do this with find is to use the following command:

% find / -name progname -type f -exec ls -alF '{' ';'

This example finds every match for progname and executes ls -alF FullPathName for every match. Any Unix command can be used as the object of -exec. Cleanup of the /tmp directory, which is usually done automatically by the operating system, can be done with this command:

find /tmp -type f -mtime +1 -exec rm -rf '{' ';'

This deletes everything that hasn't been modified within the last day. As always, you need to refer to your manual pages, or manpages, for more details (for more on manpages, see Chapter 5).

Finding an executable file with which

Usage: which progname

The which command searches your current path and reports the full path of the program that executes if you enter progname at the command prompt. This is useful if you want to know where a program is located, if, for instance, you want to be sure you're using the right version of the program. which can't find a program in a directory that isn't in your path.

Finding an executable file with whereis

Usage: whereis - [ options ] progname

The whereis command searches a standard set of directories for executables, manpages, and source files. Unlike which, whereis isn't dependent on your path, but it looks for programs only in a limited set of directories, so it doesn't give a definitive answer about the existence of a program.

Manipulating Files and Directories

Of course, just as with the stacks of papers on your desk, you periodically need to do some housekeeping on your files and directories to keep everything neat and tidy. Unix provides commands for moving, copying, and deleting files, as well as creating and removing directories.

Copying files and directories with cp

Usage: cp - [ options ] source destination

The cp command makes a copy of a source file at a destination. If the destination is a directory, the source can be multiple files, copies of which are placed in the destination directory. Frequently used options are -R and -r. Both copy recursively; that is, they copy the source directory and all its subdirectories to the destination. The -R option prevents cp from following symbolic links; only the link itself is copied. The -r option allows cp to follow symbolic links and copy all files it finds. This can cause problems if the symbolic links happen to form a circular path through the filesystem.

Normally, new files created by cp get their file ownership and permissions from your shell settings. However, the POSIX version of cp provides an -a option that attempts to maintain the original file attributes.

Moving and renaming files and directories with mv

Usage: mv source destination

The mv command simply moves or renames source to destination. Files and directories can both be either source or destination. If both source and destination are files or both are directories, the result of mv is essentially that the file or directory is renamed. If the destination is a directory, and the intention is to move already existing files or directories under that directory in the hierarchy, the directory must exist before the mv command is given. Otherwise the destination is created as a regular file, or the operation is treated as a renaming of a directory. One problem that can occur if mv isn't used carefully is when source represents a file list, and destination is a preexisting single file. When this happens, each member of source is renamed to destination and then promptly overwritten, leaving only the last file of the list intact. At this point, it's time to look for your system administrator and hope there is a recent backup.

Creating new links to files and directories with ln

Usage: ln -[ options ] source destination

The ln command establishes a link between files or directories at different locations in the directory tree. While creating a link creates the appearance of a new file in the destination location, no data is actually copied. Instead, what's created is a new pointer in the filesystem index that allows the source file to be found at more than one location "on the map."

The most commonly used option, -s, creates a symbolic link (or symlink) to a file or directory, as in the following example:

% ln -s perl5.005_03 perl

This allows you to type in just the word perl rather than remembering the entire version nomenclature for the current version of Perl.

Another common use of the ln command is to create a link to a newly compiled binary executable file in a directory in the system path, e.g., /usr/local/bin. Doing this allows you to run the program without addressing it by its full pathname.

Creating and removing directories with mkdir and rmdir

Usage: mkdir -[ options ] dirname

Usage: rmdir -[ options ] dirname

New directories can be created with the mkdir command, which has only two command-line options.

mkdir -p creates a directory and any intermediate components of the path that are missing. For instance, if user jambeck decides to create a directory mustelidae/weasels in his home directory, but the intermediate directory mustelidae doesn't exist, mkdir -p creates the intermediate directory and its subdirectory weasels.

mkdir -m mode creates a directory with the specified file-permission mode.

rmdir removes a directory if it's empty. With the -p option, rmdir removes all the empty directories in a given path. If user jambeck decides to remove the directory mustelidae/weasels, and directory mustelidae is empty except for directory weasels, rmdir -p ~/mustelidae/weasels removes both weasels and its parent directory mustelidae.

Removing files with rm

Usage: rm -[ options ] files

The rm command removes files and directories. Here are its common options:

-f: Forces the removal of files without prompting. You still can't remove files you don't own, but the write permissions on files you do own are ignored. For example, rm -f a* deletes all files starting with the letter a, but doesn't delete any subdirectories.
-i: Prompts you with rm: remove filename? Files are removed only if you begin your answer with a y or Y.
-r: (recursive option) Removes all directories and subdirectories in the list of files. Symbolic links aren't traversed; only the symlink itself is removed.
-v: (verbose option) Echoes the names of all files/directories that are removed.

While rm is a fairly simple command, there are a few instances in which it can cause serious problems for the careless user.

The command rm * removes all files in a directory. Unless you have the files set as read-only or have the interactive flag set, you will delete everything in the directory. Of course this isn't as bad as using the command rm -r * or rm -rf *, the last of which overrides any read-only file modes, traverses down through your directories and deletes everything in your current directory or below.

Occasionally you will find that you create odd files in your directories. For instance, you might have a file named -myfile where the - is part of the filename. Try deleting it, and you will get an error message concerning the fact that rm doesn't have a -m option. Your shell program interprets the -m as a command flag, not part of the filename. The solution to this problem is trivial but not always instantly apparent: simply provide a more complete path to the file, such as rm ./-myfile or rm /home/jambeck/-myfile. Similar solutions are needed if you accidently create a file with a space in the name.

Working in a Multiuser Environment

Unix systems are designed to allow multiple users to share system resources and software, yet at the same time to allow users to selectively protect their work from each other. To work with others in a multiuser environment, there are a number of general Unix concepts you need to understand.

Users and Groups

If you use a Unix system, you must be registered. You are identified by a login name and can log in only by entering the password uniquely associated with your login name. You have control over an area of the filesystem, which may be as large or small as system resources allow. You belong to one or more groups and can share files with other members of a group without needing to make the files accessible to other users. At any given time, only one of a your groups is active, and new files you create are automatically associated with the active, or primary, group. If you use group permissions to share files with other users, and you need to change to a particular group ID, the command newgrp allows you to change your primary group ID. The id command tells you what your user and primary group IDs are.

Information about your account is stored the /etc/passwd file, a file that provides the system with information needed when you log in. Your username and user ID mapping are found here, along with your default groups, full name, home directory and default shell program. The shell program is described in Chapter 5. The encrypted version of your password used to be stored here, but on most systems, for security reasons, the actual password has been removed from the passwd file. Additional group information is found in the /etc/group file. You can view the contents of these files with an editor, even though they are system files you normally can't overwrite.

User Directories

When your system administrator creates a new user account, the process includes creating an entry in the /etc/passwd file, possibly adding you to a number of groups in /etc/group, creating a home directory for you somewhere on the system, and then changing the ownership of that directory so that you own it and any files that are put into it at the time of creation. Your entry in /etc/passwd needs to match the path to your home directory, and the user and group that own your home directory. There should also be a set of files in your home directory that set up your work environment when you log in and are specific to the Unix shell listed in your passwd entry. These files are discussed in more detail in Chapter 5.

File Permissions and Statistics

As we discussed in the section on the ls command, each file and directory has an owner and a group with which it's associated. Each file is created with permissions that allow or prevent you access to the file dependent on your user ID and group. In this section we discuss how to view and change file permissions and ownership.

Viewing file attributes with stat

Usage: stat -[ options ] filename

stat lets you view the complete set of attributes of a file or directory, including permissions, modification times, and ownership. It may be more information than you need, but it's there if you want it. For example, the command stat image1.rgb returns:

image1.rgb: 
inode 11750927; dev 77; links 1; size 922112
regular; mode is rw-------; uid 12430 (jambeck); gid 280 (weasel)
projid 0  st_fstype: xfs
change time - Sun Mar 14 14:21:50 1999 <921442910>
access time - Sat Mar 13 18:11:21 1999 <921370281>
modify time - Sat Mar 13 10:28:39 1999 <921342519>

Changing file ownership and permissions with chmod

On most Unix systems, you wouldn't want every file to be readable, writable, and executable by every user. The chmod command allows you to set the file permissions, or mode, on a list of files and directories. The recursive option, -R, causes chmod to descend recursively through a directory tree and change the mode of the files and directories.

For example, a long directory listing for a directory, a symlink, and a file looks like this:

drwxr-xr-x  7  jambeck  weasel  2048    Feb 10 19:08  image/
lrwxr-xr-x  1  jambeck  weasel  10      Mar 14 13:12  image.rgb-> image1.rgb
-rw-r--r--  1  jambeck  weasel  922112  Mar 13 10:28  image1.rgb

The first character in each line indicates whether the entry is a file, directory, symlink, or one of a number of other special file types found on Unix systems. The three listed here are by far the most common. The remaining nine characters describe the mode of the file. The mode is divided into three sets of three characters. The sets correspond—in the following order—to the user, the group, and other. The user is the account that owns the directory entry, the group can be any group on the system, and other is any user that doesn't belong to the set that includes the user and the group. Within each set, the characters correspond to read (r ), write (w), and execute (x) permissions for that person or group.

In the previous example, to change the mode of the file image1.rgb so that it's readable only by the user and modified (writable) by no one, you can issue one of the following commands:

chmod  u-w,g-r,o-r image1.rgb
chmod  u=r,g=-,o=- image1.rgb
chmod  u=r,go=-    image1.rgb

Any one of these commands results in image1.rgb's permissions looking like:

-r-------- 1 jambeck weasel 922112 Mar 13 10:28 image1.rgb

The first two commands should be fairly obvious. You can add or subtract user's, group's or other's read, write or execute permissions by this mechanism. The mode parameters are:

[u,g,o]: User, group, other
[+,-,=]: Add, subtract, set
[r,w,x]: Read, write, execute

u, g, and o can be grouped or used singly. The same is true for r, w, and x. The operators +, -, and = describe the action that is to be performed.

Changing file and directory ownership with chown and chgrp

Usage: chown -[ options ] filenames item

Usage: chgrp -[ options ] filenames

The chown command lets you change the owner (or, in file-permission parlance, the user) of a file or directory. The operation of the chown command is dependent on the version of Unix you are running. For example, IRIX allows you to "give" the ownership to someone else, while this is impossible to do in Linux. We will cite only examples of the chgrp command, since in Linux, you can be a member of two groups and get this command to work for you.

chgrp lets you change the group of a file or directory. You must be a member of the group the file is being changed to, so you have to be a member of more than one group and understand how to use the newgrp command (which is described later in this chapter). Assume for a moment that you created image/, a directory containing files, while you were in your default group. Later, you realize that you want to share these files with members of another group on the system. So, at first, the permissions look like this:

drwxr-xr-x 7 jambeck weasel 2048 Feb 10 19:08 image/

Change to the other group using the command newgrp wombat, then type:

chgrp -R wombat image

to make all files in the directory accessible to the wombat group. Finally, you should change the permissions to make the files writable by the wombat group as well. This is done with the command:

chmod -R g+w image

Your entry should now appear as follows:

drwxrwxr-x 7 jambeck wombat 2048 Feb 10 19:08 image/

System Administration

Most files that control the configuration of the Unix system on your computer are writable only by the system administrator. Adding and deleting users, backing up and restoring files, installing new software in shared directories, configuring the Unix kernel, and controlling access to various parts of the filesystem are tasks normally handled by one specially designated user, with the username root. When you're doing day-to-day tasks, you shouldn't be logged in as root, because root has privileges ordinary users don't, and you can inadvertently mess up your computer system if you have those privileges. Use the su command from your command line to assume system-administration privileges temporarily, do only those tasks that need to be done by the system administrator, and then exit back to your normal user status.

If you set up a Unix system for yourself, you need to become the system administrator or superuser and learn to do the various system-administration tasks necessary to maintain your computer in a secure and useful condition. Fortunately, there are several informative reference books on Unix system administration available (several by O'Reilly), and an increasing number of easy-to-use graphical system-administration tools are included in every Linux distribution.

Conventions for Organizing Files

Unix uses a simple set of designations for the various types of files found on the system. Normally you can find what you need with info, find, or which, but sometimes it's necessary to search manually, and you don't want to look in /bin for a library. These designations are used at the operating-system level, but they are also often used in project subdirectories or software distributions to separate files:

bin: Executable files, or binaries
lib: Libraries, both runtime or shared, and those needed when compiling
spool: Directories used by the system when communicating with external devices and machines
tmp: Temporary storage
src: Source code for programs
etc: Configuration information
man: Manual pages, documentation
doc: Documentation
X: X or X11R6 refers to X programs, libraries, src, etc.; directories typically have a fairly complete set of subdirectories

Once you have a basic understanding of how to organize and manage your files and directories, you're well on your way to understanding how to work in a Unix environment. In Chapter 5 we complete our lightning Unix tutorial with a discussion of many of the most commonly used Unix commands. In order to really master the art of Unix, we strongly recommend consulting one or more of the books in the Bibliography.

Locating Files in System Directories

While all your own files should be created in your home directory or in other areas specifically designated for users to share, you need to be aware of the locations of files in other parts of the system. One benefit of a system environment designed for multiple users is that many users can share common resources while controlling access to their own files.

To say there is a standard Unix filesystem is somewhat of an overstatement, but, like Plato's vision of the perfect chair, we will attempt to imagine one out in the ether. Since Linux is being developed by thousands of programmers on different continents and has the benefit of the development of both Berkeley and AT&T's SysV Unix, along with the POSIX standards, we will use the Linux filesystem as a template and point out major discrepancies when necessary. The current standard for the Linux filesystem is described at http://www.pathname.com/fhs/. Here, we present a brief skeleton of the complete filesystem and point out a few salient features. Most directories described in this section are configurable only by the system administrator; however, as a user, you may sometimes need to know where system files and programs can be found. Figure 4-2 illustrates the major subdirectories, which are further described in the following list.

Figure 4-2. Unix subdirectories

/dev: Contains all the device drivers needed to connect peripherals to the system. Drivers for SCSI, audio, IDE drives, PPP, mice, and most other devices are found here. In general there are no user-configurable options here.
/etc: Houses all the configuration files local to your machine. This includes items such as the system name, Internet address, password file (unless your machine is part of some larger cluster), filesystem information, and Unix initialization information.
/home: A common, but not standard, part of Unix. /home is usually a fairly large, separate partition that houses all user home directories. Having /home on a separate partition has the advantage of allowing it to be shared in a cluster environment, and it also makes it difficult for users to completely fill an important system partition and cause it to lock up.
/lost+found: A system directory that is a repository for files and directories that have somehow been misplaced by the system. Typically, users can't cd into this directory. Files usually end up in the lost+found because of a system crash or a disk problem. At times it's possible that your system administrator can recover files that appear to be lost simply by moving them from lost+found and renaming them. There's a separate lost+found for each partition on the system.
/mnt: While not found on all systems, this is the typical place to mount any partitions not described by the standard Unix filesystem description. Under Linux, this is where you will find a mounted CD-ROM or floppy drive.
/nfs: Often used as the top-level directory for any mount points for partitions that are mounted from remote machines.
/opt: A relatively new addition to the Unix filesystem. This is where optional, usually commercial, packages are installed. On many systems you will find higher-end, optimizing compilers installed here.
/root: The home directory for root, i.e., for the system administrator when she is logged in as root.
/sbin, /bin, and /lib: Since the machine may need to start the boot process without the /usr partition present, any programs that are using it prior to mounting the /usr partition must reside on the main or root partition. The contents of the /sbin directory, for instance, are a subset of the /usr/sbin directory. Labeling directories sbin indicates that only system-level commands are present and that normal users probably won't need them, and therefore don't need to include these directories in their path. The /lib directory is a small subset of system libraries that are needed by programs in /bin and /sbin. Current Unix programs use shared libraries, which means that many programs can use functions from the same library, and so the library needs to be loaded into memory only once. What this means for practical purposes is that programs don't take as much memory as they would if each program included all the library routines, and the programs don't actually run if the correct library has been deleted or hasn't been mounted yet.
/tmp and /var/tmp: Typically configured to be readable/writable/executable by all users. Many standard programs, such as vi, write temporary files to one of these directories while they are running. Normally the system cleans out these directories automatically on a regular basis or when the machine is rebooted. This is a good place to write temporary files, but you can't assume that the system will wait for you to erase them.
/usr: The repository for the majority of programs, compilers, libraries, and documentation for the Unix filesystem. The current recommendation for most Unix systems is that the system should be able to mount /usr as a separate, read-only partition. In a workstation-cluster environment, this means that a server can export a /usr partition, and all the workstations in that cluster will share the programs. This makes the system administrator's job easier and provides users with a uniform set of machines.
/usr/local: The typical directory in which to install programs and documentation so that they aren't overwritten by the operating system. You will often find programs such as Perl and various others that have been downloaded from the Internet installed in this location.
/var: The directory used by all system programs that write output to the disk. All system logs, spools, and temporary data are written here. This includes logging information such as that written during the boot process, by the mailer, by the login program, and by all other system processes. Incoming and outgoing mail is stored in the /var/spool directory, as are files being sent to printers. Information needed for cron, batch, and at jobs is also found here.

^[*]Throughout this chapter and Chapter 5, we introduce many Unix commands. Our quick and dirty approach to outlining the functions of these commands and their options should help you get started working fast, but it's by no means exhaustive. The Bibliography provides several excellent Unix books that will help you fill in the details.

^[†]As you'll see when we cover the Unix shell and the command line in Chapter 5, Unix commands can be issued with or without arguments on the command line. The first word in a line is always a command. Subsequent words are arguments and can include options, which modify the command's behavior, and operands, which specify pathnames. Words in the command line are items separated by whitespace (spaces or tabs).

Get Developing Bioinformatics Computer Skills now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Developing Bioinformatics Computer Skills by Per Jambeck, Cynthia Gibas