Chapter 4. Basic Git Concepts

Basic Concepts

The previous chapter presented a typical application of Git—and probably sparked a good number of questions. Does Git store the entire file at every commit? What’s the purpose of the .git directory? Why does a commit ID resemble gibberish? Should I take note of it?

If you’ve used another version control system (VCS), such as Subversion or CVS, the commands in the last chapter likely seemed familiar. Indeed, Git serves the same function and provides all the operations you expect from a modern VCS. However, Git differs in some fundamental and surprising ways.

In this chapter, we explore why and where Git differs by examining the key components of its architecture and some important concepts. Here we focus on the basics and demonstrate how to interact with one repository. Chapter 11 explains how to work with many interconnected repositories. Keeping track of multiple repositories may seem like a daunting prospect, but the fundamentals you learn in this chapter apply just the same.

Repositories

A Git repository is simply a database containing all the information needed to retain and manage the revisions and history of a project. In Git, as with most version control systems, a repository retains a complete copy of the entire project throughout its lifetime. However, unlike most other VCSs, the Git repository provides not only a complete working copy of all the files in the repository but also a copy of the repository itself with which to work.

Git maintains a set of configuration values within each repository. You saw some of these, such as the repository user’s name and email address, in the previous chapter. Unlike file data and other repository metadata, configuration settings are not propagated from one repository to another during a clone, or duplicating, operation. Instead, Git manages and inspects configuration and setup information on a per-site, per-user, and per-repository basis.

Within a repository, Git maintains two primary data structures, the object store and the index. All of this repository data is stored at the root of your working directory in a hidden subdirectory named .git.

The object store is designed to be efficiently copied during a clone operation as part of the mechanism that supports a fully distributed VCS. The index is transitory information, is private to a repository, and can be created or modified on demand as needed.

The next two sections describe the object store and index in more detail.

Git Object Types

At the heart of Git’s repository implementation is the object store. It contains your original data files and all the log messages, author information, dates, and other information required to rebuild any version or branch of the project.

Git places only four types of objects in the object store: the blobs, trees, commits, and tags. These four atomic objects form the foundation of Git’s higher-level data structures:

Blobs: Each version of a file is represented as a blob. “Blob” is a contraction of “binary large object,” a term that’s commonly used in computing to refer to some variable or file that can contain any data and whose internal structure is ignored by the program. A blob is treated as opaque. A blob holds a file’s data but does not contain any metadata about the file or even its name.
Trees: A tree object represents one level of directory information. It records blob identifiers, pathnames, and a bit of metadata for all the files in one directory. It can also recursively reference other (sub)tree objects and thus build a complete hierarchy of files and subdirectories.
Commits: A commit object holds metadata for each change introduced into the repository, including the author, committer, commit date, and log message. Each commit points to a tree object that captures, in one complete snapshot, the state of the repository at the time the commit was performed. The initial commit, or root commit, has no parent. Most commits have one commit parent, though in Chapter 9, I explain how a commit can reference more than one parent.
Tags: A tag object assigns an arbitrary yet presumably human-readable name to a specific object, usually a commit. Although 9da581d910c9c4ac93557ca4859e767f5caf5169 refers to an exact and well-defined commit, a more familiar tag name like Ver-1.0-Alpha might make more sense!

Over time, all the information in the object store changes and grows, tracking and modeling your project edits, additions, and deletions. To use disk space and network bandwidth efficiently, Git compresses and stores the objects in pack files, which are also placed in the object store.

Index

The index is a temporary and dynamic binary file that describes the directory structure of the entire repository. More specifically, the index captures a version of the project’s overall structure at some moment in time. The project’s state could be represented by a commit and a tree from any point in the project’s history, or it could be a future state toward which you are actively developing.

One of the key distinguishing features of Git is that it enables you to alter the contents of the index in methodical, well-defined steps. The index allows a separation between incremental development steps and the committal of those changes.

Here’s how it works. As the developer, you execute Git commands to stage changes in the index. Changes usually add, delete, or edit some file or set of files. The index records and retains those changes, keeping them safe until you are ready to commit them. You can also remove or replace changes in the index. Thus, the index allows a gradual transition, usually guided by you, from one complex repository state to another, presumably better state.

As you’ll see in Chapter 9, the index plays an important role in merges, allowing multiple versions of the same file to be managed, inspected, and manipulated simultaneously.

Content-Addressable Names

The Git object store is organized and implemented as a content-addressable storage system. Specifically, each object in the object store has a unique name produced by applying SHA1 to the contents of the object, yielding an SHA1 hash value. Since the complete contents of an object contribute to the hash value and since the hash value is believed to be effectively unique to that particular content, the SHA1 hash is a sufficient index or name for that object in the object database. Any tiny change to a file causes the SHA1 hash to change, causing the new version of the file to be indexed separately.

SHA1 values are 160-bit values that are usually represented as a 40-digit hexadecimal number, such as 9da581d910c9c4ac93557ca4859e767f5caf5169. Sometimes, during display, SHA1 values are abbreviated to a smaller, unique prefix. Git users speak of SHA1, hash code, and sometimes object ID interchangeably.

Globally Unique Identifiers

An important characteristic of the SHA1 hash computation is that it always computes the same ID for identical content, regardless of where that content is. In other words, the same file content in different directories and even on different machines yields the exact same SHA1 hash ID. Thus, the SHA1 hash ID of a file is a globally unique identifier.

A powerful corollary is that files or blobs of arbitrary size can be compared for equality across the Internet by merely comparing their SHA1 identifiers.

Git Tracks Content

It’s important to see Git as something more than a version control system: Git is a content tracking system. This distinction, however subtle, guides much of the design of Git and is perhaps the key reason Git can perform internal data manipulations with relative ease. Yet this is also perhaps one of the most difficult concepts for new users of Git to grasp, so some exposition is worthwhile.

Git’s content tracking is manifested in two critical ways that differ fundamentally from almost all other^[7] revision control systems.

First, Git’s object store is based on the hashed computation of the contents of its objects, not on the file or directory names from the user’s original file layout. Thus, when Git places a file into the object store, it does so based on the hash of the data and not on the name of the file. In fact, Git does not track file or directory names, which are associated with files in secondary ways. Again, Git tracks content instead of files.

If two separate files located in two different directories have exactly the same content, Git stores a sole copy of that content as a blob within the object store. Git computes the hash code of each file according solely to its content, determines that the files have the same SHA1 values and thus the same content, and places the blob object in the object store indexed by that SHA1 value. Both files in the project, regardless of where they are located in the user’s directory structure, use that same object for content.

If one of those files changes, Git computes a new SHA1 for it, determines that it is now a different blob object, and adds the new blob to the object store. The original blob remains in the object store for the unchanged file to use.

Second, Git’s internal database efficiently stores every version of every file—not their differences—as files go from one revision to the next. Because Git uses the hash of a file’s complete content as the name for that file, it must operate on each complete copy of the file. It cannot base its work or its object store entries on only part of the file’s content, nor on the differences between two revisions of that file.

The typical user view of a file—that it has revisions and appears to progress from one revision to another revision—is simply an artifact. Git computes this history as a set of changes between different blobs with varying hashes, rather than storing a filename and set of differences directly. It may seem odd, but this feature allows Git to perform certain tasks with ease.

Pathname Versus Content

As with many other VCSs, Git needs to maintain an explicit list of files that form the content of the repository. However, this does not require that Git’s manifest be based on filenames. Indeed, Git treats the name of a file as a piece of data that is distinct from the contents of that file. In this way, it separates “index” from “data” in the traditional database sense. It may help to look at Table 4-1, which roughly compares Git to other familiar systems.

Table 4-1. Database comparison

System	Index mechanism	Data store
Traditional database	ISAM	Data records
Unix filesystem	Directories (/path/to/file)	Blocks of data
Git	.git/objects/`hash`, tree object contents	Blob objects, tree objects

The names of files and directories come from the underlying filesystem, but Git does not really care about the names. Git merely records each pathname and makes sure it can accurately reproduce the files and directories from its content, which is indexed by hash value.

Git’s physical data layout isn’t modeled after the user’s file directory structure. Instead, it has a completely different structure that can, nonetheless, reproduce the user’s original layout. Git’s internal structure is a more efficient data structure for its own internal operations and storage considerations.

When Git needs to create a working directory, it says to the filesystem, “Hey! I have this big blob of data that is supposed to be placed at pathname path/to/directory/file. Does that make sense to you?” The filesystem is responsible for saying, “Ah, yes, I recognize that string as a set of subdirectory names, and I know where to place your blob of data! Thanks!”

^[7]Monotone, Mercurial, OpenCMS, and Venti are notable exceptions here.

Get Version Control with Git now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Version Control with Git by Jon Loeliger