Chapter 6. Tracking Other Repositories
This chapter discusses copying or âcloningâ an existing repository, and thereafter sharing changes between original and clone using the Git âpushâ and âpullâ commands.
Cloning a Repository
The git clone
command initializes a new repository with the contents of another one and sets up tracking branches in the new repository so that you can easily coordinate changes between the two with the push/pull mechanism. We call the first repository a âremoteâ (even if it is in fact on the same host), and by default, this remote is named origin; you can change this with the --origin (-o)
option, or with git remote rename
later on. You can view and manipulate remotes with git remote
; a repository can have more than one remote with which it synchronizes different sets of branches.
After cloning the remote repository, Git checks out the remote HEAD branch (often master); you can have it check out a different branch with -b
, or none at all with branch
-n
:
$ git clone http://nifty-software.org/foo.git
Cloning into 'foo'...
remote: Counting objects: 528, done.
remote: Compressing objects: 100% (425/425), done.
remote: Total 528 (delta 100), reused 528 (delta 100)
Receiving objects: 100% (528/528), 1.31 MiB | 1.30 Miâ¦
Resolving deltas: 100% (100/100), done.
If you give a second argument, Git will create a directory with that name for the new repository (or use an existing directory, so long as itâs empty); otherwise, it derives the name from that of source repository using some ad hoc rules. For example, foo stays foo, but foo.git and bar/foo also become foo.
You can specify the remote repository with a URL as shown, or with a simple path to a directory in the filesystem containing a Git repository. Git supports a number of transport schemes natively to access remote repositories, including HTTP, HTTPS, its own git protocol, FTP, FTPS, and rsync.
Git will also automatically use SSH if you use the ssh URL scheme (ssh://), or give the repository as [user@]host:/path/to/repo
; this uses SSH to run git upload-pack
on the remote side. If the path is relative (no leading slash), then it is usually relative to the home directory of the login account on the server, though this depends on the SSH server configuration. You can specify the SSH program to use with the environment variable GIT_SSH
(the default is, unsurprisingly, ssh
). With the long form you can also give a TCP port number for the server, e.g., ssh://nifty-software.org:2222/foo
.
Clones and Hard Links
When you give the origin repository as a simple directory name, and the new repository is on the same filesystem, Git uses Unix âhard linksâ to the originals for certain files instead of copying them when populating the object database of the clone, saving time and disk space. This is safe for two reasons. First, the semantics of hard links are such that someone deleting a shared file in the origin repository has no effect on you; files remain accessible until the last link is removed. Second, because of content-based addressing, Git objects are immutable; an object with a given ID will not suddenly change out from under you. You can turn off this feature and force actual copying with --no-hardlinks
, or by using a URL with the âfileâ scheme to access the same path: file:///path/to/repo.git (the empty hostname between the second and third slash indicates the local host).
Note
When we refer to a âlocalâ repository in this section, we mean one accessible to Git using the filesystem, as opposed to needing an explicit network connection (SSH, HTTP, and so on). That may not in fact be âlocalâ to the host itself, however (meaning on hardware directly attached to it); it could be on a file server accessed over the network via NFS or CIFS, for example. Thus, a repository that is âlocalâ to Git might still be âremoteâ from the host.
Shared Clone
An even faster method when cloning a local repository is the --shared
option. Rather than either copy or link files between the origin and clone repositories, this simply configures the clone to search the object database of the origin in addition to its own. Initially, the object database of the clone is completely empty, because all the objects it needs are in the origin. New objects you create in the clone are added to its own database; the clone never modifies the originâs database via this link.
Itâs important to keep in mind, though, that the clone is now dependent on the origin repository to function; if the origin is not accessible, Git may abort, complaining that its object database is corrupted because it canât find objects that used to be there. If you know youâre going to remove the origin repository, you can use git repack -a
in the clone to force it to copy all the objects it needs into its own database. If you have to recover from accidentally deleting the origin, you can edit .git/objects/info/alternates if you have another local copy. You can also add the other repository with git remote add
, then use git fetch --all
remote
to pull over the objects you need.
Another issue with shared clones is garbage collection: if garbage collection is later run on the remote and by then it has removed some refs you still have, objects that are still part of your history may just disappear, again leading to âdatabase corruptedâ errors on your side.
Bare Repositories
A âbareâ repository is one without a working tree or index, created by git init --bare
; the files normally under .git are right inside the repository directory instead. A bare repository is usually a coordination point for a centralized workflow: each person pushes and pulls to and from the bare copy, which represents the current âofficialâ state of the project. No one uses the bare copy directly, so it doesnât need a working tree (you canât push into a non-bare repository if the push tries to update the currently checked-out branch, as that would change the branch out from under the person using it). Another use for a bare repository, using git clone --bare
, is shown in the next section.
Reference Repositories
- You want to have checkouts of multiple branches of the same project at once; or
- Several people with access to the same filesystem want clones of the same repository; or
- Some process requires you to clone the same repository frequently
â¦and that the repository takes a long time to clone; perhaps it has a large history, or thereâs a slow network link in the way. A solution is to share one local copy of the object database, rather than pull it over repeatedly, but using git clone --shared
is awkward for this, because it introduces two levels of push/pull: you push from your clone to the local shared (bare) clone, and then you have to push from there to the origin (and similarly for pull).
Git has another option that exactly fits this bill: a âreference repository.â Hereâs how it works: first, we make a bare clone of the remote repository, to be shared locally as a reference repository (hence named ârefrepâ):
$ git clone --bare http://foo/bar.git refrep
Cloning into 'refrep'...
remote: Counting objects: 21259, done.
remote: Compressing objects: 100% (6730/6730), done.
Receiving objects: 100% (21259/21259), 39.84 MiB | 12â¦
remote: Total 21259 (delta 15427), reused 20088 (deltâ¦
Resolving deltas: 100% (15427/15427), done.
Then, we clone the remote again, but this time giving refrep as a reference:
$ git clone --reference refrep http://foo/bar.git
Cloning into 'bar'...
done.
This happens very quickly, and you see no messages about transferring objects, because none were needed; all the objects were already available in the reference repository. Others using this repository in your site can use this command to create their clones as well, sharing the reference.
The key difference between this and the --shared
option is that you are still tracking the remote repository, not the refrep clone. When you pull, you still contact http://foo/, but you donât need to wait for it to send any objects that are already stored locally in refrep; when you push, you are updating the branches and other refs of the foo repository directly.
Of course, as soon as you and others start pushing new commits, the reference repository will become out of date, and youâll start to lose some of the benefit. Periodically, you can run git fetch --all
in refrep to pull in any new objects. A single reference repository can be a cache for the objects of any number of others; just add them as remotes in the reference:
$ git remote add zeus http://olympus/zeus.git $ git fetch --all zeus
Warning
-
You canât safely run garbage collection in a reference repository. Someone using it may be still using a branch that has been deleted in the upstream repository, or otherwise have references to objects that have become unreachable there. Garbage collection might delete those objects, and that personâs repository would then have problems, as it now canât find objects it needs. Some Git commands periodically run garbage collection automatically, as routine maintenance. You should turn off pruning of unreachable objects in the reference repository with
git config gc.pruneexpire never
. This still allows other safe operations to run during garbage collection, such as collecting objects stored in individual files (âloose objectsâ) into more efficient data structures called âpacks.â Since people donât normally use a reference repository directly and thus wonât trigger automatic garbage collection, you may want to arrange for a periodic job to rungit gc
in a reference repository (after settinggc.pruneexpire
as shown). - Be careful about security. If you have restricted who can clone a repository, but then add its objects to a reference, then anyone who can read the files in the reference can get the same information.
Local, Remote, and Tracking Branches
When you clone a repository, Git sets up âremote-trackingâ branches corresponding to the branches in the origin repository. These are branches in your local repository, which show you the state of the origin branches at the time of your last push or pull. When you check out a branch that doesnât yet exist, but there is a remote-tracking branch by that name, Git automatically creates it and sets its upstream to be that tracking branch, so that subsequent push/pull operations will synchronize your local version of this branch with the remoteâs version. For example, when you first clone a repository, Git checks out the remoteâs HEAD branch, so this happens right away for one branch:
$ git clone git://nifty-software.org/nifty.git ... $ cd nifty $ git branch --all master origin/master origin/topic
To begin with, your local and remote-tracking branches for master are at the same commit:
$ git log --oneline --decorate=short
3a9ee5f3 (origin/master, master) in principio
If you add a commit, you will see your branch pull ahead:
$ git log --oneline --decorate=short
3307465c (master) the final word
3a9ee5f3 (origin/master) in principio
If you run git fetch
, you may find that someone else has also added a commit, and the branches have now diverged:
$ git log --graph --all
* commit baa699bc (origin/master)
| Author: Nefarious O. Committer <nefarious@qoxp.net>
| Date: Fri Aug 24 09:33:10 2012 -0400
|
| not quite
|
| * commit 3307465c (master)
|/ Author: Richard E. Silverman <res@qoxp.net>
| Date: Fri Aug 24 09:32:54 2012 -0400
|
| the final word
|
* commit 3a9ee5f3
Author: Mysterious Author <ma@qoxp.net>
Date: Fri Aug 24 09:42:27 2012 -0400
in principio
git pull
will try to merge the now-distinct branches, which is necessary before you can push your changes; otherwise, git push
would update origin/master to match your master, and lose commit baa699bc in the process.
Synchronization: Push and Pull
Having cloned a repository, you use git push
and git pull
to reconcile your changes with those of others using the same upstream repository. Various things can happen when your changes conflict with theirs; weâll start discussing that here, and continue in Chapter 7.
Pulling
If a branch foo is tracking a branch in a remote repository, that remote is configured as branch.foo.remote
in this repository, and is said to be the remote associated with this branch, or just the âremote of this branch.â git pull
updates the tracking branches of the remote for the current branch (or of the origin
remote if the branch has none), fetching new objects as needed and recording new upstream branches. If the current branch is tracking an upstream in that remote, Git then tries to reconcile the current state of your branch with that of the newly updated tracking branch. If only you or the upstream has added commits to this branch since your last pull, then this will succeed with a âfast-forwardâ update: one branch head just moves forward along the branch to catch up with the other. If both sides have added commits, though, then a fast-forward update is not possible: just setting one sideâs branch head to match the other would discard the opposite sideâs new commits (they would become unreachable from the new head). This is the situation shown previously, and the solution is a merge:
$ git log --graph --oneline
* 2ee20b94 (master, origin/master) Merge branchâ¦
|\
| * 3307465c the final word
* | baa699bc not quite
|/
* 3a9ee5f3 in principio
The merge commit 2ee20b94 brings together the divergent local and upstream versions of the branch, and allows both master and origin/master to advance to the same commit without losing information. git pull
will automatically attempt this, and if it can combine the actual changes cleanly, this will all happen smoothly. If not, Git will stop and ask you to deal with the conflicts before making the merge commit; weâll discuss that process in Chapter 7.
Pushing
git push
is the converse of git pull
, with which you apply your changes to the upstream repository. If, as before, your history has diverged from that of the remote, Git will refuse to push unless you address the divergence, which you do by pulling first (as Git helpfully reminds you):
$ git push
To git://nifty-software.org/nifty.git
! [rejected] master -> master (non-fast-forward)
error: failed to push some refs to 'git://nifty-softwâ¦
hint: Updates were rejected because the tip of your
hint: current branch is behind its remote
hint: counterpart. Merge the remote changes
hint: (e.g. 'git pull') before pushing again. See
hint: the 'Note about fast-forwards' in 'git push
hint: --help' for details.
Once you pull and resolve any conflicts, you can push again successfully. The goal of pulling with regard to pushing is to integrate the upstream changes with your own so that you can push without discarding any commits in the upstream history. You may accomplish that by merging as previously shown, or by ârebasingâ (see Pull with Rebase).
If you have added a local branch of your own and want to start sharing it with others, use the -u
option to have Git add your branch to the remote, and set up tracking for your local branch in the usual way, for example:
$ git push -u origin new-branch
After this initial setup you can use just git push
on this branch, with no options or arguments, to push to the same remote.
Push Defaults
There are several approaches Git can use when given no specific remote and ref to push (just plain git push
, as opposed to git push
):remote branch
-
matching
- Push all branches with matching local and remote names
-
upstream
- Push the current branch to its upstream (making push and pull symmetric operations)
-
simple
-
Like
upstream
, but check that the branch names are the same (to guard against mistaken upstream settings) -
current
- Push the current branch to a remote one with the same name (creating it if necessary)
-
nothing
- Push nothing (require explicit arguments)
You can set this with the push.default
configuration variable. The default as of this writing is matching
, but with Git 2.0, this will change to simple
, which is more conservative and avoids easy accidental pushing of changes on other branches that are not yet ready to be published. To choose an option, think about what would happen in your particular situation if you accidentally typed git push
with each of these options in force, and pick the one that makes you most comfortable. Remember that like all options, you can set this on a per-repository basis (see Basic Configuration).
Pull with Rebase
Along with the facility of merge commits comes the need to make them wisely. The notion of what a merge should indicate with respect to content is subjective and varies as a matter of version control discipline and style, but generally you want a merge to point out a substantive combination of two lines of development. Certainly, too many merges creates a commit graph that is difficult to read, thus reducing the usefulness of the structural merge feature itself. In this context, certain workflows can easily create what one might call âspurious merges,â which do not actually correspond to such merging of content. Having lots of these clutters up the commit graph, and makes it difficult to discern the real history of a project.
As an example: suppose you and a colleague are coordinating your individual repositories via push/pull with a shared central one. You commit a change to your repository, while he commits an unrelated change on the same branch. The changes might be to different files, or even to the same file but such that they do not require manual conflict resolution. If he pushes first, then as described earlier, your subsequent push will fail, so you will pull; then Git will do a successful automatic merge (since the changes were independent), and this becomes part of the repository history with your final push. But if you think of a merge as a deliberate step to signal the combination of conflicting or substantially different content, then you donât really want this merge. The telltale sign of this sort of spurious merge is that itâs purely an artifact of timing; if the order of events had instead been:
- You commit and push.
- He pulls.
- He commits and pushes.
then there would have been no conflict, and no merge. This observation is the key to avoiding such merges using git pull
--rebase
, which reorders your changes. âRebasingâ is a more general idea, which we treat in Rebasing; the pull-with-rebase option is a special case. Briefly, what happens is this: suppose your master branch diverged from its upstream several commits back. For each divergent commit on your branch, Git constructs a patch representing the changes introduced by that commit; then it applies these in order starting at the tip of the upstream tracking branch origin/master. After applying each patch, Git makes a new commit preserving the author information and message from the original commit. Finally, it resets your master branch to point to the last of these commits. The effect is to âreplayâ your work on top of the upstream branch as new commits, rather than affecting a merge with your existing .commits.
In the earlier example, git pull --rebase
would produce the following simple, linear history instead of the âmerge bubbleâ previously pictured, with its extra commit:
* 1e6f2cb2 the final word * baa699bc not quite * 3a9ee5f3 in principio
A push now will succeed without further work (and without merging), because youâve simply added to the upstream branch; it will be a fast-forward update of that branch. Note that the commit ID for âthe final wordâ has changed; thatâs because itâs a new commit made by replaying the changes of the original on top of commit baa699bc.
If git pull
starts a merge when you know thereâs no need for it, you can always cancel it by giving an empty commit message, or with git merge --abort
if the merge failed leaving you in conflict-resolution mode. If you complete such a merge and want to undo it, use git reset HEAD^
to move your branch back again, discarding the merge commit. You can then use git pull
--rebase
instead. You can set a specific branch to automatically use --rebase
when pulling:
$ git config branch.branch-name.rebase yes
and the configuration variable branch.autosetuprebase
controls how this is set for new branches:
-
never
- Default: do not set rebase
-
remote
- Set for branches tracking remote branches
-
local
- Set for branches tracking other branches in the same repository
-
always
- Set for all tracking branches
Notes
If you know itâs the right thing to do, you can perform destructive, nonâfast-forward updates with the
--force
option to either push or pull, although in the case of push the remote must be configured to allow it; repositories created withgit init --shared
have this disabled by settingreceive.denyNonFastForwards
.Beware! Itâs one thing to do a forced pull; youâre just discarding some of your own history. A forced push, on the other hand, causes grief for other people, who will be unable to pull cleanly as a result. For a repository shared by a small set of people in close communication, or that is a read-only reference for most, this may be occasionally appropriate. For anything shared by a wide audience, though, you really donât want to do this.
The command
git remote show
gives a useful summary of the status of your repository in relation to a remote:remote
$ git remote show origin * remote origin Fetch URL: git://tamias.org/chipmunks.git Push URL: git://tamias.org/chipmunks.git HEAD branch: master Remote branches: alvin tracked theodore tracked simon tracked Local branches configured for 'git pull': alvin merges with remote alvin simon merges with remote simon Local refs configured for 'git push': alvin pushes to alvin (up to date) simon pushes to simon (local out of date)
Note that unlike most informational commands, this actually examines the remote repository, so it will run ssh or otherwise use the network if necessary. You can use the
-n
switch to avoid this; Git will skip those operations that require contacting the remote and note them as such in the output.git branch -vv
gives a more compact summary without contacting the remote (and thus reflects the state as of the last fetch or pull; remember that the remote might have changed in the meantime). The following shows a purely local master branch, plus two branches tracking remote ones: alvin is up to date with respect to its upstream, whereas the current local branch, simon, has moved three commits forward:$ git branch -vv alvin 7e55cfe3 [origin/alvin] I love chestnuts. master a675f734 Chipmunks are the real nuts. * simon 9b0e3dc5 [origin/simon: ahead 3] Walnuts!
(This state is not one resulting from previous examples.)
There appears to be a lot of pointless redundancy in many of these messages; things like âalvin pushes to alvin,â or updates indicating âmasterâmaster.â The reason is that the default, common situation is for corresponding local and remote branches to have matching names, but this need not be the case; for more complex situations, you can have arbitrary associations, and the Git messages take this into account. For example, if you have a repository with two remotes each having a master branch, your local tracking branches canât both be named master as well. You could proceed this way:
$ git remote add foo git://foo.com/foo.git $ git remote add bar http://bar.com/bar.git $ git fetch --all Fetching foo remote: Counting objects: 6, done. remote: Compressing objects: 100% (2/2), done. remote: Total 6 (delta 0), reused 0 (delta 0) Unpacking objects: 100% (6/6), done. From foo git://foo.com/foo.git * [new branch] master -> foo/master Fetching bar remote: Counting objects: 5, done. remote: Total 3 (delta 0), reused 0 (delta 0) Unpacking objects: 100% (3/3), done. From http://bar.com/bar.git * [new branch] master -> bar/master $ git checkout -b foo-master --track foo/master Branch foo-master set up to track remote branch master from foo. Switched to a new branch 'foo-master' $ git checkout -b bar-master --track bar/master Branch bar-master set up to track remote branch master from bar. Switched to a new branch 'bar-master' $ git branch -vv * bar-master f1ace62e [bar/master] bars are boring foo-master 11e4af82 [foo/master] foosball is fab ...
These messages from
git clone
:* [new branch] master -> foo/master ... * [new branch] master -> bar/master ...
might be a little confusing; they indicate that the remote branch master in each repository is now being tracked by local branches foo/master and bar/master, respectively (not that it somehow overwrote a local master branch, which might or might not exist and is not relevant here).
Access Control
In a word (or three): there is none.
It is important to understand that Git by itself does not provide any sort of authentication or comprehensive access control when accessing a remote repository. Git has no internal notion of âuserâ or âaccount,â and although some specific actions may be forbidden by configuration (e.g., nonâfast-forward updates), generally you can do whatever is possible with the operating-system level access controls in place. For example, remote repositories are often accessed via SSH. This usually means that you need to be able to log into an account on the remote machine (which account may be shared with other people); you can clone and pull from the repository if that account has read access to the repository files on that machine, and you can push to the repository if that account has write access. If youâre using HTTP for access instead, then similar comments apply to the configuration of the web server and the account under which it accesses the repository. Thatâs it. There is no way within Git to limit access to particular users according to more fine-grained notions, such as granting read-only access to one branch, commit access to another, and no access to a third. There are, however, third-party tools that add such features; Gitolite, Gitorious, and Gitosis are popular ones.
Get Git Pocket Guide now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.