This article explains the various strategies for cloning repository metadata. Git metadata is useful when analyzing the history of a repository in terms of commits, branches, tags, etc., and is especially valuable when building tooling for linting, tagging, and reporting purposes.
Environment
The stats reported in this article will vary, slightly, depending on the kind of machine used, internet connection, and normal evolution of repositories sampled. To provide context, here is the baseline I’m using:
Setup
We’ll use the Ruby on Rails open source framework as our demo repository because Rails is fairly large and therefore makes a good example case. To start we’ll need to clone as follows:
git clone --config transfer.fsckobjects=false https://github.com/rails/rails.git
The above will yield similar results to the following in your console depending on your environment:
Cloning into 'rails'... remote: Enumerating objects: 1358, done. remote: Counting objects: 100% (1358/1358), done. remote: Compressing objects: 100% (917/917), done. remote: Total 903070 (delta 804), reused 443 (delta 441), pack-reused 901712 Receiving objects: 100% (903070/903070), 279.13 MiB | 8.26 MiB/s, done. Resolving deltas: 100% (665058/665058), done. Updating files: 100% (4255/4255), done. ctags: Warning: ignoring null tag in guides/assets/javascripts/clipboard.js ctags: Warning: ignoring null tag in guides/assets/javascripts/turbolinks.js ctags: Warning: ignoring null tag in guides/assets/javascripts/turbolinks.js ctags: Warning: ignoring null tag in guides/assets/javascripts/turbolinks.js [ctags]: CTags rebuilt.
We’ll get into full cloning in a bit but you might be wondering why --config
transfer.fsckobjects=false
is being used? I have this setting configured via my
global Git
configuration (i.e. $HOME/.gitconfig
) to ensure I’m dealing with a repository of sound integrity.
If I were to leave transfer.fsckobjects
enabled, I would end up with a fatal exception:
git clone https://github.com/rails/rails.git Cloning into 'rails'... remote: Enumerating objects: 1358, done. remote: Counting objects: 100% (1358/1358), done. remote: Compressing objects: 100% (917/917), done. error: object 4cf94979c9f4d6683c9338d694d5eb3106a4e734: badTimezone: invalid author/committer line - bad time zone fatal: fsck error in packed object
For more on transfer.fsckobjects
, here’s a snippet from from the
Git
documentation:
When set, the fetch or receive will abort in the case of a malformed object or a link to a nonexistent object. In addition, various other issues are checked for, including legacy issues (see
fsck.<msg-id>
), and potential security issues like the existence of a.GIT
directory or a malicious.gitmodules
file (see the release notes for v2.2.1 and v2.17.1 for details). Other sanity and security checks may be added in future releases.
To return to our setup, next we’ll use Git Sizer to get
more insight, via git-sizer --verbose
, into the structure and size of the Rails repository:
Processing blobs: 220842 Processing trees: 570287 Processing commits: 111615 Matching commits to trees: 111615 Processing annotated tags: 326 Processing references: 26773 | Name | Value | Level of concern | | ---------------------------- | --------- | ------------------------------ | | Overall repository size | | | | * Commits | | | | * Count | 112 k | | | * Total size | 45.3 MiB | | | * Trees | | | | * Count | 570 k | | | * Total size | 462 MiB | | | * Total tree entries | 11.6 M | | | * Blobs | | | | * Count | 221 k | | | * Total size | 3.94 GiB | | | * Annotated tags | | | | * Count | 326 | | | * References | | | | * Count | 26.8 k | * | | | | | | Biggest objects | | | | * Commits | | | | * Maximum size [1] | 21.3 KiB | | | * Maximum parents [2] | 2 | | | * Trees | | | | * Maximum entries [3] | 167 | | | * Blobs | | | | * Maximum size [4] | 32.1 MiB | *** | | | | | | History structure | | | | * Maximum history depth | 53.1 k | | | * Maximum tag depth [5] | 1 | | | | | | | Biggest checkouts | | | | * Number of directories [6] | 997 | | | * Maximum path depth [6] | 12 | * | | * Maximum path length [7] | 140 B | * | | * Number of files [8] | 4.26 k | | | * Total size of files [9] | 76.2 MiB | | | * Number of symlinks [10] | 7 | | | * Number of submodules [11] | 2 | | [1] a06a734da8dd8fa308b4b6429eaca93bb4f0e9c9 (refs/remotes/pull_requests/37257) [2] a484bb7eab15c9072decdae71d74a997e986e2c4 (refs/remotes/pull_requests/39283) [3] abaf13d83062b9448964492d3e06ffb66599c692 (refs/remotes/pull_requests/41065:activerecord/test/models) [4] 0a11b22896aca19b3536c5866ff3fa33b75373ec (refs/remotes/origin/6-0-stable:activestorage/test/fixtures/files/racecar.tif) [5] 9048a32486634c85c89594294399dc9fc3a47628 (refs/tags/github24) [6] c00e55536143036a36457ec6cd3ca216aed8a229 (refs/remotes/pull_requests/41065^{tree}) [7] 0edb5efeeab66c11ba1d3aa14ba6a65471d6d7b3 (refs/remotes/pull_requests/25214^{tree}) [8] 5b1fafb40bcb2e65aa7cf4b2069739aa289e6ed3 (refs/remotes/pull_requests/37724^{tree}) [9] f0131fa57fd30862f0e27fba155c90ff7010c93b (refs/remotes/pull_requests/34977^{tree}) [10] 43105331c58c0ec5e8cbc3bee5d8d0e420e8ec02 (refs/remotes/pull_requests/1438^{tree}) [11] 5686b261c8f258e8e227e6b58cdab689c23341e8 (8834b2612b7ddda70ee6a685eb0063d3daa8e63d^{tree})
Since speed and size is our goal, ~45 MB for 11,1615 commits isn’t bad but ~4GB in blob size is
fairly large. Actual space on disk, via du -h -d0 rails
, is: 366 MB
. This is not the worst I’ve
seen. I once had to work with a proprietary repository that consumed ~4GB of actual disk space. 😱
OK, with size out of the way, what about cloning speed? We can use time
as a measurement:
time git clone --config transfer.fsckobjects=false https://github.com/rails/rails.git
Here are the results:
real 1m6.489s user 0m33.816s sys 0m17.000s
We are at ~1.6 minutes (real) cloning time for 366 MB of data, although, soon we’ll explore how to reduce both the time to clone and size of the clone.
Right now, because the purpose of this article is focused on cloning metadata, we’ll use the
--no-checkout
flag to prevent any directories and files from being cloned as well. Example:
# Time (real): 1m8.027s, Size: 314 MB
time git clone --config transfer.fsckobjects=false --no-checkout https://github.com/rails/rails.git
While the time is ~2 seconds slower with --no-checkout
, the size is reduced by 52MB. The reduction
in size is nice but not of huge significance since we are only ignoring the most recent file and
directory structure from the last commit. We can use Exa to list the
contents of the rails
repository clone:
exa --all --long --header --group-directories-first --time-style long-iso --git --git-ignore
As expected, we only have the .git
folder as a result of the clone due to using --no-checkout
:
Permissions Size User Date Modified Git Name drwxr-xr-x - bkuhlmann 2021-01-10 07:23 -- .git
We still have access to the metadata we need such as the Git Log:

Definitely a frightful mess in terms of who did what in the course of this repository’s history but we’ve got all the context we need in which to continue. 😅 We can dive into the different kinds of cloning next.
Terminology
Before jumping into looking at the different kinds of clones, let’s clarify the kinds of clones:
-
Full - Clones all commit history, trees (i.e. files and directories), and blobs (i.e. file contents). Basically, a full clone provides everything you need to develop within a repository.
-
Shallow - Clones with a truncated commit history which is great for build purposes but terrible for local development due to complications/performance with fetches and other commands.
-
Blobless - Clones all reachable commits and trees while only fetching blobs as needed.
-
Treeless - Clones all reachable commits while ignoring trees and blobs. Like shallow clones, this is another great way to build a repository while being simultaneously terrible for local development use due to a significant hit in performance.
Kinds
There are several ways to clone a repository. We’ll start with the most common and then move into the lesser known.
Full
As mentioned in Setup, here are the stats for doing a full clone while not checking out any files or directories:
# Time (real): 1m8.027s, Size: 314 MB.
time git clone --config transfer.fsckobjects=false --no-checkout https://github.com/rails/rails.git
A full clone is quite slow and also creates a large amount of data when we only want to analyze the Git commit, branch, and tag history. 🥲
Shallow
Shallow clones are one of the most common forms of cloning, especially for continuous integration servers. Here is the difference in time and size:
# Time (real): 0m4.269s, Size: 9.2 MB.
time git clone --config transfer.fsckobjects=false --no-checkout --depth 1 https://github.com/rails/rails.git
That’s a ~4.5x performance boost, compared to a full clone, while a ~34x reduction in disk usage!
💡 While shallow clones can yield impressive results, they do have a performance impact on the remote server since that is where the calculations occur. Keep this in mind when using shallow clones. For more on this, see the Homebrew 2.6.0 Release Notes and the corresponding pull request.
Blobless
For a blobless clone, we end up skipping out on the downloading of file contents. Here are the stats:
# Time (real): 0m30.031s, Size: 134 MB.
time git clone --config transfer.fsckobjects=false --no-checkout --filter=blob:none https://github.com/rails/rails.git
The shallow clone still wins, despite speed and size of the blobless clone being better than a full clone.
Treeless
With a treeless clone, we start to get closer to achieving what we want by cloning only the metadata while avoiding trees and blobs:
# Time (real): 0m14.594s, Size: 37 MB.
time git clone --config transfer.fsckobjects=false --no-checkout --filter=tree:0 https://github.com/rails/rails.git
As you can see, the results of a treeless clone are much better yet the shallow clone still wins. ðŸ’
Comparison
You might be thinking, based on the data above, that shallow clones are the clear winner, but we can’t be that hasty. For comparison, here is a private repo that I’m dealing with. As before, using Git Sizer:
Processing blobs: 93971 Processing trees: 231188 Processing commits: 48962 Matching commits to trees: 48962 Processing annotated tags: 18 Processing references: 4873 | Name | Value | Level of concern | | ---------------------------- | --------- | ------------------------------ | | Overall repository size | | | | * Commits | | | | * Count | 49.0 k | | | * Total size | 17.2 MiB | | | * Trees | | | | * Count | 231 k | | | * Total size | 297 MiB | | | * Total tree entries | 6.74 M | | | * Blobs | | | | * Count | 94.0 k | | | * Total size | 2.84 GiB | | | * Annotated tags | | | | * Count | 18 | | | * References | | | | * Count | 4.87 k | | | | | | | Biggest objects | | | | * Commits | | | | * Maximum size [1] | 36.8 KiB | | | * Maximum parents [2] | 2 | | | * Trees | | | | * Maximum entries [3] | 797 | | | * Blobs | | | | * Maximum size [4] | 129 MiB | ************* | | | | | | History structure | | | | * Maximum history depth | 13.7 k | | | * Maximum tag depth [5] | 1 | | | | | | | Biggest checkouts | | | | * Number of directories [6] | 798 | | | * Maximum path depth [6] | 9 | | | * Maximum path length [7] | 140 B | * | | * Number of files [6] | 3.88 k | | | * Total size of files [8] | 277 MiB | | | * Number of symlinks [9] | 2 | | | * Number of submodules [10] | 2 | |
Here is the same sequence of steps for the private repository, as done for Rails, but summarized for brevity:
# Time (real): 57.606s, Size: 314 MB. time git clone --no-checkout https://github.com/<private>.git # Time (real): 0m19.117s, Size: 46 MB. time git clone --no-checkout --depth 1 https://github.com/<private>.git # Time (real): 0m13.540s, Size: 38 MB. time git clone --no-checkout --filter=blob:none https://github.com/<private>.git # Time (real): 0m3.815s, Size: 14 MB. time git clone --no-checkout --filter=tree:0 https://github.com/<private>.git
Notice, in this case, the treeless clone is the clear winner — and not by a small margin! If you are wanting to improve of your clones both in terms of speed and size, you might need to experiment with shallow and treeless clones. Granted, the advantage isn’t as clear cut based on the kind of repository used, but the experimentation is likely still worthwhile. In this case, I believe the treeless clone has the upper hand due to this repository having a large binary (i.e. 129 MB max blob size) checked in at one point in the repository’s history.
Thoughts
For more on Git cloning, check out the documentation. I’ve
been enjoying digging into how to better improve the performance of cloning repositories so will
probably write about the subject more in the future. While this article remained focused on cloning
only metadata there are better ways of cloning only specific files for build purposes via git clone
--sparse
or even the more experimental sparse
checkout command. We’ll have to save that discussion for another time. 😉