The letter A styled as Alchemists logo. lchemists
Published January 10, 2021 Updated April 11, 2021
Cover
Git Metadata Cloning

This article explains the various strategies for cloning repository metadata. Git metadata is useful when analyzing the history of a repository in terms of commits, branches, tags, etc., and is especially valuable when building tooling for linting, tagging, and reporting purposes.

Environment

The stats reported in this article will vary, slightly, depending on the kind of machine used, internet connection, and normal evolution of repositories sampled. To provide context, here is the baseline I’m using:

  • Operating System: Apple macOS Big Sur 11.1.0.

  • Hardware: Apple MacBook Air (Retina, 13-inch, 2018) with 1.6 GHz Dual-Core Intel Core i5.

  • Internet: Wireless with ~239 MBPS down.

  • Git: Git 2.30.0.

Setup

We’ll use the Ruby on Rails open source framework as our demo repository because Rails is fairly large and therefore makes a good example case. To start we’ll need to clone as follows:

git clone --config transfer.fsckobjects=false https://github.com/rails/rails.git

The above will yield similar results to the following in your console depending on your environment:

Cloning into 'rails'...
remote: Enumerating objects: 1358, done.
remote: Counting objects: 100% (1358/1358), done.
remote: Compressing objects: 100% (917/917), done.
remote: Total 903070 (delta 804), reused 443 (delta 441), pack-reused 901712
Receiving objects: 100% (903070/903070), 279.13 MiB | 8.26 MiB/s, done.
Resolving deltas: 100% (665058/665058), done.
Updating files: 100% (4255/4255), done.
ctags: Warning: ignoring null tag in guides/assets/javascripts/clipboard.js
ctags: Warning: ignoring null tag in guides/assets/javascripts/turbolinks.js
ctags: Warning: ignoring null tag in guides/assets/javascripts/turbolinks.js
ctags: Warning: ignoring null tag in guides/assets/javascripts/turbolinks.js
[ctags]: CTags rebuilt.

We’ll get into full cloning in a bit but you might be wondering why --config transfer.fsckobjects=false is being used? I have this setting configured via my global Git configuration (i.e. $HOME/.gitconfig) to ensure I’m dealing with a repository of sound integrity. If I were to leave transfer.fsckobjects enabled, I would end up with a fatal exception:

git clone https://github.com/rails/rails.git

Cloning into 'rails'...
remote: Enumerating objects: 1358, done.
remote: Counting objects: 100% (1358/1358), done.
remote: Compressing objects: 100% (917/917), done.
error: object 4cf94979c9f4d6683c9338d694d5eb3106a4e734: badTimezone: invalid author/committer line - bad time zone
fatal: fsck error in packed object

For more on transfer.fsckobjects, here’s a snippet from from the Git documentation:

When set, the fetch or receive will abort in the case of a malformed object or a link to a nonexistent object. In addition, various other issues are checked for, including legacy issues (see fsck.<msg-id>), and potential security issues like the existence of a .GIT directory or a malicious .gitmodules file (see the release notes for v2.2.1 and v2.17.1 for details). Other sanity and security checks may be added in future releases.

To return to our setup, next we’ll use Git Sizer to get more insight, via git-sizer --verbose, into the structure and size of the Rails repository:

Processing blobs: 220842
Processing trees: 570287
Processing commits: 111615
Matching commits to trees: 111615
Processing annotated tags: 326
Processing references: 26773
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Commits                    |           |                                |
|   * Count                    |   112 k   |                                |
|   * Total size               |  45.3 MiB |                                |
| * Trees                      |           |                                |
|   * Count                    |   570 k   |                                |
|   * Total size               |   462 MiB |                                |
|   * Total tree entries       |  11.6 M   |                                |
| * Blobs                      |           |                                |
|   * Count                    |   221 k   |                                |
|   * Total size               |  3.94 GiB |                                |
| * Annotated tags             |           |                                |
|   * Count                    |   326     |                                |
| * References                 |           |                                |
|   * Count                    |  26.8 k   | *                              |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Commits                    |           |                                |
|   * Maximum size         [1] |  21.3 KiB |                                |
|   * Maximum parents      [2] |     2     |                                |
| * Trees                      |           |                                |
|   * Maximum entries      [3] |   167     |                                |
| * Blobs                      |           |                                |
|   * Maximum size         [4] |  32.1 MiB | ***                            |
|                              |           |                                |
| History structure            |           |                                |
| * Maximum history depth      |  53.1 k   |                                |
| * Maximum tag depth      [5] |     1     |                                |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [6] |   997     |                                |
| * Maximum path depth     [6] |    12     | *                              |
| * Maximum path length    [7] |   140 B   | *                              |
| * Number of files        [8] |  4.26 k   |                                |
| * Total size of files    [9] |  76.2 MiB |                                |
| * Number of symlinks    [10] |     7     |                                |
| * Number of submodules  [11] |     2     |                                |

[1]  a06a734da8dd8fa308b4b6429eaca93bb4f0e9c9 (refs/remotes/pull_requests/37257)
[2]  a484bb7eab15c9072decdae71d74a997e986e2c4 (refs/remotes/pull_requests/39283)
[3]  abaf13d83062b9448964492d3e06ffb66599c692 (refs/remotes/pull_requests/41065:activerecord/test/models)
[4]  0a11b22896aca19b3536c5866ff3fa33b75373ec (refs/remotes/origin/6-0-stable:activestorage/test/fixtures/files/racecar.tif)
[5]  9048a32486634c85c89594294399dc9fc3a47628 (refs/tags/github24)
[6]  c00e55536143036a36457ec6cd3ca216aed8a229 (refs/remotes/pull_requests/41065^{tree})
[7]  0edb5efeeab66c11ba1d3aa14ba6a65471d6d7b3 (refs/remotes/pull_requests/25214^{tree})
[8]  5b1fafb40bcb2e65aa7cf4b2069739aa289e6ed3 (refs/remotes/pull_requests/37724^{tree})
[9]  f0131fa57fd30862f0e27fba155c90ff7010c93b (refs/remotes/pull_requests/34977^{tree})
[10] 43105331c58c0ec5e8cbc3bee5d8d0e420e8ec02 (refs/remotes/pull_requests/1438^{tree})
[11] 5686b261c8f258e8e227e6b58cdab689c23341e8 (8834b2612b7ddda70ee6a685eb0063d3daa8e63d^{tree})

Since speed and size is our goal, ~45 MB for 11,1615 commits isn’t bad but ~4GB in blob size is fairly large. Actual space on disk, via du -h -d0 rails, is: 366 MB. This is not the worst I’ve seen. I once had to work with a proprietary repository that consumed ~4GB of actual disk space. 😱

OK, with size out of the way, what about cloning speed? We can use time as a measurement:

time git clone --config transfer.fsckobjects=false https://github.com/rails/rails.git

Here are the results:

real  1m6.489s
user  0m33.816s
sys 0m17.000s

We are at ~1.6 minutes (real) cloning time for 366 MB of data, although, soon we’ll explore how to reduce both the time to clone and size of the clone.

Right now, because the purpose of this article is focused on cloning metadata, we’ll use the --no-checkout flag to prevent any directories and files from being cloned as well. Example:

# Time (real): 1m8.027s, Size: 314 MB
time git clone --config transfer.fsckobjects=false --no-checkout https://github.com/rails/rails.git

While the time is ~2 seconds slower with --no-checkout, the size is reduced by 52MB. The reduction in size is nice but not of huge significance since we are only ignoring the most recent file and directory structure from the last commit. We can use Exa to list the contents of the rails repository clone:

exa --all --long --header --group-directories-first --time-style long-iso --git --git-ignore

As expected, we only have the .git folder as a result of the clone due to using --no-checkout:

Permissions Size User      Date Modified    Git Name
drwxr-xr-x     - bkuhlmann 2021-01-10 07:23  -- .git

We still have access to the metadata we need such as the Git Log:

Rails Git Log

Definitely a frightful mess in terms of who did what in the course of this repository’s history but we’ve got all the context we need in which to continue. 😅 We can dive into the different kinds of cloning next.

Terminology

Before jumping into looking at the different kinds of clones, let’s clarify the kinds of clones:

  • Full - Clones all commit history, trees (i.e. files and directories), and blobs (i.e. file contents). Basically, a full clone provides everything you need to develop within a repository.

  • Shallow - Clones with a truncated commit history which is great for build purposes but terrible for local development due to complications/performance with fetches and other commands.

  • Blobless - Clones all reachable commits and trees while only fetching blobs as needed.

  • Treeless - Clones all reachable commits while ignoring trees and blobs. Like shallow clones, this is another great way to build a repository while being simultaneously terrible for local development use due to a significant hit in performance.

Kinds

There are several ways to clone a repository. We’ll start with the most common and then move into the lesser known.

Full

As mentioned in Setup, here are the stats for doing a full clone while not checking out any files or directories:

# Time (real): 1m8.027s, Size: 314 MB.
time git clone --config transfer.fsckobjects=false --no-checkout https://github.com/rails/rails.git

A full clone is quite slow and also creates a large amount of data when we only want to analyze the Git commit, branch, and tag history. 🥲

Shallow

Shallow clones are one of the most common forms of cloning, especially for continuous integration servers. Here is the difference in time and size:

# Time (real): 0m4.269s, Size: 9.2 MB.
time git clone --config transfer.fsckobjects=false --no-checkout --depth 1 https://github.com/rails/rails.git

That’s a ~4.5x performance boost, compared to a full clone, while a ~34x reduction in disk usage!

💡 While shallow clones can yield impressive results, they do have a performance impact on the remote server since that is where the calculations occur. Keep this in mind when using shallow clones. For more on this, see the Homebrew 2.6.0 Release Notes and the corresponding pull request.

Blobless

For a blobless clone, we end up skipping out on the downloading of file contents. Here are the stats:

# Time (real): 0m30.031s, Size: 134 MB.
time git clone --config transfer.fsckobjects=false --no-checkout --filter=blob:none https://github.com/rails/rails.git

The shallow clone still wins, despite speed and size of the blobless clone being better than a full clone.

Treeless

With a treeless clone, we start to get closer to achieving what we want by cloning only the metadata while avoiding trees and blobs:

# Time (real): 0m14.594s, Size: 37 MB.
time git clone --config transfer.fsckobjects=false --no-checkout --filter=tree:0 https://github.com/rails/rails.git

As you can see, the results of a treeless clone are much better yet the shallow clone still wins. 💭

Comparison

You might be thinking, based on the data above, that shallow clones are the clear winner, but we can’t be that hasty. For comparison, here is a private repo that I’m dealing with. As before, using Git Sizer:

Processing blobs: 93971
Processing trees: 231188
Processing commits: 48962
Matching commits to trees: 48962
Processing annotated tags: 18
Processing references: 4873
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Commits                    |           |                                |
|   * Count                    |  49.0 k   |                                |
|   * Total size               |  17.2 MiB |                                |
| * Trees                      |           |                                |
|   * Count                    |   231 k   |                                |
|   * Total size               |   297 MiB |                                |
|   * Total tree entries       |  6.74 M   |                                |
| * Blobs                      |           |                                |
|   * Count                    |  94.0 k   |                                |
|   * Total size               |  2.84 GiB |                                |
| * Annotated tags             |           |                                |
|   * Count                    |    18     |                                |
| * References                 |           |                                |
|   * Count                    |  4.87 k   |                                |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Commits                    |           |                                |
|   * Maximum size         [1] |  36.8 KiB |                                |
|   * Maximum parents      [2] |     2     |                                |
| * Trees                      |           |                                |
|   * Maximum entries      [3] |   797     |                                |
| * Blobs                      |           |                                |
|   * Maximum size         [4] |   129 MiB | *************                  |
|                              |           |                                |
| History structure            |           |                                |
| * Maximum history depth      |  13.7 k   |                                |
| * Maximum tag depth      [5] |     1     |                                |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [6] |   798     |                                |
| * Maximum path depth     [6] |     9     |                                |
| * Maximum path length    [7] |   140 B   | *                              |
| * Number of files        [6] |  3.88 k   |                                |
| * Total size of files    [8] |   277 MiB |                                |
| * Number of symlinks     [9] |     2     |                                |
| * Number of submodules  [10] |     2     |                                |

Here is the same sequence of steps for the private repository, as done for Rails, but summarized for brevity:

# Time (real): 57.606s, Size: 314 MB.
time git clone --no-checkout https://github.com/<private>.git

# Time (real): 0m19.117s, Size: 46 MB.
time git clone --no-checkout --depth 1 https://github.com/<private>.git

# Time (real): 0m13.540s, Size: 38 MB.
time git clone --no-checkout --filter=blob:none https://github.com/<private>.git

# Time (real): 0m3.815s, Size: 14 MB.
time git clone --no-checkout --filter=tree:0 https://github.com/<private>.git

Notice, in this case, the treeless clone is the clear winner — and not by a small margin! If you are wanting to improve of your clones both in terms of speed and size, you might need to experiment with shallow and treeless clones. Granted, the advantage isn’t as clear cut based on the kind of repository used, but the experimentation is likely still worthwhile. In this case, I believe the treeless clone has the upper hand due to this repository having a large binary (i.e. 129 MB max blob size) checked in at one point in the repository’s history.

Thoughts

For more on Git cloning, check out the documentation. I’ve been enjoying digging into how to better improve the performance of cloning repositories so will probably write about the subject more in the future. While this article remained focused on cloning only metadata there are better ways of cloning only specific files for build purposes via git clone --sparse or even the more experimental sparse checkout command. We’ll have to save that discussion for another time. 😉