Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack-objects: add --path-walk option for better deltas #1813

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

derrickstolee
Copy link

@derrickstolee derrickstolee commented Oct 8, 2024

This is a reviewable series introducing the path-walk API and its application within the 'git pack-objects --path-walk' machinery. This API was previously discussed in the path-walk RFC [1] and the patch series around the --full-name-hash option (branch ds/pack-name-hash-tweak) [2]. This series conflicts with ds/pack-name-hash-tweak, but that was on hold because it did not seem as critical based on community interest.

[1] https://lore.kernel.org/git/[email protected]

[2] https://lore.kernel.org/git/[email protected]

The primary motivation for this feature is its use to shrink the packfile created by 'git push' when there are many name-hash collisions. This need was discovered in several Javascript repositories that use the beachball tool [3] to generate CHANGELOG.md and CHANGELOG.json files. When a batch of these files are created at the same time and pushed to a release branch, the 'git pack-objects' process has many collisions among these files and delta bases are selected poorly.

[3] https://github.com/microsoft/beachball

In some cases, 'git push' is pushing 60-100 MB when the new path-walk algorithm will identify better delta bases and pack the same objects into a thin pack less than 1 MB. This was the most extreme example we could find and is present in a private monorepo. However, the microsoft/fluentui repo [4] is a good example for demonstrating similar improvements. The patch descriptions frequently refer to this repo and which commit should be checked out to reproduce this behavior.

[4] https://github.com/microsoft/fluentui

The path-walk API is a new way to walk objects. Traditionally, the revision API walks objects by visiting a commit, then visiting its root tree and recursing through trees to visit reachable objects that were not previously visited. The path-walk API visits objects in batches based on the path walked from a root tree to that object. (Only the first discovered path is chosen; this avoids certain kinds of Git bombs.)

This has an immediate application to 'git pack-objects'.

When using the traditional revision API to walk objects, each object is emitted with an associated path. Since this path may appear for many objects spread across the full list, a heuristic is used: the "name-hash" is stored for that object instead of the full path name. This name-hash will group objects at the same path together, but also has a form of "locality" to group likely-similar objects together. When there are few collisions in the name-hash function, this helps compress objects that appear at the same path as well as help compress objects across different paths that have similar suffixes. When there are many versions of the same path, then finding delta bases across that family of objects is very important. When there are few versions of the same path, then finding cross-name delta bases is also important. The former is important for clones and repacks while the latter is important for shallow clones. They can both be important for pushes. In all cases, this approach is fraught when there are many name-hash collisions as the window size becomes a limiting factor for finding quality delta bases.

When using the path-walk API to walk objects, we group objects by the same path from the start. We don't need to store the path name, since we have the objects already in a group. We can compute deltas within that group and then use the name-hash approach to resort the object list and look for opportunistic cross-path deltas. Thus, the path-walk approach allows finding delta bases at least as good as the traditional revision API approach. (Caveat: if we assume delta reuse and the existing deltas were computed with the revision API approach, then the path-walk API approach may result in slightly worse delta compression. The performance tests in this series use --no-reuse-delta for this reason.)

Once 'git pack-objects --path-walk' exists, we have a few ways to take advantage of it in other places:

  • The new 'pack.usePathWalk' config option will assume the --path-walk option. This allows 'git push' to assume this value and get the effect we want. This is similar to the 'pack.useSparse' config option that uses a similar path-based walk to limit the set of boundary objects.

  • 'git repack' learns a '--path-walk' option to pass to its child 'git pack-objects' process. This is also implied by 'pack.usePathWalk' but allows for testing without modifying repository config.

I'll repeat the following table of repacking results when using 'git repack -adf [--path-walk]' on a set of repositories with many name-hash collisions. Only the microsoft/fluentui repository is publicly available for testing, so the others are left as Repo B/C/D.

| Repo     | Standard Repack | With --path-walk |
|----------|-----------------|------------------|
| fluentui |         438 MB  |          148 MB  |
| Repo B   |       6,255 MB  |          778 MB  |
| Repo C   |      37,737 MB  |        6,158 MB  |
| Repo D   |     130,049 MB  |        4,432 MB  |

While this series is replacing ds/pack-name-hash-tweak and its introduction of the --full-name-hash option, it is worth comparing that option to the --path-walk option.

  • The --full-name-hash option is a much smaller code change, as it drops into the existing uses of the name-hash function.

  • The --full-name-hash option is more likely to integrate with server-side features such as delta islands and reachability bitmaps due to not changing the object walk. It was already noted that the .bitmap files store name-hash values, so there is some compatibility required to integrate with bitmaps. The --path-walk option will be more difficult to integrate with those options (at least during a repack), but maybe is not impossible; the --path-walk option will not work when reading reachability bitmaps, since we are avoiding walking trees entirely.

  • The --full-name-hash option is good when there are many name-hash collisions and many versions of the paths with those collisions. When creating a shallow clone or certain kinds of pushes, the --full-name-hash option is much worse at finding cross-path delta bases since it loses the locality of the standard name-hash function. The --path-walk option includes a second pass of delta computation using the standard name-hash function and thus finds good cross-path delta bases when they improve upon the same-path delta bases.

There are a few differences from the RFC version of this series:

  1. The last two patches refactor the approach to perform delta calculations by path after the object walk and then allows those delta calculations to happen in a threaded manner.

  2. Both 'git pack-objects' and 'git repack' are removed from the t0450 exclusion list.

  3. The path-walk API has improved technical documentation that is extended as its functionality is expanded.

  4. Various bugs have been patched with matching tests. The new 'test-tool path-walk' helper allows for careful testing of the API separately from its use within other commands.

Updates in v2

I'm sending this v2 to request some review feedback on the series. I'm sorry it's so long.

There are two updates in this version:

  • Fixed a performance issue in the presence of many annotated tags. This is caught by p5313 when run on a repo with 10,000+ annotated tags.
  • The Scalar config was previously wrong and should be pack.usePathWalk, not push.usePathWalk.

Thanks, - Stolee

cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]

derrickstolee and others added 3 commits October 8, 2024 07:35
In anticipation of a few planned applications, introduce the most basic form
of a path-walk API. It currently assumes that there are no UNINTERESTING
objects and does not include any complicated filters. It calls a function
pointer on groups of tree and blob objects as grouped by path. This only
includes objects the first time they are discovered, so an object that
appears at multiple paths will not be included in two batches.

There are many future adaptations that could be made, but they are left for
future updates when consumers are ready to take advantage of those features.

Signed-off-by: Derrick Stolee <[email protected]>
Add some tests based on the current behavior, doing interesting checks
for different sets of branches, ranges, and the --boundary option. This
sets a baseline for the behavior and we can extend it as new options are
introduced.

Signed-off-by: Derrick Stolee <[email protected]>
We add the ability to filter the object types in the path-walk API so
the callback function is called fewer times.

This adds the ability to ask for the commits in a list, as well. Future
changes will add the ability to visit annotated tags.

Signed-off-by: Derrick Stolee <[email protected]>
@derrickstolee
Copy link
Author

/submit

Copy link

gitgitgadget bot commented Oct 8, 2024

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1813/derrickstolee/path-walk-upstream-v1

To fetch this version to local tag pr-1813/derrickstolee/path-walk-upstream-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1813/derrickstolee/path-walk-upstream-v1

Copy link

gitgitgadget bot commented Oct 8, 2024

This patch series was integrated into seen via git@b0017d4.

@gitgitgadget gitgitgadget bot added the seen label Oct 8, 2024
Copy link

gitgitgadget bot commented Oct 9, 2024

This patch series was integrated into seen via git@e1dc636.

derrickstolee and others added 11 commits October 9, 2024 10:03
In anticipation of using the path-walk API to analyze tags or include
them in a pack-file, add the ability to walk the tags that were included
in the revision walk.

When these tag objects point to blobs or trees, we need to make sure
those objects are also visited. Treat tagged trees as root trees, but
put the tagged blobs in their own category.

Be careful about objects that are referred to by multiple references.

Co-authored-by: Johannes Schindelin <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
The sparse tree walk algorithm was created in d5d2e93 (revision:
implement sparse algorithm, 2019-01-16) and involves using the
mark_trees_uninteresting_sparse() method. This method takes a repository
and an oidset of tree IDs, some of which have the UNINTERESTING flag and
some of which do not.

Create a method that has an equivalent set of preconditions but uses a
"dense" walk (recursively visits all reachable trees, as long as they
have not previously been marked UNINTERESTING). This is an important
difference from mark_tree_uninteresting(), which short-circuits if the
given tree has the UNINTERESTING flag.

A use of this method will be added in a later change, with a condition
set whether the sparse or dense approach should be used.

Signed-off-by: Derrick Stolee <[email protected]>
This option causes the path-walk API to act like the sparse tree-walk
algorithm implemented by mark_trees_uninteresting_sparse() in
list-objects.c.

Starting from the commits marked as UNINTERESTING, their root trees and
all objects reachable from those trees are UNINTERSTING, at least as we
walk path-by-path. When we reach a path where all objects associated
with that path are marked UNINTERESTING, then do no continue walking the
children of that path.

We need to be careful to pass the UNINTERESTING flag in a deep way on
the UNINTERESTING objects before we start the path-walk, or else the
depth-first search for the path-walk API may accidentally report some
objects as interesting.

Signed-off-by: Derrick Stolee <[email protected]>
This will be helpful in a future change that introduces a new way to
compute deltas.

Be careful to preserve the nr_deltas counting logic in the existing
method, but take the rest of the logic wholesale.

Signed-off-by: Derrick Stolee <[email protected]>
In order to more easily compute delta bases among objects that appear at the
exact same path, add a --path-walk option to 'git pack-objects'.

This option will use the path-walk API instead of the object walk given by
the revision machinery. Since objects will be provided in batches
representing a common path, those objects can be tested for delta bases
immediately instead of waiting for a sort of the full object list by
name-hash. This has multiple benefits, including avoiding collisions by
name-hash.

The objects marked as UNINTERESTING are included in these batches, so we
are guaranteeing some locality to find good delta bases.

After the individual passes are done on a per-path basis, the default
name-hash is used to find other opportunistic delta bases that did not
match exactly by the full path name.

The current implementation performs delta calculations while walking
objects, which is not ideal for a few reasons. First, this will cause
the "Enumerating objects" phase to be much longer than usual. Second, it
does not take advantage of threading during the path-scoped delta
calculations. Even with this lack of threading, the path-walk option is
sometimes faster than the usual approach. Future changes will refactor
this code to allow for threading, but that complexity is deferred until
later to keep this patch as simple as possible.

This new walk is incompatible with some features and is ignored by
others:

 * Object filters are not currently integrated with the path-walk API,
   such as sparse-checkout or tree depth. A blobless packfile could be
   integrated easily, but that is deferred for later.

 * Server-focused features such as delta islands, shallow packs, and
   using a bitmap index are incompatible with the path-walk API.

 * The path walk API is only compatible with the --revs option, not
   taking object lists or pack lists over stdin. These alternative ways
   to specify the objects currently ignores the --path-walk option
   without even a warning.

Future changes will create performance tests that demonstrate the power
of this approach.

Signed-off-by: Derrick Stolee <[email protected]>
The t0450 test script verifies that builtin usage matches the synopsis
in the documentation. Adjust the builtin to match and then remove 'git
pack-objects' from the exception list.

Signed-off-by: Derrick Stolee <[email protected]>
The previous change added a --path-walk option to 'git pack-objects'.
Create a performance test that demonstrates the time and space benefits
of the feature.

In order to get an appropriate comparison, we need to avoid reusing
deltas and recompute them from scratch.

Compare the creation of a thin pack representing a small push and the
creation of a relatively large non-thin pack.

Running on my copy of the Git repository results in this data:

Test                                      this tree
---------------------------------------------------------
5313.2: thin pack                         0.01(0.00+0.00)
5313.3: thin pack size                               1.1K
5313.4: thin pack with --path-walk        0.01(0.01+0.00)
5313.5: thin pack size with --path-walk              1.1K
5313.6: big pack                          2.52(6.59+0.38)
5313.7: big pack size                               14.1M
5313.8: big pack with --path-walk         4.90(5.76+0.26)
5313.9: big pack size with --path-walk              13.2M

Note that the timing is slower because there is no threading in the
--path-walk case (yet).

The cases where the --path-walk option really shines is when the default
name-hash is overwhelmed with collisions. An open source example can be
found in the microsoft/fluentui repo [1] at a certain commit [2].

[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

Running the tests on this repo results in the following output:

Test                                      this tree
----------------------------------------------------------
5313.2: thin pack                         0.28(0.38+0.02)
5313.3: thin pack size                               1.2M
5313.4: thin pack with --path-walk        0.08(0.06+0.01)
5313.5: thin pack size with --path-walk             18.4K
5313.6: big pack                          4.05(29.62+0.43)
5313.7: big pack size                               20.0M
5313.8: big pack with --path-walk         5.99(9.06+0.24)
5313.9: big pack size with --path-walk              16.4M

Notice in particular that in the small thin pack, the time performance
has improved from 0.28s to 0.08s and this is likely due to the improved
size of the resulting pack: 18.4K instead of 1.2M.

Finally, running this on a copy of the Linux kernel repository results
in these data points:

Test                                      this tree
-----------------------------------------------------------
5313.2: thin pack                         0.00(0.00+0.00)
5313.3: thin pack size                               5.8K
5313.4: thin pack with --path-walk        0.00(0.01+0.00)
5313.5: thin pack size with --path-walk              5.8K
5313.6: big pack                          24.39(65.81+1.31)
5313.7: big pack size                              155.7M
5313.8: big pack with --path-walk         41.07(60.69+0.68)
5313.9: big pack size with --path-walk             150.8M

Signed-off-by: Derrick Stolee <[email protected]>
There are many tests that validate whether 'git pack-objects' works as
expected. Instead of duplicating these tests, add a new test environment
variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
when specified.

This was useful in testing the implementation of the --path-walk
implementation, especially in conjunction with test such as:

 - t0411-clone-from-partial.sh : One test fetches from a repo that does
   not have the boundary objects. This causes the path-based walk to
   fail. Disable the variable for this test.

 - t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo
   without a boundary object.

 - t5310-pack-bitmaps.sh : One test compares the case when packing with
   bitmaps to the case when packing without them. Since we disable the
   test variable when writing bitmaps, this causes a difference in the
   object list (the --path-walk option adds an extra object). Specify
   --no-path-walk in both processes for the comparison. Another test
   checks for a specific delta base, but when computing dynamically
   without using bitmaps, the base object it too small to be considered
   in the delta calculations so no base is used.

 - t5316-pack-delta-depth.sh : This script cares about certain delta
   choices and their chain lengths. The --path-walk option changes how
   these chains are selected, and thus changes the results of this test.

 - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
   the --sparse option and how it combines with --path-walk.

 - t5332-multi-pack-reuse.sh : This test verifies that the preferred
   pack is used for delta reuse when possible. The --path-walk option is
   not currently aware of the preferred pack at all, so finds a
   different delta base.

 - t7406-submodule-update.sh : When using the variable, the --depth
   option collides with the --path-walk feature, resulting in a warning
   message. Disable the variable so this warning does not appear.

I want to call out one specific test change that is only temporary:

 - t5530-upload-pack-error.sh : One test cares specifically about an
   "unable to read" error message. Since the current implementation
   performs delta calculations within the path-walk API callback, a
   different "unable to get size" error message appears. When this
   is changed in a future refactoring, this test change can be reverted.

Signed-off-by: Derrick Stolee <[email protected]>
Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.

In my copy of the Git repository, the new tests in p5313 show these
results:

Test                                      this tree
-------------------------------------------------------------
5313.10: repack                           27.88(150.23+2.70)
5313.11: repack size                               228.2M
5313.12: repack with --path-walk          134.59(148.77+0.81)
5313.13: repack size with --path-walk              209.7M

Note that the 'git pack-objects --path-walk' feature is not integrated
with threads. Look forward to a future change that will introduce
threading to improve the time performance of this feature with
equivalent space performance.

For the microsoft/fluentui repo [1] had some interesting aspects for the
previous tests in p5313, so here are the repack results:

Test                                      this tree
-------------------------------------------------------------
5313.10: repack                           91.76(680.94+2.48)
5313.11: repack size                               439.1M
5313.12: repack with --path-walk          110.35(130.46+0.74)
5313.13: repack size with --path-walk              155.3M

[1] https://github.com/microsoft/fluentui

Here, we see the significant improvement of a full repack using this
strategy. The name-hash collisions in this repo cause the space
problems. Those collisions also cause the repack command to spend a lot
of cycles trying to find delta bases among files that are not actually
very similar, so the lack of threading with the --path-walk feature is
less pronounced in the process time.

For the Linux kernel repository, we have these stats:

Test                                      this tree
---------------------------------------------------------------
5313.10: repack                           553.61(1929.41+30.31)
5313.11: repack size                                 2.5G
5313.12: repack with --path-walk          1777.63(2044.16+7.47)
5313.13: repack size with --path-walk                2.5G

This demonstrates that the --path-walk feature does not always present
measurable improvements, especially in cases where the name-hash has
very few collisions.

Signed-off-by: Derrick Stolee <[email protected]>
The t0450 test script verifies that the builtin usage matches the
synopsis in the documentation. Update 'git repack' to match and remove
it from the exception list.

Signed-off-by: Derrick Stolee <[email protected]>
Users may want to enable the --path-walk option for 'git pack-objects' by
default, especially underneath commands like 'git push' or 'git repack'.

This should be limited to client repositories, since the --path-walk option
disables bitmap walks, so would be bad to include in Git servers when
serving fetches and clones. There is potential that it may be helpful to
consider when repacking the repository, to take advantage of improved deltas
across historical versions of the same files.

Much like how "pack.useSparse" was introduced and included in
"feature.experimental" before being enabled by default, use the repository
settings infrastructure to make the new "pack.usePathWalk" config enabled by
"feature.experimental" and "feature.manyFiles".

Signed-off-by: Derrick Stolee <[email protected]>
Copy link

gitgitgadget bot commented Oct 9, 2024

This branch is now known as ds/path-walk.

Copy link

gitgitgadget bot commented Oct 10, 2024

This patch series was integrated into seen via git@be8d416.

Copy link

gitgitgadget bot commented Oct 10, 2024

This patch series was integrated into seen via git@5ddc828.

Copy link

gitgitgadget bot commented Oct 10, 2024

There was a status update in the "Cooking" section about the branch ds/path-walk on the Git mailing list:

A new algorithm for object graph traversal to favor visiting the
objects at the same tree path in succession (as opposed to visiting
objects that are different between trees as we walk commit
histories) is introduced to optimize object packing.

Needs review.
source: <[email protected]>

Copy link

gitgitgadget bot commented Oct 11, 2024

This patch series was integrated into seen via git@7d1078a.

Copy link

gitgitgadget bot commented Oct 11, 2024

This patch series was integrated into seen via git@24e32b0.

Copy link

gitgitgadget bot commented Oct 12, 2024

There was a status update in the "Cooking" section about the branch ds/path-walk on the Git mailing list:

A new algorithm for object graph traversal to favor visiting the
objects at the same tree path in succession (as opposed to visiting
objects that are different between trees as we walk commit
histories) is introduced to optimize object packing.

Needs review.
source: <[email protected]>

Copy link

gitgitgadget bot commented Oct 14, 2024

This patch series was integrated into seen via git@542db49.

Copy link

gitgitgadget bot commented Oct 14, 2024

This patch series was integrated into seen via git@33f243d.

Copy link

gitgitgadget bot commented Oct 16, 2024

This patch series was integrated into seen via git@1a84029.

Repositories registered with Scalar are expected to be client-only
repositories that are rather large. This means that they are more likely to
be good candidates for using the --path-walk option when running 'git
pack-objects', especially under the hood of 'git push'. Enable this config
in Scalar repositories.

Signed-off-by: Derrick Stolee <[email protected]>
Previously, the --path-walk option to 'git pack-objects' would compute
deltas inline with the path-walk logic. This would make the progress
indicator look like it is taking a long time to enumerate objects, and
then very quickly computed deltas.

Instead of computing deltas on each region of objects organized by tree,
store a list of regions corresponding to these groups. These can later
be pulled from the list for delta compression before doing the "global"
delta search.

This presents a new progress indicator that can be used in tests to
verify that this stage is happening.

The current implementation is not integrated with threads, but could be
done in a future update.

Since we do not attempt to sort objects by size until after exploring
all trees, we can remove the previous change to t5530 due to a different
error message appearing first.

Signed-off-by: Derrick Stolee <[email protected]>
Adapting the implementation of ll_find_deltas(), create a threaded
version of the --path-walk compression step in 'git pack-objects'.

This involves adding a 'regions' member to the thread_params struct,
allowing each thread to own a section of paths. We can simplify the way
jobs are split because there is no value in extending the batch based on
name-hash the way sections of the object entry array are attempted to be
grouped. We re-use the 'list_size' and 'remaining' items for the purpose
of borrowing work in progress from other "victim" threads when a thread
has finished its batch of work more quickly.

Using the Git repository as a test repo, the p5313 performance test
shows that the resulting size of the repo is the same, but the threaded
implementation gives gains of varying degrees depending on the number of
objects being packed. (This was tested on a 16-core machine.)

Test                                      HEAD~1    HEAD
---------------------------------------------------------------
5313.2: thin pack                           0.01    0.01  +0.0%
5313.4: thin pack with --path-walk          0.01    0.01  +0.0%
5313.6: big pack                            2.54    2.60  +2.4%
5313.8: big pack with --path-walk           4.70    3.09 -34.3%
5313.10: repack                            28.75   28.55  -0.7%
5313.12: repack with --path-walk          108.55   46.14 -57.5%

On the microsoft/fluentui repo, where the --path-walk feature has been
shown to be more effective in space savings, we get these results:

Test                                     HEAD~1    HEAD
----------------------------------------------------------------
5313.2: thin pack                         0.39     0.40  +2.6%
5313.4: thin pack with --path-walk        0.08     0.07 -12.5%
5313.6: big pack                          4.15     4.15  +0.0%
5313.8: big pack with --path-walk         6.41     3.21 -49.9%
5313.10: repack                          90.69    90.83  +0.2%
5313.12: repack with --path-walk        108.23    49.09 -54.6%

Signed-off-by: Derrick Stolee <[email protected]>
Copy link

gitgitgadget bot commented Oct 18, 2024

This patch series was integrated into seen via git@f497f37.

Copy link

gitgitgadget bot commented Oct 18, 2024

This patch series was integrated into seen via git@5f6571b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant