filter-repo: improve support for partial history rewrites

Partial history rewrites were possible before with the (previously
hidden) --refs flag, but the defaults were wrong.  That could be worked
around with the --source or --target flags, but that disabled --no-data
for fast-export and thus slowed things down, and also would require
overridding --replace-refs.  And the defaults for --source and --target
may diverge further from what is wanted/needed for partial history
rewrites in the future.

So, add --partial as a first-class supported option with scary
documentation about how it permits mixing new and old history.  Make
--refs imply that flag.  Make the behavioral similarities (in regards to
which steps are skipped) between --source, --target, and --partial more
clear.  Add relevant documentation to round it out.

Signed-off-by: Elijah Newren <newren@gmail.com>
This commit is contained in:
Elijah Newren 2019-10-16 17:46:02 -07:00
parent 509a624b6a
commit 320c85f941
5 changed files with 98 additions and 41 deletions

View File

@ -256,9 +256,10 @@ Generic callback code snippets
Location to filter from/to
~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTE: Specifying alternate source or target locations will disable some
auxiliary steps such as disconnecting the origin remote, and avoiding
mixing new and old history.
NOTE: Specifying alternate source or target locations implies --partial
except that the normal default for --replace-refs is used. However, unlike
normal uses of --partial, this doesn't risk mixing old and new history
since the old and new histories are in different repositories.
--source <source>::
Git repository to read from
@ -278,6 +279,25 @@ Miscellaneous options
Rewrite history even if the current repo does not look like a fresh
clone.
--partial:
Do a partial history rewrite, resulting in the mixture of old and
new history. This implies a default of update-no-add for
--replace-refs, disables rewriting refs/remotes/origin/* to
refs/heads/*, disables removing of the 'origin' remote, disables
removing unexported refs, disables expiring the reflog, and
disables the automatic post-filter gc. Also, this modifies
--tag-rename and --refname-callback options such that instead of
replacing old refs with new refnames, it will instead create new
refs and keep the old ones around. Use with caution.
--refs <refs+>::
Limit history rewriting to the specified refs. Implies --partial.
In addition to the normal caveats of --partial (mixing old and new
history, no automatic remapping of refs/remotes/origin/* to
refs/heads/*, etc.), this also may cause problems for pruning of
degenerate empty merge commits when negative revisions are
specified.
--dry-run::
Do not change the repository. Run `git fast-export` and filter its
output, and save both the original and the filtered version for
@ -699,6 +719,23 @@ The reason to specify --force is two-fold: filter-repo will error out
if no arguments are specified, and the new graft commit would
otherwise trigger the not-a-fresh-clone check.
Partial history rewrites
~~~~~~~~~~~~~~~~~~~~~~~~
To rewrite the history on just one branch (which may cause it to no longer
share any common history with other branches), use `--refs`. For example,
to remove a file named 'extraneous.txt' from the 'master' branch:
--------------------------------------------------
git filter-repo --invert-paths --path extraneous.txt --refs master
--------------------------------------------------
To rewrite just some recent commits:
--------------------------------------------------
git filter-repo --invert-paths --path extraneous.txt --refs master~3..master
--------------------------------------------------
[[CALLBACKS]]
CALLBACKS
---------
@ -946,8 +983,11 @@ Some notes or exceptions on each of the above:
are that they've only rewritten trees and commits and maybe a few
blobs, so `--aggressive` isn't needed and would be too slow.)
Information about these steps is printed out when `--debug` is passed to
filter-repo.
Information about these steps is printed out when `--debug` is passed
to filter-repo. When doing a `--partial` history rewrite, steps 2, 3,
7, and 8 are unconditionally skipped, step 5 is skipped if
`--replace-refs` is `update-no-add`, and just the nuke-unused-refs
portion of step 5 is skipped if `--replace-refs` is something else.
Limitations
~~~~~~~~~~~
@ -1041,18 +1081,16 @@ Issues specific to filter-repo
such as `-M` or `-C` would break assumptions used in other places of
filter-repo.
* Partial-repo filtering does not mesh well with filter-repo's "avoid
mixing old and new history" design. filter-repo has some capability
in this area but it is intentionally underdocumented and mostly left
for use by external scripts which import filter-repo as a module
(some examples in contrib/filter-repo-demos/ do use this). The only
real usecases I've seen for partial repo filtering, though, are
sidestepping filter-branch's insanely slow execution on commits that
would not be changed by the filters in question anyway (which is
largely irrelevant since filter-repo is multiple orders of magnitude
faster), or to do operations better suited to linkgit:git-rebase[1]
and which rebase grew special options for years ago (e.g. the
`--signoff` option).
* Partial-repo filtering, while supported, runs counter to filter-repo's
"avoid mixing old and new history" design. This support has required
improvements to core git as well (e.g. it depends upon the
`--reference-excluded-parents` option to fast-export that was added
specifically for this usage within filter-repo). The `--partial` and
`--refs` options will continue to be supported since there are people
with usecases for them; however, I am concerned that this inconsistency
about mixing old and new history seems likely to lead to user mistakes.
For now, I just hope that long explanations of caveats in the
documentation of these options suffice to curtail any such problems.
Comments on reversibility
^^^^^^^^^^^^^^^^^^^^^^^^^

View File

@ -400,7 +400,7 @@ class BFG_ish:
stdin = subprocess.PIPE,
stdout = subprocess.PIPE)
self.args = bfg_args
# Setting source and target to anything prevents:
# Setting partial prevents:
# * remapping origin remote tracking branches to regular branches
# * deletion of the origin remote
# * nuking unused refs
@ -411,9 +411,8 @@ class BFG_ish:
# The third is irrelevant since BFG has no mechanism for renaming refs,
# and we'll manually add the fourth and fifth back in below by calling
# RepoFilter.cleanup().
fr_args = fr.FilteringOptions.parse_args(['--source', '.',
'--target', '.',
'--force'] + extra_args)
fr_args = fr.FilteringOptions.parse_args(['--partial', '--force'] +
extra_args)
self.filter = fr.RepoFilter(fr_args, commit_callback=self.commit_update)
self.filter.run()
if new_replace_file:

View File

@ -585,9 +585,7 @@ class UserInterfaceNightmare:
self.args.prune_empty = True
fr_args = fr.FilteringOptions.parse_args(['--preserve-commit-hashes',
'--preserve-commit-encoding',
'--replace-refs', 'update-no-add',
'--source', '.',
'--target', '.',
'--partial',
'--force'] + extra_args)
fr_args.prune_empty = 'always' if self.args.prune_empty else 'never'
fr_args.refs = self.get_extended_refs()

View File

@ -58,10 +58,7 @@ def add_signed_off_by_trailer(commit, metadata):
# * nuking reflogs
# * repacking
# so we cheat and set source and target both to '.'
args = fr.FilteringOptions.parse_args(['--source', '.',
'--target', '.',
'--force',
'--replace-refs', 'update-no-add',
args = fr.FilteringOptions.parse_args(['--force',
'--refs'] + myargs.rev_list_args)
args.refs = myargs.rev_list_args
filter = fr.RepoFilter(args, commit_callback=add_signed_off_by_trailer)

View File

@ -1673,10 +1673,6 @@ EXAMPLES
"useful in determining what to filter in a subsequent run. "
"Will not modify your repo."))
refs = parser.add_argument_group(title=_("Git References"))
refs.add_argument('--refs', nargs='*', default=['--all'],
help=argparse.SUPPRESS)
path = parser.add_argument_group(title=_("Filtering based on paths "
"(see also --filename-callback)"))
path.add_argument('--invert-paths', action='store_false', dest='inclusive',
@ -1846,9 +1842,10 @@ EXAMPLES
"CALLBACKS section below."))
desc = _(
"Specifying alternate source or target locations will disable some \n"
"auxiliary steps such as disconnecting the origin remote, and avoiding\n"
"mixing new and old history.")
"Specifying alternate source or target locations implies --partial,\n"
"except that the normal default for --replace-refs is used. However,\n"
"unlike normal uses of --partial, this doesn't risk mixing old and new\n"
"history since the old and new histories are in different repositories.")
location = parser.add_argument_group(title=_("Location to filter from/to"),
description=desc)
location.add_argument('--source', type=os.fsencode,
@ -1862,6 +1859,29 @@ EXAMPLES
misc.add_argument('--force', '-f', action='store_true',
help=_("Rewrite history even if the current repo does not look "
"like a fresh clone."))
misc.add_argument('--partial', action='store_true',
help=_("Do a partial history rewrite, resulting in the mixture of "
"old and new history. This implies a default of "
"update-no-add for --replace-refs, disables rewriting "
"refs/remotes/origin/* to refs/heads/*, disables removing "
"of the 'origin' remote, disables removing unexported refs, "
"disables expiring the reflog, and disables the automatic "
"post-filter gc. Also, this modifies --tag-rename and "
"--refname-callback options such that instead of replacing "
"old refs with new refnames, it will instead create new "
"refs and keep the old ones around. Use with caution."))
# WARNING: --refs presents a problem with become-degenerate pruning:
# * Excluding a commit also excludes its ancestors so when some other
# commit has an excluded ancestor as a parent we have no way of
# knowing what it is an ancestor of without doing a special
# full-graph walk.
misc.add_argument('--refs', nargs='+',
help=_("Limit history rewriting to the specified refs. Implies "
"--partial. In addition to the normal caveats of --partial "
"(mixing old and new history, no automatic remapping of "
"refs/remotes/origin/* to refs/heads/*, etc.), this also may "
"cause problems for pruning of degenerate empty merge "
"commits when negative revisions are specified."))
misc.add_argument('--dry-run', action='store_true',
help=_("Do not change the repository. Run `git fast-export` and "
@ -2065,6 +2085,12 @@ EXAMPLES
args.strip_blobs_with_ids = set(f.read().split())
else:
args.strip_blobs_with_ids = set()
if (args.partial or args.refs) and not args.replace_refs:
args.replace_refs = 'update-no-add'
if args.refs or args.source or args.target:
args.partial = True
if not args.refs:
args.refs = ['--all']
return args
class RepoAnalyze(object):
@ -3475,8 +3501,6 @@ class RepoFilter(object):
.format(decode(self._fe_filt)))
def _migrate_origin_to_heads(self):
if self._args.dry_run or self._args.source or self._args.target:
return
refs_to_migrate = set(x for x in self._orig_refs
if x.startswith(b'refs/remotes/origin/'))
if not refs_to_migrate:
@ -3532,7 +3556,7 @@ class RepoFilter(object):
# Remove unused refs
exported_refs, imported_refs = self.get_exported_and_imported_refs()
refs_to_nuke = exported_refs - imported_refs
if self._args.source or self._args.target:
if self._args.partial:
refs_to_nuke = set()
if refs_to_nuke and self._args.debug:
print("[DEBUG] Deleting the following refs:\n "+
@ -3690,7 +3714,8 @@ class RepoFilter(object):
start = time.time()
if not self._input and not self._output:
self._run_sanity_checks()
self._migrate_origin_to_heads()
if not self._args.dry_run and not self._args.partial:
self._migrate_origin_to_heads()
self._setup_input(use_done_feature = True)
self._setup_output()
assert self._sanity_checks_handled
@ -3725,7 +3750,7 @@ class RepoFilter(object):
self._save_marks_files()
# Notify user how long it took, before doing a gc and such
repack = (not self._args.source and not self._args.target)
repack = (not self._args.partial)
msg = "New history written in {:.2f} seconds..."
if repack:
msg = "New history written in {:.2f} seconds; now repacking/cleaning..."
@ -3749,7 +3774,7 @@ class RepoFilter(object):
# Write out data about run
self._record_metadata(self.results_tmp_dir(), self._orig_refs)
# Nuke the reflogs and repack
# If repack, then nuke the reflogs and repack. If reset, do a reset --hard
reset = not GitUtils.is_repository_bare(target_working_dir)
RepoFilter.cleanup(target_working_dir, repack, reset,
run_quietly=self._args.quiet,