filter-repo: update README.md

Signed-off-by: Elijah Newren <newren@gmail.com>
pull/13/head
Elijah Newren 5 years ago
parent 73e91edecc
commit 911f234e3d

@ -1,11 +1,12 @@
git filter-repo is a tool for rewriting history, which includes some
capabilities I have not found anywhere else. It is most similar to
[git filter-branch](https://git-scm.com/docs/git-filter-branch),
though it fixes what I perceive to be some glaring deficiencies in
that tool and brings a much different taste in usability. Also, being
based on fast-export/fast-import, it is orders of magnitude faster (it
has speed roughly comparable to BFG repo cleaner, but isn't
multi-threaded).
git filter-repo is a tool for rewriting history, which includes [some
capabilities I have not found anywhere
else](#design-rationale-behind-filter-repo-why-create-a-new-tool). It is
most similar to [git
filter-branch](https://git-scm.com/docs/git-filter-branch), though it fixes
what I perceive to be some glaring deficiencies in that tool and brings a
much different taste in usability. Also, being based on
fast-export/fast-import, it is [orders of magnitude
faster](https://public-inbox.org/git/CABPp-BGOz8nks0+Tdw5GyGqxeYR-3FF6FT5JcgVqZDYVRQ6qog@mail.gmail.com/).
filter-repo is a single-file python script, depending only on the
python standard library (and execution of git commands), all of which
@ -14,20 +15,113 @@ $PATH.
# Table of Contents
* Background
* [Why create another repo filtering tool?](#why-git-filter-repo)
* [Warnings: Not yet ready for external usage](
#warnings-not-yet-ready-for-external-usage)
* [Why not $FAVORITE_COMPETITOR](#why-not-favorite_competitor)
* [Why filter-repo instead of filter-branch?](#why-filter-repo-instead-of-filter-branch)
* [Example usage, comparing to filter-branch](#example-usage-comparing-to-filter-branch)
* [Design rationale behind filter-repo](#design-rationale-behind-filter-repo-why-create-a-new-tool)
* [Usage](#usage)
# Background
## Why git-filter-repo?
None of the [existing repository filtering
tools](#why-not-favorite_competitor) do what I want. They're all good
in their own way, but come up short for my needs. In no particular order:
## Why filter-repo instead of filter-branch?
filter-branch has a number of problems:
* filter-branch is extremely to unusably slow (multiple orders of
magnitude slower than it should be) for non-trivial repositories.
* filter-branch made a number of usability choices that are okay for
small repos, but these choices sometimes conflict as more options
are combined, and the overall usability often causes difficulties
for users trying to work with intermediate or larger repos.
* filter-branch is missing some basic features.
The first two are intrinsic to filter-branch's design at this point
and cannot be backward-compatibly fixed.
## Example usage, comparing to filter-branch
Let's say that we want to extract a piece of a repository, with the intent
on merging just that piece into some other bigger repo. We also want to know
how much smaller this extracted repo is without the binary-blobs/ directory
in it. For extraction, we want to:
* extract the history of a single directory, src/. This means that only
paths under src/ remain in the repo, and any commits that only touched
paths outside this directory will be removed.
* rename all files to have a new leading directory, my-module/ (e.g. so that
src/foo.c becomes my-module/src/foo.c)
* rename any tags in the extracted repository to have a 'my-module-'
prefix (to avoid any conflicts when we later merge this repo into
something else)
Doing this with filter-repo is as simple as the following command:
```shell
git filter-repo --path src/ --to-subdirectory-filter my-module --tag-rename '':'my-module-'
```
(the single quotes are unnecessary, but make it clearer to a human that we
are replacing the empty string as a prefix with `my-module-`)
By contrast, filter-branch comes with a pile of caveats (more on that
below) even once you figure out the necessary invocation(s):
```shell
git filter-branch --tree-filter 'mkdir -p my-module && git ls-files | grep -v ^src/ | xargs git rm -f -q && ls -d * | grep -v my-module | xargs -I files mv files my-module/' --tag-name-filter 'echo "my-module-$(cat)"' --prune-empty -- --all
git clone file://$(pwd) newcopy
cd newcopy
git for-each-ref --format="delete %(refname)" refs/tags/ | grep -v refs/tags/my-module- | git update-ref --stdin
git gc --prune=now
```
Some might notice that the above filter-branch invocation will be really
slow due to using --tree-filter; you could alternatively use the
--index-filter option of filter-branch, changing the above commands to:
```shell
git filter-branch --index-filter 'git ls-files | grep -v ^src/ | xargs git rm -q --cached; git ls-files -s | sed "s-$(printf \\t)-&my-module/-" | git update-index --index-info; git ls-files | grep -v ^my-module/ | xargs git rm -q --cached' --tag-name-filter 'echo "my-module-$(cat)"' --prune-empty -- --all
git clone file://$(pwd) newcopy
cd newcopy
git for-each-ref --format="delete %(refname)" refs/tags/ | grep -v refs/tags/my-module- | git update-ref --stdin
git gc --prune=now
```
However, for either filter-branch command there are a pile of caveats.
First, some may be wondering why I list five commands here for
filter-branch. Despite the use of --all and --tag-name-filter, and
filter-branch's manpage claiming that a clone is enough to get rid of
old objects, the extra steps to delete the other tags and do another
gc are still required to clean out the old objects and avoid mixing
new and old history before pushing somewhere. Other caveats:
* Commit messages are not rewritten; so if some of your commit
messages refer to prior commits by (abbreviated) sha1, after the
rewrite those messages will no refer to commits that are no longer
part of the history. It would be better to rewrite those
(abbreviated) sha1 references to refer to the new commit ids.
* The --prune-empty flag sometimes missing commits that should be
pruned, and it will also prune commits that *started* empty rather
than just ended empty due to filtering. For repositories that
intentionally use empty commits for versioning and publishing
related purposes, this can be detrimental.
* The commands above are OS-specific. GNU vs. BSD issues for sed,
xargs, and other commands often trip up users; I think I failed to
get most folks to use --index-filter since the only example in the
filter-branch manpage that both uses it and shows how to move
everything into a subdirectory is linux-specific, and it is not
obvious to the reader that it has a portability issue since it
silently misbehaves rather than failing loudly.
* The --index-filter version of the filter-branch command may be two to
three times faster than the --tree-filter version, but both
filter-branch commands are going to be multiple orders of magnitude
slower than filter-repo.
## Design rationale behind filter-repo (why create a new tool?)
None of the existing repository filtering tools do what I want. They're
all good in their own way, but come up short for my needs. No tool
provided any of the first seven traits below I wanted, and all failed to
provide at least one of the last three traits as well:
1. [Starting report] Provide user an analysis of their repo to help
them get started on what to prune or rename, instead of expecting
@ -81,6 +175,17 @@ in their own way, but come up short for my needs. In no particular order:
when using the --tag-name-filter option, and sometimes also an
issue when only filtering a subset of branches.)
1. [Versatility] Provide the user the ability to extend the tool or
even write new tools that leverage existing capabilities, and
provide this extensibility in a way that (a) avoids the need to
fork separate processes (which would destroy performance), (b)
avoids making the user specify OS-dependent shell commands (which
would prevent users from sharing commands with each other), (c)
takes advantage of rich data structures (because hashes, dicts,
lists, and arrays are prohibitively difficult in shell) and (d)
provides reasonable string manipulation capabilities (which are
sorely lacking in shell).
1. [Commit message consistency] If commit messages refer to other
commits by ID (e.g. "this reverts commit 01234567890abcdef", "In
commit 0013deadbeef9a..."), those commit messages should be
@ -116,109 +221,7 @@ in their own way, but come up short for my needs. In no particular order:
1. [Speed] Filtering should be reasonably fast
## Warnings: Not yet ready for external usage
This repository is still under heavy construction. Some caveats:
* It will not work without a specially compiled version of git:
* git clone --branch fast-export-import-improvements https://github.com/newren/git/
* Build according to normal git.git build instructions. You can find 'em.
* I have a list of known bugs, conveniently mostly tracked in my head.
I'll fix that, but the fact that you're reading this sentence means
I haven't yet.
* Actually, there's a couple exceptions to where bugs are tracked mentioned
above. In particular, the following bugs are tracked here:
* Multiple unimplemented placeholder option flags exist. Just because it
shows up in --help doesn't mean it does anything.
* Usage instructions and examples at the end of this document are rather
lacking.
* Random debugging code or extraneous files might be checked in at any
given time; I'll probably rewrite history to remove them...eventually.
* I reserve the right to:
* Rename the tool altogether (filter-repo to be like filter-branch?)
* Rename or redefine any command line options
* Rewrite the history of this repository at any time
* and possibly more...but do you really need any more reasons than
the above? This isn't ready for widespread use.
## Why not $FAVORITE_COMPETITOR?
Here are some of the prominent competitors I know of:
* git_fast_filter.py (Original link dead, use google if you care; this repo
is the successor, though.)
* [reposurgeon](http://www.catb.org/esr/reposurgeon/)
* [BFG repo cleaner](https://rtyley.github.io/bfg-repo-cleaner/)
* [git filter-branch](https://mirrors.edge.kernel.org/pub/software/scm/git/docs/git-filter-branch.html)
Here's why I think these tools don't meet my needs:
* git_fast_filter.py:
* This was actually the basis for filter-repo, though it required lots of
additional work.
* Was meant as a library more than a tool, and had too high of an
activation energy.
* empty commit pruning was not as thorough as it should have been
* had no provision for commit message rewriting for commit message
consistency.
* missing lots of little conveniences
* reposurgeon
* focused on converting repositories between version control systems,
and handles all the crazy impedance mismatches inherent in such
conversions. I only care about rewriting history that starts in git
and ends in git. If you care about converting between version control
systems, though, reposurgeon is a much better tool.
* might be general enough to use for other uses, but can't find any
documentation or examples on anything other than huge repository
conversions between version control systems.
* way too much effort for many simple repository rewrites that many
users want to perform
* BFG repo cleaner
* Very focused on just removing crazy big files and sensitive data.
Probably the best tool if that's all you want. But lacks the ability
to handle anything outside this special (but important!) usecase.
* Has useful options for helping you remove the N biggest blobs, but
nothing to help you know how big N should be.
* Doesn't prune commits which become empty due to filtering; if you
just want to extract a directory added 3 months ago and its history,
you'd be stuck with years of commits touching other directories, all
empty.
* The refusal to rewrite HEAD, while it makes sense when trying to
remove a few crazy big files and sensitive data (users tend to
re-add and re-commit bad files if you didn't manually remove it
and have them update), is totally misaligned with more general
rewrite cases (e.g. the desire to turn a subdirectory into the
root of a repository, or move the root of the repository into a
subdirectory for merging into some other bigger repo.)
* Telling the user how to shrink the repo afterwards seems lame since
that was the whole point; just do it for them by default.
* git filter-branch
* Fundamental design flaw causing it to be orders of magnitude
slower than it should be for most repo rewriting jobs. So slow
that it becomes a major usability impediment, if not a deal
breaker. However, it is _extremely_ versatile.
* Generally quick for users to invoke (quick one-liners with lots
of examples), just missing some useful capabilities like
selecting wanted paths (as opposed to unwanted paths) and
providing easier path renaming (also, e.g. no
--to-subdirectory-filter as the opposite of
--subdirectory-filter)
* Doesn't rewrite commit hashes in commit messages, causing commit messages
to refer to phantom commits instead.
* Mixes old repository information (original tags, unrewritten branches)
with new, risking re-pushing the old stuff
* Lame defaults
* --prune-empty should be default (although only commits which become
empty, not ones which started empty)
* allows user to mess with repos which aren't a clean clone without an
override
* Makes it very difficult to actually get rid of unwanted objects and
shrink repository. Long multi-step instructions in manpage for this,
which are incomplete when --tag-name-filter is in use.
# Usage
Run `git filter-repo --help` and figure it out from there. Good luck.
Run `git filter-repo -h`; more detailed docs will be added soon...

Loading…
Cancel
Save