Add README.md explaing new repo-filter tool

rebase-i-autosquash-rebase-merges-fails
Elijah Newren 6 years ago
parent ad59fffed0
commit 0b187cf667

151
README

@ -1,151 +0,0 @@
git_fast_filter.py is designed to make it easy to rewrite the history of a
git repository. As such it plays a similar role to git-filter-branch, and
was created primarily to overcome the (sometimes severe) speed shortcomings
of git-filter-branch. The idea of git_fast_filter.py is to serve as a
small library which makes it easy to write python scripts that filter the
output of git-fast-export. Thus, the calling convention is typically of
the form:
git fast-export | filter_script.py | git fast-import
Though to be more precise, one would probably run this as
$ mkdir target && cd target && git init
$ (cd /PATH/LEADING/TO/source && git fast-export --branches --tags) \
| /PATH/TO/filter_script.py | git fast-import
Example filter scripts can be found in the testcases subdirectory,
with a brief README file explaining calling syntax for the scripts.
===== Abilities =====
git_fast_filter.py can be used to
* modify repositories in ways similar to git-filter-branch
* facilitate the creation of fast-export-like output on some set of data
* splice together independent repositories (interleaving commits from
separate repositories)
It has been used to modify file contents, filter out files based on content
and based on name; drop, split, and insert commits; edit author and
committer information; clean up commit log messages (and store excessive
information in git-note format); modify branch names; drop and insert blobs
(i.e. files) and/or commits, splicing together independent repositories
(interleaving commits), and perhaps other small changes I'm forgetting at
the moment.
There is also a filtered_sparse_shallow_clone.py library that can be used
to create scripts for creating a filtered sparse or shallow "clone" of a
repository, and for bidirectional collaboration between the filtered and
unfiltered repositories.
===== Caveats =====
I think git_fast_filter.py works pretty well, but there are some potential
gotchas if you're not using recent enough versions of git or try to do
something unusual...
You need to be using git>=1.6.3 (technically, git >= v1.6.2.1-353-gebeec7d)
in order for filtering on a subset of history not including a root commit
to work correctly. (In other words, if you're passing something like
master~5..master to git-fast-export, you need a recent version of git. If
you just pass master or --all, then old versions of git will suffice.)
You either need to use git>=1.6.2 or pass the --topo-order flag to
git-fast-export in order to avoid merge commits being squashed.
git_fast_filter passes this flag to git-fast-export, if you have it call
git-fast-export for you.
Since git-fast-export and git_fast_filter.py both work by assigning integer
identifiers to every blob & commit (and typically in the range 1..n), it
presents a uniqueness challenge when interleaving commits from separate
repositories or inserting commits or using the --import-marks flag. In
particular, doing one of these things (interleaving commits from separate
repositories, inserting commits, or using the --import-marks flag) any not
letting git_fast_filter.py know about it is a recipe for trouble. When
interleaving commits, make use of the fast_export_output() function instead
of piping git fast-export output to the script. When using the
--import-marks flag to git-fast-export, again do so via the
fast_export_output() function so git_filter_branch.py can be aware of the
range of ids to avoid.
While git_fast_filter has some logic to keep identifiers unique when
inserting commits, using --import-marks, or splicing together commits from
separate repositories (which it does by remapping identifiers as
necessary), it may not handle corner cases. Its identifier remapping has
been tested on special cases individually, but it has not been tested on
all combinations of special cases. In particular, I do not know if it will
handle the combination of --import-marks being passed to multiple
fast_export_output() streams and trying to combine all these streams into a
single repository. (Incidentally, I can't think of a use case for doing
that either.)
Inserting manually created commits (or interleaving commits between
repositories) provides an interesting challenge for git_fast_filter.
First, if you are inserting changes to files and expecting them to
propagate, you will be disappointed; each commit specifies the exact
version of each file (which is different from its first parent) that it
will use. Thus, if you want to insert changes to files, you either have to
rewrite all subsequent files or use a different tool like git rebase.
Second, if the commits you insert end up on a merged branch (that is, the
inserted commit is reachable through the second or later parent of some
commit) then any new files you inserted would normally be dropped by
git-fast-import. The reason for this is that git-fast-import expects each
commit to provide the list of files which are different than its first
parent. Files must be repeated in the merge commit if they exist only on
the branches corresponding to parents after the first, even if these files
are not being changed in the merge commit. git_fast_filter.py has some
ugly hacks to make this happen behind the scenes for you, but it only works
when the inserted commits contain new, unique files that are not also
created or modified on other branches. If you do something clever or more
complicated than this that defeats my simple hack, we may need to modify
git-fast-import (and perhaps git-fast-export) to have them allow the
following behavior via some flag: diff relative to all parents and only
require merge commits to list files that conflict among the different
parents (or that were otherwise changed in the merge commit).
===== Comparing/contrasting to git-filter-branch =====
* Similar Basics: The basic abilities and warnings in the first three
paragraphs of the git-filter-branch manpage are equally applicable to
git_fast_filter.py, except that rev-list options are passed to
git-fast-export (which, as noted above, is typically executed separately
in addition to the filter script). In other words, the tools are very
similar in purpose.
* Speed of Execution: By virtue of using fast-export and fast-import,
git_fast_filter avoids lots of forks (typically thousands or millions of
them) and bypasses the need to rewrite the same file 50,000 times.
(Also, git_fast_filter does not use a temporary directory of any sort,
and moving repositories to tmpfs to accelerate I/O would not
significantly speed up the operation.)
* Speed of Development: Since usage of git_fast_filter involves writing a
separate python script and typically invoking two extra programs, it
takes longer to invoke than typing git-filter-branch one-liners. (One
can have the python script invoke fast-export and fast-import rather than
doing it on the command line and using pipes, if one wants to. It's
still a little bit of extra typing, though.) Speed of "development" is
probably more important than speed of execution for many small
repositories or simple rewrites, thus git-filter-branch will likely
remain the preferred tool of choice in many cases.
* Location of rewritten History: git-filter-branch always puts the
rewritten history back into the same repository that holds the original
history. That confuses a lot of people; while the same can be done
with git_fast_filter, examples are geared at writing the new history
into a different repository.
* Rewritting a subset of history (potential gotcha): When git-fast-export
operates on a subset of history that does not include a root commit, it
truncates history before the first exported commits. This makes sense
since the destination repository may not have the unexported commits
already. (Note that one can use the --import-marks feature to
git-fast-export to notify fast-export that the destination repository
does indeed have the needed commits, i.e. that an 'incremental' export is
being done and thus that history should not be truncated.) WHY THIS
MATTERS: git-filter-branch will not truncate history when dealing with a
subset of history, since it is writing the modified history back to the
source repository where it is known that the non-rewritten commits are
available. If someone tries to duplicate such behavior with
git_fast_filter, they may be surprised unless they pass the --import-marks
flag to git-fast-export.

@ -0,0 +1,191 @@
git repo-filter is intended to be a tool similar to [git
filter-branch](https://git-scm.com/docs/git-filter-branch) for
rewriting repository history. While filter-branch is relatively quick
to learn and invoke and is relatively versatile, it has a few glaring
deficiencies. repo-filter tries to copy filter-branch's good
qualities, while bringing a significant performance boost and a
different taste in usability.
# Table of Contents
* Background
* [Why create another repo filtering tool?](#why-git-repo-filter)
* [Warnings: Not yet ready for external usage](
#warnings-not-yet-ready-for-external-usage)
* [Why not $FAVORITE_COMPETITOR](#why-not-favorite_competitor)
* [Usage](#usage)
# Background
## Why git-repo-filter?
None of the [existing repository filtering
tools](#why-not-favorite_competitor) do what I want. They're all good
in their own way, but come up short for my needs. In no particular order:
1. [Starting report] Provide user an analysis of their repo to help
them get started on what to prune or rename, instead of expecting
them to guess or find other tools to figure it out. (Triggered, e.g.
by running the first time with a special flag, such as --analyze.)
1. [Keep vs. remove] Instead of just providing a way for users to
easily remove selected paths, also provide flags for users to
only *keep* certain paths. Sure, users could workaround this by
specifying to remove all paths other than the ones they want to
keep, but the need to specify all paths that *ever* existed in
**any** version of the repository could sometimes be quite
painful. For filter-branch, using pipelines like `git ls-files |
grep -v ... | xargs -r git rm` might be a reasonable workaround
but can get unwieldy and isn't as straightforward for users.
1. [Renaming] It should be easy to rename paths. For example, in
addition to allowing one to treat some subdirectory as the root
of the repository, also provide options for users to make the
root of the repository just become a subdirectory. And more
generally allow files and directories to be easily renamed.
Provide sanity checks if renaming causes multiple files to exist
at the same path. (And add special handling so that if a commit
merely renamed oldname->newname, then filtering oldname->newname
doesn't trigger the sanity check and die on that commit.)
1. [More intelligent safety] Writing copies of the original refs to
a special namespace within the repo does not provide a
user-friendly recovery mechanism. Many would struggle to recover
using that. Almost everyone I've ever seen do a repository
filtering operation has done so with a fresh clone, because
wiping out the clone in case of error is a vastly easier recovery
mechanism. Strongly encourage that workflow by detecting and
bailing if we're not in a fresh clone, unless the user overrides
with --force. (Allow the old filter-branch workflow if a special
--store-backup flag is provided.)
1. [Auto shrink] Automatically remove old cruft and repack the
repository for the user after filtering (unless overridden)
1. [Clean separation] Avoid confusing users (and prevent accidental
re-pushing of old stuff) due to mixing old repo and rewritten
repo together. (This is particularly a problem with filter-branch
when using the --tag-name-filter option, and sometimes also an
issue when only filtering a subset of branches.)
1. [Commit message consistency] If commit messages refer to other
commits by ID (e.g. "this reverts commit 01234567890abcdef", "In
commit 0013deadbeef9a..."), those commit messages should be
rewritten to refer to the new commit IDs.
1. [Empty pruning] Commits which become empty due to filtering
should be pruned. That includes merge commits which become empty
(e.g. when grabbing the history of a single directory that hasn't
always existed within the repo; I don't want thousands of
unrelated commits that pre-dated the introduction of that
directory). However, I do not want commits which were empty in
the original repository to be pruned, though.
1. [Speed] Filtering should be reasonably fast
## Warnings: Not yet ready for external usage
This repository is still under heavy construction. Some caveats:
* It will not work without a specially compiled version of git:
* git clone --branch fast-export-import-improvements https://github.com/newren/git/
* Build according to normal git.git build instructions. You can find 'em.
* I have a list of known bugs, conveniently mostly tracked in my head.
I'll fix that, but the fact that you're reading this sentence means
I haven't yet.
* Actually, there's a couple exceptions to where bugs are tracked mentioned
above. In particular, the following bugs are tracked here:
* Multiple unimplemented placeholder option flags exist. Just because it
shows up in --help doesn't mean it does anything.
* Usage instructions and examples at the end of this document are rather
lacking.
* Random debugging code or extraneous files might be checked in at any
given time; I'll probably rewrite history to remove them...eventually.
* I reserve the right to:
* Rename the tool altogether (filter-repo to be like filter-branch?)
* Rename or redefine any command line options
* Rewrite the history of this repository at any time
* and possibly more...but do you really need any more reasons than
the above? This isn't ready for widespread use.
## Why not $FAVORITE_COMPETITOR?
Here are some of the prominent competitors I know of:
* git_fast_filter.py (Original link dead, use google if you care; this repo
is the successor, though.)
* [reposurgeon](http://www.catb.org/esr/reposurgeon/)
* [BFG repo cleaner](https://rtyley.github.io/bfg-repo-cleaner/)
* [git filter-branch](https://mirrors.edge.kernel.org/pub/software/scm/git/docs/git-filter-branch.html)
Here's why I think these tools don't meet my needs:
* git_fast_filter.py:
* This was actually the basis for repo-filter, though it required lots of
additional work.
* Was meant as a library more than a tool, and had too high of an
activation energy.
* empty commit pruning was not as thorough as it should have been
* had no provision for commit message rewriting for commit message
consistency.
* missing lots of little conveniences
* reposurgeon
* focused on converting repositories between version control systems,
and handles all the crazy impedance mismatches inherent in such
conversions. I only care about rewriting history that starts in git
and ends in git. If you care about converting between version control
systems, though, reposurgeon is a much better tool.
* might be general enough to use for other uses, but can't find any
documentation or examples on anything other than huge repository
conversions between version control systems.
* way too much effort for many simple repository rewrites that many
users want to perform
* BFG repo cleaner
* Very focused on just removing crazy big files and sensitive data.
Probably the best tool if that's all you want. But lacks the ability
to handle anything outside this special (but important!) usecase.
* Has useful options for helping you remove the N biggest blobs, but
nothing to help you know how big N should be.
* Doesn't prune commits which become empty due to filtering; if you
just want to extract a directory added 3 months ago and its history,
you'd be stuck with years of commits touching other directories, all
empty.
* The refusal to rewrite HEAD, while it makes sense when trying to
remove a few crazy big files and sensitive data (users tend to
re-add and re-commit bad files if you didn't manually remove it
and have them update), is totally misaligned with more general
rewrite cases (e.g. the desire to turn a subdirectory into the
root of a repository, or move the root of the repository into a
subdirectory for merging into some other bigger repo.)
* Telling the user how to shrink the repo afterwards seems lame since
that was the whole point; just do it for them by default.
* git filter-branch
* Fundamental design flaw causing it to be orders of magnitude
slower than it should be for most repo rewriting jobs. So slow
that it becomes a major usability impediment, if not a deal
breaker. However, it is _extremely_ versatile.
* Generally quick for users to invoke (quick one-liners with lots
of examples), just missing some useful capabilities like
selecting wanted paths (as opposed to unwanted paths) and
providing easier path renaming (also, e.g. no
--to-subdirectory-filter as the opposite of
--subdirectory-filter)
* Doesn't rewrite commit hashes in commit messages, causing commit messages
to refer to phantom commits instead.
* Mixes old repository information (original tags, unrewritten branches)
with new, risking re-pushing the old stuff
* Lame defaults
* --prune-empty should be default (although only commits which become
empty, not ones which started empty)
* allows user to mess with repos which aren't a clean clone without an
override
* Makes it very difficult to actually get rid of unwanted objects and
shrink repository. Long multi-step instructions in manpage for this,
which are incomplete when --tag-name-filter is in use.
# Usage
Run `git repo-filter --help` and figure it out from there. Good luck.
Loading…
Cancel
Save