Add README.md explaing new repo-filter tool
parent
ad59fffed0
commit
0b187cf667
@ -1,151 +0,0 @@
|
||||
git_fast_filter.py is designed to make it easy to rewrite the history of a
|
||||
git repository. As such it plays a similar role to git-filter-branch, and
|
||||
was created primarily to overcome the (sometimes severe) speed shortcomings
|
||||
of git-filter-branch. The idea of git_fast_filter.py is to serve as a
|
||||
small library which makes it easy to write python scripts that filter the
|
||||
output of git-fast-export. Thus, the calling convention is typically of
|
||||
the form:
|
||||
|
||||
git fast-export | filter_script.py | git fast-import
|
||||
|
||||
Though to be more precise, one would probably run this as
|
||||
|
||||
$ mkdir target && cd target && git init
|
||||
$ (cd /PATH/LEADING/TO/source && git fast-export --branches --tags) \
|
||||
| /PATH/TO/filter_script.py | git fast-import
|
||||
|
||||
Example filter scripts can be found in the testcases subdirectory,
|
||||
with a brief README file explaining calling syntax for the scripts.
|
||||
|
||||
===== Abilities =====
|
||||
|
||||
git_fast_filter.py can be used to
|
||||
* modify repositories in ways similar to git-filter-branch
|
||||
* facilitate the creation of fast-export-like output on some set of data
|
||||
* splice together independent repositories (interleaving commits from
|
||||
separate repositories)
|
||||
|
||||
It has been used to modify file contents, filter out files based on content
|
||||
and based on name; drop, split, and insert commits; edit author and
|
||||
committer information; clean up commit log messages (and store excessive
|
||||
information in git-note format); modify branch names; drop and insert blobs
|
||||
(i.e. files) and/or commits, splicing together independent repositories
|
||||
(interleaving commits), and perhaps other small changes I'm forgetting at
|
||||
the moment.
|
||||
|
||||
There is also a filtered_sparse_shallow_clone.py library that can be used
|
||||
to create scripts for creating a filtered sparse or shallow "clone" of a
|
||||
repository, and for bidirectional collaboration between the filtered and
|
||||
unfiltered repositories.
|
||||
|
||||
===== Caveats =====
|
||||
|
||||
I think git_fast_filter.py works pretty well, but there are some potential
|
||||
gotchas if you're not using recent enough versions of git or try to do
|
||||
something unusual...
|
||||
|
||||
You need to be using git>=1.6.3 (technically, git >= v1.6.2.1-353-gebeec7d)
|
||||
in order for filtering on a subset of history not including a root commit
|
||||
to work correctly. (In other words, if you're passing something like
|
||||
master~5..master to git-fast-export, you need a recent version of git. If
|
||||
you just pass master or --all, then old versions of git will suffice.)
|
||||
|
||||
You either need to use git>=1.6.2 or pass the --topo-order flag to
|
||||
git-fast-export in order to avoid merge commits being squashed.
|
||||
git_fast_filter passes this flag to git-fast-export, if you have it call
|
||||
git-fast-export for you.
|
||||
|
||||
Since git-fast-export and git_fast_filter.py both work by assigning integer
|
||||
identifiers to every blob & commit (and typically in the range 1..n), it
|
||||
presents a uniqueness challenge when interleaving commits from separate
|
||||
repositories or inserting commits or using the --import-marks flag. In
|
||||
particular, doing one of these things (interleaving commits from separate
|
||||
repositories, inserting commits, or using the --import-marks flag) any not
|
||||
letting git_fast_filter.py know about it is a recipe for trouble. When
|
||||
interleaving commits, make use of the fast_export_output() function instead
|
||||
of piping git fast-export output to the script. When using the
|
||||
--import-marks flag to git-fast-export, again do so via the
|
||||
fast_export_output() function so git_filter_branch.py can be aware of the
|
||||
range of ids to avoid.
|
||||
|
||||
While git_fast_filter has some logic to keep identifiers unique when
|
||||
inserting commits, using --import-marks, or splicing together commits from
|
||||
separate repositories (which it does by remapping identifiers as
|
||||
necessary), it may not handle corner cases. Its identifier remapping has
|
||||
been tested on special cases individually, but it has not been tested on
|
||||
all combinations of special cases. In particular, I do not know if it will
|
||||
handle the combination of --import-marks being passed to multiple
|
||||
fast_export_output() streams and trying to combine all these streams into a
|
||||
single repository. (Incidentally, I can't think of a use case for doing
|
||||
that either.)
|
||||
|
||||
Inserting manually created commits (or interleaving commits between
|
||||
repositories) provides an interesting challenge for git_fast_filter.
|
||||
First, if you are inserting changes to files and expecting them to
|
||||
propagate, you will be disappointed; each commit specifies the exact
|
||||
version of each file (which is different from its first parent) that it
|
||||
will use. Thus, if you want to insert changes to files, you either have to
|
||||
rewrite all subsequent files or use a different tool like git rebase.
|
||||
Second, if the commits you insert end up on a merged branch (that is, the
|
||||
inserted commit is reachable through the second or later parent of some
|
||||
commit) then any new files you inserted would normally be dropped by
|
||||
git-fast-import. The reason for this is that git-fast-import expects each
|
||||
commit to provide the list of files which are different than its first
|
||||
parent. Files must be repeated in the merge commit if they exist only on
|
||||
the branches corresponding to parents after the first, even if these files
|
||||
are not being changed in the merge commit. git_fast_filter.py has some
|
||||
ugly hacks to make this happen behind the scenes for you, but it only works
|
||||
when the inserted commits contain new, unique files that are not also
|
||||
created or modified on other branches. If you do something clever or more
|
||||
complicated than this that defeats my simple hack, we may need to modify
|
||||
git-fast-import (and perhaps git-fast-export) to have them allow the
|
||||
following behavior via some flag: diff relative to all parents and only
|
||||
require merge commits to list files that conflict among the different
|
||||
parents (or that were otherwise changed in the merge commit).
|
||||
|
||||
===== Comparing/contrasting to git-filter-branch =====
|
||||
|
||||
* Similar Basics: The basic abilities and warnings in the first three
|
||||
paragraphs of the git-filter-branch manpage are equally applicable to
|
||||
git_fast_filter.py, except that rev-list options are passed to
|
||||
git-fast-export (which, as noted above, is typically executed separately
|
||||
in addition to the filter script). In other words, the tools are very
|
||||
similar in purpose.
|
||||
|
||||
* Speed of Execution: By virtue of using fast-export and fast-import,
|
||||
git_fast_filter avoids lots of forks (typically thousands or millions of
|
||||
them) and bypasses the need to rewrite the same file 50,000 times.
|
||||
(Also, git_fast_filter does not use a temporary directory of any sort,
|
||||
and moving repositories to tmpfs to accelerate I/O would not
|
||||
significantly speed up the operation.)
|
||||
|
||||
* Speed of Development: Since usage of git_fast_filter involves writing a
|
||||
separate python script and typically invoking two extra programs, it
|
||||
takes longer to invoke than typing git-filter-branch one-liners. (One
|
||||
can have the python script invoke fast-export and fast-import rather than
|
||||
doing it on the command line and using pipes, if one wants to. It's
|
||||
still a little bit of extra typing, though.) Speed of "development" is
|
||||
probably more important than speed of execution for many small
|
||||
repositories or simple rewrites, thus git-filter-branch will likely
|
||||
remain the preferred tool of choice in many cases.
|
||||
|
||||
* Location of rewritten History: git-filter-branch always puts the
|
||||
rewritten history back into the same repository that holds the original
|
||||
history. That confuses a lot of people; while the same can be done
|
||||
with git_fast_filter, examples are geared at writing the new history
|
||||
into a different repository.
|
||||
|
||||
* Rewritting a subset of history (potential gotcha): When git-fast-export
|
||||
operates on a subset of history that does not include a root commit, it
|
||||
truncates history before the first exported commits. This makes sense
|
||||
since the destination repository may not have the unexported commits
|
||||
already. (Note that one can use the --import-marks feature to
|
||||
git-fast-export to notify fast-export that the destination repository
|
||||
does indeed have the needed commits, i.e. that an 'incremental' export is
|
||||
being done and thus that history should not be truncated.) WHY THIS
|
||||
MATTERS: git-filter-branch will not truncate history when dealing with a
|
||||
subset of history, since it is writing the modified history back to the
|
||||
source repository where it is known that the non-rewritten commits are
|
||||
available. If someone tries to duplicate such behavior with
|
||||
git_fast_filter, they may be surprised unless they pass the --import-marks
|
||||
flag to git-fast-export.
|
@ -0,0 +1,191 @@
|
||||
git repo-filter is intended to be a tool similar to [git
|
||||
filter-branch](https://git-scm.com/docs/git-filter-branch) for
|
||||
rewriting repository history. While filter-branch is relatively quick
|
||||
to learn and invoke and is relatively versatile, it has a few glaring
|
||||
deficiencies. repo-filter tries to copy filter-branch's good
|
||||
qualities, while bringing a significant performance boost and a
|
||||
different taste in usability.
|
||||
|
||||
# Table of Contents
|
||||
|
||||
* Background
|
||||
* [Why create another repo filtering tool?](#why-git-repo-filter)
|
||||
* [Warnings: Not yet ready for external usage](
|
||||
#warnings-not-yet-ready-for-external-usage)
|
||||
* [Why not $FAVORITE_COMPETITOR](#why-not-favorite_competitor)
|
||||
* [Usage](#usage)
|
||||
|
||||
# Background
|
||||
|
||||
## Why git-repo-filter?
|
||||
|
||||
None of the [existing repository filtering
|
||||
tools](#why-not-favorite_competitor) do what I want. They're all good
|
||||
in their own way, but come up short for my needs. In no particular order:
|
||||
|
||||
1. [Starting report] Provide user an analysis of their repo to help
|
||||
them get started on what to prune or rename, instead of expecting
|
||||
them to guess or find other tools to figure it out. (Triggered, e.g.
|
||||
by running the first time with a special flag, such as --analyze.)
|
||||
|
||||
1. [Keep vs. remove] Instead of just providing a way for users to
|
||||
easily remove selected paths, also provide flags for users to
|
||||
only *keep* certain paths. Sure, users could workaround this by
|
||||
specifying to remove all paths other than the ones they want to
|
||||
keep, but the need to specify all paths that *ever* existed in
|
||||
**any** version of the repository could sometimes be quite
|
||||
painful. For filter-branch, using pipelines like `git ls-files |
|
||||
grep -v ... | xargs -r git rm` might be a reasonable workaround
|
||||
but can get unwieldy and isn't as straightforward for users.
|
||||
|
||||
1. [Renaming] It should be easy to rename paths. For example, in
|
||||
addition to allowing one to treat some subdirectory as the root
|
||||
of the repository, also provide options for users to make the
|
||||
root of the repository just become a subdirectory. And more
|
||||
generally allow files and directories to be easily renamed.
|
||||
Provide sanity checks if renaming causes multiple files to exist
|
||||
at the same path. (And add special handling so that if a commit
|
||||
merely renamed oldname->newname, then filtering oldname->newname
|
||||
doesn't trigger the sanity check and die on that commit.)
|
||||
|
||||
1. [More intelligent safety] Writing copies of the original refs to
|
||||
a special namespace within the repo does not provide a
|
||||
user-friendly recovery mechanism. Many would struggle to recover
|
||||
using that. Almost everyone I've ever seen do a repository
|
||||
filtering operation has done so with a fresh clone, because
|
||||
wiping out the clone in case of error is a vastly easier recovery
|
||||
mechanism. Strongly encourage that workflow by detecting and
|
||||
bailing if we're not in a fresh clone, unless the user overrides
|
||||
with --force. (Allow the old filter-branch workflow if a special
|
||||
--store-backup flag is provided.)
|
||||
|
||||
1. [Auto shrink] Automatically remove old cruft and repack the
|
||||
repository for the user after filtering (unless overridden)
|
||||
|
||||
1. [Clean separation] Avoid confusing users (and prevent accidental
|
||||
re-pushing of old stuff) due to mixing old repo and rewritten
|
||||
repo together. (This is particularly a problem with filter-branch
|
||||
when using the --tag-name-filter option, and sometimes also an
|
||||
issue when only filtering a subset of branches.)
|
||||
|
||||
1. [Commit message consistency] If commit messages refer to other
|
||||
commits by ID (e.g. "this reverts commit 01234567890abcdef", "In
|
||||
commit 0013deadbeef9a..."), those commit messages should be
|
||||
rewritten to refer to the new commit IDs.
|
||||
|
||||
1. [Empty pruning] Commits which become empty due to filtering
|
||||
should be pruned. That includes merge commits which become empty
|
||||
(e.g. when grabbing the history of a single directory that hasn't
|
||||
always existed within the repo; I don't want thousands of
|
||||
unrelated commits that pre-dated the introduction of that
|
||||
directory). However, I do not want commits which were empty in
|
||||
the original repository to be pruned, though.
|
||||
|
||||
1. [Speed] Filtering should be reasonably fast
|
||||
|
||||
## Warnings: Not yet ready for external usage
|
||||
|
||||
This repository is still under heavy construction. Some caveats:
|
||||
|
||||
* It will not work without a specially compiled version of git:
|
||||
* git clone --branch fast-export-import-improvements https://github.com/newren/git/
|
||||
* Build according to normal git.git build instructions. You can find 'em.
|
||||
* I have a list of known bugs, conveniently mostly tracked in my head.
|
||||
I'll fix that, but the fact that you're reading this sentence means
|
||||
I haven't yet.
|
||||
* Actually, there's a couple exceptions to where bugs are tracked mentioned
|
||||
above. In particular, the following bugs are tracked here:
|
||||
* Multiple unimplemented placeholder option flags exist. Just because it
|
||||
shows up in --help doesn't mean it does anything.
|
||||
* Usage instructions and examples at the end of this document are rather
|
||||
lacking.
|
||||
* Random debugging code or extraneous files might be checked in at any
|
||||
given time; I'll probably rewrite history to remove them...eventually.
|
||||
* I reserve the right to:
|
||||
* Rename the tool altogether (filter-repo to be like filter-branch?)
|
||||
* Rename or redefine any command line options
|
||||
* Rewrite the history of this repository at any time
|
||||
* and possibly more...but do you really need any more reasons than
|
||||
the above? This isn't ready for widespread use.
|
||||
|
||||
## Why not $FAVORITE_COMPETITOR?
|
||||
|
||||
Here are some of the prominent competitors I know of:
|
||||
* git_fast_filter.py (Original link dead, use google if you care; this repo
|
||||
is the successor, though.)
|
||||
* [reposurgeon](http://www.catb.org/esr/reposurgeon/)
|
||||
* [BFG repo cleaner](https://rtyley.github.io/bfg-repo-cleaner/)
|
||||
* [git filter-branch](https://mirrors.edge.kernel.org/pub/software/scm/git/docs/git-filter-branch.html)
|
||||
|
||||
Here's why I think these tools don't meet my needs:
|
||||
|
||||
* git_fast_filter.py:
|
||||
* This was actually the basis for repo-filter, though it required lots of
|
||||
additional work.
|
||||
* Was meant as a library more than a tool, and had too high of an
|
||||
activation energy.
|
||||
* empty commit pruning was not as thorough as it should have been
|
||||
* had no provision for commit message rewriting for commit message
|
||||
consistency.
|
||||
* missing lots of little conveniences
|
||||
|
||||
* reposurgeon
|
||||
* focused on converting repositories between version control systems,
|
||||
and handles all the crazy impedance mismatches inherent in such
|
||||
conversions. I only care about rewriting history that starts in git
|
||||
and ends in git. If you care about converting between version control
|
||||
systems, though, reposurgeon is a much better tool.
|
||||
* might be general enough to use for other uses, but can't find any
|
||||
documentation or examples on anything other than huge repository
|
||||
conversions between version control systems.
|
||||
* way too much effort for many simple repository rewrites that many
|
||||
users want to perform
|
||||
|
||||
* BFG repo cleaner
|
||||
* Very focused on just removing crazy big files and sensitive data.
|
||||
Probably the best tool if that's all you want. But lacks the ability
|
||||
to handle anything outside this special (but important!) usecase.
|
||||
* Has useful options for helping you remove the N biggest blobs, but
|
||||
nothing to help you know how big N should be.
|
||||
* Doesn't prune commits which become empty due to filtering; if you
|
||||
just want to extract a directory added 3 months ago and its history,
|
||||
you'd be stuck with years of commits touching other directories, all
|
||||
empty.
|
||||
* The refusal to rewrite HEAD, while it makes sense when trying to
|
||||
remove a few crazy big files and sensitive data (users tend to
|
||||
re-add and re-commit bad files if you didn't manually remove it
|
||||
and have them update), is totally misaligned with more general
|
||||
rewrite cases (e.g. the desire to turn a subdirectory into the
|
||||
root of a repository, or move the root of the repository into a
|
||||
subdirectory for merging into some other bigger repo.)
|
||||
* Telling the user how to shrink the repo afterwards seems lame since
|
||||
that was the whole point; just do it for them by default.
|
||||
|
||||
* git filter-branch
|
||||
|
||||
* Fundamental design flaw causing it to be orders of magnitude
|
||||
slower than it should be for most repo rewriting jobs. So slow
|
||||
that it becomes a major usability impediment, if not a deal
|
||||
breaker. However, it is _extremely_ versatile.
|
||||
* Generally quick for users to invoke (quick one-liners with lots
|
||||
of examples), just missing some useful capabilities like
|
||||
selecting wanted paths (as opposed to unwanted paths) and
|
||||
providing easier path renaming (also, e.g. no
|
||||
--to-subdirectory-filter as the opposite of
|
||||
--subdirectory-filter)
|
||||
* Doesn't rewrite commit hashes in commit messages, causing commit messages
|
||||
to refer to phantom commits instead.
|
||||
* Mixes old repository information (original tags, unrewritten branches)
|
||||
with new, risking re-pushing the old stuff
|
||||
* Lame defaults
|
||||
* --prune-empty should be default (although only commits which become
|
||||
empty, not ones which started empty)
|
||||
* allows user to mess with repos which aren't a clean clone without an
|
||||
override
|
||||
* Makes it very difficult to actually get rid of unwanted objects and
|
||||
shrink repository. Long multi-step instructions in manpage for this,
|
||||
which are incomplete when --tag-name-filter is in use.
|
||||
|
||||
# Usage
|
||||
|
||||
Run `git repo-filter --help` and figure it out from there. Good luck.
|
Loading…
Reference in New Issue