Add README.md explaing new repo-filter tool

6 years ago · 0b187cf667
parent ad59fffed0
commit 0b187cf667
2 changed files with 191 additions and 151 deletions
--- a/151
+++ b/151
@ -1,151 +0,0 @@
-git_fast_filter.py is designed to make it easy to rewrite the history of a
-git repository.  As such it plays a similar role to git-filter-branch, and
-was created primarily to overcome the (sometimes severe) speed shortcomings
-of git-filter-branch.  The idea of git_fast_filter.py is to serve as a
-small library which makes it easy to write python scripts that filter the
-output of git-fast-export.  Thus, the calling convention is typically of
-the form:
-
-       git fast-export | filter_script.py | git fast-import
-
-Though to be more precise, one would probably run this as
-
-  $ mkdir target && cd target && git init
-  $ (cd /PATH/LEADING/TO/source && git fast-export --branches --tags) \
-       | /PATH/TO/filter_script.py | git fast-import
-
-Example filter scripts can be found in the testcases subdirectory,
-with a brief README file explaining calling syntax for the scripts.
-
-===== Abilities =====
-
-git_fast_filter.py can be used to
-  * modify repositories in ways similar to git-filter-branch
-  * facilitate the creation of fast-export-like output on some set of data
-  * splice together independent repositories (interleaving commits from
-    separate repositories)
-
-It has been used to modify file contents, filter out files based on content
-and based on name; drop, split, and insert commits; edit author and
-committer information; clean up commit log messages (and store excessive
-information in git-note format); modify branch names; drop and insert blobs
-(i.e. files) and/or commits, splicing together independent repositories
-(interleaving commits), and perhaps other small changes I'm forgetting at
-the moment.
-
-There is also a filtered_sparse_shallow_clone.py library that can be used
-to create scripts for creating a filtered sparse or shallow "clone" of a
-repository, and for bidirectional collaboration between the filtered and
-unfiltered repositories.
-
-===== Caveats =====
-
-I think git_fast_filter.py works pretty well, but there are some potential
-gotchas if you're not using recent enough versions of git or try to do
-something unusual...
-
-You need to be using git>=1.6.3 (technically, git >= v1.6.2.1-353-gebeec7d)
-in order for filtering on a subset of history not including a root commit
-to work correctly.  (In other words, if you're passing something like
-master~5..master to git-fast-export, you need a recent version of git.  If
-you just pass master or --all, then old versions of git will suffice.)
-
-You either need to use git>=1.6.2 or pass the --topo-order flag to
-git-fast-export in order to avoid merge commits being squashed.
-git_fast_filter passes this flag to git-fast-export, if you have it call
-git-fast-export for you.
-
-Since git-fast-export and git_fast_filter.py both work by assigning integer
-identifiers to every blob & commit (and typically in the range 1..n), it
-presents a uniqueness challenge when interleaving commits from separate
-repositories or inserting commits or using the --import-marks flag.  In
-particular, doing one of these things (interleaving commits from separate
-repositories, inserting commits, or using the --import-marks flag) any not
-letting git_fast_filter.py know about it is a recipe for trouble.  When
-interleaving commits, make use of the fast_export_output() function instead
-of piping git fast-export output to the script.  When using the
--import-marks flag to git-fast-export, again do so via the
-fast_export_output() function so git_filter_branch.py can be aware of the
-range of ids to avoid.
-
-While git_fast_filter has some logic to keep identifiers unique when
-inserting commits, using --import-marks, or splicing together commits from
-separate repositories (which it does by remapping identifiers as
-necessary), it may not handle corner cases.  Its identifier remapping has
-been tested on special cases individually, but it has not been tested on
-all combinations of special cases.  In particular, I do not know if it will
-handle the combination of --import-marks being passed to multiple
-fast_export_output() streams and trying to combine all these streams into a
-single repository.  (Incidentally, I can't think of a use case for doing
-that either.)
-
-Inserting manually created commits (or interleaving commits between
-repositories) provides an interesting challenge for git_fast_filter.
-First, if you are inserting changes to files and expecting them to
-propagate, you will be disappointed; each commit specifies the exact
-version of each file (which is different from its first parent) that it
-will use.  Thus, if you want to insert changes to files, you either have to
-rewrite all subsequent files or use a different tool like git rebase.
-Second, if the commits you insert end up on a merged branch (that is, the
-inserted commit is reachable through the second or later parent of some
-commit) then any new files you inserted would normally be dropped by
-git-fast-import.  The reason for this is that git-fast-import expects each
-commit to provide the list of files which are different than its first
-parent.  Files must be repeated in the merge commit if they exist only on
-the branches corresponding to parents after the first, even if these files
-are not being changed in the merge commit.  git_fast_filter.py has some
-ugly hacks to make this happen behind the scenes for you, but it only works
-when the inserted commits contain new, unique files that are not also
-created or modified on other branches.  If you do something clever or more
-complicated than this that defeats my simple hack, we may need to modify
-git-fast-import (and perhaps git-fast-export) to have them allow the
-following behavior via some flag: diff relative to all parents and only
-require merge commits to list files that conflict among the different
-parents (or that were otherwise changed in the merge commit).
-
-===== Comparing/contrasting to git-filter-branch =====
-
-* Similar Basics: The basic abilities and warnings in the first three
-  paragraphs of the git-filter-branch manpage are equally applicable to
-  git_fast_filter.py, except that rev-list options are passed to
-  git-fast-export (which, as noted above, is typically executed separately
-  in addition to the filter script).  In other words, the tools are very
-  similar in purpose.
-
-* Speed of Execution: By virtue of using fast-export and fast-import,
-  git_fast_filter avoids lots of forks (typically thousands or millions of
-  them) and bypasses the need to rewrite the same file 50,000 times.
-  (Also, git_fast_filter does not use a temporary directory of any sort,
-  and moving repositories to tmpfs to accelerate I/O would not
-  significantly speed up the operation.)
-
-* Speed of Development: Since usage of git_fast_filter involves writing a
-  separate python script and typically invoking two extra programs, it
-  takes longer to invoke than typing git-filter-branch one-liners.  (One
-  can have the python script invoke fast-export and fast-import rather than
-  doing it on the command line and using pipes, if one wants to.  It's
-  still a little bit of extra typing, though.)  Speed of "development" is
-  probably more important than speed of execution for many small
-  repositories or simple rewrites, thus git-filter-branch will likely
-  remain the preferred tool of choice in many cases.
-
-* Location of rewritten History: git-filter-branch always puts the
-  rewritten history back into the same repository that holds the original
-  history.  That confuses a lot of people; while the same can be done
-  with git_fast_filter, examples are geared at writing the new history
-  into a different repository.
-
-* Rewritting a subset of history (potential gotcha): When git-fast-export
-  operates on a subset of history that does not include a root commit, it
-  truncates history before the first exported commits.  This makes sense
-  since the destination repository may not have the unexported commits
-  already.  (Note that one can use the --import-marks feature to
-  git-fast-export to notify fast-export that the destination repository
-  does indeed have the needed commits, i.e. that an 'incremental' export is
-  being done and thus that history should not be truncated.)  WHY THIS
-  MATTERS: git-filter-branch will not truncate history when dealing with a
-  subset of history, since it is writing the modified history back to the
-  source repository where it is known that the non-rewritten commits are
-  available.  If someone tries to duplicate such behavior with
-  git_fast_filter, they may be surprised unless they pass the --import-marks
-  flag to git-fast-export.
--- a/README.md
+++ b/README.md
@ -0,0 +1,191 @@
+git repo-filter is intended to be a tool similar to [git
+filter-branch](https://git-scm.com/docs/git-filter-branch) for
+rewriting repository history.  While filter-branch is relatively quick
+to learn and invoke and is relatively versatile, it has a few glaring
+deficiencies.  repo-filter tries to copy filter-branch's good
+qualities, while bringing a significant performance boost and a
+different taste in usability.
+
+# Table of Contents
+
+  * Background
+    * [Why create another repo filtering tool?](#why-git-repo-filter)
+    * [Warnings: Not yet ready for external usage](
+      #warnings-not-yet-ready-for-external-usage)
+    * [Why not $FAVORITE_COMPETITOR](#why-not-favorite_competitor)
+  * [Usage](#usage)
+
+# Background
+
+## Why git-repo-filter?
+
+None of the [existing repository filtering
+tools](#why-not-favorite_competitor) do what I want.  They're all good
+in their own way, but come up short for my needs.  In no particular order:
+
+  1. [Starting report] Provide user an analysis of their repo to help
+     them get started on what to prune or rename, instead of expecting
+     them to guess or find other tools to figure it out.  (Triggered, e.g.
+     by running the first time with a special flag, such as --analyze.)
+
+  1. [Keep vs. remove] Instead of just providing a way for users to
+     easily remove selected paths, also provide flags for users to
+     only *keep* certain paths.  Sure, users could workaround this by
+     specifying to remove all paths other than the ones they want to
+     keep, but the need to specify all paths that *ever* existed in
+     **any** version of the repository could sometimes be quite
+     painful.  For filter-branch, using pipelines like `git ls-files |
+     grep -v ... | xargs -r git rm` might be a reasonable workaround
+     but can get unwieldy and isn't as straightforward for users.
+
+  1. [Renaming] It should be easy to rename paths.  For example, in
+     addition to allowing one to treat some subdirectory as the root
+     of the repository, also provide options for users to make the
+     root of the repository just become a subdirectory.  And more
+     generally allow files and directories to be easily renamed.
+     Provide sanity checks if renaming causes multiple files to exist
+     at the same path.  (And add special handling so that if a commit
+     merely renamed oldname->newname, then filtering oldname->newname
+     doesn't trigger the sanity check and die on that commit.)
+
+  1. [More intelligent safety] Writing copies of the original refs to
+     a special namespace within the repo does not provide a
+     user-friendly recovery mechanism.  Many would struggle to recover
+     using that.  Almost everyone I've ever seen do a repository
+     filtering operation has done so with a fresh clone, because
+     wiping out the clone in case of error is a vastly easier recovery
+     mechanism.  Strongly encourage that workflow by detecting and
+     bailing if we're not in a fresh clone, unless the user overrides
+     with --force.  (Allow the old filter-branch workflow if a special
+     --store-backup flag is provided.)
+     
+  1. [Auto shrink] Automatically remove old cruft and repack the
+     repository for the user after filtering (unless overridden)
+
+  1. [Clean separation] Avoid confusing users (and prevent accidental
+     re-pushing of old stuff) due to mixing old repo and rewritten
+     repo together.  (This is particularly a problem with filter-branch
+     when using the --tag-name-filter option, and sometimes also an
+     issue when only filtering a subset of branches.)
+
+  1. [Commit message consistency] If commit messages refer to other
+     commits by ID (e.g. "this reverts commit 01234567890abcdef", "In
+     commit 0013deadbeef9a..."), those commit messages should be
+     rewritten to refer to the new commit IDs.
+     
+  1. [Empty pruning] Commits which become empty due to filtering
+     should be pruned.  That includes merge commits which become empty
+     (e.g. when grabbing the history of a single directory that hasn't
+     always existed within the repo; I don't want thousands of
+     unrelated commits that pre-dated the introduction of that
+     directory).  However, I do not want commits which were empty in
+     the original repository to be pruned, though.
+     
+  1. [Speed] Filtering should be reasonably fast
+
+## Warnings: Not yet ready for external usage
+
+This repository is still under heavy construction.  Some caveats:
+
+  * It will not work without a specially compiled version of git:
+    * git clone --branch fast-export-import-improvements https://github.com/newren/git/
+    * Build according to normal git.git build instructions.  You can find 'em.
+  * I have a list of known bugs, conveniently mostly tracked in my head.
+    I'll fix that, but the fact that you're reading this sentence means
+    I haven't yet.
+  * Actually, there's a couple exceptions to where bugs are tracked mentioned
+    above.  In particular, the following bugs are tracked here:
+    * Multiple unimplemented placeholder option flags exist.  Just because it
+      shows up in --help doesn't mean it does anything.
+    * Usage instructions and examples at the end of this document are rather
+      lacking.
+    * Random debugging code or extraneous files might be checked in at any
+      given time; I'll probably rewrite history to remove them...eventually.
+  * I reserve the right to:
+    * Rename the tool altogether (filter-repo to be like filter-branch?)
+    * Rename or redefine any command line options
+    * Rewrite the history of this repository at any time
+  * and possibly more...but do you really need any more reasons than
+    the above?  This isn't ready for widespread use.
+
+## Why not $FAVORITE_COMPETITOR?
+
+Here are some of the prominent competitors I know of:
+  * git_fast_filter.py (Original link dead, use google if you care; this repo
+    is the successor, though.)
+  * [reposurgeon](http://www.catb.org/esr/reposurgeon/)
+  * [BFG repo cleaner](https://rtyley.github.io/bfg-repo-cleaner/)
+  * [git filter-branch](https://mirrors.edge.kernel.org/pub/software/scm/git/docs/git-filter-branch.html)
+
+Here's why I think these tools don't meet my needs:
+
+  * git_fast_filter.py:
+    * This was actually the basis for repo-filter, though it required lots of
+      additional work.
+    * Was meant as a library more than a tool, and had too high of an
+      activation energy.
+    * empty commit pruning was not as thorough as it should have been
+    * had no provision for commit message rewriting for commit message
+      consistency.
+    * missing lots of little conveniences
+
+  * reposurgeon
+    * focused on converting repositories between version control systems,
+      and handles all the crazy impedance mismatches inherent in such
+      conversions.  I only care about rewriting history that starts in git
+      and ends in git.  If you care about converting between version control
+      systems, though, reposurgeon is a much better tool.
+    * might be general enough to use for other uses, but can't find any
+      documentation or examples on anything other than huge repository
+      conversions between version control systems.
+    * way too much effort for many simple repository rewrites that many
+      users want to perform
+
+  * BFG repo cleaner
+    * Very focused on just removing crazy big files and sensitive data.
+      Probably the best tool if that's all you want.  But lacks the ability
+      to handle anything outside this special (but important!) usecase.
+    * Has useful options for helping you remove the N biggest blobs, but
+      nothing to help you know how big N should be.
+    * Doesn't prune commits which become empty due to filtering; if you
+      just want to extract a directory added 3 months ago and its history,
+      you'd be stuck with years of commits touching other directories, all
+      empty.
+    * The refusal to rewrite HEAD, while it makes sense when trying to
+      remove a few crazy big files and sensitive data (users tend to
+      re-add and re-commit bad files if you didn't manually remove it
+      and have them update), is totally misaligned with more general
+      rewrite cases (e.g. the desire to turn a subdirectory into the
+      root of a repository, or move the root of the repository into a
+      subdirectory for merging into some other bigger repo.)
+    * Telling the user how to shrink the repo afterwards seems lame since
+      that was the whole point; just do it for them by default.
+
+  * git filter-branch
+
+    * Fundamental design flaw causing it to be orders of magnitude
+      slower than it should be for most repo rewriting jobs.  So slow
+      that it becomes a major usability impediment, if not a deal
+      breaker.  However, it is _extremely_ versatile.
+    * Generally quick for users to invoke (quick one-liners with lots
+      of examples), just missing some useful capabilities like
+      selecting wanted paths (as opposed to unwanted paths) and
+      providing easier path renaming (also, e.g. no
+      --to-subdirectory-filter as the opposite of
+      --subdirectory-filter)
+    * Doesn't rewrite commit hashes in commit messages, causing commit messages
+      to refer to phantom commits instead.
+    * Mixes old repository information (original tags, unrewritten branches)
+      with new, risking re-pushing the old stuff
+    * Lame defaults
+      * --prune-empty should be default (although only commits which become
+        empty, not ones which started empty)
+      * allows user to mess with repos which aren't a clean clone without an
+        override
+      * Makes it very difficult to actually get rid of unwanted objects and
+        shrink repository.  Long multi-step instructions in manpage for this,
+        which are incomplete when --tag-name-filter is in use.
+
+# Usage
+
+Run `git repo-filter --help` and figure it out from there.  Good luck.