Various todo-related files

6 years ago · caffe46d77
commit caffe46d77
6 changed files with 691 additions and 0 deletions
--- a/197
+++ b/197
@ -0,0 +1,197 @@
+Before widely announcing:
+  - Notes on splitting
+    - exporter needs to know the pipe combination for commit message rewriting
+      - commit message rewriting gets weird if commits held in memory for later
+      - pruning gets weird too
+    - need to handle 1 export -> 2 imports
+  - Test setup
+    - Add several more tests, particularly around:
+      - commit pruning
+        - pruning commits that become empty
+	- pruning commits that started empty and have no parent
+	- not pruning commits that have changes or remain a merge commit
+        - pruning parent(s) of a merge
+        - coalescing common commits of a merge
+	- coalescing parents of a merge when one is an ancestor of the other
+      - ref pruning
+        - tags pointing at commits which are pruned along with their history
+	- refs pointing at commits which are pruned along with their history
+	- refs or tags behind a negative revision specification
+      - commit message rewriting
+      - renaming, particular when it causes collisions
+      - use coverage.py to direct test writing
+  - Check whether the version of git in use supports the appropriate flags
+  - Rewrite history
+    - Remove tests from older commits until they would actually work
+
+Generate upstream patches:
+  - Tags of tags of commits fail to export:
+    - In git.git, try:
+      $ git fast-export --no-data --use-done-feature --signed-tags=strip \
+                  --tag-of-filtered-object=rewrite-feature v1.0rc1 >/dev/null
+      fatal: tag 5f4cd4ca015dc795b9f7f4fed11b3f80a60ac175 tags unexported tag!
+
+Bigger ideas
+  - 1st step, create local branches for each remote tracking branch:
+      git fetch . refs/remotes/origin/*:refs/heads/*
+    also, nuke refs/remotes/origin/*; it won't match upstream anyway
+  - Performance:
+    - Smarter record_remapping -- do it lazily
+    - Unnecessary re-computation of 'epoch' (calling fromtimestamp)
+      ...and perhaps just unnecessary use of FixedTimeZone when most the time
+      it will not be checked or modified?
+    - What part of _parse_commit takes so much time?
+    - What part of commit.dump takes so much time?
+    - Speedup _parse_optional_filechange using str.split(None, 3) instead of re
+    - Which wait() are we waiting on?
+    - Smarter become-empty checks; only do more expensive checks if:
+      - First parent is no longer original first parent or ancestor thereof
+        - e.g. first-parent history empty, second parent becomes first parent
+        - e.g. --parent-filter causes some kind of graft operation (although
+               maybe we don't want to prune in this case anyway...)
+      - Blob filtering is active AND the only file_changes involved correspond
+        to filenames that have previously been modified.
+    - Regex optimization
+    - memoize (or just outright store?) filename remapping
+      - memoize net result: dequote -> do mods -> requote
+  - Work with submodules
+  - Important features
+    - paths-from-file (--paths-from-file <(git ls-tree -r HEAD)
+    - include-old-names-of-specified-files
+      - so users don't have to look for rename data from --analyze
+  - Do git rev-list --count to get idea of amount of work; show progress
+
+Left over bits:
+  - Fix up --analyze
+    * shouldn't allow running --analyze with negative refspecs
+    * add a --no-detect-renames option (for performance)
+  - metadata
+    - On second and subsequent runs, update metadata instead of overwriting
+       - for maps, give beginning_hash -> end_hash, not intermediate hashes
+    - OR error out if .git/repo-filter already created?
+  - error out if any progress messages in stream (can't deal with them unless
+    we can pass --cat-blob-fd to fast-import, and that seems non-portable)
+
+More path stuff, maybe
+  --path-rename-regex
+  --path-stream-rename  (invoked once; must read one line then print)
+  --path-stream-filter  (invoked once per commit with new files)
+  --path-tree-filter
+Ref stuff
+  --ref-rename
+  --ref-stream-rename
+Blob filter
+  --tree-filter
+
+
+Safety stuff
+  --keep-excluded-revisions
+  --keep-excluded-refs
+  --store-backup
+  --empty-pruning={no/off,auto,always/on}
+  --negative-refs={drop,reference}
+
+Other things:
+  - add a filename_callback too, for just editing file names
+  - add --skip-cleanup (pruning, gc, etc.; keep reset --hard) for speed compare
+  - get rid of user-run fast-export & fast-import; don't want to have to
+    update two callsites.
+
+Performance notes:
+  * On rails:
+    * 1) time git fast-export --show-original-ids --signed-tags=strip \
+                           --tag-of-filtered-object=rewrite --no-data \
+                           --use-done-feature --all >/dev/null
+    * 2) time git fast-export --show-original-ids --signed-tags=strip \
+                           --tag-of-filtered-object=rewrite --no-data \
+                           --use-done-feature --all >saved_output
+    * 3a) time git fast-export --show-original-ids --signed-tags=strip \
+                           --tag-of-filtered-object=rewrite --no-data \
+                           --use-done-feature --all \
+                           | sed -e s/+051800/+0261/ >/dev/null
+    * 3b) time git fast-export --show-original-ids --signed-tags=strip \
+                           --tag-of-filtered-object=rewrite --no-data \
+                           --use-done-feature --all \
+                           | stupid.py >/dev/null
+    * 4) time git fast-export --show-original-ids --signed-tags=strip \
+                           --tag-of-filtered-object=rewrite --no-data \
+                           --use-done-feature --all \
+                           | sed -e s/+051800/+0261/ \
+                           | git fast-import --force --quiet >/dev/null
+    * 5) time git repo-filter --invert-paths --path pushgems.rb
+         (with early quit right before removing unused refs)
+    * 6) time python -m cProfile -o repo-filter.profile \
+             ~/floss/git-repo-filter/git-repo-filter \
+             --invert-paths --path pushgems.rb
+    * 7) time java -jar ~/Downloads/bfg-1.13.0.jar --delete-files pushgems.rb
+
+
+      1:  3.910    fast-export
+      2:  3.958    fast-export + save output
+      3:  4.128    fast-export + sed (but toss output)
+      3a: 4.234    fast-export + python stdin using 'for' iterator
+      3b: 4.189    fast-export + python stdin using readline
+      3c:27.796    fast-export + python from subprocess using readline
+      3d: 4.196    fast-export + python from subprocess using 'for' iterator
+      3e: 4.580    fast-export + python3 from subprocess using readline
+      3f: 5.334    fast-export + python3 from subprocess using 'for' iterator
+      3g: 4.264    fast-export + python from subprocess using readline & bufsize
+      4: 11.279    fast-export + sed + fast-import
+      5: 64.098    filter-repo
+      5: 35.914    filter-repo, after bufsize=-1 for subprocess stuff
+      6: 69.150    filter-repo run under cProfile
+      7: 20.155    bfg
+
+    Other Notes:
+      * cProfile:
+        python -m cProfile -o repo-filter.profile \
+            ~/floss/git-repo-filter/git-repo-filter \
+            --invert-paths --path pushgems.rb
+        python
+        >>> import pstats
+        >>> p = pstats.Stats('repo-filter.profile')
+        >>> p.strip_dirs().sort_stats('cumtime').print_stats()
+      * reports 64.2% of time in readline()
+      * reports 37.0% of time under _advance_currentline
+
+
+Argument parsing stuff:
+  # NOT YET IMPLEMENTED OPTIONS BELOW
+  misc.add_argument('--empty-pruning', choices=['always', 'auto', 'never'],
+                    default='auto',
+                    help='''The default, auto, will check if filtering
+                            causes commits to become empty (have no file
+                            changes and only have one parent) and prune them
+                            if so.  This pruning can also cause merge
+                            commits to have fewer parents and possibly
+                            become empty themselves, and thus be pruned.
+                            Further, any branch or tag whose entire history
+                            is pruned due to becoming empty will be pruned.
+                            However, auto will not prune commits which
+                            started out empty in the original repo and have
+                            a non-pruned parent.''')
+  misc.add_argument('--store-backup', default=None,
+                    metavar='NAMESPACE', dest='backup',
+                    help='Store a copy of original refs under refs/NAMESPACE/')
+  misc.add_argument('--keep-excluded-refs', action='store_true',
+                    help='''If refs are excluded either explicitly (e.g.
+                            ^master) or implicitly (e.g. a branch in the
+                            history of an excluded ref/revision, or a branch
+                            not listed in the set of revisions to filter),
+                            then that ref will be deleted by the filtering
+                            process.  Use --keep-excluded-refs to retain
+                            such refs.''')
+
+  misc.add_argument('--keep-excluded-revisions', action='store_true',
+                    help='''If negative revisions are provided to exclude
+                            the range of history we are filtering over (e.g.
+                            negative_branch..master or ^negative_branch_1
+                            ^negative_branch_2 master develop), then by
+                            default any commits in the history of those
+                            revisions are excluded from the filtered history
+                            (resulting in the first not-excluded commit in
+                            history becoming a root commit and often
+                            containing an unusually large number of file
+                            changes).  With --keep-excluded-revisions, those
+                            commits are all retained (in their unfiltered
+                            form).''')
--- a/96
+++ b/96
@ -0,0 +1,96 @@
+#!/bin/bash
+
+if [[ $# < 2 || $# > 3 ]]; then
+  echo "Syntax:"
+  echo "  $0 REPO1 REPO2 [--summary]"
+  exit 1
+fi
+repo1="$1"
+repo2="$2"
+detail=1
+if [ $# == 3 ]; then
+  if [ $3 != "--summary" ]; then
+    echo "Unrecognized argument: $3"
+    exit 1
+  fi
+  detail=
+fi
+
+if ( ! (cd "$repo1" && git rev-parse --git-dir > /dev/null) ); then
+  echo "$repo1 is not a directory or does not have a git repository!"
+  exit 1
+fi
+if ( ! (cd "$repo2" && git rev-parse --git-dir > /dev/null) ); then
+  echo "$repo2 is not a directory or does not have a git repository!"
+  exit 1
+fi
+
+tempfile=$(mktemp)
+
+#
+# Compare branches for identicalness
+#
+diff -u <(cd "$repo1" && git show-ref -h --heads --tags) <(cd "$repo2" && git show-ref -h --heads --tags) > $tempfile
+if [ $? != 0 ]; then
+  echo -n "Branches & tags do not match"
+  if test $detail; then
+    echo "; differences:"
+    cat $tempfile
+  else
+    echo "."
+  fi
+else
+  echo "* Branches and tags match exactly"
+  exit 0
+fi
+
+#
+# Compare branch names
+#
+diff -u <(cd "$repo1" && git for-each-ref --format="%(refname)" | grep refs/heads/) <(cd "$repo2" && git for-each-ref --format="%(refname)" | grep refs/heads/) > $tempfile
+if [ $? != 0 ]; then
+  echo -n "Branch names do not match"
+  if test $detail; then
+    echo "; differences:"
+    cat $tempfile
+  else
+    echo "."
+  fi
+else
+  echo "* Branch names match"
+fi
+
+#
+# Compare trees of branches
+#
+diff -u <(cd "$repo1" && git rev-parse $(git for-each-ref --format="%(refname)" | grep refs/heads/ | sed -e s/$/^{tree}/)) <(cd "$repo2" && git rev-parse $(git for-each-ref --format="%(refname)" | grep refs/heads/ | sed -e s/$/^{tree}/)) > $tempfile
+if [ $? != 0 ]; then
+  echo -n "Trees of branches do not match"
+  if test $detail; then
+    echo "; differences:"
+    cat $tempfile
+  else
+    echo "."
+  fi
+else
+  echo "* Trees of branches match"
+fi
+
+#
+# Compare number of commits on each branch
+#
+diff -u <(cd "$repo1" && for i in $(git for-each-ref --format="%(refname)" | grep refs/heads/); do count=$(git rev-list $i | wc -l); printf "%5d %s\n" $count $i; done) <(cd "$repo2" && for i in $(git for-each-ref --format="%(refname)" | grep refs/heads/); do count=$(git rev-list $i | wc -l); printf "%5d %s\n" $count $i; done) > $tempfile
+if [ $? != 0 ]; then
+  echo -n "Branch commit counts do not match"
+  if test $detail; then
+    echo "; differences:"
+    cat $tempfile
+  else
+    echo "."
+  fi
+else
+  echo "* Branch commit counts match"
+fi
+
+
+rm $tempfile
--- a/102
+++ b/102
@ -0,0 +1,102 @@
+Background:
+  Desire to combine, split-apart, or clean up repositories
+  Examples: pgdev, nucleus, willamette
+Example, want:
+  Only certain paths (a specific directory)
+  move into a subdirectory
+  rename tags to not conflict
+Filter-branch command (takes 65.950 seconds, or 15.594 seconds):
+
+
+  time git filter-branch --tree-filter 'mkdir -p modules && git ls-files | grep -v ^src/main/java/com/palantir/annotation | xargs git rm -f -q && ls -d * | grep -v modules | xargs -I files mv files modules/' --tag-name-filter 'echo "table-helper-$(cat)"' --prune-empty -- --all
+
+
+Faster version (takes 37.802 seconds, or 6.287 seconds):
+
+
+  time git filter-branch --index-filter 'git ls-files | grep -v ^src/main/java/com/palantir/annotation | xargs git rm -q --cached; git ls-files -s | sed "s-$(printf \\t)-&modules/-" | git update-index --index-info; git ls-files | grep -v ^modules/ | xargs -r git rm -q --cached' --tag-name-filter 'echo "table-helper-$(cat)"' --prune-empty -- --all
+
+
+Caveats:
+  Really complicated to come up with
+  Googled solutions may be subtly os- or case- specific (sed, xargs, '*' above)
+  (I know git & bash & gnu vs. bsd, fixed filter-branch, etc.)
+Error Prone:
+  mixing old and new history
+  safety -- how to restore (refs/original hard; annotated tags may be missing)
+  pruning of empty commits overeager
+Painful, but possible:
+  selecting stuff to keep (as opposed to removing)
+  renaming files
+  figuring out what to remove (--analyze)
+  shrinking (man-page is misleading...)
+Limiting:
+  speed
+  commit message rewriting
+Compare:
+
+  git repo-filter --analyze
+  
+  time git repo-filter --path src/main/java/com/palantir/annotation --subdirectory-filter modules
+
+
+
+**********************************************************************
+
+Before demo tomorrow:
+  Submit git patch
+  Come up with basic demo and what to discuss
+
+    issues:
+      common:
+	no up-front report to help find what to remove
+        painful to select things to keep
+	shrinking is extra painful step
+
+      git-filter-branch issues:
+        doesn't rewrite commit messages
+	slow
+        mixes old and new history (& needs help to remove big objects)
+	pruning of empty commits is possible but overbearing hammer
+	painful to rename
+	safety: if using '--tag-name-filter cat', annotated tags NOT backed up
+
+      bfg:
+        cannot rename
+	does not prune empty commits
+
+    git-filter-branch:
+
+      65.950   time git filter-branch --tree-filter 'mkdir -p modules && git ls-files | grep -v ^src/main/java/com/palantir/annotation | xargs git rm -f -q && ls -d * | grep -v modules | xargs -I files mv files modules/' --tag-name-filter 'echo "table-helper-$(cat)"' --prune-empty -- --all
+
+      37.802   time git filter-branch --index-filter 'git ls-files | grep -v ^src/main/java/com/palantir/annotation | xargs git rm -q --cached; git ls-files -s | sed "s-$(printf \\t)-&modules/-" | git update-index --index-info; git ls-files | grep -v ^modules/ | xargs git rm -q --cached' --tag-name-filter 'echo "table-helper-$(cat)"' --prune-empty -- --all
+
+
+         time git clone ../whatever newcopy
+	 du -ks .git
+         git for-each-ref --format="delete %(refname)" refs/tags/ | grep -v refs/tags/foo- | git update-ref --stdin
+	 time git gc --prune=now
+	 du -ks .git
+
+       0.660  time git repo-filter --path src/main/java/com/palantir/annotation --path-rename :modules/
+
+      git for-each-ref --format="delete %(refname)" refs/tags/ | grep -v refs/tags/foo- | git update-ref --stdin
+
+      git for-each-ref --format="delete %(refname)" refs/original/ | git update-ref --stdin
+      git reflog expire --expire=now --all
+      git gc --prune=now
+
+    BFG:
+      bfg --delete-from <(git rev-list --objects --all | awk {print\$2} | grep -v ^$ | sort | uniq | grep -v $DIR_OF_INTEREST)
+      git fetch . refs/tags/*:refs/tags/foo-*
+      git show-ref --tags | awk {print\$2} | grep -v refs/tags/foo- | sed -e 's/^/delete /' | git update-ref --stdin
+      git reflog expire --expire-unreachable=now
+      git gc --prune=now
+
+***************************************************************************
+
+rails:
+  5252.036    time git filter-branch --tree-filter 'rm -f pushgems.rb' --tag-name-filter cat -- --all
+  1962.735    time git filter-branch --index-filter 'git rm --cached --ignore-unmatch pushgems.rb' --tag-name-filter cat -- --all
+    39.715    time git repo-filter --invert-paths --path pushgems.rb
+    33.169    <same, but with early exit>
--- a/48
+++ b/48
@ -0,0 +1,48 @@
+git-filter-branch
+  Ease of use differences in usability:
+    Easier path selection and renaming
+    Rewrite sha1sums (and abbreviations) in commit messages
+    Defaults to pruning empty commits (but only BECOME empty commits)
+      - (Technical notes, on kinds of empty:
+         - Empty due to blob filtering resulting in later patch becoming empty
+         - Empty due to path filtering
+         - Empty branch causing merge to lose parent(s) -- 3 styles
+           - One or more parents had no changes themselves or in their history
+           - Most recent non-empty commit on all branches was either the
+             merge-base or an ancestor (i.e. keeping the merge commit would
+             mean merging a commit with itself)
+           - Most recent non-empty commit on one parent's side of history is
+             an ancestor of another parent (i.e. that side no longer has any
+             interesting changes, and the parent corresponding to the empty
+             side should be removed)
+         - Empty ref due to entire history before it being empty
+    Deletes stuff not requested in the rewrite (unless overridden), so that
+      it doesn't confuse user or accidentally get re-pushed
+    Typically far faster to execute
+    Bails if not in a clean clone by default
+      - Users have a far easier time restoring if they can just nuke the clone
+      - Avoids the default need for users to mess with backups of original refs
+        (either for restoration, or for pruning to make sure repo is clean)
+    Repacks and shrinks repo for you (unless overridden)
+      - Makes it easier to ensure you've cleaned out unwanted stuff
+
+  Advantages over git-repo-filter:
+    - Filters every file once per revision even if unmodified between commits;
+      allows filtering differently for different commits.
+    
+    
+  
+bfg repo-cleaner
+  Ease of use differences in usability:
+    Automatic repack and shrink repo (instead of documenting extra steps)
+    No stupid 'fix your current branch first manually, then run'
+    Pathname inclusion, not just exclusion
+    Full pathname matching, instead of just *basename* (globs for basename)
+
+  Capability differences:
+    Prunes commits which become empty due to filtering
+    Lots of general filtering options outside of removing a few big files
+
+  Advantages of BFG repo cleaner:
+    - Very focused on just removing crazy big files, and sensitive data
+    -
--- a/performance-notes.txt
+++ b/performance-notes.txt
@ -0,0 +1,65 @@
+rails (git clone https://github.com/rails/rails)
+  Timings of:  time git repo-filter --invert-paths --path pushgems.rb
+
+  64.098   Starting point
+  35.914   After using bufsize=-1 on output only subprocess stuff
+  27.777   After removing fi_input/fi_output write/read for sha1sum mapping
+  20.980   After removing fi_input/fi_output write/read for check_merge_if_empty
+
+Other important factors:
+
+  Am I calling is_ancestor too much?  (Only call with pruned parents)
+  Unnecessary re-computation of 'epoch' (calling fromtimestamp)
+  Excessive calls to re.compile
+  Why is posix.waitpid so long?
+  Can parse_user be sped up by if..endswith rather than try..except?
+  Memoize filename remapping in order to spead up tweak_commit?
+
+   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+    83488    1.830    0.000   19.022    0.000 git-repo-filter:989(_parse_commit)
+    33314    1.192    0.000    1.650    0.000 git-repo-filter:123(is_ancestor)
+   997617    1.108    0.000    1.108    0.000 {method 'match' of '_sre.SRE_Pattern' objects}
+  1020663    1.083    0.000    1.083    0.000 {method 'readline' of 'file' objects}
+   334486    0.995    0.000    1.535    0.000 {built-in method fromtimestamp}
+  1081102    0.985    0.000    0.991    0.000 re.py:230(_compile)
+    83476    0.902    0.000    3.564    0.000 git-repo-filter:490(dump)
+       11    0.855    0.078    0.855    0.078 {posix.waitpid}
+   417904    0.803    0.000    2.685    0.000 git-repo-filter:807(_parse_optional_filechange)
+   167255    0.640    0.000    1.066    0.000 git-repo-filter:56(__init__)
+   167255    0.586    0.000    3.186    0.000 git-repo-filter:871(_parse_user)
+   997598    0.560    0.000    2.589    0.000 re.py:138(match)
+  1284279    0.529    0.000    0.529    0.000 {method 'write' of 'file' objects}
+   167231    0.492    0.000    1.629    0.000 git-repo-filter:42(_write_date)
+   668972    0.485    0.000    0.485    0.000 git-repo-filter:83(dst)
+    83488    0.463    0.000    1.006    0.000 git-repo-filter:2255(tweak_commit)
+   334394    0.428    0.000    0.654    0.000 git-repo-filter:410(dump)
+    83488    0.417    0.000    0.674    0.000 collections.py:50(__init__)
+  1020663    0.408    0.000    1.492    0.000 git-repo-filter:766(_advance_currentline)
+    83488    0.353    0.000    0.377    0.000 {method 'sub' of '_sre.SRE_Pattern' objects}
+   334416    0.331    0.000    0.439    0.000 git-repo-filter:381(__init__)
+  1796776    0.304    0.000    0.304    0.000 {method 'startswith' of 'str' objects}
+        1    0.271    0.271   19.961   19.961 git-repo-filter:1367(run)
+   100432    0.260    0.000    0.618    0.000 git-repo-filter:784(_parse_optional_parent_ref)
+   334416    0.254    0.000    0.497    0.000 git-repo-filter:2267(newname)
+
+
+Python commands:
+
+  $ python -m cProfile -o repo-filter.profile \
+        ~/floss/git-repo-filter/git-repo-filter \
+        --invert-paths --path pushgems.rb
+
+  Just showing basic stats ('cumtime' and 'tottime' seem to be what matter):
+    import pstats
+    p = pstats.Stats('repo-filter.profile')
+    p.strip_dirs().sort_stats('cumtime').print_stats()
+
+  Writing to some other string instead of stdout:
+    a = cStringIO.StringIO()
+    p = pstats.Stats('repo-filter.profile', stream=a)
+    p.strip_dirs().sort_stats('tottime').print_stats()
+
+  Get various data out of the written output
+    lines = a.getvalue().splitlines()[7:-2]
+    sum(float(line.split(None, 5)[1]) for line in lines)
+    print('\n'.join(' '.join(line.split(None, 5)[1:6:4]) for line in lines))
--- a/rfc-letter.txt
+++ b/rfc-letter.txt
@ -0,0 +1,183 @@
+----- Short version -----
+
+As suggested by Ævar[1], I am proposing git repo-filter for inclusion
+in git.git.  I hope that my documentation included in the repo-filter
+repository[2] can answer questions you have about it; if it does not,
+that may indicate I need to supplement its documentation.  However, I
+am happy to answer any and all questions you may have about the tool;
+fire away.
+
+
+Basic Info:
+
+git repo-filter is tool for rewriting history that includes some
+capabilities I have not found anywhere else.  It is most similar to
+filter-branch, though it has a significantly different taste in
+usability.  Also, being based on fast-export/fast-import, is orders of
+magnitude faster (it has speed roughly comparable to BFG repo cleaner,
+but isn't multi-threaded).
+
+repo-filter is a ~2500 (FIXME) line single-file python script,
+depending only on the python standard library (and execution of git
+commands), all of which is designed to make build/installation
+trivial: you just need to copy it into your $PATH.
+
+
+[1] https://public-inbox.org/git/87r2fq3b9t.fsf@evledraar.gmail.com/
+[2] Currently tracked at https://github.com/newren/git-repo-filter,
+    but the plan would be to instead point people at git.git if it is
+    merged.  (And if it is merged, the merge should just delete its
+    antique fork of t/test-lib.sh and its README.md.)
+
+
+----- Intermediate length version -----
+
+As suggested Ævar[1], I am proposing git repo-filter[2] for inclusion
+in git.git.  There are a few issues that make me wonder if the git
+community will want it, which I've done my best to explain and address
+these below.
+
+Sorry for the lengthy email; feel free to skim for whatever bits seem
+relevant to you.
+
+
+Basic background
+----------------
+
+git repo-filter is tool for rewriting history.  It has a significantly
+different taste in usability than filter-branch, and being based on
+fast-export/fast-import, is orders of magnitude faster (it has speed
+roughly comparable to BFG repo cleaner, but isn't multi-threaded).  It
+includes some capabilities I have not found anywhere else.
+
+
+Important inclusion information
+-------------------------------
+
+  1. Build: No special build rules required; it's a single-file script
+            to simplify build/installation.  Its only dependencies are
+            git and python.  This python script only uses the python
+            standard library, so no extra python packages are needed.
+
+  2. Tests: (FIXME) git-style end-to-end tests (using an ancient fork of
+            test-lib.sh from git.git) are in use, making the inclusion
+            into git trivial.  There are also some python-style unit
+            tests, though these are also invoked from a test in the
+            end-to-end suite so no additional tooling is needed.
+
+  3. Documentation: (FIXME) Built-in help and git-style asciidoc man-page
+                    already included.
+
+
+Possible reasons to exclude from git.git
+----------------------------------------
+
+  1. Portability: repo-filter is written in Python, which I've heard
+     is difficult for some platforms where git is run.
+
+  2. Maintainability/EOL decisions: repo-filter is (currently) written
+     in Python 2 rather than Python 3.
+
+  3. User story: Since repo-filter will not and can not be backward
+     compatible to filter-branch, we inevitably would have two tools
+     for rewriting history.  Some may see that as confusing to users,
+     especially since I didn't just implement a slightly different
+     feature set: I fixed usability warts by changing a few basic
+     underlying assumptions.
+
+
+Counter-arguments against exclusion
+-----------------------------------
+
+  1) Portability:
+
+     1a) repo-filter only uses the python standard library, simplifying
+         the porting story significantly.
+     1b) repo-filter is a single file script.  While it is even longer than
+         git-send-email.perl, putting it on the big side, this does mean
+         no special build instructions are needed.
+     1c) repo-filter is not a daily-use tool, nor is it a collaboration
+         tool.  It's a tool that one person on your team uses once in
+         maybe five years, then shares the results with everyone once.  Thus,
+         portability to esoteric platforms is perhaps less critical than it
+         is for other components of git.
+
+  2) *shrug*.  repo-filter was started by importing git-fast-filter[3]
+     (which was in Python 2), and I haven't bothered porting.  I have often
+     worked with older enterprise distros, so I am a bit of a laggard with
+     the Python 3 transition.  If others find this worrisome, I can work on
+     porting.
+
+  3) I've already made this email too long so I'll summarize; let me
+     know if you want more detail.  In short: repo-filter enables
+     usage on repositories for which filter-branch is just completely
+     impractical, and also has new capabilities that I cannot even
+     emulate within filter-branch.  But it's more than just that.
+     While filter-branch is a nifty easy-to-use tool for a few very
+     simple cases and has enough versatility to sometimes handle more
+     complex cases, the the complexity increases rapidly and some of
+     the underlying assumptions make for greater user confusion and/or
+     cause problems in trying to use several different features for
+     the same filtering operation.  As such, I think a tool designed
+     for larger filtering operations or less sophisticated users of
+     necessity needs to change some basic things about how
+     filter-branch operates, which implies it must be a new different
+     tool.
+
+
+So...thoughts?
+
+Thanks,
+Elijah
+
+
+[1] https://public-inbox.org/git/87r2fq3b9t.fsf@evledraar.gmail.com/
+[2] Currently tracked at https://github.com/newren/git-repo-filter,
+    but the plan would be to instead point people at git.git if it's
+    included.
+[3] https://public-inbox.org/git/51419b2c0904072035u1182b507o836a67ac308d32b9@mail.gmail.com/
+
+
+
+
+
+Background:
+  Desire to combine, split-apart, or clean up repositories
+  Examples: pgdev, nucleus, willamette
+Example, want:
+  Only certain paths (a specific directory)
+  move into a subdirectory
+  rename tags to not conflict
+Filter-branch command (takes 65.950 seconds, or 15.594 seconds):
+
+
+  time git filter-branch --tree-filter 'mkdir -p modules && git ls-files | grep -v ^src/main/java/com/palantir/annotation | xargs git rm -f -q && ls -d * | grep -v modules | xargs -I files mv files modules/' --tag-name-filter 'echo "table-helper-$(cat)"' --prune-empty -- --all
+
+
+Faster version (takes 37.802 seconds, or 6.287 seconds):
+
+
+  time git filter-branch --index-filter 'git ls-files | grep -v ^src/main/java/com/palantir/annotation | xargs git rm -q --cached; git ls-files -s | sed "s-$(printf \\t)-&modules/-" | git update-index --index-info; git ls-files | grep -v ^modules/ | xargs -r git rm -q --cached' --tag-name-filter 'echo "table-helper-$(cat)"' --prune-empty -- --all
+
+
+Caveats:
+  Really complicated to come up with
+  Googled solutions may be subtly os- or case- specific (sed, xargs, '*' above)
+  (I know git & bash & gnu vs. bsd, fixed filter-branch, etc.)
+Error Prone:
+  mixing old and new history
+  safety -- how to restore (refs/original hard; annotated tags may be missing)
+  pruning of empty commits overeager
+Painful, but possible:
+  selecting stuff to keep (as opposed to removing)
+  renaming files
+  figuring out what to remove (--analyze)
+  shrinking (man-page is misleading...)
+Limiting:
+  speed
+  commit message rewriting
+Compare:
+
+  git repo-filter --analyze
+  
+  time git repo-filter --path src/main/java/com/palantir/annotation --subdirectory-filter modules