Various todo-related files

todo
Elijah Newren 6 years ago
commit caffe46d77

197
TODO

@ -0,0 +1,197 @@
Before widely announcing:
- Notes on splitting
- exporter needs to know the pipe combination for commit message rewriting
- commit message rewriting gets weird if commits held in memory for later
- pruning gets weird too
- need to handle 1 export -> 2 imports
- Test setup
- Add several more tests, particularly around:
- commit pruning
- pruning commits that become empty
- pruning commits that started empty and have no parent
- not pruning commits that have changes or remain a merge commit
- pruning parent(s) of a merge
- coalescing common commits of a merge
- coalescing parents of a merge when one is an ancestor of the other
- ref pruning
- tags pointing at commits which are pruned along with their history
- refs pointing at commits which are pruned along with their history
- refs or tags behind a negative revision specification
- commit message rewriting
- renaming, particular when it causes collisions
- use coverage.py to direct test writing
- Check whether the version of git in use supports the appropriate flags
- Rewrite history
- Remove tests from older commits until they would actually work
Generate upstream patches:
- Tags of tags of commits fail to export:
- In git.git, try:
$ git fast-export --no-data --use-done-feature --signed-tags=strip \
--tag-of-filtered-object=rewrite-feature v1.0rc1 >/dev/null
fatal: tag 5f4cd4ca015dc795b9f7f4fed11b3f80a60ac175 tags unexported tag!
Bigger ideas
- 1st step, create local branches for each remote tracking branch:
git fetch . refs/remotes/origin/*:refs/heads/*
also, nuke refs/remotes/origin/*; it won't match upstream anyway
- Performance:
- Smarter record_remapping -- do it lazily
- Unnecessary re-computation of 'epoch' (calling fromtimestamp)
...and perhaps just unnecessary use of FixedTimeZone when most the time
it will not be checked or modified?
- What part of _parse_commit takes so much time?
- What part of commit.dump takes so much time?
- Speedup _parse_optional_filechange using str.split(None, 3) instead of re
- Which wait() are we waiting on?
- Smarter become-empty checks; only do more expensive checks if:
- First parent is no longer original first parent or ancestor thereof
- e.g. first-parent history empty, second parent becomes first parent
- e.g. --parent-filter causes some kind of graft operation (although
maybe we don't want to prune in this case anyway...)
- Blob filtering is active AND the only file_changes involved correspond
to filenames that have previously been modified.
- Regex optimization
- memoize (or just outright store?) filename remapping
- memoize net result: dequote -> do mods -> requote
- Work with submodules
- Important features
- paths-from-file (--paths-from-file <(git ls-tree -r HEAD)
- include-old-names-of-specified-files
- so users don't have to look for rename data from --analyze
- Do git rev-list --count to get idea of amount of work; show progress
Left over bits:
- Fix up --analyze
* shouldn't allow running --analyze with negative refspecs
* add a --no-detect-renames option (for performance)
- metadata
- On second and subsequent runs, update metadata instead of overwriting
- for maps, give beginning_hash -> end_hash, not intermediate hashes
- OR error out if .git/repo-filter already created?
- error out if any progress messages in stream (can't deal with them unless
we can pass --cat-blob-fd to fast-import, and that seems non-portable)
More path stuff, maybe
--path-rename-regex
--path-stream-rename (invoked once; must read one line then print)
--path-stream-filter (invoked once per commit with new files)
--path-tree-filter
Ref stuff
--ref-rename
--ref-stream-rename
Blob filter
--tree-filter
Safety stuff
--keep-excluded-revisions
--keep-excluded-refs
--store-backup
--empty-pruning={no/off,auto,always/on}
--negative-refs={drop,reference}
Other things:
- add a filename_callback too, for just editing file names
- add --skip-cleanup (pruning, gc, etc.; keep reset --hard) for speed compare
- get rid of user-run fast-export & fast-import; don't want to have to
update two callsites.
Performance notes:
* On rails:
* 1) time git fast-export --show-original-ids --signed-tags=strip \
--tag-of-filtered-object=rewrite --no-data \
--use-done-feature --all >/dev/null
* 2) time git fast-export --show-original-ids --signed-tags=strip \
--tag-of-filtered-object=rewrite --no-data \
--use-done-feature --all >saved_output
* 3a) time git fast-export --show-original-ids --signed-tags=strip \
--tag-of-filtered-object=rewrite --no-data \
--use-done-feature --all \
| sed -e s/+051800/+0261/ >/dev/null
* 3b) time git fast-export --show-original-ids --signed-tags=strip \
--tag-of-filtered-object=rewrite --no-data \
--use-done-feature --all \
| stupid.py >/dev/null
* 4) time git fast-export --show-original-ids --signed-tags=strip \
--tag-of-filtered-object=rewrite --no-data \
--use-done-feature --all \
| sed -e s/+051800/+0261/ \
| git fast-import --force --quiet >/dev/null
* 5) time git repo-filter --invert-paths --path pushgems.rb
(with early quit right before removing unused refs)
* 6) time python -m cProfile -o repo-filter.profile \
~/floss/git-repo-filter/git-repo-filter \
--invert-paths --path pushgems.rb
* 7) time java -jar ~/Downloads/bfg-1.13.0.jar --delete-files pushgems.rb
1: 3.910 fast-export
2: 3.958 fast-export + save output
3: 4.128 fast-export + sed (but toss output)
3a: 4.234 fast-export + python stdin using 'for' iterator
3b: 4.189 fast-export + python stdin using readline
3c:27.796 fast-export + python from subprocess using readline
3d: 4.196 fast-export + python from subprocess using 'for' iterator
3e: 4.580 fast-export + python3 from subprocess using readline
3f: 5.334 fast-export + python3 from subprocess using 'for' iterator
3g: 4.264 fast-export + python from subprocess using readline & bufsize
4: 11.279 fast-export + sed + fast-import
5: 64.098 filter-repo
5: 35.914 filter-repo, after bufsize=-1 for subprocess stuff
6: 69.150 filter-repo run under cProfile
7: 20.155 bfg
Other Notes:
* cProfile:
python -m cProfile -o repo-filter.profile \
~/floss/git-repo-filter/git-repo-filter \
--invert-paths --path pushgems.rb
python
>>> import pstats
>>> p = pstats.Stats('repo-filter.profile')
>>> p.strip_dirs().sort_stats('cumtime').print_stats()
* reports 64.2% of time in readline()
* reports 37.0% of time under _advance_currentline
Argument parsing stuff:
# NOT YET IMPLEMENTED OPTIONS BELOW
misc.add_argument('--empty-pruning', choices=['always', 'auto', 'never'],
default='auto',
help='''The default, auto, will check if filtering
causes commits to become empty (have no file
changes and only have one parent) and prune them
if so. This pruning can also cause merge
commits to have fewer parents and possibly
become empty themselves, and thus be pruned.
Further, any branch or tag whose entire history
is pruned due to becoming empty will be pruned.
However, auto will not prune commits which
started out empty in the original repo and have
a non-pruned parent.''')
misc.add_argument('--store-backup', default=None,
metavar='NAMESPACE', dest='backup',
help='Store a copy of original refs under refs/NAMESPACE/')
misc.add_argument('--keep-excluded-refs', action='store_true',
help='''If refs are excluded either explicitly (e.g.
^master) or implicitly (e.g. a branch in the
history of an excluded ref/revision, or a branch
not listed in the set of revisions to filter),
then that ref will be deleted by the filtering
process. Use --keep-excluded-refs to retain
such refs.''')
misc.add_argument('--keep-excluded-revisions', action='store_true',
help='''If negative revisions are provided to exclude
the range of history we are filtering over (e.g.
negative_branch..master or ^negative_branch_1
^negative_branch_2 master develop), then by
default any commits in the history of those
revisions are excluded from the filtered history
(resulting in the first not-excluded commit in
history becoming a root commit and often
containing an unusually large number of file
changes). With --keep-excluded-revisions, those
commits are all retained (in their unfiltered
form).''')

@ -0,0 +1,96 @@
#!/bin/bash
if [[ $# < 2 || $# > 3 ]]; then
echo "Syntax:"
echo " $0 REPO1 REPO2 [--summary]"
exit 1
fi
repo1="$1"
repo2="$2"
detail=1
if [ $# == 3 ]; then
if [ $3 != "--summary" ]; then
echo "Unrecognized argument: $3"
exit 1
fi
detail=
fi
if ( ! (cd "$repo1" && git rev-parse --git-dir > /dev/null) ); then
echo "$repo1 is not a directory or does not have a git repository!"
exit 1
fi
if ( ! (cd "$repo2" && git rev-parse --git-dir > /dev/null) ); then
echo "$repo2 is not a directory or does not have a git repository!"
exit 1
fi
tempfile=$(mktemp)
#
# Compare branches for identicalness
#
diff -u <(cd "$repo1" && git show-ref -h --heads --tags) <(cd "$repo2" && git show-ref -h --heads --tags) > $tempfile
if [ $? != 0 ]; then
echo -n "Branches & tags do not match"
if test $detail; then
echo "; differences:"
cat $tempfile
else
echo "."
fi
else
echo "* Branches and tags match exactly"
exit 0
fi
#
# Compare branch names
#
diff -u <(cd "$repo1" && git for-each-ref --format="%(refname)" | grep refs/heads/) <(cd "$repo2" && git for-each-ref --format="%(refname)" | grep refs/heads/) > $tempfile
if [ $? != 0 ]; then
echo -n "Branch names do not match"
if test $detail; then
echo "; differences:"
cat $tempfile
else
echo "."
fi
else
echo "* Branch names match"
fi
#
# Compare trees of branches
#
diff -u <(cd "$repo1" && git rev-parse $(git for-each-ref --format="%(refname)" | grep refs/heads/ | sed -e s/$/^{tree}/)) <(cd "$repo2" && git rev-parse $(git for-each-ref --format="%(refname)" | grep refs/heads/ | sed -e s/$/^{tree}/)) > $tempfile
if [ $? != 0 ]; then
echo -n "Trees of branches do not match"
if test $detail; then
echo "; differences:"
cat $tempfile
else
echo "."
fi
else
echo "* Trees of branches match"
fi
#
# Compare number of commits on each branch
#
diff -u <(cd "$repo1" && for i in $(git for-each-ref --format="%(refname)" | grep refs/heads/); do count=$(git rev-list $i | wc -l); printf "%5d %s\n" $count $i; done) <(cd "$repo2" && for i in $(git for-each-ref --format="%(refname)" | grep refs/heads/); do count=$(git rev-list $i | wc -l); printf "%5d %s\n" $count $i; done) > $tempfile
if [ $? != 0 ]; then
echo -n "Branch commit counts do not match"
if test $detail; then
echo "; differences:"
cat $tempfile
else
echo "."
fi
else
echo "* Branch commit counts match"
fi
rm $tempfile

@ -0,0 +1,102 @@
Background:
Desire to combine, split-apart, or clean up repositories
Examples: pgdev, nucleus, willamette
Example, want:
Only certain paths (a specific directory)
move into a subdirectory
rename tags to not conflict
Filter-branch command (takes 65.950 seconds, or 15.594 seconds):
time git filter-branch --tree-filter 'mkdir -p modules && git ls-files | grep -v ^src/main/java/com/palantir/annotation | xargs git rm -f -q && ls -d * | grep -v modules | xargs -I files mv files modules/' --tag-name-filter 'echo "table-helper-$(cat)"' --prune-empty -- --all
Faster version (takes 37.802 seconds, or 6.287 seconds):
time git filter-branch --index-filter 'git ls-files | grep -v ^src/main/java/com/palantir/annotation | xargs git rm -q --cached; git ls-files -s | sed "s-$(printf \\t)-&modules/-" | git update-index --index-info; git ls-files | grep -v ^modules/ | xargs -r git rm -q --cached' --tag-name-filter 'echo "table-helper-$(cat)"' --prune-empty -- --all
Caveats:
Really complicated to come up with
Googled solutions may be subtly os- or case- specific (sed, xargs, '*' above)
(I know git & bash & gnu vs. bsd, fixed filter-branch, etc.)
Error Prone:
mixing old and new history
safety -- how to restore (refs/original hard; annotated tags may be missing)
pruning of empty commits overeager
Painful, but possible:
selecting stuff to keep (as opposed to removing)
renaming files
figuring out what to remove (--analyze)
shrinking (man-page is misleading...)
Limiting:
speed
commit message rewriting
Compare:
git repo-filter --analyze
time git repo-filter --path src/main/java/com/palantir/annotation --subdirectory-filter modules
**********************************************************************
Before demo tomorrow:
Submit git patch
Come up with basic demo and what to discuss
issues:
common:
no up-front report to help find what to remove
painful to select things to keep
shrinking is extra painful step
git-filter-branch issues:
doesn't rewrite commit messages
slow
mixes old and new history (& needs help to remove big objects)
pruning of empty commits is possible but overbearing hammer
painful to rename
safety: if using '--tag-name-filter cat', annotated tags NOT backed up
bfg:
cannot rename
does not prune empty commits
git-filter-branch:
65.950 time git filter-branch --tree-filter 'mkdir -p modules && git ls-files | grep -v ^src/main/java/com/palantir/annotation | xargs git rm -f -q && ls -d * | grep -v modules | xargs -I files mv files modules/' --tag-name-filter 'echo "table-helper-$(cat)"' --prune-empty -- --all
37.802 time git filter-branch --index-filter 'git ls-files | grep -v ^src/main/java/com/palantir/annotation | xargs git rm -q --cached; git ls-files -s | sed "s-$(printf \\t)-&modules/-" | git update-index --index-info; git ls-files | grep -v ^modules/ | xargs git rm -q --cached' --tag-name-filter 'echo "table-helper-$(cat)"' --prune-empty -- --all
time git clone ../whatever newcopy
du -ks .git
git for-each-ref --format="delete %(refname)" refs/tags/ | grep -v refs/tags/foo- | git update-ref --stdin
time git gc --prune=now
du -ks .git
0.660 time git repo-filter --path src/main/java/com/palantir/annotation --path-rename :modules/
git for-each-ref --format="delete %(refname)" refs/tags/ | grep -v refs/tags/foo- | git update-ref --stdin
git for-each-ref --format="delete %(refname)" refs/original/ | git update-ref --stdin
git reflog expire --expire=now --all
git gc --prune=now
BFG:
bfg --delete-from <(git rev-list --objects --all | awk {print\$2} | grep -v ^$ | sort | uniq | grep -v $DIR_OF_INTEREST)
git fetch . refs/tags/*:refs/tags/foo-*
git show-ref --tags | awk {print\$2} | grep -v refs/tags/foo- | sed -e 's/^/delete /' | git update-ref --stdin
git reflog expire --expire-unreachable=now
git gc --prune=now
***************************************************************************
rails:
5252.036 time git filter-branch --tree-filter 'rm -f pushgems.rb' --tag-name-filter cat -- --all
1962.735 time git filter-branch --index-filter 'git rm --cached --ignore-unmatch pushgems.rb' --tag-name-filter cat -- --all
39.715 time git repo-filter --invert-paths --path pushgems.rb
33.169 <same, but with early exit>

@ -0,0 +1,48 @@
git-filter-branch
Ease of use differences in usability:
Easier path selection and renaming
Rewrite sha1sums (and abbreviations) in commit messages
Defaults to pruning empty commits (but only BECOME empty commits)
- (Technical notes, on kinds of empty:
- Empty due to blob filtering resulting in later patch becoming empty
- Empty due to path filtering
- Empty branch causing merge to lose parent(s) -- 3 styles
- One or more parents had no changes themselves or in their history
- Most recent non-empty commit on all branches was either the
merge-base or an ancestor (i.e. keeping the merge commit would
mean merging a commit with itself)
- Most recent non-empty commit on one parent's side of history is
an ancestor of another parent (i.e. that side no longer has any
interesting changes, and the parent corresponding to the empty
side should be removed)
- Empty ref due to entire history before it being empty
Deletes stuff not requested in the rewrite (unless overridden), so that
it doesn't confuse user or accidentally get re-pushed
Typically far faster to execute
Bails if not in a clean clone by default
- Users have a far easier time restoring if they can just nuke the clone
- Avoids the default need for users to mess with backups of original refs
(either for restoration, or for pruning to make sure repo is clean)
Repacks and shrinks repo for you (unless overridden)
- Makes it easier to ensure you've cleaned out unwanted stuff
Advantages over git-repo-filter:
- Filters every file once per revision even if unmodified between commits;
allows filtering differently for different commits.
bfg repo-cleaner
Ease of use differences in usability:
Automatic repack and shrink repo (instead of documenting extra steps)
No stupid 'fix your current branch first manually, then run'
Pathname inclusion, not just exclusion
Full pathname matching, instead of just *basename* (globs for basename)
Capability differences:
Prunes commits which become empty due to filtering
Lots of general filtering options outside of removing a few big files
Advantages of BFG repo cleaner:
- Very focused on just removing crazy big files, and sensitive data
-

@ -0,0 +1,65 @@
rails (git clone https://github.com/rails/rails)
Timings of: time git repo-filter --invert-paths --path pushgems.rb
64.098 Starting point
35.914 After using bufsize=-1 on output only subprocess stuff
27.777 After removing fi_input/fi_output write/read for sha1sum mapping
20.980 After removing fi_input/fi_output write/read for check_merge_if_empty
Other important factors:
Am I calling is_ancestor too much? (Only call with pruned parents)
Unnecessary re-computation of 'epoch' (calling fromtimestamp)
Excessive calls to re.compile
Why is posix.waitpid so long?
Can parse_user be sped up by if..endswith rather than try..except?
Memoize filename remapping in order to spead up tweak_commit?
ncalls tottime percall cumtime percall filename:lineno(function)
83488 1.830 0.000 19.022 0.000 git-repo-filter:989(_parse_commit)
33314 1.192 0.000 1.650 0.000 git-repo-filter:123(is_ancestor)
997617 1.108 0.000 1.108 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
1020663 1.083 0.000 1.083 0.000 {method 'readline' of 'file' objects}
334486 0.995 0.000 1.535 0.000 {built-in method fromtimestamp}
1081102 0.985 0.000 0.991 0.000 re.py:230(_compile)
83476 0.902 0.000 3.564 0.000 git-repo-filter:490(dump)
11 0.855 0.078 0.855 0.078 {posix.waitpid}
417904 0.803 0.000 2.685 0.000 git-repo-filter:807(_parse_optional_filechange)
167255 0.640 0.000 1.066 0.000 git-repo-filter:56(__init__)
167255 0.586 0.000 3.186 0.000 git-repo-filter:871(_parse_user)
997598 0.560 0.000 2.589 0.000 re.py:138(match)
1284279 0.529 0.000 0.529 0.000 {method 'write' of 'file' objects}
167231 0.492 0.000 1.629 0.000 git-repo-filter:42(_write_date)
668972 0.485 0.000 0.485 0.000 git-repo-filter:83(dst)
83488 0.463 0.000 1.006 0.000 git-repo-filter:2255(tweak_commit)
334394 0.428 0.000 0.654 0.000 git-repo-filter:410(dump)
83488 0.417 0.000 0.674 0.000 collections.py:50(__init__)
1020663 0.408 0.000 1.492 0.000 git-repo-filter:766(_advance_currentline)
83488 0.353 0.000 0.377 0.000 {method 'sub' of '_sre.SRE_Pattern' objects}
334416 0.331 0.000 0.439 0.000 git-repo-filter:381(__init__)
1796776 0.304 0.000 0.304 0.000 {method 'startswith' of 'str' objects}
1 0.271 0.271 19.961 19.961 git-repo-filter:1367(run)
100432 0.260 0.000 0.618 0.000 git-repo-filter:784(_parse_optional_parent_ref)
334416 0.254 0.000 0.497 0.000 git-repo-filter:2267(newname)
Python commands:
$ python -m cProfile -o repo-filter.profile \
~/floss/git-repo-filter/git-repo-filter \
--invert-paths --path pushgems.rb
Just showing basic stats ('cumtime' and 'tottime' seem to be what matter):
import pstats
p = pstats.Stats('repo-filter.profile')
p.strip_dirs().sort_stats('cumtime').print_stats()
Writing to some other string instead of stdout:
a = cStringIO.StringIO()
p = pstats.Stats('repo-filter.profile', stream=a)
p.strip_dirs().sort_stats('tottime').print_stats()
Get various data out of the written output
lines = a.getvalue().splitlines()[7:-2]
sum(float(line.split(None, 5)[1]) for line in lines)
print('\n'.join(' '.join(line.split(None, 5)[1:6:4]) for line in lines))

@ -0,0 +1,183 @@
----- Short version -----
As suggested by Ævar[1], I am proposing git repo-filter for inclusion
in git.git. I hope that my documentation included in the repo-filter
repository[2] can answer questions you have about it; if it does not,
that may indicate I need to supplement its documentation. However, I
am happy to answer any and all questions you may have about the tool;
fire away.
Basic Info:
git repo-filter is tool for rewriting history that includes some
capabilities I have not found anywhere else. It is most similar to
filter-branch, though it has a significantly different taste in
usability. Also, being based on fast-export/fast-import, is orders of
magnitude faster (it has speed roughly comparable to BFG repo cleaner,
but isn't multi-threaded).
repo-filter is a ~2500 (FIXME) line single-file python script,
depending only on the python standard library (and execution of git
commands), all of which is designed to make build/installation
trivial: you just need to copy it into your $PATH.
[1] https://public-inbox.org/git/87r2fq3b9t.fsf@evledraar.gmail.com/
[2] Currently tracked at https://github.com/newren/git-repo-filter,
but the plan would be to instead point people at git.git if it is
merged. (And if it is merged, the merge should just delete its
antique fork of t/test-lib.sh and its README.md.)
----- Intermediate length version -----
As suggested Ævar[1], I am proposing git repo-filter[2] for inclusion
in git.git. There are a few issues that make me wonder if the git
community will want it, which I've done my best to explain and address
these below.
Sorry for the lengthy email; feel free to skim for whatever bits seem
relevant to you.
Basic background
----------------
git repo-filter is tool for rewriting history. It has a significantly
different taste in usability than filter-branch, and being based on
fast-export/fast-import, is orders of magnitude faster (it has speed
roughly comparable to BFG repo cleaner, but isn't multi-threaded). It
includes some capabilities I have not found anywhere else.
Important inclusion information
-------------------------------
1. Build: No special build rules required; it's a single-file script
to simplify build/installation. Its only dependencies are
git and python. This python script only uses the python
standard library, so no extra python packages are needed.
2. Tests: (FIXME) git-style end-to-end tests (using an ancient fork of
test-lib.sh from git.git) are in use, making the inclusion
into git trivial. There are also some python-style unit
tests, though these are also invoked from a test in the
end-to-end suite so no additional tooling is needed.
3. Documentation: (FIXME) Built-in help and git-style asciidoc man-page
already included.
Possible reasons to exclude from git.git
----------------------------------------
1. Portability: repo-filter is written in Python, which I've heard
is difficult for some platforms where git is run.
2. Maintainability/EOL decisions: repo-filter is (currently) written
in Python 2 rather than Python 3.
3. User story: Since repo-filter will not and can not be backward
compatible to filter-branch, we inevitably would have two tools
for rewriting history. Some may see that as confusing to users,
especially since I didn't just implement a slightly different
feature set: I fixed usability warts by changing a few basic
underlying assumptions.
Counter-arguments against exclusion
-----------------------------------
1) Portability:
1a) repo-filter only uses the python standard library, simplifying
the porting story significantly.
1b) repo-filter is a single file script. While it is even longer than
git-send-email.perl, putting it on the big side, this does mean
no special build instructions are needed.
1c) repo-filter is not a daily-use tool, nor is it a collaboration
tool. It's a tool that one person on your team uses once in
maybe five years, then shares the results with everyone once. Thus,
portability to esoteric platforms is perhaps less critical than it
is for other components of git.
2) *shrug*. repo-filter was started by importing git-fast-filter[3]
(which was in Python 2), and I haven't bothered porting. I have often
worked with older enterprise distros, so I am a bit of a laggard with
the Python 3 transition. If others find this worrisome, I can work on
porting.
3) I've already made this email too long so I'll summarize; let me
know if you want more detail. In short: repo-filter enables
usage on repositories for which filter-branch is just completely
impractical, and also has new capabilities that I cannot even
emulate within filter-branch. But it's more than just that.
While filter-branch is a nifty easy-to-use tool for a few very
simple cases and has enough versatility to sometimes handle more
complex cases, the the complexity increases rapidly and some of
the underlying assumptions make for greater user confusion and/or
cause problems in trying to use several different features for
the same filtering operation. As such, I think a tool designed
for larger filtering operations or less sophisticated users of
necessity needs to change some basic things about how
filter-branch operates, which implies it must be a new different
tool.
So...thoughts?
Thanks,
Elijah
[1] https://public-inbox.org/git/87r2fq3b9t.fsf@evledraar.gmail.com/
[2] Currently tracked at https://github.com/newren/git-repo-filter,
but the plan would be to instead point people at git.git if it's
included.
[3] https://public-inbox.org/git/51419b2c0904072035u1182b507o836a67ac308d32b9@mail.gmail.com/
Background:
Desire to combine, split-apart, or clean up repositories
Examples: pgdev, nucleus, willamette
Example, want:
Only certain paths (a specific directory)
move into a subdirectory
rename tags to not conflict
Filter-branch command (takes 65.950 seconds, or 15.594 seconds):
time git filter-branch --tree-filter 'mkdir -p modules && git ls-files | grep -v ^src/main/java/com/palantir/annotation | xargs git rm -f -q && ls -d * | grep -v modules | xargs -I files mv files modules/' --tag-name-filter 'echo "table-helper-$(cat)"' --prune-empty -- --all
Faster version (takes 37.802 seconds, or 6.287 seconds):
time git filter-branch --index-filter 'git ls-files | grep -v ^src/main/java/com/palantir/annotation | xargs git rm -q --cached; git ls-files -s | sed "s-$(printf \\t)-&modules/-" | git update-index --index-info; git ls-files | grep -v ^modules/ | xargs -r git rm -q --cached' --tag-name-filter 'echo "table-helper-$(cat)"' --prune-empty -- --all
Caveats:
Really complicated to come up with
Googled solutions may be subtly os- or case- specific (sed, xargs, '*' above)
(I know git & bash & gnu vs. bsd, fixed filter-branch, etc.)
Error Prone:
mixing old and new history
safety -- how to restore (refs/original hard; annotated tags may be missing)
pruning of empty commits overeager
Painful, but possible:
selecting stuff to keep (as opposed to removing)
renaming files
figuring out what to remove (--analyze)
shrinking (man-page is misleading...)
Limiting:
speed
commit message rewriting
Compare:
git repo-filter --analyze
time git repo-filter --path src/main/java/com/palantir/annotation --subdirectory-filter modules
Loading…
Cancel
Save