You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
git-filter-repo/TODO

224 lines
11 KiB
Plaintext

Before widely announcing:
- Cleanups:
- Remove usage of 'codecs' library; PathQuoting is better
- sanity_checks() call in run() should be moved to constructor
- is_bare should check for self._args.target repo
- orig_refs is also only relevant for self._args.target repo
- Notes on splitting
- exporter needs to know the pipe combination for commit message rewriting
- commit message rewriting gets weird if commits held in memory for later
- pruning gets weird too
- need to handle both 2 exports -> 1 import, 1 export -> 2 imports,
no exports (except one created manually) -> 1 import
- Test setup
- Fix lib-oriented tests
- Add several more tests, particularly around:
- commit pruning
- pruning commits that become empty
- pruning commits that started empty and have no parent
- not pruning commits that have changes or remain a merge commit
- pruning parent(s) of a merge
- coalescing common commits of a merge
- coalescing parents of a merge when one is an ancestor of the other
- ref pruning
- tags pointing at commits which are pruned along with their history
- refs pointing at commits which are pruned along with their history
- refs or tags behind a negative revision specification
- commit message rewriting
- renaming, particular when it causes collisions
- use coverage.py to direct test writing
- Check whether the version of git in use supports the appropriate flags
- Blob rewriting (BFG-like replacements); will need to update check-if-empty
- Rewrite history
- rename to git-repo-filter
- Change newren@palantir.com to newren@gmail.com
- remove stupid files
- rename t9302-fast-filter.sh to t9391-repo-filter-lib-usage.sh
- s/testcases/t/, for sucking into git
- do renames in analysis; modify file contents as necessary for those changes
- prefix commit messages with "repo-filter:"
- postfix messages with Signed-off-by (and add enewren@sandia.gov for Jim's)
Generate upstream patches:
- Tags of tags of commits fail to export:
- In git.git, try:
$ git fast-export --no-data --use-done-feature --signed-tags=strip \
--tag-of-filtered-object=rewrite-feature v1.0rc1 >/dev/null
fatal: tag 5f4cd4ca015dc795b9f7f4fed11b3f80a60ac175 tags unexported tag!
Bigger ideas
- 1st step, create local branches for each remote tracking branch:
git fetch . refs/remotes/origin/*:refs/heads/*
also, nuke refs/remotes/origin/*; it won't match upstream anyway
- Performance:
- Smarter record_remapping -- do it lazily
- Unnecessary re-computation of 'epoch' (calling fromtimestamp)
...and perhaps just unnecessary use of FixedTimeZone when most the time
it will not be checked or modified?
- What part of _parse_commit takes so much time?
- What part of commit.dump takes so much time?
- Speedup _parse_optional_filechange using str.split(None, 3) instead of re
- Which wait() are we waiting on?
- Smarter become-empty checks; only do more expensive checks if:
- First parent is no longer original first parent or ancestor thereof
- e.g. first-parent history empty, second parent becomes first parent
- e.g. --parent-filter causes some kind of graft operation (although
maybe we don't want to prune in this case anyway...)
- Blob filtering is active AND the only file_changes involved correspond
to filenames that have previously been modified.
- Regex optimization
- memoize (or just outright store?) filename remapping
- memoize net result: dequote -> do mods -> requote
- Work with submodules
- Important features
- paths-from-file (--paths-from-file <(git ls-tree -r HEAD)
- include-old-names-of-specified-files
- so users don't have to look for rename data from --analyze
- --use-mailmap (point to "MAPPING AUTHORS" in git-shortlog)
- Do git rev-list --count to get idea of amount of work; show progress
Left over bits:
- Fix up --analyze
* shouldn't allow running --analyze with negative refspecs
* add a --no-detect-renames option (for performance)
- renames & copies can cause commits to become empty
- metadata
- On second and subsequent runs, update metadata instead of overwriting
- for maps, give beginning_hash -> end_hash, not intermediate hashes
- OR error out if .git/repo-filter already created?
- error out if any progress messages in stream (can't deal with them unless
we can pass --cat-blob-fd to fast-import, and that seems non-portable)
More path stuff, maybe
--path-rename-regex
--path-stream-rename (invoked once; must read one line then print)
--path-stream-filter (invoked once per commit with new files)
--path-tree-filter
Ref stuff
--ref-rename
--ref-stream-rename
Blob filter
--tree-filter
Safety stuff
--keep-excluded-revisions
--keep-excluded-refs
--store-backup
--empty-pruning={no/off,auto,always/on}
--negative-refs={drop,reference}
Other things:
/ when implementing renames, check for collisions.
- add a filename_callback too, for just editing file names
- add --skip-cleanup (pruning, gc, etc.; keep reset --hard) for speed compare
- get rid of user-run fast-export & fast-import; don't want to have to
update two callsites.
- Nuke 'WIP' in commit messages
Late state stuff:
Naming
filter-repo (like filter-branch)
repo-filter (for preliminary version?)
Performance notes:
* On rails:
* 1) time git fast-export --show-original-ids --signed-tags=strip \
--tag-of-filtered-object=rewrite --no-data \
--use-done-feature --all >/dev/null
* 2) time git fast-export --show-original-ids --signed-tags=strip \
--tag-of-filtered-object=rewrite --no-data \
--use-done-feature --all >saved_output
* 3a) time git fast-export --show-original-ids --signed-tags=strip \
--tag-of-filtered-object=rewrite --no-data \
--use-done-feature --all \
| sed -e s/+051800/+0261/ >/dev/null
* 3b) time git fast-export --show-original-ids --signed-tags=strip \
--tag-of-filtered-object=rewrite --no-data \
--use-done-feature --all \
| stupid.py >/dev/null
* 4) time git fast-export --show-original-ids --signed-tags=strip \
--tag-of-filtered-object=rewrite --no-data \
--use-done-feature --all \
| sed -e s/+051800/+0261/ \
| git fast-import --force --quiet >/dev/null
* 5) time git repo-filter --invert-paths --path pushgems.rb
(with early quit right before removing unused refs)
* 6) time python -m cProfile -o repo-filter.profile \
~/floss/git-repo-filter/git-repo-filter \
--invert-paths --path pushgems.rb
* 7) time java -jar ~/Downloads/bfg-1.13.0.jar --delete-files pushgems.rb
1: 3.910 fast-export
2: 3.958 fast-export + save output
3: 4.128 fast-export + sed (but toss output)
3a: 4.234 fast-export + python stdin using 'for' iterator
3b: 4.189 fast-export + python stdin using readline
3c:27.796 fast-export + python from subprocess using readline
3d: 4.196 fast-export + python from subprocess using 'for' iterator
3e: 4.580 fast-export + python3 from subprocess using readline
3f: 5.334 fast-export + python3 from subprocess using 'for' iterator
3g: 4.264 fast-export + python from subprocess using readline & bufsize
4: 11.279 fast-export + sed + fast-import
5: 64.098 filter-repo
5: 35.914 filter-repo, after bufsize=-1 for subprocess stuff
6: 69.150 filter-repo run under cProfile
7: 20.155 bfg
Other Notes:
* cProfile:
python -m cProfile -o repo-filter.profile \
~/floss/git-repo-filter/git-repo-filter \
--invert-paths --path pushgems.rb
python
>>> import pstats
>>> p = pstats.Stats('repo-filter.profile')
>>> p.strip_dirs().sort_stats('cumtime').print_stats()
* reports 64.2% of time in readline()
* reports 37.0% of time under _advance_currentline
Argument parsing stuff:
# NOT YET IMPLEMENTED OPTIONS BELOW
misc.add_argument('--empty-pruning', choices=['always', 'auto', 'never'],
default='auto',
help='''The default, auto, will check if filtering
causes commits to become empty (have no file
changes and only have one parent) and prune them
if so. This pruning can also cause merge
commits to have fewer parents and possibly
become empty themselves, and thus be pruned.
Further, any branch or tag whose entire history
is pruned due to becoming empty will be pruned.
However, auto will not prune commits which
started out empty in the original repo and have
a non-pruned parent.''')
misc.add_argument('--store-backup', default=None,
metavar='NAMESPACE', dest='backup',
help='Store a copy of original refs under refs/NAMESPACE/')
misc.add_argument('--keep-excluded-refs', action='store_true',
help='''If refs are excluded either explicitly (e.g.
^master) or implicitly (e.g. a branch in the
history of an excluded ref/revision, or a branch
not listed in the set of revisions to filter),
then that ref will be deleted by the filtering
process. Use --keep-excluded-refs to retain
such refs.''')
misc.add_argument('--keep-excluded-revisions', action='store_true',
help='''If negative revisions are provided to exclude
the range of history we are filtering over (e.g.
negative_branch..master or ^negative_branch_1
^negative_branch_2 master develop), then by
default any commits in the history of those
revisions are excluded from the filtered history
(resulting in the first not-excluded commit in
history becoming a root commit and often
containing an unusually large number of file
changes). With --keep-excluded-revisions, those
commits are all retained (in their unfiltered
form).''')