Lots o' updates

todo
Elijah Newren 5 years ago
parent 856e7ada33
commit 56d3009d41

202
TODO

@ -1,86 +1,134 @@
Before widely announcing:
- Notes on splitting
- need to handle 1 export -> 2 imports
- Test setup
- Add several more tests, particularly around:
- commit pruning
- pruning commits that become empty
- pruning commits that started empty and have no parent
- not pruning commits that have changes or remain a merge commit
- pruning parent(s) of a merge
- coalescing common commits of a merge
- coalescing parents of a merge when one is an ancestor of the other
- crazy no-ff stuff
- what about when splicing repos; does it still work?
- ref pruning
- tags pointing at commits which are pruned along with their history
- refs pointing at commits which are pruned along with their history
- refs or tags behind a negative revision specification
- commit message rewriting
- does it also work with repo splicing?
- renaming, particular when it causes collisions
- use coverage.py to direct test writing
Callbacks
- filename_callback
- message_callback (commit and tag messages; see also commit/tag_callback)
- person_name_callback
- email_callback
- date_callback (?!? author/committer/tagger)
- refname_callback (error if annotated tag gets rewritten outside refs/tags/)
- commit_callback
- blob_callback
- tag_callback
Crazy ideas, showing filter-repo flexibility:
- implement bfg
- implement filter-branch (except only provide changed files)
- show modifying mode (e.g. mark executable)
- delete commits older than certain date (simple commit.skip())
- alternative path filtering
- doing case-insensitive path deletion
- clean via (current) .gitignore files [see git-check-ignore]
- removing/inserting files
- removing symlinks?
- adding a new file (LICENSE/COPYING) to the beginning of history
- remove a submodule (copy contents into tree for each submodule commit)
- extract to submodule (where to put submodule history though?)
- convert to git lfs
- reformatting (lint) files in history [maybe only via git-very-bad-idea??]
- extend argument parser and use extended version
- git-bfgish (bfg) and git-very-bad-idea (filter-branch, on changed-only tree)
Generate upstream patches:
- Tags of tags of commits fail to export (so does tags of blobs):
- In git.git, try:
$ git fast-export --no-data --use-done-feature --signed-tags=strip \
--tag-of-filtered-object=rewrite-feature v1.0rc1 >/dev/null
More thorough testing
- commit pruning
- pruning commits that become empty
- pruning commits that started empty and have no parent
- not pruning commits that have changes or remain a merge commit
- pruning parent(s) of a merge
- coalescing common commits of a merge
- coalescing parents of a merge when one is an ancestor of the other
- crazy no-ff stuff
- what about when splicing repos; does it still work?
- ref pruning
- tags pointing at commits which are pruned along with their history
- refs pointing at commits which are pruned along with their history
- refs or tags behind a negative revision specification
- splitting a repo in addition to splicing
- commit message rewriting
- does it also work with repo splicing/splitting?
- renaming, particular when it causes collisions
- use coverage.py to direct test writing
Repository diffing
- Specialized fast-export --no-data --show-original-ids --all output
- If blobs differ, then I manually augment output with blob info
- Replace "from :<id>" with "from :<commit-msg-summary>"
- Filter out original ids, but remember them
- Change "commit <refname>" lines into just "commit"
- At the end, augment with "reset <refname> from :<commit-msg-summary>"
for each non-tag refname, in sorted order
Bug reporting:
git.git:
- fast export fails on tags of blobs (example: git.git)
- fast export fails on tags of tags (example: git.git)
- $ git fast-export --no-data --use-done-feature --signed-tags=strip \
--tag-of-filtered-object=rewrite-feature v1.0rc1 >/dev/null
fatal: tag 5f4cd4ca015dc795b9f7f4fed11b3f80a60ac175 tags unexported tag!
- [fundamental] fast export fails on tags of trees (example: linux.git)
- fast export does not report annotated/signed tags outside of refs/tags/
namespace correctly when name doesn't match internal 'tag' field.
- fast import always places tags under refs/tags/, which combined with
fast export reported tags outside of refs/tags/ weird means we get tags
that migrate to new locations.
Handle tags of trees:
- create commit has the given tree as main objects, rewrite the tag to tag it
- run it through normal filter-repo stuff
- at the end, rewrite the tag to tag its commit's tree
Document or start conversation around general issues:
- May not be able to force push over e.g. refs/changes/, refs/pulls/, etc.
- People need a way to do "update project", and get GitHub/Gerrit/Gitlab fixed
Bigger ideas
- Performance:
- Smarter record_remapping -- do it lazily
- memoize net result: dequote -> do mods -> requote
- Smarter become-empty checks; only do more expensive checks if:
- First parent is no longer original first parent or ancestor thereof
- e.g. first-parent history empty, second parent becomes first parent
- e.g. --parent-filter causes some kind of graft operation (although
maybe we don't want to prune in this case anyway...)
- Blob filtering is active AND the only file_changes involved correspond
to filenames that have previously been modified.
- Work with submodules
- Important features
- paths-from-file (--paths-from-file <(git ls-tree -r HEAD)
- include-old-names-of-specified-files
- so users don't have to look for rename data from --analyze
- Do git rev-list --count to get idea of amount of work; show progress
Other feature ideas:
- Compatibility:
- Handle grafts and replace files
- Is any work needed to handle submodules?
- Options:
- paths-from-file (similar to --replace-text, maybe also invert or basename?)
- include-old-names-of-specified-files (auto get rename data from --analyze?)
- Write git notes mapping old ids to new ids (or make special references?)
- add --skip-cleanup (pruning, gc, etc.; keep reset --hard) for speed compare
- Old list:
- --keep-excluded-revisions
- --keep-excluded-refs
- --store-backup
- --empty-pruning={no/off,auto,always/on}
- --no-ff-pruning={no/off,auto,always/on}
- --negative-refs={drop,reference}
Left over bits:
- Fix up --analyze
* shouldn't allow running --analyze with negative refspecs
* add a --no-detect-renames option (for performance)
Cleanups and left-over bits:
- put $(git --exec-path) in front of PYTHONPATH before importing?
- should handle remote symrefs better (don't special case origin/HEAD)
- metadata
- On second and subsequent runs, update metadata instead of overwriting
- for maps, give beginning_hash -> end_hash, not intermediate hashes
- OR error out if .git/filter-repo already created?
- error out if any progress messages in stream (can't deal with them unless
we can pass --cat-blob-fd to fast-import, and that seems non-portable)
Add a filename_callback parameter for those that want to affect just that
Add
--blob-callback <string serving as python code>
--commit-callback <string serving as python code>
--tag-callback <string serving as python code>
and two special ones:
--refname-callback <string serving as python code>
--filename-callback <string serving as python code>
Documentation
- Examples, include these in:
- help output
- README.md
- manpage
- Backward compatiblity guarantees (or lack thereof)
- Big comment at top of git-filter-tree
- Reference caveat next to every import statement
- Make list of caveats:
- notes about history becoming incompatible (from rebase documentation)
- signed tags will be stripped
- empty commit pruning and topological changes
- commit message rewriting
- path rename collisions
- all the git.git bug reports
Safety stuff
--keep-excluded-revisions
--keep-excluded-refs
--store-backup
--empty-pruning={no/off,auto,always/on}
--no-ff-pruning={no/off,auto,always/on}
--negative-refs={drop,reference}
Other things:
- add --skip-cleanup (pruning, gc, etc.; keep reset --hard) for speed compare
Performance:
- Smarter record_remapping -- do it lazily
- Pathquoting memoization; or full result? (dequote -> do mods -> requote)
- Smarter become-empty checks; only do more expensive checks if:
- First parent is no longer original first parent or ancestor thereof
- e.g. first-parent history empty, second parent becomes first parent
- e.g. --parent-filter causes some kind of graft operation (although
maybe we don't want to prune in this case anyway...)
- Blob filtering is active AND the only file_changes involved correspond
to filenames that have previously been modified.
Argument parsing stuff:
# NOT YET IMPLEMENTED OPTIONS BELOW
@ -122,3 +170,17 @@ Argument parsing stuff:
changes). With --keep-excluded-revisions, those
commits are all retained (in their unfiltered
form).''')
BFG issues:
- only works if repo is packed (can't find sizes of loose objects)
- can fail badly if not pre-gc'ed (issue 7)
- rewrite of files annoyingly leaves ids around if nuked by size
- private is weird/annoying: issue 139, issue 112
- filter-content-including doesn't affect -b option
- After replace text operation, index & working tree not updated
- may be the 'real' reason behind blob protection?
- Doesn't auto-repack (users think repo got bigger, may slow down)
- Issue 221: JGit looks at pack-refs and not individual refs, causing problems
- Issue 221: symrefs cause problems
- Issue 116: --convert-to-git-lfs doesn't really work
- and in Issue 215 they suggest https://github.com/bozaro/git-lfs-migrate

@ -0,0 +1,51 @@
Subject 1: filter-repo: history rewriting tool OR tool for writing history rewriting tools?
Subject 2: filter-repo versatility
Hi everyone,
A while ago, Jonathan expressed a worry that making filter-repo a core
command could discourage experimentation with history-rewriting, much
as he felt filter-branch did. So, I came up with a crazy idea to
demonstrate why I think including filter-repo may actually do the
opposite:
I re-wrote BFG and filter-branch as scripts on top of filter-repo.
You can see these scripts at t/t9392/git-bfgish and
t/t9392/git-really-bad-idea in the filter-repo repository. In BFG's
case, I left out BFG's nice post-run reports but believe I implemented
everything else and lightly tested on a couple cases to verify I got
the same results[1]. In filter-branch's case, what I implemented is
technically not backwards compatible -- it creates trees and indexes
with only the subset of files that changed in any given commit and
without access to the full 'git-log' of history to that point. BUT,
I've never seen a filter-branch invocation that made use of either in
modifying history, so it gives results that match filter-branch for
all practical intents and purposes[2].
Crazy? Genius? I don't know.
Maybe most people will just use filter-repo as a simple tool and I'm
the only one interested in this kind of flexibility, but filter-repo
is certainly far more versatile than filter-branch in addition to being
faster and (in my opinion) having much better usability.
[1] Of course, I had to disable empty commit pruning, make sure to
only match on file basenames rather than full paths, and needed to add
the concept of blob protection in order to match, but none of this
needed changes to filter-repo core. Only the match on blob size
needed a change to filter-repo core, and it was relatively small and
something I wanted anyway.
[2] Also, to match filter branch I did have to turn off automatic
commit message updating, use less-accurate prune-empty logic, slow
history rewriting to a crawl by forking zillions of shell commands
(though it's still a lot faster than filter-branch), implement
infuriating defaults, add usability pitfalls for users, etc., but it
allowed me to compare end results (e.g. git show-ref) and verify
identicalness.

@ -1,183 +0,0 @@
----- Short version -----
As suggested by Ævar[1], I am proposing git repo-filter for inclusion
in git.git. I hope that my documentation included in the repo-filter
repository[2] can answer questions you have about it; if it does not,
that may indicate I need to supplement its documentation. However, I
am happy to answer any and all questions you may have about the tool;
fire away.
Basic Info:
git repo-filter is tool for rewriting history that includes some
capabilities I have not found anywhere else. It is most similar to
filter-branch, though it has a significantly different taste in
usability. Also, being based on fast-export/fast-import, is orders of
magnitude faster (it has speed roughly comparable to BFG repo cleaner,
but isn't multi-threaded).
repo-filter is a ~2500 (FIXME) line single-file python script,
depending only on the python standard library (and execution of git
commands), all of which is designed to make build/installation
trivial: you just need to copy it into your $PATH.
[1] https://public-inbox.org/git/87r2fq3b9t.fsf@evledraar.gmail.com/
[2] Currently tracked at https://github.com/newren/git-repo-filter,
but the plan would be to instead point people at git.git if it is
merged. (And if it is merged, the merge should just delete its
antique fork of t/test-lib.sh and its README.md.)
----- Intermediate length version -----
As suggested Ævar[1], I am proposing git repo-filter[2] for inclusion
in git.git. There are a few issues that make me wonder if the git
community will want it, which I've done my best to explain and address
these below.
Sorry for the lengthy email; feel free to skim for whatever bits seem
relevant to you.
Basic background
----------------
git repo-filter is tool for rewriting history. It has a significantly
different taste in usability than filter-branch, and being based on
fast-export/fast-import, is orders of magnitude faster (it has speed
roughly comparable to BFG repo cleaner, but isn't multi-threaded). It
includes some capabilities I have not found anywhere else.
Important inclusion information
-------------------------------
1. Build: No special build rules required; it's a single-file script
to simplify build/installation. Its only dependencies are
git and python. This python script only uses the python
standard library, so no extra python packages are needed.
2. Tests: (FIXME) git-style end-to-end tests (using an ancient fork of
test-lib.sh from git.git) are in use, making the inclusion
into git trivial. There are also some python-style unit
tests, though these are also invoked from a test in the
end-to-end suite so no additional tooling is needed.
3. Documentation: (FIXME) Built-in help and git-style asciidoc man-page
already included.
Possible reasons to exclude from git.git
----------------------------------------
1. Portability: repo-filter is written in Python, which I've heard
is difficult for some platforms where git is run.
2. Maintainability/EOL decisions: repo-filter is (currently) written
in Python 2 rather than Python 3.
3. User story: Since repo-filter will not and can not be backward
compatible to filter-branch, we inevitably would have two tools
for rewriting history. Some may see that as confusing to users,
especially since I didn't just implement a slightly different
feature set: I fixed usability warts by changing a few basic
underlying assumptions.
Counter-arguments against exclusion
-----------------------------------
1) Portability:
1a) repo-filter only uses the python standard library, simplifying
the porting story significantly.
1b) repo-filter is a single file script. While it is even longer than
git-send-email.perl, putting it on the big side, this does mean
no special build instructions are needed.
1c) repo-filter is not a daily-use tool, nor is it a collaboration
tool. It's a tool that one person on your team uses once in
maybe five years, then shares the results with everyone once. Thus,
portability to esoteric platforms is perhaps less critical than it
is for other components of git.
2) *shrug*. repo-filter was started by importing git-fast-filter[3]
(which was in Python 2), and I haven't bothered porting. I have often
worked with older enterprise distros, so I am a bit of a laggard with
the Python 3 transition. If others find this worrisome, I can work on
porting.
3) I've already made this email too long so I'll summarize; let me
know if you want more detail. In short: repo-filter enables
usage on repositories for which filter-branch is just completely
impractical, and also has new capabilities that I cannot even
emulate within filter-branch. But it's more than just that.
While filter-branch is a nifty easy-to-use tool for a few very
simple cases and has enough versatility to sometimes handle more
complex cases, the the complexity increases rapidly and some of
the underlying assumptions make for greater user confusion and/or
cause problems in trying to use several different features for
the same filtering operation. As such, I think a tool designed
for larger filtering operations or less sophisticated users of
necessity needs to change some basic things about how
filter-branch operates, which implies it must be a new different
tool.
So...thoughts?
Thanks,
Elijah
[1] https://public-inbox.org/git/87r2fq3b9t.fsf@evledraar.gmail.com/
[2] Currently tracked at https://github.com/newren/git-repo-filter,
but the plan would be to instead point people at git.git if it's
included.
[3] https://public-inbox.org/git/51419b2c0904072035u1182b507o836a67ac308d32b9@mail.gmail.com/
Background:
Desire to combine, split-apart, or clean up repositories
Examples: pgdev, nucleus, willamette
Example, want:
Only certain paths (a specific directory)
move into a subdirectory
rename tags to not conflict
Filter-branch command (takes 65.950 seconds, or 15.594 seconds):
time git filter-branch --tree-filter 'mkdir -p modules && git ls-files | grep -v ^src/main/java/com/palantir/annotation | xargs git rm -f -q && ls -d * | grep -v modules | xargs -I files mv files modules/' --tag-name-filter 'echo "table-helper-$(cat)"' --prune-empty -- --all
Faster version (takes 37.802 seconds, or 6.287 seconds):
time git filter-branch --index-filter 'git ls-files | grep -v ^src/main/java/com/palantir/annotation | xargs git rm -q --cached; git ls-files -s | sed "s-$(printf \\t)-&modules/-" | git update-index --index-info; git ls-files | grep -v ^modules/ | xargs -r git rm -q --cached' --tag-name-filter 'echo "table-helper-$(cat)"' --prune-empty -- --all
Caveats:
Really complicated to come up with
Googled solutions may be subtly os- or case- specific (sed, xargs, '*' above)
(I know git & bash & gnu vs. bsd, fixed filter-branch, etc.)
Error Prone:
mixing old and new history
safety -- how to restore (refs/original hard; annotated tags may be missing)
pruning of empty commits overeager
Painful, but possible:
selecting stuff to keep (as opposed to removing)
renaming files
figuring out what to remove (--analyze)
shrinking (man-page is misleading...)
Limiting:
speed
commit message rewriting
Compare:
git repo-filter --analyze
time git repo-filter --path src/main/java/com/palantir/annotation --subdirectory-filter modules

@ -0,0 +1,14 @@
filter-branch questions:
https://stackoverflow.com/questions/53413645/filter-branch-wont-delete-orphan-branches
https://stackoverflow.com/questions/53691547/keeping-history-of-splitted-repository-on-renamed-folder
https://stackoverflow.com/questions/6638019/detach-subdirectory-that-was-renamed-into-a-new-repo
https://stackoverflow.com/questions/53200708/retaining-original-folder-with-git-subdirectory-filter
https://stackoverflow.com/questions/53502654/how-do-i-run-a-code-formatter-over-my-source-without-modifying-git-history
https://stackoverflow.com/questions/52505480/how-can-i-convert-this-git-filter-branch-command-from-tree-filter-to-index-filte
BFG questsion:
https://stackoverflow.com/questions/54310566/cant-get-rid-of-a-big-file-in-gitlab-repository
https://stackoverflow.com/questions/54139438/bfg-is-there-any-way-to-replace-text-on-files-on-a-specific-path
https://stackoverflow.com/questions/53821522/before-git-push-how-can-we-delete-big-files-using-bfg-including-protected-dir
https://stackoverflow.com/questions/50288203/github-cleaning-history-of-unwanted-files
Loading…
Cancel
Save