Commit Graph

442 Commits

Author SHA1 Message Date
Elijah Newren
bd2c9c4d4d contrib: new simple no-op-example
The purpose of this example is to solely show what to import and run to
recover filter-repo's behavior as-is.  It doesn't modify any behavior,
but instead exists as an example so people can easily find a good
starting point for making their own modifications.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-06 11:19:35 -07:00
Elijah Newren
caa05d15b4 filter-repo: make default replacement text a variable
Allow external scripts that import git-filter-repo to change the value
of the default replacement text instead of having it hardcoded within
some function.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-06 10:51:41 -07:00
Elijah Newren
31f00a9ff8 filter-repo: avoid applying --replace-text to binary files
--replace-text is meant to replace _text_ throughout the repository, not
binary data.  Use the same scheme as the lint-history script uses to
avoid applying the changes to binary blob data.

Reported-by: Tobias Gruetzmacher <tobias-git@23.gs>
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-06 10:27:48 -07:00
Elijah Newren
859e66ae1c converting-from-filter-branch.md: add a small clarification
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-06 10:08:56 -07:00
Elijah Newren
d32f6258a8 converting-from-bfg-repo-cleaner.md: add a small clarification
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-06 10:02:31 -07:00
Elijah Newren
d87b665ed4 git-filter-repo.txt: connect --no-local and fresh clones more thoroughly
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-01 08:16:50 -07:00
Elijah Newren
469a3e10f2 filter-repo (README): separate sections for different tools
Our showing of how to handle the simple example with different tools
combined three different tools into a single section which I think made
it slightly harder to read and follow.  It also concentrated almost
exclusively on filter-branch.  Provide a separate section for each tool,
and provide more details for BFG and fast-export/fast-import.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-29 17:49:03 -07:00
Elijah Newren
8ba3566119 filter-repo (README): link cheat sheets from usage section too
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-29 17:49:03 -07:00
Elijah Newren
cdb7b77f07 filter-repo: repack with --source or --target
When using --source or --target in combination with filtering paths,
users were surprised out how large the resulting repository was.  The
usage of --source and --target were turning off repacking; while we
don't want repacking for partial history rewrites and --source and
--target turn on some of the other features we want with partial history
rewrites, repacking is something that we still want turned on.

Reported-by: Alexey Volkov <alexey.volkov@ark-kun.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-28 08:16:12 -07:00
Elijah Newren
2bfb9cf261 git-filter-repo.txt: fix extraneous space
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-27 23:47:50 -07:00
Elijah Newren
7b18e6d7f5 filter-repo: fix --prune-degenerate=never with path filtering
When combining `--prune-degenerate never` with a `--path` specification,
we could end up trying to write a parent out to the fast-import stream
whose value was actually None.  The problem occurs when the parents of
a merge commit are filtered out by the path specification, leaving us
only with no-longer-extant parents.  In such a case, we need to filter
out these 'None' (i.e. invalid) parents.  The point of
`--prune-degenerate never` is to avoid removing parents that are either
the same as or an ancestor of another parent, not to avoid removing
non-existent parents.  Remove the non-existent parent(s).

Reported-by: Gaurav Kanoongo (@gauravkanoongo on GitHub)
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-27 07:04:17 -07:00
Elijah Newren
df6c8652a2 filter-repo: fix ugly bug with mixing path filtering and renaming
There's also a fix in here to make sure to throw an error if users are
trying to rename paths and use --invert-paths; it's not clear at all
what that would even mean.  But that also becomes important later...

Due to the ability to either filter wanted paths (default), or to just
specify unwanted paths (with --invert-paths), I keep a special
args.inclusive variable to track whether a "match" means we want the
path or not.  There are some special cases, notably when there are no
filters present (meaning e.g. no --path specifications, at most there
are some --path-rename values provided).  When there are no filters
present, that means we should keep paths even if we don't "find a match"
against any of the filters.

Now, since the rename code was embedded in the same loop as the filter
checks, it unfortunately was also being checked against the
args.inclusive setting despite never setting whether it found a match.
That happened to work in the special case that there were no filtering
paths but only because of the special logic for that case.  Since
renaming only makes sense if --invert-paths is not specified, any path
we rename is one we always want to keep.  Make sure we do.

Reported-by: Nadège (@nagreme on GitHub)
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-25 12:35:34 -07:00
Elijah Newren
0375758806 filter-repo: fix possible deadlock in sanity_check_args
I'm a little surprised that stdout buffers must have filled up on MacOS X, but
either way we don't have to wait for the '-h' processes to finish before
attempting to read stdout.  In fact, since we weren't storing the returncode
attribute from calling p.wait(), there wasn't much point in doing so.  Trying
to read all stdout all at once is going to implicitly take until the process
finishes anyway, so just do that.

Reported-by: Benoit Lefèvre <contact@benoit-lefevre.org>
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-25 11:00:09 -07:00
Elijah Newren
15494bba8a filter-repo: make git version requirement error message more direct
Users won't know which versions of git have --mark-tags, --reencode, or
--combined-all-paths options for fast-export and diff-tree.  I didn't
either when I wrote those messages because it wasn't in a released
version of git.  Now that they are in released versions and have been
for a while, we can simplify the messages to just state which git
version is needed.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-19 16:44:27 -07:00
Elijah Newren
1e2d0e91cb Documentation: add more detailed explanation of safety checks and --force
I occasionally get people doing special things, or see people
recommending to others to just use --force.  Add some explanations
behind the safety checks so that those doing special things know when
it's okay, and to explain why it's a really bad idea to casually or
haphazardly recommend others use --force.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-19 14:52:51 -07:00
Elijah Newren
3dfaf3874e filter-repo: fix --no-local error when there is no remote
Commit 011c646ee8 (filter-repo: suggest --no-local when cloning local
repos, 2020-05-15) added an additional message to the error to make it
more clear what to do when cloning local repos.  However, if there was
no remote, then the code path would run os.path.isdir(None), triggering
a traceback.  Fix the logic.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-18 23:19:27 -07:00
Elijah Newren
423b7d2c89 INSTALL: streamline a bit and guide folks to package managers
Now that several package managers are packaging filter-repo (Debian and
Ubuntu seem to be the primary holdouts, but maybe treating Linux as
"covered" will pressure them to package it too), guide people to use
package managers for easy installation and streamline the wording.
Still keep the old instructions around, just move them later.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-18 09:25:11 -07:00
Elijah Newren
7e1184cd42 git-filter-repo.txt: add more --paths-from-file examples with large filtering lists
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-18 07:58:18 -07:00
Elijah Newren
5c4637ff81 Documentation: add guides for people converting from filter-branch or BFG
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-18 07:57:11 -07:00
Elijah Newren
4cfc765eb1 filter-repo: allow removing .git directories from history
Commit 7cfef09e9b (filter-repo: warn users who try to use invalid path
components, 2019-12-26) attempt to protect against using invalid path
components, but also added a check against a path that has sometimes
been valid in the past and which users might want to be able to remove
from their history.  Relax the check so that users can remove '.git'
directories in subdirectories (or even at the toplevel) from their
history.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-16 23:58:47 -07:00
Elijah Newren
db9ac1fffe git-filter-repo.txt: add documentation of --no-ff option
Commit 5e04dff097 (filter-repo: add new --no-ff option, 2020-01-01)
added support for a --no-ff option, but only added documentation in the
built-in output, not in the intended-to-be-more-complete manual.  Add
documentation to the manual for this option.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-16 18:19:05 -07:00
Elijah Newren
2833ef275f filter-repo: throw an error if user specifies any path starting with a slash
All paths are intended to be relative paths, relative to the project
root, not to the filesystem root.  There have been a few people who
didn't understand this, and then ended up with fast-import crashes that
are not very clear.  Check for it early and throw a simple error message
instead.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-16 18:05:59 -07:00
Elijah Newren
764e0e00dd git-filter-repo.txt: add examples for --[to-]subdirectory-filter
I had lots of examples of these being horribly mis-used and being used in place
of each other; add some examples with some description of the repository layout
to try to avoid all that confusion.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-16 17:49:09 -07:00
Elijah Newren
e834379254 filter-repo: clarify usage of --use-base-name
fast-export/fast-import only work with filenames (using full path from
the root of the repository); thus that's all that filter-repo works
with.  Full pathnames implicitly include all leading directories as part
of the pathname, which is what allows us to match against directories.
However, it obviously means --use-base-name can't be used to match paths
against directories.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-16 17:46:24 -07:00
Elijah Newren
7c877cd750 filter-repo: make --version more robust against modified shebangs
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-16 15:09:43 -07:00
Elijah Newren
e9c2d9adb5 filter-repo: ensure we write final newline after final progress update
We try to write 'Parsed %d commits' messages only after enough time has
past to avoid writing to stdout becoming a bottleneck.  However, there
was a slight logic error that would cause it to only print the final
newline if there was a new message since the last progress update,
leaving a small race condition where we might miss it.

Reported-by: Valentyn Shtronda (@valiko-ua)
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-16 12:53:43 -07:00
Elijah Newren
011c646ee8 filter-repo: suggest --no-local when cloning local repos
Cloning local repos by default makes a bunch of hardlinks, giving you a
non-packed repository, and leading folks to use and suggest --force.
That, of course, bypasses the important fresh clone checks to prevent
people from accidentally and irrecoverably deleting their non-backed-up
data.  Let's make it easier for people to avoid (and suggest) that
mistake.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-16 12:35:47 -07:00
Elijah Newren
c0c37a7656 filter-repo: fix bitrotted documentation links
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-04-04 23:02:17 -07:00
Elijah Newren
427b265195 Merge branch 'mr/filter-lamely-and-special-filenames' into master
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-04-03 11:03:23 -07:00
Marius Renner
3427ee171b contrib: fix special character handling in filter-lamely
filter-lamely does not handle filenames with special characters (such as
äöü or even \n and \t) properly when using a tree filter or index
filter. It either does not quote the input to git correctly or parses
git output incorrectly, causing affected filenames to be mangled with
extraneous double quotes in the history or even crashing the program.

Make filter-lamely correctly handle such filenames by using
NUL-delimited input and output modes for the affected git commands.

Signed-off-by: Marius Renner <marius@mariusrenner.de>
2020-04-03 09:42:57 +02:00
Elijah Newren
f164f2b2e6 Merge branch 'kf/fix-example-typo' into master
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-03-24 16:17:45 -07:00
Kate F
420aa32dac git-filter-repo.txt: Fix typo for example
Signed-off-by: Kate F <kate@elide.org>
2020-03-24 16:02:52 -07:00
Elijah Newren
3a394ca152 Makefile: a few sanity checks for releasing
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-03-23 16:04:07 -07:00
Elijah Newren
9928b7cb3e t9390: add missing '&&' in command chain
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-03-23 15:26:52 -07:00
Elijah Newren
e11343e504 filter-repo: handle typechange modifications when first parent is pruned
Commit 509a624b (filter-repo: fix issue with pruning of empty commits,
2019-10-03) added code to get a new list of file changes when the first
parent was pruned.  However, this logic did not handle cases where one
of the file modifications was a typechange.  Add the necessary logic to
handle that case.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-03-23 15:24:52 -07:00
Elijah Newren
4f84a74ada filter-repo: use more expensive prunability checks when needed
When users are inserting new objects into the stream, we cannot make as
many assumptions and need to do more careful checks for whether commits
become empty or not.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-03-23 14:55:07 -07:00
Elijah Newren
b1fae4819a filter-repo: relax the definition of freshly packed
transfer.unpackLimit defaults to 100, meaning that if less than 100
objects exist in the repository, git will automatically unpack the
objects to be loose as part of the clone operation.  So, if there are no
packs and less than 100 objects, consider the repo to be freshly packed
for purposes of our fresh clone sanity checks.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-03-21 20:12:22 -07:00
Elijah Newren
fe33fc42b3 filter-repo: avoid dying with --analyze on commits with unseen parents
analyze_commit() calls add_commit_and_parents() which does a sanity
check that we have seen all parents previously.  --refs breaks that
assumption, so we need to workaround that check when ref limiting is in
effect.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-03-21 19:48:16 -07:00
Elijah Newren
46549e7d3f lint-history: point people to issue with more linting examples
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-02-18 21:59:28 -08:00
Elijah Newren
4c28ed6b8a Merge branch 'sb/setup-idempotency' into master
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-02-11 22:51:19 -08:00
Sirio Balmelli
9cf87ae036
setup.py: test for FileExistsError on symlink
Multiple runs of setuptools encounter a FileExistsError exception
trying to re-symlink the same files.

This exception is safe to ignore: the files were already symlinked
so the call can be considered successful.

Signed-off-by: Sirio Balmelli <sirio@b-ad.ch>
2020-02-11 20:19:01 +01:00
Elijah Newren
b9c62540b7 filter-repo: fix cache of file renames
Users may have long lists of --path, --path-rename, --path-regex, etc.
flags (or even a --paths-from-file option with a lot of entries in the
file).  In such cases, we may have to compare any given path against a
lot of different values.  In order to avoid having to repeat that long
list of comparisons every time a given path is updated, we long ago
added a cache of the renames so that we can compute the new name for a
path once and then just reuse it each time a new commit updates the old
filepath.

Sadly, I flubbed the implementation and instead of setting
   cache[oldname] = newname
I somehow did the boneheaded
   cache[newname] = newname
For most repositories and rewrites, this would just have the effect of
making the cache useless, but it could wreak various kinds of havoc if
a newname matched the oldname of some other file.

Make sure we record the mapping from OLDNAME to newname to fix these
issues.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-02-10 13:41:19 -08:00
Elijah Newren
85c8e3660d filter-repo: accelerate is_ancestor() for --analyze mode
The --analyze mode was extremely slow for the freebsd/freebsd repo on
github; digging in, the is_ancestor() function was being called a huge
number of times -- about 22 times per commit on average (and about 17
million times overall).  The analyze mode uses is_ancestor() to
determine whether a rename equivalency class should be broken (i.e.
renaming A->B mean all versions of A and B are just different versions
of the same file, but if someone adds a new A in some commit which
contains the A->B rename in its history then this equivalence class no
longer holds).  Each is_ancestor() call potentially has to walk a tree
of dependencies all the way back to a sufficient depth where it can
realize that the commit cannot be an ancestor; this can be a very long
walk.

We can speed this up by keeping track of some previous is_ancestor()
results.  If commit F is not an ancestor of commit G, then F cannot be
an ancestor of children of G (unless that child has multiple parents;
but even in that case F can only be an ancestor through one of the
parents other than G).  Similarly, if F is an ancestor of commit G, then
F will always be an ancestor of any children of G.  Cache results from
previous calls to is_ancestor() and use them to accelerate subsequent
calls.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-02-07 18:04:53 -08:00
Elijah Newren
f2dccbc2ef filter-repo: avoid repeatedly translating the same string with --analyze
Translating "Processed %d blob sizes" or "Processed %d commits" hundreds
of thousands or millions of times is a waste and turns out to be pretty
expensive.  Translate it once, cache the string, and then re-use it.
Note that a similar issue was noted in commit 3999349be4 (filter-repo:
fix perf regression; avoid excessive translation, 2019-05-21), but I did
not think to check --analyze mode for similar issues back then.  Fix it
now.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-02-07 18:00:46 -08:00
Elijah Newren
9d3d99593c lint-history: avoid dying when we get file deletions
When a file is deleted, there is nothing to lint, so we can just keep
the deletion as-is.

Reported-by: Thorben Kröger <dev@thorben.net>
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-02-06 13:02:12 -08:00
Elijah Newren
4ea19c0bf8 filter-repo (README): streamline prerequisite wording a little bit
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-02-01 09:59:05 -08:00
Elijah Newren
bcd9964537 filter-repo (README): link to upstream docs
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-02-01 09:59:05 -08:00
Elijah Newren
96e217355c Contributing.md: start with git guidelines, then mention exceptions
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-02-01 09:49:26 -08:00
Elijah Newren
18f98295e4 git-filter-repo.txt: fix nested bullets to render correctly
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-02-01 09:49:26 -08:00
Elijah Newren
1dae85ee9a filter-repo: permit trailing slash for --[to-]subdirectory-filter argument
There was code to allow the argument of --to-subdirectory-filter and
--subdirectory-filter to have a trailing slash, but it was broken due to
a bug in python3's bytestring design: b'somestring/'[-1] != b'/',
despite that being the obvious expectation.  One either has to compare
b'somestring/'[-1:] to b'/' or else compare b'somestring/'[-1] to
b'/'[0].  So lame.  Note that this is essentially a follow-up to commit
385b0586ca ("filter-repo (python3): bytestr splicing and iterating is
different", 2019-04-27).

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-01-22 07:46:20 -08:00