Commit Graph

119 Commits (main)

Author SHA1 Message Date
Stefano Rivera 838bdd19f6 Update expected test data for git 2.35
Commit order from fast-export --first-parent has changed in git 2.35,
see 726a228dfb

This will break the same tests on older git releases.

Fixes: #344
Signed-off-by: Stefano Rivera <stefano@rivera.za.net>
2 years ago
Elijah Newren 05e3548b67 Merge branch 'rnd/add-report-dir-option'
Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
rndbit e9d5ab3529 filter-repo: add option --report-dir to set custom analysis dir
--analyze is hardcoded to write to a subdirectory inside GIT_DIR.

When practicing filtering runs on a large repo it is desirable to keep
an unchanged copy read-only to reduce chance of user error. It is
desirable to be able to analyze a read-only repo without having to clone
it. This would save a lot of time and space.

Add --report-dir option to set a non-default destination directory for
writing analysis output to.

Signed-off-by: rndbit <rndbit@filter.bitman.net>
[en: fixed existing regression test broken by now not overwriting the
     analysis directory unconditionally, and also added a new test of
     the new behavior for code coverage.]
Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
rndbit 993216739e filter-repo: add tests for --replace-text in binary blobs
The --replace-text failed to detect blobs as binary and incorrectly
applied to all blobs.
Prior to switch from python2 to python3 it incorrectly designated blobs
containing 0 character instead of NUL byte as binary and would have been
causing text replacements to apply to binary files and not apply to text
files containing 0 character.

Add regression tests with blobs containing; 0 character, NUL byte, and
both 0 character and NUL byte.

Signed-off-by: rndbit <rndbit@filter.bitman.net>
3 years ago
Gwyneth Morgan 129a3bcb8b filter-repo: add new --replace-message option
Like --replace-text, add an option --replace-message which replaces text
in commit/tag message bodies, so that users can easily replace text
without constructing a --message-callback.

Signed-off-by: Gwyneth Morgan <gwymor@tilde.club>
Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
Elijah Newren 47c5a29fd4 Merge branch 'sb/callback-from-file'
Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
Shezan Baig 5256c99e49 Allow callback body to be loaded from a file
For anything more complicated than a few lines, it's easier to write the
callback body in a file and let filter-repo load the file as a string.

Signed-off-by: Shezan Baig <sbaig1@bloomberg.net>
[en: added a testcase for code coverage]
Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
Stefano Rivera 24f09bd016 Share implementation with github workflow
Signed-off-by: Stefano Rivera <stefano@rivera.za.net>
3 years ago
Stefano Rivera 26e3f8c52e Exit non-zero if the tests fail
Signed-off-by: Stefano Rivera <stefano@rivera.za.net>
3 years ago
Stefano Rivera 34b26f4026 Break the actual test runner into its own script
So that we don't have to run with coverage if we don't want to.

Additionally, don't require being in the t directory to run tests

Signed-off-by: Stefano Rivera <stefano@rivera.za.net>
3 years ago
Elijah Newren 8683d6fe48 Merge branch 'js/windows-fixes'
Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
Johannes Schindelin e0a3df8c62 Fix the Python path on Windows
On Windows, we want to run with a native Python, i.e. the separator is a
semicolon, and the paths should be Windows paths (although they're
allowed to have forward slashes instead of backslashes).

Since we're most likely running this in an MSYS2 Bash, allow for
`$TEST_DIRECTORY` to pretend to be a Unix path, and translate it via
`cygpath` into a Windows path.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
3 years ago
Elijah Newren cf67ccd978 filter-repo: improve invalid repository error message
Even though the repository is encoded as a bytestring, we want error
messages to be UTF-8.

Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
Elijah Newren 7500fb7c5a t9390: add a testcase for --path-rename with no colon
Commit 28b479b7 (Fix bug in --path-rename argument without colon,
2021-03-12) added a new conditional error message, with no corresponding
testcase to ensure the line was covered.  I forgot to check the coverage
before merging the change.  Add a relevant test now.

Signed-off-by: Elijah Newren <newren@gmail.com>
3 years ago
Johannes Schindelin d0dcece202 t9391: guard `dos2unix` use behind a prereq
Not all setups have `dos2unix`. Most notably, the Ubuntu and macOS
agents of GitHub Actions don't.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
4 years ago
Johannes Schindelin 85afdf9da9 t9391: don't rely on the system gitconfig defining core.autocrlf=false
The test case t9391.12 specifically wants to test LF vs CR/LF line
ending issues, expecting `core.autoCRLF` to default to `false`. This is
true on Linux and macOS and pretty much everywhere else, except on
Windows.

Let's make sure that the test operates with the `core.autoCRLF` value it
assumes to operate under.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
4 years ago
Johannes Schindelin fe79ec9912 t9390: work around yet another Unix<->Win32 path issue
On Windows, there is no absolute path `/fake/path`, but MSYS2 (which Git
for Windows uses e.g. for running Bash scripts) pretends that it exists.
This only works within MSYS2 applications, of course, so... when MSYS2
sees that we hand a parameter to a non-MSYS2 application in a shell
script, it helpfully converts it to the full path (prepending MSYS2's
pseudo root directory).

Let's work around that by using a Win32-compatible path to begin with:
`$(pwd)` produces that on Windows. On other platforms, it still works.

As a bonus, this safe-guards our test against a setup where `/fake/path`
_actually exists_. Stranger things have been seen in the wild, after
all.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
4 years ago
Johannes Schindelin 848cd652f0 t9390: work around clash with MSYS2's Unix<->Win32 path conversion
MSYS2 tries to be very helpful, and in most cases it even works, by
converting parameters passed from inside an MSYS2 Bash to a non-MSYS2
application (such as `git.exe`) if they look like Unix-style paths or
path lists.

Sometimes, however, this automatic path conversion is unhelpful, e.g.
when passing the parameter `foo:.` to Git, which MSYS2 will readily
convert to a Windows-style path list: `foo;bar` (i.e. using a semicolon
instead of a colon).

Happily, there is a way to avoid that: the `MSYS_NO_PATHCONV` variable.
Let's use it.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
4 years ago
Johannes Schindelin 6967fad156 t9390: avoid using `colrm`
While it is true that `colrm` is available on macOS by default, and even
in Ubuntu (thanks to the `bsdmainutils` package), it is not available on
Windows.

Let's use `cut` instead.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
4 years ago
Johannes Schindelin e6ffeded2e t9390: avoid using Bash-ism `<(...)`
The problem with this is that on Windows, we use the MSYS2 Bash which
uses the POSIX emulation layer called "MSYS2 runtime" that pretends that
there _is_ something like the `/dev/fd/` namespace, and tells `git.exe`
about it, but `git.exe` does not use the POSIX emulation layer, and
hence has no idea what Bash is talking about.

Besides, we should avoid pipes, just as we do in the Git project.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
4 years ago
Johannes Schindelin 8bc195673c t9390: close link of broken &&-chain
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
4 years ago
Johannes Schindelin f1ee28d78f t9390: expect the correct line count in `--strip-blobs-with-ids`
In that test case, we expect the line count to be 5, but it is actually
6 lines that we should expect:

	numbers/medium.num
	numbers/small.num
	sequence/know
	whatever
	words/know

Note the empty line at the top: this list is generated via `git log
--format=%n`, and that `%n` stands for "newline", meaning that we _must_
expect an empty line.

This expectation seems to have been broken already in the commit that
added the test case: b6a35f8 (filter-repo: implement
--strip-blobs-with-ids, 2019-05-30). It was hidden for such a long time
by a broken &&-chain, which we will fix next.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
4 years ago
Johannes Schindelin 6c475a7e09 t9390: use the correct prereq when using "funny" file names
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
4 years ago
Elijah Newren 93ee4ae907 Merge branch 'mw/empty-author-name' into main
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Martin Wilck 282f8ddb9b filter-repo: only set author from committer if author email not set
Some commits may have a valid author email, but no valid author name.
Old versions of git didn't enforce a non-empty name.
Setting the author data from the committer is wrong in this case.

Also add a test case for this to t9390.

Example: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c6295cdf656de63d6d1123def71daba6cd91939c

(en: replaced with a dedicated test instead of tweaking existing ones)

Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 7eaaf191de filter-repo: correctly prune nested tags not matching filtering criteria
When the user specifies some kind of criteria to filter commits by (e.g.
--subdirectory-filter mysubdir), we rewrite parents commits that are
entirely filtered out to the most recent ancestor that still exists, or
just prune the parent if there isn't one.  That works great when the
parent is a commit, but nested tags have parents that are tags.  If we
only prune the first tag (i.e. the tag of a commit), then letting any
tags through that had that tag as a parent will result in a fast-import
crash with a message of the form

   fatal: mark :35390 not declared

Ensure that when a tag gets pruned, the pruning is recorded as such...so
that any children tags will get pruned as well.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren d79ea709b7 filter-repo: fix crash from assuming parent is an int
When filtering with --refs, parents can be a hash rather than an
integer.  There was a code path in RepoFilter._prunable() that was
written assuming the first parent would always be an integer; fix it to
handle a hash as well.

Reported-by: Niklas Hambüchen <mail@nh2.me>
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren e4960a53f8 Fix undefined variable names
Reported-by: Christian Clauss <cclauss@me.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren cefeef1c0a filter-repo: use new --date-format=raw-permissive fast-import option
fast-import gained a new raw-permissive date format explictly for
allowing people to import repositories as-is.  Make use of the flag, and
stop rewriting the bogus timezone found in rails.git.

If users do not like these bogus times, they can of course write a
filter to fix them (or even make them bogus in a different way).  For
example:

    git filter-repo ... --commit-callback '
      if commit.author_date.endswith(b"+051800"):
        commit.author_date.replace(b"+051800", b"+0261")
    '

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 38e70b69e8 filter-repo: ignore comment lines in --paths-from-file
Allow lines starting with '#' to be treated as a comment and be ignored.
Update the documentation to note that both blank lines and comment lines
are ignored, and mention how filenames starting with '#' can be matched
(namely, the same way that filenames startwith with 'regex:', 'glob:',
or 'literal:' can be -- by prefixing the filename with 'literal:').

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 25b226b1de t9390: make tests individually re-runnable
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 49d6f02ff8 filter-repo: clarify interactions between path filtering and path renaming
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 3e1bff264c Revert "filter-repo: fix ugly bug with mixing path filtering and renaming"
This reverts commit df6c8652a2.  The
motivating example was wrong; path renaming should not be involved in
path filtering, it only says how paths should be renamed if they happen
to be selected.  A subsequent commit will improve the documentation.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren df6c8652a2 filter-repo: fix ugly bug with mixing path filtering and renaming
There's also a fix in here to make sure to throw an error if users are
trying to rename paths and use --invert-paths; it's not clear at all
what that would even mean.  But that also becomes important later...

Due to the ability to either filter wanted paths (default), or to just
specify unwanted paths (with --invert-paths), I keep a special
args.inclusive variable to track whether a "match" means we want the
path or not.  There are some special cases, notably when there are no
filters present (meaning e.g. no --path specifications, at most there
are some --path-rename values provided).  When there are no filters
present, that means we should keep paths even if we don't "find a match"
against any of the filters.

Now, since the rename code was embedded in the same loop as the filter
checks, it unfortunately was also being checked against the
args.inclusive setting despite never setting whether it found a match.
That happened to work in the special case that there were no filtering
paths but only because of the special logic for that case.  Since
renaming only makes sense if --invert-paths is not specified, any path
we rename is one we always want to keep.  Make sure we do.

Reported-by: Nadège (@nagreme on GitHub)
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 2833ef275f filter-repo: throw an error if user specifies any path starting with a slash
All paths are intended to be relative paths, relative to the project
root, not to the filesystem root.  There have been a few people who
didn't understand this, and then ended up with fast-import crashes that
are not very clear.  Check for it early and throw a simple error message
instead.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 011c646ee8 filter-repo: suggest --no-local when cloning local repos
Cloning local repos by default makes a bunch of hardlinks, giving you a
non-packed repository, and leading folks to use and suggest --force.
That, of course, bypasses the important fresh clone checks to prevent
people from accidentally and irrecoverably deleting their non-backed-up
data.  Let's make it easier for people to avoid (and suggest) that
mistake.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 9928b7cb3e t9390: add missing '&&' in command chain
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren e11343e504 filter-repo: handle typechange modifications when first parent is pruned
Commit 509a624b (filter-repo: fix issue with pruning of empty commits,
2019-10-03) added code to get a new list of file changes when the first
parent was pruned.  However, this logic did not handle cases where one
of the file modifications was a typechange.  Add the necessary logic to
handle that case.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 4f84a74ada filter-repo: use more expensive prunability checks when needed
When users are inserting new objects into the stream, we cannot make as
many assumptions and need to do more careful checks for whether commits
become empty or not.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 85c8e3660d filter-repo: accelerate is_ancestor() for --analyze mode
The --analyze mode was extremely slow for the freebsd/freebsd repo on
github; digging in, the is_ancestor() function was being called a huge
number of times -- about 22 times per commit on average (and about 17
million times overall).  The analyze mode uses is_ancestor() to
determine whether a rename equivalency class should be broken (i.e.
renaming A->B mean all versions of A and B are just different versions
of the same file, but if someone adds a new A in some commit which
contains the A->B rename in its history then this equivalence class no
longer holds).  Each is_ancestor() call potentially has to walk a tree
of dependencies all the way back to a sufficient depth where it can
realize that the commit cannot be an ancestor; this can be a very long
walk.

We can speed this up by keeping track of some previous is_ancestor()
results.  If commit F is not an ancestor of commit G, then F cannot be
an ancestor of children of G (unless that child has multiple parents;
but even in that case F can only be an ancestor through one of the
parents other than G).  Similarly, if F is an ancestor of commit G, then
F will always be an ancestor of any children of G.  Cache results from
previous calls to is_ancestor() and use them to accelerate subsequent
calls.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 1dae85ee9a filter-repo: permit trailing slash for --[to-]subdirectory-filter argument
There was code to allow the argument of --to-subdirectory-filter and
--subdirectory-filter to have a trailing slash, but it was broken due to
a bug in python3's bytestring design: b'somestring/'[-1] != b'/',
despite that being the obvious expectation.  One either has to compare
b'somestring/'[-1:] to b'/' or else compare b'somestring/'[-1] to
b'/'[0].  So lame.  Note that this is essentially a follow-up to commit
385b0586ca ("filter-repo (python3): bytestr splicing and iterating is
different", 2019-04-27).

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 9d51a90648 filter-repo: fix pruning of empty commits with blob callbacks
Blob callbacks, either implicit (via e.g. --replace-text) or explicit,
can modify blobs in ways that make them match other blobs, which in turn
can result in some commits becoming empty.  We need to detect such cases
and ensure we prune these empty commits when --prune-empty=auto.

Reported-by: John Gietzen <john@gietzen.us>
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 8994b4e55d filter-repo: fix bad column label in path-all-sizes.txt report
Reported-by: John Gietzen <john@gietzen.us>
Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 5e04dff097 filter-repo: add new --no-ff option
Some projects have a strict --no-ff merging policy.  With the default
behavior of --prune-degenerate, we can prune merge commits in a way that
transforms the history into a fast-forward merge.  Consider this
example:
  * There are two independent commits or branches, named B & C, which
    are both built on top of A so that history look like this diagram:
        A
        \ \
         \ B
          \
           -C
  * Someone runs the following sequence of commands:
    * git checkout A
    * git merge --no-ff B
    * git merge --no-ff C
  * This will result in a history that looks like:
        A---AB---AC
        \ \ /   /
         \ B   /
          \   /
           -C-
  * Later, someone comes along and runs filter-repo, specifying to
    remove the only path(s) that were modified by B.  That would
    naturally remove commit B and the no-longer-necessary merge
    commit AB.  For someone using a strict no-ff policy, the desired
    history is
        A---AC
         \ /
          C
    However, the default handling for --prune-degenerate would
    notice that AC merely merges C into its own ancestor A, whereas
    the original AC merged C into something separate (namely, AB).
    So, it would say that AC has become degenerate and prune it,
    leaving the simple history of
        A
         \
          C
    For projects not using a strict no-ff policy, this simpler history
    is probably better, but for folks that want a strict no-ff policy,
    it is unfortunate.

Provide a --no-ff option to tweak the --prune-degenerate behavior so
that it ignores the first parent being an ancestor of another parent
(leaving the first parent unpruned even if it is or becomes degenerate
in this fashion).

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Karl Lenz caf85b68ec filter-repo: allow --dry-run and --debug to be used together
Prior to this commit, git-filter-repo could only be used with either the
--dry-run flag or the --debug flag, not both. When run in debug mode,
git-filter-repo expected to be able to read from the output stream,
which obviously isn't created when doing a dry run, so it stack traced
when it tried to use the non-existent output stream. This commit fixes
that bug with an equally simple sanity check for the existence of the
output stream when run in debug mode.

Signed-off-by: Karl Lenz <xorangekiller@gmail.com>
4 years ago
Karl Lenz 780c74b218 filter-repo: parse mailmap entries with no email address
The mailmap format parsed by the "git shortlog" command allows for
matching mailmap entries with no email address. This is admittedly an
edge case, because most Git commits will have an email address
associated with them as well as a name, but technically the address
isn't required, and "git shortlog" accommodates that in its mailmap
format. This commit teaches git-filter-repo to do the same thing.

Signed-off-by: Karl Lenz <xorangekiller@gmail.com>
4 years ago
Elijah Newren 7cfef09e9b filter-repo: warn users who try to use invalid path components
It's hard to be exhaustive, but if users try something like:
   --path-rename foo/bar/baz:.
or
   --path ../other-dir
then bad things happen.  In the first case, filter-repo will try to
ask fast-import to create a directory named '.' and move everything
from foo/bar/baz/ into it but of course '.' is a reserved directory
name so we can't create it.  In the second case, they are probably
running from a subdirectory, but filter-repo doesn't work from a
subdirectory.  I hard-coded the assumption that everything was in the
toplevel directory and all paths were relative from there pretty
early on.  So, if the user tries to use any of these components
anywhere, just throw an early error.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren a9a93d9d83 filter-repo: actually fix issue with pruning of empty commits
In commit 509a624 (filter-repo: fix issue with pruning of empty commits,
2019-10-03), it was noted that when the first parent is pruned away,
then we need to generate a corrected list of file changes relative to
the new first parent.  Unfortunately, we did not apply our set of file
filters to that new list of file changes, causing us to possibly
introduce many unwanted files from the second parent into the history.
The testcase added at the time was rather lax and totally missed this
problem (which possibly exacerbated the original bug being fixed rather
than helping).  Tighten the testcase, and fix the error by filtering the
generated list of file changes.

Signed-off-by: Elijah Newren <newren@gmail.com>
4 years ago
Elijah Newren 64aa9359ed run_coverage: prefer coverage3 to python3-coverage
Some of the systems I ran on had a 'python3-coverage' and some had a
'coverage3' program.  More were of the latter name, but more
importantly, the upstream tarball only creates the latter name;
apparently the former was just added by some distros.  So, switch to the
more official name of the program.

Signed-off-by: Elijah Newren <newren@gmail.com>
5 years ago
Elijah Newren 904e03f963 filter-repo: workaround Windows' insistence that command args be strings
It appears that in addition to Windows requiring cwd be a string (and
not a bytestring), it also requires the command line arguments to be
unicode strings.  This appears to be a python-on-Windows issue at the
surface (attempts to quote things that assumes the arguments are all
strings), but whether it's solely a python-on-Windows issue or there is
also a deeper Windows issue, we can workaround this brain-damage by
extending the SubprocessWrapper slightly.  As with the cwd changes, only
apply this on Windows and not elsewhere because there are perfectly
legitimate reasons to pass non-unicode parameters (e.g. filenames that
are not valid unicode).

Signed-off-by: Elijah Newren <newren@gmail.com>
5 years ago