While it is true that `colrm` is available on macOS by default, and even
in Ubuntu (thanks to the `bsdmainutils` package), it is not available on
Windows.
Let's use `cut` instead.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The problem with this is that on Windows, we use the MSYS2 Bash which
uses the POSIX emulation layer called "MSYS2 runtime" that pretends that
there _is_ something like the `/dev/fd/` namespace, and tells `git.exe`
about it, but `git.exe` does not use the POSIX emulation layer, and
hence has no idea what Bash is talking about.
Besides, we should avoid pipes, just as we do in the Git project.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
In that test case, we expect the line count to be 5, but it is actually
6 lines that we should expect:
numbers/medium.num
numbers/small.num
sequence/know
whatever
words/know
Note the empty line at the top: this list is generated via `git log
--format=%n`, and that `%n` stands for "newline", meaning that we _must_
expect an empty line.
This expectation seems to have been broken already in the commit that
added the test case: b6a35f8 (filter-repo: implement
--strip-blobs-with-ids, 2019-05-30). It was hidden for such a long time
by a broken &&-chain, which we will fix next.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Some commits may have a valid author email, but no valid author name.
Old versions of git didn't enforce a non-empty name.
Setting the author data from the committer is wrong in this case.
Also add a test case for this to t9390.
Example: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c6295cdf656de63d6d1123def71daba6cd91939c
(en: replaced with a dedicated test instead of tweaking existing ones)
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
When the user specifies some kind of criteria to filter commits by (e.g.
--subdirectory-filter mysubdir), we rewrite parents commits that are
entirely filtered out to the most recent ancestor that still exists, or
just prune the parent if there isn't one. That works great when the
parent is a commit, but nested tags have parents that are tags. If we
only prune the first tag (i.e. the tag of a commit), then letting any
tags through that had that tag as a parent will result in a fast-import
crash with a message of the form
fatal: mark :35390 not declared
Ensure that when a tag gets pruned, the pruning is recorded as such...so
that any children tags will get pruned as well.
Signed-off-by: Elijah Newren <newren@gmail.com>
When filtering with --refs, parents can be a hash rather than an
integer. There was a code path in RepoFilter._prunable() that was
written assuming the first parent would always be an integer; fix it to
handle a hash as well.
Reported-by: Niklas Hambüchen <mail@nh2.me>
Signed-off-by: Elijah Newren <newren@gmail.com>
fast-import gained a new raw-permissive date format explictly for
allowing people to import repositories as-is. Make use of the flag, and
stop rewriting the bogus timezone found in rails.git.
If users do not like these bogus times, they can of course write a
filter to fix them (or even make them bogus in a different way). For
example:
git filter-repo ... --commit-callback '
if commit.author_date.endswith(b"+051800"):
commit.author_date.replace(b"+051800", b"+0261")
'
Signed-off-by: Elijah Newren <newren@gmail.com>
Allow lines starting with '#' to be treated as a comment and be ignored.
Update the documentation to note that both blank lines and comment lines
are ignored, and mention how filenames starting with '#' can be matched
(namely, the same way that filenames startwith with 'regex:', 'glob:',
or 'literal:' can be -- by prefixing the filename with 'literal:').
Signed-off-by: Elijah Newren <newren@gmail.com>
This reverts commit df6c8652a2. The
motivating example was wrong; path renaming should not be involved in
path filtering, it only says how paths should be renamed if they happen
to be selected. A subsequent commit will improve the documentation.
Signed-off-by: Elijah Newren <newren@gmail.com>
There's also a fix in here to make sure to throw an error if users are
trying to rename paths and use --invert-paths; it's not clear at all
what that would even mean. But that also becomes important later...
Due to the ability to either filter wanted paths (default), or to just
specify unwanted paths (with --invert-paths), I keep a special
args.inclusive variable to track whether a "match" means we want the
path or not. There are some special cases, notably when there are no
filters present (meaning e.g. no --path specifications, at most there
are some --path-rename values provided). When there are no filters
present, that means we should keep paths even if we don't "find a match"
against any of the filters.
Now, since the rename code was embedded in the same loop as the filter
checks, it unfortunately was also being checked against the
args.inclusive setting despite never setting whether it found a match.
That happened to work in the special case that there were no filtering
paths but only because of the special logic for that case. Since
renaming only makes sense if --invert-paths is not specified, any path
we rename is one we always want to keep. Make sure we do.
Reported-by: Nadège (@nagreme on GitHub)
Signed-off-by: Elijah Newren <newren@gmail.com>
All paths are intended to be relative paths, relative to the project
root, not to the filesystem root. There have been a few people who
didn't understand this, and then ended up with fast-import crashes that
are not very clear. Check for it early and throw a simple error message
instead.
Signed-off-by: Elijah Newren <newren@gmail.com>
Cloning local repos by default makes a bunch of hardlinks, giving you a
non-packed repository, and leading folks to use and suggest --force.
That, of course, bypasses the important fresh clone checks to prevent
people from accidentally and irrecoverably deleting their non-backed-up
data. Let's make it easier for people to avoid (and suggest) that
mistake.
Signed-off-by: Elijah Newren <newren@gmail.com>
Commit 509a624b (filter-repo: fix issue with pruning of empty commits,
2019-10-03) added code to get a new list of file changes when the first
parent was pruned. However, this logic did not handle cases where one
of the file modifications was a typechange. Add the necessary logic to
handle that case.
Signed-off-by: Elijah Newren <newren@gmail.com>
When users are inserting new objects into the stream, we cannot make as
many assumptions and need to do more careful checks for whether commits
become empty or not.
Signed-off-by: Elijah Newren <newren@gmail.com>
The --analyze mode was extremely slow for the freebsd/freebsd repo on
github; digging in, the is_ancestor() function was being called a huge
number of times -- about 22 times per commit on average (and about 17
million times overall). The analyze mode uses is_ancestor() to
determine whether a rename equivalency class should be broken (i.e.
renaming A->B mean all versions of A and B are just different versions
of the same file, but if someone adds a new A in some commit which
contains the A->B rename in its history then this equivalence class no
longer holds). Each is_ancestor() call potentially has to walk a tree
of dependencies all the way back to a sufficient depth where it can
realize that the commit cannot be an ancestor; this can be a very long
walk.
We can speed this up by keeping track of some previous is_ancestor()
results. If commit F is not an ancestor of commit G, then F cannot be
an ancestor of children of G (unless that child has multiple parents;
but even in that case F can only be an ancestor through one of the
parents other than G). Similarly, if F is an ancestor of commit G, then
F will always be an ancestor of any children of G. Cache results from
previous calls to is_ancestor() and use them to accelerate subsequent
calls.
Signed-off-by: Elijah Newren <newren@gmail.com>
There was code to allow the argument of --to-subdirectory-filter and
--subdirectory-filter to have a trailing slash, but it was broken due to
a bug in python3's bytestring design: b'somestring/'[-1] != b'/',
despite that being the obvious expectation. One either has to compare
b'somestring/'[-1:] to b'/' or else compare b'somestring/'[-1] to
b'/'[0]. So lame. Note that this is essentially a follow-up to commit
385b0586ca ("filter-repo (python3): bytestr splicing and iterating is
different", 2019-04-27).
Signed-off-by: Elijah Newren <newren@gmail.com>
Blob callbacks, either implicit (via e.g. --replace-text) or explicit,
can modify blobs in ways that make them match other blobs, which in turn
can result in some commits becoming empty. We need to detect such cases
and ensure we prune these empty commits when --prune-empty=auto.
Reported-by: John Gietzen <john@gietzen.us>
Signed-off-by: Elijah Newren <newren@gmail.com>
Some projects have a strict --no-ff merging policy. With the default
behavior of --prune-degenerate, we can prune merge commits in a way that
transforms the history into a fast-forward merge. Consider this
example:
* There are two independent commits or branches, named B & C, which
are both built on top of A so that history look like this diagram:
A
\ \
\ B
\
-C
* Someone runs the following sequence of commands:
* git checkout A
* git merge --no-ff B
* git merge --no-ff C
* This will result in a history that looks like:
A---AB---AC
\ \ / /
\ B /
\ /
-C-
* Later, someone comes along and runs filter-repo, specifying to
remove the only path(s) that were modified by B. That would
naturally remove commit B and the no-longer-necessary merge
commit AB. For someone using a strict no-ff policy, the desired
history is
A---AC
\ /
C
However, the default handling for --prune-degenerate would
notice that AC merely merges C into its own ancestor A, whereas
the original AC merged C into something separate (namely, AB).
So, it would say that AC has become degenerate and prune it,
leaving the simple history of
A
\
C
For projects not using a strict no-ff policy, this simpler history
is probably better, but for folks that want a strict no-ff policy,
it is unfortunate.
Provide a --no-ff option to tweak the --prune-degenerate behavior so
that it ignores the first parent being an ancestor of another parent
(leaving the first parent unpruned even if it is or becomes degenerate
in this fashion).
Signed-off-by: Elijah Newren <newren@gmail.com>
Prior to this commit, git-filter-repo could only be used with either the
--dry-run flag or the --debug flag, not both. When run in debug mode,
git-filter-repo expected to be able to read from the output stream,
which obviously isn't created when doing a dry run, so it stack traced
when it tried to use the non-existent output stream. This commit fixes
that bug with an equally simple sanity check for the existence of the
output stream when run in debug mode.
Signed-off-by: Karl Lenz <xorangekiller@gmail.com>
The mailmap format parsed by the "git shortlog" command allows for
matching mailmap entries with no email address. This is admittedly an
edge case, because most Git commits will have an email address
associated with them as well as a name, but technically the address
isn't required, and "git shortlog" accommodates that in its mailmap
format. This commit teaches git-filter-repo to do the same thing.
Signed-off-by: Karl Lenz <xorangekiller@gmail.com>
It's hard to be exhaustive, but if users try something like:
--path-rename foo/bar/baz:.
or
--path ../other-dir
then bad things happen. In the first case, filter-repo will try to
ask fast-import to create a directory named '.' and move everything
from foo/bar/baz/ into it but of course '.' is a reserved directory
name so we can't create it. In the second case, they are probably
running from a subdirectory, but filter-repo doesn't work from a
subdirectory. I hard-coded the assumption that everything was in the
toplevel directory and all paths were relative from there pretty
early on. So, if the user tries to use any of these components
anywhere, just throw an early error.
Signed-off-by: Elijah Newren <newren@gmail.com>
In commit 509a624 (filter-repo: fix issue with pruning of empty commits,
2019-10-03), it was noted that when the first parent is pruned away,
then we need to generate a corrected list of file changes relative to
the new first parent. Unfortunately, we did not apply our set of file
filters to that new list of file changes, causing us to possibly
introduce many unwanted files from the second parent into the history.
The testcase added at the time was rather lax and totally missed this
problem (which possibly exacerbated the original bug being fixed rather
than helping). Tighten the testcase, and fix the error by filtering the
generated list of file changes.
Signed-off-by: Elijah Newren <newren@gmail.com>
Some of the systems I ran on had a 'python3-coverage' and some had a
'coverage3' program. More were of the latter name, but more
importantly, the upstream tarball only creates the latter name;
apparently the former was just added by some distros. So, switch to the
more official name of the program.
Signed-off-by: Elijah Newren <newren@gmail.com>
It appears that in addition to Windows requiring cwd be a string (and
not a bytestring), it also requires the command line arguments to be
unicode strings. This appears to be a python-on-Windows issue at the
surface (attempts to quote things that assumes the arguments are all
strings), but whether it's solely a python-on-Windows issue or there is
also a deeper Windows issue, we can workaround this brain-damage by
extending the SubprocessWrapper slightly. As with the cwd changes, only
apply this on Windows and not elsewhere because there are perfectly
legitimate reasons to pass non-unicode parameters (e.g. filenames that
are not valid unicode).
Signed-off-by: Elijah Newren <newren@gmail.com>
Unfortunately, it appears that Windows does not allow the 'cwd' argument
of various subprocess calls to be a bytestring. That may be functional
on Windows since Windows-related filesystems are allowed to require that
all file and directory names be valid unicode, but not all platforms
enforce such restrictions. As such, I certainly cannot change
cwd=directory
to
cwd=decode(directory)
because that could break on other platforms (and perhaps even on Windows
if someone is trying to read a non-native filesystem). Instead, create
a SubprocessWrapper class that will always call decode on the cwd
argument before passing along to the real subprocess class. Use these
wrappers on Windows, and do not use them elsewhere.
Signed-off-by: Elijah Newren <newren@gmail.com>
Note that this isn't a version *number* or even the more generalized
version string that folks are used to seeing, but a version hash (or
leading portion thereof).
A few import points:
* These version hashes are not strictly monotonically increasing
values. Like I said, these aren't version numbers. If that
bothers you, read on...
* This scheme has incredibly nice semantics satisfying a pair of
properties that most version schemes would assume are mutually
incompatible:
This scheme works even if the user doesn't have a clone of
filter-repo and doesn't require any build step to inject the
version into the program; it works even if people just download
git-filter-repo.py off GitHub without any of the other sources.
And:
This scheme means that a user is running precisely version X of
the code, with the version not easily faked or misrepresented
when third parties edit the code.
Given the wonderful semantics provided by satisfying this pair of
properties that all other versioning schemes seem to miss out on, I
think I should name this scheme. How about "Semantic Versioning"?
(Hehe...)
* The version hash is super easy to use; I just go to my own clone of
filter-repo and run either:
git show $VERSION_HASH
or
git describe $VERSION_HASH
* A human consumable version might suggest to folks that this software
is something they might frequently use and upgrade. This program
should only be used in exceptional cases (because rewriting history
is not for the faint of heart).
* A human consumable version (i.e. a version number or even the
more relaxed version strings in more common use) might suggest to
folks that they can rely on strict backward compatibility. It's
nice to subtly undercut any such assumption.
* Despite all that, I will make releases (downloadable tarballs with
real version numbers in the tarball name; I'm just going to re-use
whatever version git is released with at the time). But those
version numbers won't be used by the --version option; instead the
version hash will.
Signed-off-by: Elijah Newren <newren@gmail.com>
In order to build the correct tree for a commit, git-fast-import always
takes a list of file changes for a merge commit relative to the first
parent.
When the entire first-parent history of a merge commit is pruned away
and the merge had paths with no difference relative to the first parent
but which differed relative to later parents, then we really need to
generate a new list of file changes in order to have one of those other
parents become the new first parent. An example might help clarify...
Let's say that there is a merge commit, and:
* it resolved differences in pathA between its two parents by taking
the version of pathA from the first parent.
* pathB was added in the history of the second parent (it is not
present in the first parent) and is NOT included in the merge commit
(either being deleted, or via rename treated as deleted and added as
something else)
For this merge commit, neither pathA nor pathB differ from the first
parent, and thus wouldn't appear in the list of file changes shown by
fast-export. However, when our filtering rules determine that the first
parent (and all its parents) should be pruned away, then the second
parent has to become the new first parent of the merge commit. But to
end up with the right files in the merge commit despite using a
different parent, we need a list of file changes that specifies the
changes for both pathA and pathB.
Signed-off-by: Elijah Newren <newren@gmail.com>
Allow folks to periodically update the export of a live repo without
re-exporting from the beginning. This is a performance improvement, but
can also be important for collaboration. For example, for sensitivity
reasons, folks might want to export a subset of a repo and update the
export periodically. While this could be done by just re-exporting the
repository anew each time, there is a risk that the paths used to
specify the wanted subset might need to change in the future; making the
user verify that their paths (including globs or regexes) don't also
pick up anything from history that was previously excluded so that they
don't get a divergent history is not very user friendly. Allowing them
to just export stuff that is new since the last export works much better
for them.
Signed-off-by: Elijah Newren <newren@gmail.com>
Commit 346f2ba891 (filter-repo: make reencoding of commit messages
togglable, 2019-05-11) made reencoding of commit messages togglable but
forgot to add parsing and outputting of the encoding header itself. Add
such ability now.
Signed-off-by: Elijah Newren <newren@gmail.com>
External rewrite tools using filter-repo as a library may want to add
additional objects into the stream. Some examples in t/t9391 did this
using an internal _output field and using syntax that did not seem so
clear. Provide an insert() method for doing this, and convert existing
cases over to it.
Signed-off-by: Elijah Newren <newren@gmail.com>
When we prune a commit for being empty, there is no update to the branch
associated with the commit in the fast-import stream. If the parent
commit had been associated with a different branch, then the branch
associated with the pruned commit would not be updated without
additional measures. In the past, we resolved this by recording that
the branch needed an update in _seen_refs. While this works, it is a
bit more complicated than just issuing an immediate Reset. Also, note
that we need to avoid calling callbacks on that Reset because those
could rename branches (again, if the commit-callback already renamed
once) causing us to not update the intended branch.
There was actually one testcase where the old method didn't work: when a
branch was pruned away to nothing. A testcase accidentally encoded the
wrong behavior, hiding this problem. Fix the testcase to check for
correct behavior.
Signed-off-by: Elijah Newren <newren@gmail.com>
Add a flag allowing for specifying a file filled with blob-ids which
will be stripped from the repository.
Signed-off-by: Elijah Newren <newren@gmail.com>
Fix a few issues and add a token testcase for partial repo filtering.
Add a note about how I think this is not a particularly interesting or
core usecase for filter-repo, even if I have put some good effort into
the fast-export side to ensure it worked. If there is a core usecase
that can be addressed without causing usability problems (particularly
the "don't mix old and new history" edict for normal rewrites), then
I'll be happy to add more testcases, document it better, etc.
Signed-off-by: Elijah Newren <newren@gmail.com>
Make several fixes around --source and --target:
* Explain steps we skip when source or target locations are specified
* Only write reports to the target directory, never the source
* Query target git repo for final ref values, not the source
* Make sure --debug messages avoid throwing TypeErrors due to mixing
strings and bytes
* Make sure to include entries in ref-map that weren't in the original
target repo
* Don't:
* worry about mixing old and new history (i.e. nuking refs
that weren't updated, expiring reflogs, gc'ing)
* attempt to map refs/remotes/origin/* -> refs/heads/*
* disconnect origin remote
* Continue (but only in target repo):
* fresh-clone sanity checks
* writing replace refs
* doing a 'git reset --hard'
Signed-off-by: Elijah Newren <newren@gmail.com>
Add a flag for filtering out blob based on their size, and allow the
size to be specified using 'K', 'M', or 'G' suffixes.
Signed-off-by: Elijah Newren <newren@gmail.com>
Imperative form sounds better than --empty-pruning and
--degenerate-pruning, and it probably works better with command line
completion.
Signed-off-by: Elijah Newren <newren@gmail.com>
The reset directive can specify a commit hash for the 'from' directive,
which can be used to reset to a specify commit, or, if the hash is all
zeros, then it can be used to delete the ref. Support such operations.
Signed-off-by: Elijah Newren <newren@gmail.com>
For other programs importing git-filter-repo as a library and passing a
blob, commit, tag, or reset callback to RepoFilter, pass a second
parameter to these functions with extra metadata they might find useful.
For simplicity of implementation, this technically changes the calling
signature of the --*-callback functions passed on the command line, but
we hide that behind a _do_not_use_this_variable parameter for now, leave
it undocumented, and encourage folks who want to use it to write an
actual python program that imports git-filter-repo. In the future, we
may modify the --*-callback functions to not pass this extra parameter,
or if it is deemed sufficiently useful, then we'll rename the second
parameter and document it.
As already noted in our API compatibilty caveat near the top of
git-filter-repo, I am not guaranteeing API backwards compatibility.
That especially applies to this metadata argument, other than the fact
that it'll be a dict mapping strings to some kind of value. I might add
more keys, rename them, change the corresponding value, or even remove
keys that used to be part of metadata.
Signed-off-by: Elijah Newren <newren@gmail.com>