Some commits may have a valid author email, but no valid author name.
Old versions of git didn't enforce a non-empty name.
Setting the author data from the committer is wrong in this case.
Also add a test case for this to t9390.
Example: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c6295cdf656de63d6d1123def71daba6cd91939c
(en: replaced with a dedicated test instead of tweaking existing ones)
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
When the user specifies some kind of criteria to filter commits by (e.g.
--subdirectory-filter mysubdir), we rewrite parents commits that are
entirely filtered out to the most recent ancestor that still exists, or
just prune the parent if there isn't one. That works great when the
parent is a commit, but nested tags have parents that are tags. If we
only prune the first tag (i.e. the tag of a commit), then letting any
tags through that had that tag as a parent will result in a fast-import
crash with a message of the form
fatal: mark :35390 not declared
Ensure that when a tag gets pruned, the pruning is recorded as such...so
that any children tags will get pruned as well.
Signed-off-by: Elijah Newren <newren@gmail.com>
In repositories with annotated tags filter-lamely crashes with the
message: "NameError: name 'Reset' is not defined".
This is because of a missing "fr" module prefix in the code, which this
commit adds.
Signed-off-by: Marius Renner <marius@mariusrenner.de>
When filtering with --refs, parents can be a hash rather than an
integer. There was a code path in RepoFilter._prunable() that was
written assuming the first parent would always be an integer; fix it to
handle a hash as well.
Reported-by: Niklas Hambüchen <mail@nh2.me>
Signed-off-by: Elijah Newren <newren@gmail.com>
fast-import gained a new raw-permissive date format explictly for
allowing people to import repositories as-is. Make use of the flag, and
stop rewriting the bogus timezone found in rails.git.
If users do not like these bogus times, they can of course write a
filter to fix them (or even make them bogus in a different way). For
example:
git filter-repo ... --commit-callback '
if commit.author_date.endswith(b"+051800"):
commit.author_date.replace(b"+051800", b"+0261")
'
Signed-off-by: Elijah Newren <newren@gmail.com>
"no-op" might suggest that it doesn't do anything, when in reality it
does exactly what filter-repo does. Rename it to barebones-example.
Signed-off-by: Elijah Newren <newren@gmail.com>
Homebrew and scoop are both package managers and package repositories.
Fedora 32 is not a package manager, but does map to a package
repository. Clarify wording that the list from repology.org is a list
of package repositories, not package managers.
Signed-off-by: Elijah Newren <newren@gmail.com>
Useful commit and reference mappings are created on every run. These are
helpful in a number of situations, and should be documented so that
end-users and Git hosts can understand how to use the output.
The commit-map is particularly useful for Git hosts to override
retention mechanisms, like hidden refs. This allows end-users to purge
large files and sensitive data.
Signed-off-by: James Ramsay <james@jramsay.com.au>
Apparently, despite the fact that *overwrite* *repo* *history* are three
important words that each individually convey a lot of important
meaning, people ignore it and instinctively add --force. Insert the
word "destructively" to get people to pause.
Further, change the end of the warning not to how to get around the
warning with the current repository, but instead with a suggestion that
they should instead be operating on a fresh clone and only then make a
side comment that the --force flag can be used to override.
Signed-off-by: Elijah Newren <newren@gmail.com>
Online blogs/articles/Q&A as well as direct feedback suggests that
people use the --force flag rather cavalierly. Add words like
"irreversible" and "immediate pruning" to discourage such blithe
application of this flag. I hope this encourages folks to either learn
the ramifications of irreversible full-repository entire history
rewrites first, or to follow the recommendation of only operating on a
fresh clone.
Signed-off-by: Elijah Newren <newren@gmail.com>
People keep trying to read this file, unaware that it is the source code
for generating the documentation, not the generated documentation. Add
a comment at the top that explains this and points people in the right
direction.
Signed-off-by: Elijah Newren <newren@gmail.com>
Allow lines starting with '#' to be treated as a comment and be ignored.
Update the documentation to note that both blank lines and comment lines
are ignored, and mention how filenames starting with '#' can be matched
(namely, the same way that filenames startwith with 'regex:', 'glob:',
or 'literal:' can be -- by prefixing the filename with 'literal:').
Signed-off-by: Elijah Newren <newren@gmail.com>
I added special code to filter-repo so that --path expressions could
match filenames or some leading directory name. --path-regex, since it
does not implicitly add anchorings, can also match a leading path, and
can thus be used to match against directories. --path-glob could not be
used to match a leading directory of a path, since fnmatch.fnmatch()
requires the full string to match. But users like being able to specify
directory names, such as '*/bin', so let's take any glob expression and
treat it as two: '<glob>' and '<glob>/*' and try to match against either
one; this will allow it to match against file or directory names like
the other two types of path matching.
Signed-off-by: Elijah Newren <newren@gmail.com>
Make use of `git --man-path` and `git --html-path` to simplify the
manual installation instructions a bit. Also, there appears to be a
site.getsitepackages() call in python to give similar information about
where git_filter_repo.py can be installed.
Signed-off-by: Elijah Newren <newren@gmail.com>
Flags like --local, --shared, --reference (and --dissociate), and
--origin would all mess up the fresh clone checker. Attempting to
defend against all of them would not only be costly, but make it harder
to draw the line about guesses as to whether a repository is a fresh
clone or not. --origin also has problems in that filter-repo has
special handling for the 'origin' remote that I don't want to apply to
other random remotes.
Flags like --depth, --single-branch, and --no-tags could prevent enough
data from being downloaded to do a full rewrite and result in a
partially rewritten or possibly even corrupt history (no idea how
shallow clones interact; probably badly). --filter would also make the
repo start without enough info though it'd at least be downloaded on
demand; it'd still be a really slow way to do it, though, so it's a bad
idea.
filter-repo doesn't really provide an easy mechanism to rewrite a repo
and its submodule simultaneously, so recursing submodules seems useless
and unhelpful. --shallow-submodules would be bad for at least the same
reasons --depth is for the parent module, assuming we handled
submodules. --remote-submodules just provides a way to make the repo
dirty to start, which is counter-productive. --jobs could be useful, if
recursing submodules was.
--no-checkout might be safe to use and --sparse might also be okay for
as long as it only affects the working tree, but in both cases why not
go --bare or --mirror if you're doing that? Likewise, --no-hardlinks is
useless given that we're already saying people need to use --no-local.
-b would be okay to use, but why wouldn't you just change the default
branch on the server rather than just within this one clone used for
rewriting the history? Whether you push back to the original repository
or to a new repo, you'd have to take a separate step to change it in
that remote repo. And if you really will use this new local repository
as the official source, then you can switch branches at the end of the
rewrite just as easily.
--separate-git-dir and --template might be okay to use, I haven't
tested. If either doesn't work now, or breaks at any point in the
future, I feel much better being able to say, "I told you to only use
these three flags to git clone."
-u only affects the ability to receive the clone; it's fine to use.
Also, -q only affects the console output during the clone operation, so
you could use it.
There will probably be more flags added to git-clone over time. Testing
against all of them is insanity. Recommend people only use --no-local,
--bare, and --mirror, with the first only needed when cloning from a
local filesystem, and the other two never needed but allowed for those
that prefer.
Signed-off-by: Elijah Newren <newren@gmail.com>
This reverts commit df6c8652a2. The
motivating example was wrong; path renaming should not be involved in
path filtering, it only says how paths should be renamed if they happen
to be selected. A subsequent commit will improve the documentation.
Signed-off-by: Elijah Newren <newren@gmail.com>
The desire to format or lint code throughout history has arisen several
times. It's more natural to do this in filter-branch since it somewhat
forces people to run external commands, but we have an example contrib
demo that shows how to run an external command on each file in history
that I created even before any of these requests came in and yet I still
periodically get requests about it.
Make lint-history ever-so-slightly easier to apply to a subset of
filenames, and include its usage as an extra cheat sheet comparison for
filter-branch-vs-filter-repo commands.
Signed-off-by: Elijah Newren <newren@gmail.com>
The purpose of this example is to solely show what to import and run to
recover filter-repo's behavior as-is. It doesn't modify any behavior,
but instead exists as an example so people can easily find a good
starting point for making their own modifications.
Signed-off-by: Elijah Newren <newren@gmail.com>
Allow external scripts that import git-filter-repo to change the value
of the default replacement text instead of having it hardcoded within
some function.
Signed-off-by: Elijah Newren <newren@gmail.com>
--replace-text is meant to replace _text_ throughout the repository, not
binary data. Use the same scheme as the lint-history script uses to
avoid applying the changes to binary blob data.
Reported-by: Tobias Gruetzmacher <tobias-git@23.gs>
Signed-off-by: Elijah Newren <newren@gmail.com>
Our showing of how to handle the simple example with different tools
combined three different tools into a single section which I think made
it slightly harder to read and follow. It also concentrated almost
exclusively on filter-branch. Provide a separate section for each tool,
and provide more details for BFG and fast-export/fast-import.
Signed-off-by: Elijah Newren <newren@gmail.com>
When using --source or --target in combination with filtering paths,
users were surprised out how large the resulting repository was. The
usage of --source and --target were turning off repacking; while we
don't want repacking for partial history rewrites and --source and
--target turn on some of the other features we want with partial history
rewrites, repacking is something that we still want turned on.
Reported-by: Alexey Volkov <alexey.volkov@ark-kun.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
When combining `--prune-degenerate never` with a `--path` specification,
we could end up trying to write a parent out to the fast-import stream
whose value was actually None. The problem occurs when the parents of
a merge commit are filtered out by the path specification, leaving us
only with no-longer-extant parents. In such a case, we need to filter
out these 'None' (i.e. invalid) parents. The point of
`--prune-degenerate never` is to avoid removing parents that are either
the same as or an ancestor of another parent, not to avoid removing
non-existent parents. Remove the non-existent parent(s).
Reported-by: Gaurav Kanoongo (@gauravkanoongo on GitHub)
Signed-off-by: Elijah Newren <newren@gmail.com>
There's also a fix in here to make sure to throw an error if users are
trying to rename paths and use --invert-paths; it's not clear at all
what that would even mean. But that also becomes important later...
Due to the ability to either filter wanted paths (default), or to just
specify unwanted paths (with --invert-paths), I keep a special
args.inclusive variable to track whether a "match" means we want the
path or not. There are some special cases, notably when there are no
filters present (meaning e.g. no --path specifications, at most there
are some --path-rename values provided). When there are no filters
present, that means we should keep paths even if we don't "find a match"
against any of the filters.
Now, since the rename code was embedded in the same loop as the filter
checks, it unfortunately was also being checked against the
args.inclusive setting despite never setting whether it found a match.
That happened to work in the special case that there were no filtering
paths but only because of the special logic for that case. Since
renaming only makes sense if --invert-paths is not specified, any path
we rename is one we always want to keep. Make sure we do.
Reported-by: Nadège (@nagreme on GitHub)
Signed-off-by: Elijah Newren <newren@gmail.com>
I'm a little surprised that stdout buffers must have filled up on MacOS X, but
either way we don't have to wait for the '-h' processes to finish before
attempting to read stdout. In fact, since we weren't storing the returncode
attribute from calling p.wait(), there wasn't much point in doing so. Trying
to read all stdout all at once is going to implicitly take until the process
finishes anyway, so just do that.
Reported-by: Benoit Lefèvre <contact@benoit-lefevre.org>
Signed-off-by: Elijah Newren <newren@gmail.com>
Users won't know which versions of git have --mark-tags, --reencode, or
--combined-all-paths options for fast-export and diff-tree. I didn't
either when I wrote those messages because it wasn't in a released
version of git. Now that they are in released versions and have been
for a while, we can simplify the messages to just state which git
version is needed.
Signed-off-by: Elijah Newren <newren@gmail.com>
I occasionally get people doing special things, or see people
recommending to others to just use --force. Add some explanations
behind the safety checks so that those doing special things know when
it's okay, and to explain why it's a really bad idea to casually or
haphazardly recommend others use --force.
Signed-off-by: Elijah Newren <newren@gmail.com>
Commit 011c646ee8 (filter-repo: suggest --no-local when cloning local
repos, 2020-05-15) added an additional message to the error to make it
more clear what to do when cloning local repos. However, if there was
no remote, then the code path would run os.path.isdir(None), triggering
a traceback. Fix the logic.
Signed-off-by: Elijah Newren <newren@gmail.com>