Commit Graph

426 Commits

Author SHA1 Message Date
Elijah Newren
9282a33a02 git-filter-repo.txt: regexes & globs apply to entire file, not to lines
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-10-19 08:10:08 -07:00
Elijah Newren
93ee4ae907 Merge branch 'mw/empty-author-name' into main
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-10-17 17:07:27 -07:00
Martin Wilck
282f8ddb9b filter-repo: only set author from committer if author email not set
Some commits may have a valid author email, but no valid author name.
Old versions of git didn't enforce a non-empty name.
Setting the author data from the committer is wrong in this case.

Also add a test case for this to t9390.

Example: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c6295cdf656de63d6d1123def71daba6cd91939c

(en: replaced with a dedicated test instead of tweaking existing ones)

Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-10-17 17:06:53 -07:00
Elijah Newren
7eaaf191de filter-repo: correctly prune nested tags not matching filtering criteria
When the user specifies some kind of criteria to filter commits by (e.g.
--subdirectory-filter mysubdir), we rewrite parents commits that are
entirely filtered out to the most recent ancestor that still exists, or
just prune the parent if there isn't one.  That works great when the
parent is a commit, but nested tags have parents that are tags.  If we
only prune the first tag (i.e. the tag of a commit), then letting any
tags through that had that tag as a parent will result in a fast-import
crash with a message of the form

   fatal: mark :35390 not declared

Ensure that when a tag gets pruned, the pruning is recorded as such...so
that any children tags will get pruned as well.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-10-17 12:14:18 -07:00
Elijah Newren
b1606ba8ac Merge branch 'mr/fix-filter-lamely-name-error' into main
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-10-07 10:52:54 -07:00
Elijah Newren
f9a54f36d9 Merge branch 'tm/fix-typo' into main
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-10-07 10:51:58 -07:00
Marius Renner
70f83c2526 filter-lamely: fix NameError because of forgotten fr module prefix
In repositories with annotated tags filter-lamely crashes with the
message: "NameError: name 'Reset' is not defined".

This is because of a missing "fr" module prefix in the code, which this
commit adds.

Signed-off-by: Marius Renner <marius@mariusrenner.de>
2020-10-06 16:27:39 +02:00
Tom Matthews
96959d1174
converting-from-bfg-repo-cleaner.md: fix typo
Signed-off-by: Tom Matthews <trcm@pm.me>
2020-10-06 11:04:47 +01:00
Elijah Newren
7b3e714b94 filter-repo (README): remove outdated 2.28.0-not-yet-released comment
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-07-27 11:34:33 -07:00
Elijah Newren
d79ea709b7 filter-repo: fix crash from assuming parent is an int
When filtering with --refs, parents can be a hash rather than an
integer.  There was a code path in RepoFilter._prunable() that was
written assuming the first parent would always be an integer; fix it to
handle a hash as well.

Reported-by: Niklas Hambüchen <mail@nh2.me>
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-07-27 10:52:59 -07:00
Elijah Newren
4b452da4ef Merge branch 'jb/ignore-generated-docs' into main
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-07-27 10:03:01 -07:00
Elijah Newren
e4960a53f8 Fix undefined variable names
Reported-by: Christian Clauss <cclauss@me.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-07-27 09:49:43 -07:00
Jonas Bernoulli
8a8278701f .gitignore: ignore the generated documentation
Signed-off-by: Jonas Bernoulli <jonas@bernoul.li>
2020-07-09 13:47:45 +02:00
Elijah Newren
ed6f410088 Contributing.md: link to Nicolai Hähnle's code review comments
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-07-07 16:05:14 -06:00
Elijah Newren
cefeef1c0a filter-repo: use new --date-format=raw-permissive fast-import option
fast-import gained a new raw-permissive date format explictly for
allowing people to import repositories as-is.  Make use of the flag, and
stop rewriting the bogus timezone found in rails.git.

If users do not like these bogus times, they can of course write a
filter to fix them (or even make them bogus in a different way).  For
example:

    git filter-repo ... --commit-callback '
      if commit.author_date.endswith(b"+051800"):
        commit.author_date.replace(b"+051800", b"+0261")
    '

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-07-07 09:38:34 -06:00
Elijah Newren
debe52000d contrib: rename no-op-example to barebones-example
"no-op" might suggest that it doesn't do anything, when in reality it
does exactly what filter-repo does.  Rename it to barebones-example.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-23 19:59:39 -07:00
Elijah Newren
2f26e4bce5 INSTALL.md: wording clarification on what repology.org tracks
Homebrew and scoop are both package managers and package repositories.
Fedora 32 is not a package manager, but does map to a package
repository.  Clarify wording that the list from repology.org is a list
of package repositories, not package managers.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-23 09:38:57 -07:00
Elijah Newren
b74eb6b69d Merge branch 'jr/document-commit-and-ref-map' into main
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-23 08:41:32 -07:00
James Ramsay
f867bb6ad7 git-filter-repo.txt: document mapping output
Useful commit and reference mappings are created on every run. These are
helpful in a number of situations, and should be documented so that
end-users and Git hosts can understand how to use the output.

The commit-map is particularly useful for Git hosts to override
retention mechanisms, like hidden refs. This allows end-users to purge
large files and sensitive data.

Signed-off-by: James Ramsay <james@jramsay.com.au>
2020-06-23 10:12:57 +10:00
Elijah Newren
1e0c3ab3ae filter-repo: make fresh clone warning scarier
Apparently, despite the fact that *overwrite* *repo* *history* are three
important words that each individually convey a lot of important
meaning, people ignore it and instinctively add --force.  Insert the
word "destructively" to get people to pause.

Further, change the end of the warning not to how to get around the
warning with the current repository, but instead with a suggestion that
they should instead be operating on a fresh clone and only then make a
side comment that the --force flag can be used to override.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-20 23:51:22 -07:00
Elijah Newren
8abf8faec8 git-filter-repo.txt: be more forceful on the wording of --force
Online blogs/articles/Q&A as well as direct feedback suggests that
people use the --force flag rather cavalierly.  Add words like
"irreversible" and "immediate pruning" to discourage such blithe
application of this flag.  I hope this encourages folks to either learn
the ramifications of irreversible full-repository entire history
rewrites first, or to follow the recommendation of only operating on a
fresh clone.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-20 23:51:22 -07:00
Elijah Newren
f8c14d159c git-filter-repo.txt: point people at the generated documentation
People keep trying to read this file, unaware that it is the source code
for generating the documentation, not the generated documentation.  Add
a comment at the top that explains this and points people in the right
direction.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-20 23:51:22 -07:00
Elijah Newren
38e70b69e8 filter-repo: ignore comment lines in --paths-from-file
Allow lines starting with '#' to be treated as a comment and be ignored.
Update the documentation to note that both blank lines and comment lines
are ignored, and mention how filenames starting with '#' can be matched
(namely, the same way that filenames startwith with 'regex:', 'glob:',
or 'literal:' can be -- by prefixing the filename with 'literal:').

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-20 09:26:38 -07:00
Elijah Newren
771404d656 filter-repo: allow globs to match file or directory names
I added special code to filter-repo so that --path expressions could
match filenames or some leading directory name.  --path-regex, since it
does not implicitly add anchorings, can also match a leading path, and
can thus be used to match against directories.  --path-glob could not be
used to match a leading directory of a path, since fnmatch.fnmatch()
requires the full string to match.  But users like being able to specify
directory names, such as '*/bin', so let's take any glob expression and
treat it as two: '<glob>' and '<glob>/*' and try to match against either
one; this will allow it to match against file or directory names like
the other two types of path matching.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-20 08:31:53 -07:00
Elijah Newren
25b226b1de t9390: make tests individually re-runnable
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-18 21:49:19 -07:00
Elijah Newren
eb9ea17629 INSTALL.md: fix missing trailing backquote
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-14 14:37:45 -07:00
Elijah Newren
34f761734b INSTALL.md: simplify manual installation instructions
Make use of `git --man-path` and `git --html-path` to simplify the
manual installation instructions a bit.  Also, there appears to be a
site.getsitepackages() call in python to give similar information about
where git_filter_repo.py can be installed.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-13 22:25:53 -07:00
Elijah Newren
a238e3b7e6 git-filter-repo.txt: discourage use of random clone flags
Flags like --local, --shared, --reference (and --dissociate), and
--origin would all mess up the fresh clone checker.  Attempting to
defend against all of them would not only be costly, but make it harder
to draw the line about guesses as to whether a repository is a fresh
clone or not.  --origin also has problems in that filter-repo has
special handling for the 'origin' remote that I don't want to apply to
other random remotes.

Flags like --depth, --single-branch, and --no-tags could prevent enough
data from being downloaded to do a full rewrite and result in a
partially rewritten or possibly even corrupt history (no idea how
shallow clones interact; probably badly).  --filter would also make the
repo start without enough info though it'd at least be downloaded on
demand; it'd still be a really slow way to do it, though, so it's a bad
idea.

filter-repo doesn't really provide an easy mechanism to rewrite a repo
and its submodule simultaneously, so recursing submodules seems useless
and unhelpful.  --shallow-submodules would be bad for at least the same
reasons --depth is for the parent module, assuming we handled
submodules.  --remote-submodules just provides a way to make the repo
dirty to start, which is counter-productive.  --jobs could be useful, if
recursing submodules was.

--no-checkout might be safe to use and --sparse might also be okay for
as long as it only affects the working tree, but in both cases why not
go --bare or --mirror if you're doing that?  Likewise, --no-hardlinks is
useless given that we're already saying people need to use --no-local.

-b would be okay to use, but why wouldn't you just change the default
branch on the server rather than just within this one clone used for
rewriting the history?  Whether you push back to the original repository
or to a new repo, you'd have to take a separate step to change it in
that remote repo.  And if you really will use this new local repository
as the official source, then you can switch branches at the end of the
rewrite just as easily.

--separate-git-dir and --template might be okay to use, I haven't
tested.  If either doesn't work now, or breaks at any point in the
future, I feel much better being able to say, "I told you to only use
these three flags to git clone."

-u only affects the ability to receive the clone; it's fine to use.
Also, -q only affects the console output during the clone operation, so
you could use it.

There will probably be more flags added to git-clone over time.  Testing
against all of them is insanity.  Recommend people only use --no-local,
--bare, and --mirror, with the first only needed when cloning from a
local filesystem, and the other two never needed but allowed for those
that prefer.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-13 21:06:08 -07:00
Elijah Newren
49d6f02ff8 filter-repo: clarify interactions between path filtering and path renaming
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-13 10:20:56 -07:00
Elijah Newren
3e1bff264c Revert "filter-repo: fix ugly bug with mixing path filtering and renaming"
This reverts commit df6c8652a2.  The
motivating example was wrong; path renaming should not be involved in
path filtering, it only says how paths should be renamed if they happen
to be selected.  A subsequent commit will improve the documentation.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-13 10:10:53 -07:00
Elijah Newren
a4c12253a8 git-filter-repo.txt: briefly explain steps for pushing to original url
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-08 20:17:20 -07:00
Elijah Newren
b8ebda97dd contrib: avoid applying --replace-text to binary files in bfg-ish
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-08 19:51:21 -07:00
Elijah Newren
86569ee7ac Contributing.md: add a small clarification about line coverage
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-08 08:25:53 -07:00
Elijah Newren
23bec32283 contrib, docs: make discovery of code formatting and linting easier
The desire to format or lint code throughout history has arisen several
times.  It's more natural to do this in filter-branch since it somewhat
forces people to run external commands, but we have an example contrib
demo that shows how to run an external command on each file in history
that I created even before any of these requests came in and yet I still
periodically get requests about it.

Make lint-history ever-so-slightly easier to apply to a subset of
filenames, and include its usage as an extra cheat sheet comparison for
filter-branch-vs-filter-repo commands.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-06 11:54:28 -07:00
Elijah Newren
bd2c9c4d4d contrib: new simple no-op-example
The purpose of this example is to solely show what to import and run to
recover filter-repo's behavior as-is.  It doesn't modify any behavior,
but instead exists as an example so people can easily find a good
starting point for making their own modifications.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-06 11:19:35 -07:00
Elijah Newren
caa05d15b4 filter-repo: make default replacement text a variable
Allow external scripts that import git-filter-repo to change the value
of the default replacement text instead of having it hardcoded within
some function.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-06 10:51:41 -07:00
Elijah Newren
31f00a9ff8 filter-repo: avoid applying --replace-text to binary files
--replace-text is meant to replace _text_ throughout the repository, not
binary data.  Use the same scheme as the lint-history script uses to
avoid applying the changes to binary blob data.

Reported-by: Tobias Gruetzmacher <tobias-git@23.gs>
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-06 10:27:48 -07:00
Elijah Newren
859e66ae1c converting-from-filter-branch.md: add a small clarification
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-06 10:08:56 -07:00
Elijah Newren
d32f6258a8 converting-from-bfg-repo-cleaner.md: add a small clarification
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-06 10:02:31 -07:00
Elijah Newren
d87b665ed4 git-filter-repo.txt: connect --no-local and fresh clones more thoroughly
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-01 08:16:50 -07:00
Elijah Newren
469a3e10f2 filter-repo (README): separate sections for different tools
Our showing of how to handle the simple example with different tools
combined three different tools into a single section which I think made
it slightly harder to read and follow.  It also concentrated almost
exclusively on filter-branch.  Provide a separate section for each tool,
and provide more details for BFG and fast-export/fast-import.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-29 17:49:03 -07:00
Elijah Newren
8ba3566119 filter-repo (README): link cheat sheets from usage section too
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-29 17:49:03 -07:00
Elijah Newren
cdb7b77f07 filter-repo: repack with --source or --target
When using --source or --target in combination with filtering paths,
users were surprised out how large the resulting repository was.  The
usage of --source and --target were turning off repacking; while we
don't want repacking for partial history rewrites and --source and
--target turn on some of the other features we want with partial history
rewrites, repacking is something that we still want turned on.

Reported-by: Alexey Volkov <alexey.volkov@ark-kun.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-28 08:16:12 -07:00
Elijah Newren
2bfb9cf261 git-filter-repo.txt: fix extraneous space
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-27 23:47:50 -07:00
Elijah Newren
7b18e6d7f5 filter-repo: fix --prune-degenerate=never with path filtering
When combining `--prune-degenerate never` with a `--path` specification,
we could end up trying to write a parent out to the fast-import stream
whose value was actually None.  The problem occurs when the parents of
a merge commit are filtered out by the path specification, leaving us
only with no-longer-extant parents.  In such a case, we need to filter
out these 'None' (i.e. invalid) parents.  The point of
`--prune-degenerate never` is to avoid removing parents that are either
the same as or an ancestor of another parent, not to avoid removing
non-existent parents.  Remove the non-existent parent(s).

Reported-by: Gaurav Kanoongo (@gauravkanoongo on GitHub)
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-27 07:04:17 -07:00
Elijah Newren
df6c8652a2 filter-repo: fix ugly bug with mixing path filtering and renaming
There's also a fix in here to make sure to throw an error if users are
trying to rename paths and use --invert-paths; it's not clear at all
what that would even mean.  But that also becomes important later...

Due to the ability to either filter wanted paths (default), or to just
specify unwanted paths (with --invert-paths), I keep a special
args.inclusive variable to track whether a "match" means we want the
path or not.  There are some special cases, notably when there are no
filters present (meaning e.g. no --path specifications, at most there
are some --path-rename values provided).  When there are no filters
present, that means we should keep paths even if we don't "find a match"
against any of the filters.

Now, since the rename code was embedded in the same loop as the filter
checks, it unfortunately was also being checked against the
args.inclusive setting despite never setting whether it found a match.
That happened to work in the special case that there were no filtering
paths but only because of the special logic for that case.  Since
renaming only makes sense if --invert-paths is not specified, any path
we rename is one we always want to keep.  Make sure we do.

Reported-by: Nadège (@nagreme on GitHub)
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-25 12:35:34 -07:00
Elijah Newren
0375758806 filter-repo: fix possible deadlock in sanity_check_args
I'm a little surprised that stdout buffers must have filled up on MacOS X, but
either way we don't have to wait for the '-h' processes to finish before
attempting to read stdout.  In fact, since we weren't storing the returncode
attribute from calling p.wait(), there wasn't much point in doing so.  Trying
to read all stdout all at once is going to implicitly take until the process
finishes anyway, so just do that.

Reported-by: Benoit Lefèvre <contact@benoit-lefevre.org>
Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-25 11:00:09 -07:00
Elijah Newren
15494bba8a filter-repo: make git version requirement error message more direct
Users won't know which versions of git have --mark-tags, --reencode, or
--combined-all-paths options for fast-export and diff-tree.  I didn't
either when I wrote those messages because it wasn't in a released
version of git.  Now that they are in released versions and have been
for a while, we can simplify the messages to just state which git
version is needed.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-19 16:44:27 -07:00
Elijah Newren
1e2d0e91cb Documentation: add more detailed explanation of safety checks and --force
I occasionally get people doing special things, or see people
recommending to others to just use --force.  Add some explanations
behind the safety checks so that those doing special things know when
it's okay, and to explain why it's a really bad idea to casually or
haphazardly recommend others use --force.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-19 14:52:51 -07:00
Elijah Newren
3dfaf3874e filter-repo: fix --no-local error when there is no remote
Commit 011c646ee8 (filter-repo: suggest --no-local when cloning local
repos, 2020-05-15) added an additional message to the error to make it
more clear what to do when cloning local repos.  However, if there was
no remote, then the code path would run os.path.isdir(None), triggering
a traceback.  Fix the logic.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-05-18 23:19:27 -07:00