Some projects have a strict --no-ff merging policy. With the default
behavior of --prune-degenerate, we can prune merge commits in a way that
transforms the history into a fast-forward merge. Consider this
example:
* There are two independent commits or branches, named B & C, which
are both built on top of A so that history look like this diagram:
A
\ \
\ B
\
-C
* Someone runs the following sequence of commands:
* git checkout A
* git merge --no-ff B
* git merge --no-ff C
* This will result in a history that looks like:
A---AB---AC
\ \ / /
\ B /
\ /
-C-
* Later, someone comes along and runs filter-repo, specifying to
remove the only path(s) that were modified by B. That would
naturally remove commit B and the no-longer-necessary merge
commit AB. For someone using a strict no-ff policy, the desired
history is
A---AC
\ /
C
However, the default handling for --prune-degenerate would
notice that AC merely merges C into its own ancestor A, whereas
the original AC merged C into something separate (namely, AB).
So, it would say that AC has become degenerate and prune it,
leaving the simple history of
A
\
C
For projects not using a strict no-ff policy, this simpler history
is probably better, but for folks that want a strict no-ff policy,
it is unfortunate.
Provide a --no-ff option to tweak the --prune-degenerate behavior so
that it ignores the first parent being an ancestor of another parent
(leaving the first parent unpruned even if it is or becomes degenerate
in this fashion).
Signed-off-by: Elijah Newren <newren@gmail.com>
Prior to this commit, git-filter-repo could only be used with either the
--dry-run flag or the --debug flag, not both. When run in debug mode,
git-filter-repo expected to be able to read from the output stream,
which obviously isn't created when doing a dry run, so it stack traced
when it tried to use the non-existent output stream. This commit fixes
that bug with an equally simple sanity check for the existence of the
output stream when run in debug mode.
Signed-off-by: Karl Lenz <xorangekiller@gmail.com>
The mailmap format parsed by the "git shortlog" command allows for
matching mailmap entries with no email address. This is admittedly an
edge case, because most Git commits will have an email address
associated with them as well as a name, but technically the address
isn't required, and "git shortlog" accommodates that in its mailmap
format. This commit teaches git-filter-repo to do the same thing.
Signed-off-by: Karl Lenz <xorangekiller@gmail.com>
It's hard to be exhaustive, but if users try something like:
--path-rename foo/bar/baz:.
or
--path ../other-dir
then bad things happen. In the first case, filter-repo will try to
ask fast-import to create a directory named '.' and move everything
from foo/bar/baz/ into it but of course '.' is a reserved directory
name so we can't create it. In the second case, they are probably
running from a subdirectory, but filter-repo doesn't work from a
subdirectory. I hard-coded the assumption that everything was in the
toplevel directory and all paths were relative from there pretty
early on. So, if the user tries to use any of these components
anywhere, just throw an early error.
Signed-off-by: Elijah Newren <newren@gmail.com>
In commit 509a624 (filter-repo: fix issue with pruning of empty commits,
2019-10-03), it was noted that when the first parent is pruned away,
then we need to generate a corrected list of file changes relative to
the new first parent. Unfortunately, we did not apply our set of file
filters to that new list of file changes, causing us to possibly
introduce many unwanted files from the second parent into the history.
The testcase added at the time was rather lax and totally missed this
problem (which possibly exacerbated the original bug being fixed rather
than helping). Tighten the testcase, and fix the error by filtering the
generated list of file changes.
Signed-off-by: Elijah Newren <newren@gmail.com>
Note that this isn't a version *number* or even the more generalized
version string that folks are used to seeing, but a version hash (or
leading portion thereof).
A few import points:
* These version hashes are not strictly monotonically increasing
values. Like I said, these aren't version numbers. If that
bothers you, read on...
* This scheme has incredibly nice semantics satisfying a pair of
properties that most version schemes would assume are mutually
incompatible:
This scheme works even if the user doesn't have a clone of
filter-repo and doesn't require any build step to inject the
version into the program; it works even if people just download
git-filter-repo.py off GitHub without any of the other sources.
And:
This scheme means that a user is running precisely version X of
the code, with the version not easily faked or misrepresented
when third parties edit the code.
Given the wonderful semantics provided by satisfying this pair of
properties that all other versioning schemes seem to miss out on, I
think I should name this scheme. How about "Semantic Versioning"?
(Hehe...)
* The version hash is super easy to use; I just go to my own clone of
filter-repo and run either:
git show $VERSION_HASH
or
git describe $VERSION_HASH
* A human consumable version might suggest to folks that this software
is something they might frequently use and upgrade. This program
should only be used in exceptional cases (because rewriting history
is not for the faint of heart).
* A human consumable version (i.e. a version number or even the
more relaxed version strings in more common use) might suggest to
folks that they can rely on strict backward compatibility. It's
nice to subtly undercut any such assumption.
* Despite all that, I will make releases (downloadable tarballs with
real version numbers in the tarball name; I'm just going to re-use
whatever version git is released with at the time). But those
version numbers won't be used by the --version option; instead the
version hash will.
Signed-off-by: Elijah Newren <newren@gmail.com>
In order to build the correct tree for a commit, git-fast-import always
takes a list of file changes for a merge commit relative to the first
parent.
When the entire first-parent history of a merge commit is pruned away
and the merge had paths with no difference relative to the first parent
but which differed relative to later parents, then we really need to
generate a new list of file changes in order to have one of those other
parents become the new first parent. An example might help clarify...
Let's say that there is a merge commit, and:
* it resolved differences in pathA between its two parents by taking
the version of pathA from the first parent.
* pathB was added in the history of the second parent (it is not
present in the first parent) and is NOT included in the merge commit
(either being deleted, or via rename treated as deleted and added as
something else)
For this merge commit, neither pathA nor pathB differ from the first
parent, and thus wouldn't appear in the list of file changes shown by
fast-export. However, when our filtering rules determine that the first
parent (and all its parents) should be pruned away, then the second
parent has to become the new first parent of the merge commit. But to
end up with the right files in the merge commit despite using a
different parent, we need a list of file changes that specifies the
changes for both pathA and pathB.
Signed-off-by: Elijah Newren <newren@gmail.com>
Allow folks to periodically update the export of a live repo without
re-exporting from the beginning. This is a performance improvement, but
can also be important for collaboration. For example, for sensitivity
reasons, folks might want to export a subset of a repo and update the
export periodically. While this could be done by just re-exporting the
repository anew each time, there is a risk that the paths used to
specify the wanted subset might need to change in the future; making the
user verify that their paths (including globs or regexes) don't also
pick up anything from history that was previously excluded so that they
don't get a divergent history is not very user friendly. Allowing them
to just export stuff that is new since the last export works much better
for them.
Signed-off-by: Elijah Newren <newren@gmail.com>
Commit 346f2ba891 (filter-repo: make reencoding of commit messages
togglable, 2019-05-11) made reencoding of commit messages togglable but
forgot to add parsing and outputting of the encoding header itself. Add
such ability now.
Signed-off-by: Elijah Newren <newren@gmail.com>
When we prune a commit for being empty, there is no update to the branch
associated with the commit in the fast-import stream. If the parent
commit had been associated with a different branch, then the branch
associated with the pruned commit would not be updated without
additional measures. In the past, we resolved this by recording that
the branch needed an update in _seen_refs. While this works, it is a
bit more complicated than just issuing an immediate Reset. Also, note
that we need to avoid calling callbacks on that Reset because those
could rename branches (again, if the commit-callback already renamed
once) causing us to not update the intended branch.
There was actually one testcase where the old method didn't work: when a
branch was pruned away to nothing. A testcase accidentally encoded the
wrong behavior, hiding this problem. Fix the testcase to check for
correct behavior.
Signed-off-by: Elijah Newren <newren@gmail.com>
Add a flag allowing for specifying a file filled with blob-ids which
will be stripped from the repository.
Signed-off-by: Elijah Newren <newren@gmail.com>
Fix a few issues and add a token testcase for partial repo filtering.
Add a note about how I think this is not a particularly interesting or
core usecase for filter-repo, even if I have put some good effort into
the fast-export side to ensure it worked. If there is a core usecase
that can be addressed without causing usability problems (particularly
the "don't mix old and new history" edict for normal rewrites), then
I'll be happy to add more testcases, document it better, etc.
Signed-off-by: Elijah Newren <newren@gmail.com>
Make several fixes around --source and --target:
* Explain steps we skip when source or target locations are specified
* Only write reports to the target directory, never the source
* Query target git repo for final ref values, not the source
* Make sure --debug messages avoid throwing TypeErrors due to mixing
strings and bytes
* Make sure to include entries in ref-map that weren't in the original
target repo
* Don't:
* worry about mixing old and new history (i.e. nuking refs
that weren't updated, expiring reflogs, gc'ing)
* attempt to map refs/remotes/origin/* -> refs/heads/*
* disconnect origin remote
* Continue (but only in target repo):
* fresh-clone sanity checks
* writing replace refs
* doing a 'git reset --hard'
Signed-off-by: Elijah Newren <newren@gmail.com>
Add a flag for filtering out blob based on their size, and allow the
size to be specified using 'K', 'M', or 'G' suffixes.
Signed-off-by: Elijah Newren <newren@gmail.com>
Imperative form sounds better than --empty-pruning and
--degenerate-pruning, and it probably works better with command line
completion.
Signed-off-by: Elijah Newren <newren@gmail.com>
The reset directive can specify a commit hash for the 'from' directive,
which can be used to reset to a specify commit, or, if the hash is all
zeros, then it can be used to delete the ref. Support such operations.
Signed-off-by: Elijah Newren <newren@gmail.com>
This allows the user to put a whole bunch of paths they want to keep (or
want to remove) in a file and then just provide the path to it. They
can also use globs or regexes (similar to --replace-text) and can also
do renames. In fact, this allows regex renames, despite the fact that I
never added a --path-rename-regex option.
Signed-off-by: Elijah Newren <newren@gmail.com>
Using an exact path (file or directory) for --path-rename instead of a
prefix removes an ugly caveat from the documentation, makes it operate
similarly to --path, and will make it easier to reuse common code when I
add the --paths-from-file option. Switch over, and replace the
startswith() check by a call to filename_matches().
Signed-off-by: Elijah Newren <newren@gmail.com>
This new flag allows people to filter files solely based on their
basename rather than on their full path within the repo, making it
easier to e.g. remove all .DS_Store files or keep all README.md
files.
Signed-off-by: Elijah Newren <newren@gmail.com>
This adds the ability to automatically add new replacement refs for each
rewritten commit (as well as delete or update replacement refs that
existed before the run). This will allow users to use either new or old
commit hashes to reference commits locally, though old commit hashes
will need to be unabbreviated. The only requirement for this to work,
is that the person who does the rewrite also needs to push the replace
refs up where other users can grab them, and users who want to use them
need to modify their fetch refspecs to grab the replace refs.
However, other tools external to git may not understand replace refs...
Tools like Gerrit and GitHub apparently do not yet natively understand
replace refs. Trying to view "commits" by the replacement ref will
yield various forms of "Not Found" in each tool. One has to instead try
to view it as a branch with an odd name (including "refs/replace/"), and
often branches are accessed via a different URL style than commits so it
becomes very non-obvious to users how to access the info associated with
an old commit hash.
* In Gerrit, instead of being able to search on the sha1sum or use a
pre-defined URL to search and auto-redirect to the appropriate code
review with
https://gerrit.SITE.COM/#/q/${OLD_SHA1SUM},n,z
one instead has to have a special plugin and go to a URL like
https://gerrit.SITE.COM/plugins/gitiles/ORG/REPO/+/refs/replace/${OLD_SHA1SUM}
but then the user isn't shown the actual code review and will need
to guess which link to click on to get to it (and it'll only be
there if the user included a Change-Id in the commit message).
* In GitHub, instead of being able to go to a URL like
https://github.SITE.COM/ORG/REPO/commit/${OLD_SHA1SUM}
one instead has to navigate based on branch using
https://github.SITE.COM/ORG/REPO/tree/refs/replace/${OLD_SHA1SUM}
but that will show a listing of commits instead of information about
a specific commit; the user has to manually click on the first commit
to get to the desired location.
For now, providing replace refs at least allows users to access
information locally using old IDs; perhaps in time as other external
tools will gain a better understanding of how to use replace refs, the
barrier to history rewrites will decrease enough that big projects that
really need it (e.g. those that have committed many sins by commiting
stupidly large useless binary blobs) can at least seriously contemplate
the undertaking. History rewrites will always have some drawbacks and
pain associated with them, as they should, but when warranted it's nice
to have transition plans that are more smooth than a massive flag day.
Signed-off-by: Elijah Newren <newren@gmail.com>
We have a good default for pruning of empty commits and degenerate merge
commits: only pruning such commits that didn't start out that way (i.e.
that couldn't intentionally have been empty or degenerate). However,
users may have reasons to want to aggressively prune such commits (maybe
they used BFG repo filter or filter-branch previously and have lots of
cruft commits that they want remoed), and we may as well allow them to
specify that they don't want pruning too, just to be flexible.
Signed-off-by: Elijah Newren <newren@gmail.com>
This is by far the largest python3 change; it consists basically of
* using b'<str>' instead of '<str>' in lots of places
* adding a .encode() if we really do work with a string but need to
get it converted to a bytestring
* replace uses of .format() with interpolation via the '%' operator,
since bytestrings don't have a .format() method.
Signed-off-by: Elijah Newren <newren@gmail.com>
Use UTF-8 chars in user names, filenames, branch names, tag names, and
file contents. Also include invalid UTF-8 in file contents; should be
able to handle binary data.
Signed-off-by: Elijah Newren <newren@gmail.com>
The sorting order of entries written to files in the analysis directory
didn't specify a secondary sort, thus making the order dependent on the
random-ish sorting order of dictionaries and making it inconsistent
between python versions. While the secondary order didn't matter much,
having a defined order makes it slightly easier to define a single
testcase that can work across versions.
Signed-off-by: Elijah Newren <newren@gmail.com>
Assuming filter-repo will be merged into git.git, use "git" for the
TEXTDOMAIN, and assume its build system will replace "@@LOCALEDIR@@"
appropriately.
Note that the xgettext command used to grab string translations is
nearly identical to the one for C files in git.git; just use
--language=python instead and add --join-existing to avoid overwriting
the po/git.pot file. In other words, use the command:
xgettext -o../git/po/git.pot --join-existing --force-po \
--add-comments=TRANSLATORS: \
--msgid-bugs-address="Git Mailing List <git@vger.kernel.org>" \
--from-code=UTF-8 --language=python \
--keyword=_ --keyword=N_ --keyword="Q_:1,2" \
git-filter-repo
To create or update the translation, go to git.git/po and run either of:
msginit --locale=XX
msgmerge --add-location --backup=off -U XX.po git.pot
Once you've updated the translation, within git.git just build as
normal. That's all that's needed.
Signed-off-by: Elijah Newren <newren@gmail.com>
The AncestryGraph setup assumed we had previously seen all commits which
would be used as parents; that interacted badly with doing an
incremental import. Add a function which can be used to record external
commits, each of which we'll treat like a root commit (i.e. depth 1 and
having no parents of its own). Add a test to prevent regressions.
Signed-off-by: Elijah Newren <newren@gmail.com>
There are a number of things not present in "normal" imports that we
nevertheless support and need to be tested:
* broken timezone adjustment (+051800->+0261; observed in the wild
in real repos, and adjustment prevents fast-import from dying)
* commits missing an author (observed in the wild in a real repo;
just sets author to committer)
* optional additional linefeeds in the input allowed by
git-fast-import but usually not written by git-fast-export
* progress and checkpoint objects
* progress, checkpoint, and 'everything' callbacks
Signed-off-by: Elijah Newren <newren@gmail.com>
The test does check the exact output for the report, meaning if the
output is changed at all this test will need to be updated, but it at
least makes sure we are getting all the right kinds of information. I
do not expect the output format will change very often.
Signed-off-by: Elijah Newren <newren@gmail.com>
Pruning of commits which become empty can result in a variety of
topology changes: a merge may have lost all its ancestors corresponding
to one of (or more) of its parents, a merge may end up merging a commit
with itself, or a merge may end up merging a commit with its own
ancestor. Merging a commit with itself makes no sense, so we'd rather
prune down to one parent and hopefully prune the merge commit, but we do
need to worry about whether the are changes in the commit and whether
the original merge commit also merged something with itself. We have
similar cases for dealing with a merge of some commit with its own
ancestor: if the original topology did the same, or the merge commit has
additional file changes, then we cannot remove the commit. But,
otherwise, the commit can be pruned.
Add testcases covering the variety of changes that can occur to make
sure we get them right.
Signed-off-by: Elijah Newren <newren@gmail.com>
There are several cases to worry about with commit pruning; commits
that start empty and had no parent, commits that start empty and
had a parent which may or may not get pruned, commits which had
changes but became empty, commits which were merges but lost a line
of ancestry and have no changes of their own, etc. Add testcases
covering these cases, though most topology related ones will be
deferred to a later set of tests.
Signed-off-by: Elijah Newren <newren@gmail.com>
Make it easy for users to search and replace text throughout the
repository history. Instead of inventing some new syntax, reuse the
same syntax used by BFG repo filter's --replace-text option, namely,
a file with one expression per line of the form
[regex:|glob:|literal:]$MATCH_EXPR[==>$REPLACEMENT_EXPR]
Where "$MATCH_EXPR" is by default considered to be literal text, but
could be a regex or a glob if the appropriate prefix is used. Also,
$REPLACEMENT_EXPR defaults to '***REMOVED***' if not specified. If
you want a literal '==>' to be part of your $MATCH_EXPR, then you
must also manually specify a replacement expression instead of taking
the default. Some examples:
sup3rs3kr3t
(replaces 'sup3rs3kr3t' with '***REMOVED***')
HeWhoShallNotBeNamed==>Voldemort
(replaces 'HeWhoShallNotBeNamed' with 'Voldemort')
very==>
(replaces 'very' with the empty string)
regex:(\d{2})/(\d{2})/(\d{4})==>\2/\1/\3
(replaces '05/17/2012' with '17/05/2012', and vice-versa)
The format for regex is as from
re.sub(<pattern>, <repl>, <string>) from
https://docs.python.org/2/library/re.html
The <string> comes from file contents of the repo, and you specify
the <pattern> and <repl>.
glob:Copy*t==>Cartel
(replaces 'Copyright' or 'Copyleft' or 'Copy my st' with 'Cartel')
Signed-off-by: Elijah Newren <newren@gmail.com>