Allow folks to periodically update the export of a live repo without
re-exporting from the beginning. This is a performance improvement, but
can also be important for collaboration. For example, for sensitivity
reasons, folks might want to export a subset of a repo and update the
export periodically. While this could be done by just re-exporting the
repository anew each time, there is a risk that the paths used to
specify the wanted subset might need to change in the future; making the
user verify that their paths (including globs or regexes) don't also
pick up anything from history that was previously excluded so that they
don't get a divergent history is not very user friendly. Allowing them
to just export stuff that is new since the last export works much better
for them.
Signed-off-by: Elijah Newren <newren@gmail.com>
When we prune a commit for being empty, there is no update to the branch
associated with the commit in the fast-import stream. If the parent
commit had been associated with a different branch, then the branch
associated with the pruned commit would not be updated without
additional measures. In the past, we resolved this by recording that
the branch needed an update in _seen_refs. While this works, it is a
bit more complicated than just issuing an immediate Reset. Also, note
that we need to avoid calling callbacks on that Reset because those
could rename branches (again, if the commit-callback already renamed
once) causing us to not update the intended branch.
There was actually one testcase where the old method didn't work: when a
branch was pruned away to nothing. A testcase accidentally encoded the
wrong behavior, hiding this problem. Fix the testcase to check for
correct behavior.
Signed-off-by: Elijah Newren <newren@gmail.com>
For other programs importing git-filter-repo as a library and passing a
blob, commit, tag, or reset callback to RepoFilter, pass a second
parameter to these functions with extra metadata they might find useful.
For simplicity of implementation, this technically changes the calling
signature of the --*-callback functions passed on the command line, but
we hide that behind a _do_not_use_this_variable parameter for now, leave
it undocumented, and encourage folks who want to use it to write an
actual python program that imports git-filter-repo. In the future, we
may modify the --*-callback functions to not pass this extra parameter,
or if it is deemed sufficiently useful, then we'll rename the second
parameter and document it.
As already noted in our API compatibilty caveat near the top of
git-filter-repo, I am not guaranteeing API backwards compatibility.
That especially applies to this metadata argument, other than the fact
that it'll be a dict mapping strings to some kind of value. I might add
more keys, rename them, change the corresponding value, or even remove
keys that used to be part of metadata.
Signed-off-by: Elijah Newren <newren@gmail.com>
Location of filtering logic was previously split in a confusing fashion
between FastExportFilter and RepoFilter. Move all filtering logic from
FastExportFilter into RepoFilter, and rename the former to
FastExportParser to reflect this change.
One downside of this change is that FastExportParser's _parse_commit
holds two pieces of information (orig_parents and had_file_changes)
which are not part of the commit object but which are now needed by
RepoFilter. Adding those bits of info to the commit object does not
make sense, so for now we pass an auxiliary dict with the
commit_callback that has these two fields. This information is not
passed along to external commit_callbacks passed to RepoFilter, though,
which seems suboptimal. To be fair, though, commit_callbacks to
RepoFilter never had access to this information so this is not a new
shortcoming, it just seems more apparent now.
Signed-off-by: Elijah Newren <newren@gmail.com>
I introduced this over a decade ago thinking it would come in handy in
some special case, and the only place I used it was in a testcase that
existed almost solely to increase code coverage. Modify the testcase to
instead demonstrate how it is trivial to get the effects of the
everything_callback without it being present.
Signed-off-by: Elijah Newren <newren@gmail.com>
This is by far the largest python3 change; it consists basically of
* using b'<str>' instead of '<str>' in lots of places
* adding a .encode() if we really do work with a string but need to
get it converted to a bytestring
* replace uses of .format() with interpolation via the '%' operator,
since bytestrings don't have a .format() method.
Signed-off-by: Elijah Newren <newren@gmail.com>
There are a number of things not present in "normal" imports that we
nevertheless support and need to be tested:
* broken timezone adjustment (+051800->+0261; observed in the wild
in real repos, and adjustment prevents fast-import from dying)
* commits missing an author (observed in the wild in a real repo;
just sets author to committer)
* optional additional linefeeds in the input allowed by
git-fast-import but usually not written by git-fast-export
* progress and checkpoint objects
* progress, checkpoint, and 'everything' callbacks
Signed-off-by: Elijah Newren <newren@gmail.com>