mirror of
https://github.com/newren/git-filter-repo.git
synced 2024-11-17 03:26:08 +00:00
9282a33a02
Signed-off-by: Elijah Newren <newren@gmail.com>
1369 lines
58 KiB
Plaintext
1369 lines
58 KiB
Plaintext
// This file is NOT the documentation; it's the *source code* for it.
|
|
// Please follow the "user manual" link under
|
|
// https://github.com/newren/git-filter-repo#how-do-i-use-it
|
|
// to access the actual documentation.
|
|
|
|
git-filter-repo(1)
|
|
==================
|
|
|
|
NAME
|
|
----
|
|
git-filter-repo - Rewrite repository history
|
|
|
|
SYNOPSIS
|
|
--------
|
|
[verse]
|
|
'git filter-repo' --analyze
|
|
'git filter-repo' [<path_filtering_options>] [<content_filtering_options>]
|
|
[<ref_renaming_options>] [<commit_message_filtering_options>]
|
|
[<name_or_email_filtering_options>] [<parent_rewriting_options>]
|
|
[<generic_callback_options>] [<miscellaneous_options>]
|
|
|
|
DESCRIPTION
|
|
-----------
|
|
|
|
Rapidly rewrite entire repository history using user-specified filters.
|
|
This is a destructive operation which should not be used lightly; it
|
|
writes new commits, trees, tags, and blobs corresponding to (but
|
|
filtered from) the original objects in the repository, then deletes the
|
|
original history and leaves only the new. See <<DISCUSSION>> for more
|
|
details on the ramifications of using this tool. Several different
|
|
types of history rewrites are possible; examples include (but are not
|
|
limited to):
|
|
|
|
* stripping large files (or large directories or large extensions)
|
|
* stripping unwanted files by path
|
|
* extracting wanted paths and their history (stripping everything else)
|
|
* restructuring the file layout (such as moving all files into a
|
|
subdirectory in preparation for merging with another repo, making a
|
|
subdirectory become the new toplevel directory, or merging two
|
|
directories with independent filenames into one directory)
|
|
* renaming tags (also often in preparation for merging with another repo)
|
|
* replacing or removing sensitive text such as passwords
|
|
* making mailmap rewriting of user names or emails permanent
|
|
* making grafts or replacement refs permanent
|
|
* rewriting commit messages
|
|
|
|
Additionally, several concerns are handled automatically (many of these
|
|
can be overridden, but they are all on by default):
|
|
|
|
* rewriting (possibly abbreviated) hashes in commit messages to
|
|
refer to the new post-rewrite commit hashes
|
|
* pruning commits which become empty due to the above filters (also
|
|
handles edge cases like pruning of merge commits which become
|
|
degenerate and empty)
|
|
* creating replace-refs (see linkgit:git-replace[1]) for old commit
|
|
hashes, which if pushed and fetched will allow users to continue to
|
|
refer to new commits using (unabbreviated) old commit IDs
|
|
* stripping of original history to avoid mixing old and new history
|
|
* repacking the repository post-rewrite to shrink the repo for the
|
|
user
|
|
|
|
Also, it's worth noting that there is an important safety mechanism:
|
|
|
|
* abort if run from a repo that is not a fresh clone (to prevent
|
|
accidental data loss from rewriting local history that doesn't
|
|
exist anywhere else). See <<FRESHCLONE>>.
|
|
|
|
For those who know that there is large unwanted stuff in their history
|
|
and want help finding it, this command also
|
|
|
|
* provides an option to analyze a repository and generate reports that
|
|
can be useful in determining what to filter (or in determining
|
|
whether a separate filtering command was successful).
|
|
|
|
See also <<VERSATILITY>>, <<DISCUSSION>>, <<EXAMPLES>>, and
|
|
<<INTERNALS>>.
|
|
|
|
OPTIONS
|
|
-------
|
|
|
|
Analysis Options
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
--analyze::
|
|
Analyze repository history and create a report that may be
|
|
useful in determining what to filter in a subsequent run (or
|
|
in determining if a previous filtering command did what you
|
|
wanted). Will not modify your repo.
|
|
|
|
Filtering based on paths (see also --filename-callback)
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
--invert-paths::
|
|
Invert the selection of files from the specified
|
|
--path-{match,glob,regex} options below, i.e. only select
|
|
files matching none of those options.
|
|
|
|
--path-match <dir_or_file>::
|
|
--path <dir_or_file>::
|
|
Exact paths (files or directories) to include in filtered
|
|
history. Multiple --path options can be specified to get a
|
|
union of paths.
|
|
|
|
--path-glob <glob>::
|
|
Glob of paths to include in filtered history. Multiple
|
|
--path-glob options can be specified to get a union of paths.
|
|
|
|
--path-regex <regex>::
|
|
Regex of paths to include in filtered history. Multiple
|
|
--path-regex options can be specified to get a union of paths.
|
|
|
|
--use-base-name::
|
|
Match on file base name instead of full path from the top of
|
|
the repo. Incompatible with --path-rename, and incompatible
|
|
with matching against directory names.
|
|
|
|
Renaming based on paths (see also --filename-callback)
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Note: if you combine path filtering with path renaming, be aware that
|
|
a rename directive does not select paths, it only says how to
|
|
rename paths that are selected with the filters.
|
|
|
|
--path-rename <old_name:new_name>::
|
|
--path-rename-match <old_name:new_name>::
|
|
Path to rename; if filename or directory matches <old_name>
|
|
rename to <new_name>. Multiple --path-rename options can be
|
|
specified.
|
|
|
|
Path shortcuts
|
|
~~~~~~~~~~~~~~
|
|
|
|
--paths-from-file <filename>::
|
|
Specify several path filtering and renaming directives, one
|
|
per line. Lines with `==>` in them specify path renames, and
|
|
lines can begin with `literal:` (the default), `glob:`, or
|
|
`regex:` to specify different matching styles. Blank lines
|
|
and lines starting with a `#` are ignored (if you have a
|
|
filename that you want to filter on that starts with
|
|
`literal:`, `#`, `glob:`, or `regex:`, then prefix the line
|
|
with 'literal:').
|
|
|
|
--subdirectory-filter <directory>::
|
|
Only look at history that touches the given subdirectory and
|
|
treat that directory as the project root. Equivalent to using
|
|
`--path <directory>/ --path-rename <directory>/:`
|
|
|
|
--to-subdirectory-filter <directory>::
|
|
Treat the project root as instead being under
|
|
<directory>. Equivalent to using `--path-rename :<directory>/`
|
|
|
|
Content editing filters (see also --blob-callback)
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
--replace-text <expressions_file>::
|
|
A file with expressions that, if found, will be replaced. By
|
|
default, each expression is treated as literal text, but
|
|
`regex:` and `glob:` prefixes are supported. You can end the
|
|
line with `==>` and some replacement text to choose a
|
|
replacement choice other than the default of `***REMOVED***`.
|
|
|
|
--strip-blobs-bigger-than <size>::
|
|
Strip blobs (files) bigger than specified size (e.g. `5M`,
|
|
`2G`, etc)
|
|
|
|
--strip-blobs-with-ids <blob_id_filename>::
|
|
Read git object ids from each line of the given file, and
|
|
strip all of them from history
|
|
|
|
Renaming of refs (see also --refname-callback)
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
--tag-rename <old:new>::
|
|
Rename tags starting with <old> to start with <new>. For example,
|
|
--tag-rename foo:bar will rename tag foo-1.2.3 to bar-1.2.3;
|
|
either <old> or <new> can be empty.
|
|
|
|
Filtering of commit messages (see also --message-callback)
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
--preserve-commit-hashes::
|
|
By default, since commits are rewritten and thus gain new
|
|
hashes, references to old commit hashes in commit messages are
|
|
replaced with new commit hashes (abbreviated to the same
|
|
length as the old reference). Use this flag to turn off
|
|
updating commit hashes in commit messages.
|
|
|
|
--preserve-commit-encoding::
|
|
Do not reencode commit messages into UTF-8. By default, if the
|
|
commit object specifies an encoding for the commit message,
|
|
the message is re-encoded into UTF-8.
|
|
|
|
Filtering of names & emails (see also --name-callback and --email-callback)
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
--mailmap <filename>::
|
|
Use specified mailmap file (see linkgit:git-shortlog[1] for details
|
|
on the format) when rewriting author, committer, and tagger names
|
|
and emails. If the specified file is part of git history,
|
|
historical versions of the file will be ignored; only the current
|
|
contents are consulted.
|
|
|
|
--use-mailmap::
|
|
Same as: '--mailmap .mailmap'
|
|
|
|
Parent rewriting
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
--replace-refs {delete-no-add, delete-and-add, update-no-add, update-or-add, update-and-add}::
|
|
Replace refs (see linkgit:git-replace[1]) are used to rewrite
|
|
parents (unless turned off by the usual git mechanism); this
|
|
flag specifies what do do with those refs afterward. Replace
|
|
refs can either be deleted or updated to point at new commit
|
|
hashes. Also, new replace refs can be added for each commit
|
|
rewrite. With 'update-or-add', new replace refs are only
|
|
added for commit rewrites that aren't used to update an
|
|
existing replace ref. default is 'update-and-add' if
|
|
$GIT_DIR/filter-repo/already_ran does not exist;
|
|
'update-or-add' otherwise.
|
|
|
|
--prune-empty {always, auto, never}::
|
|
Whether to prune empty commits. 'auto' (the default) means
|
|
only prune commits which become empty (not commits which were
|
|
empty in the original repo, unless their parent was
|
|
pruned). When the parent of a commit is pruned, the first
|
|
non-pruned ancestor becomes the new parent.
|
|
|
|
--prune-degenerate {always, auto, never}::
|
|
Since merge commits are needed for history topology, they are
|
|
typically exempt from pruning. However, they can become
|
|
degenerate with the pruning of other commits (having fewer
|
|
than two parents, having one commit serve as both parents, or
|
|
having one parent as the ancestor of the other.) If such merge
|
|
commits have no file changes, they can be pruned. The default
|
|
('auto') is to only prune empty merge commits which become
|
|
degenerate (not which started as such).
|
|
|
|
--no-ff::
|
|
Even if the first parent is or becomes an ancestor of another
|
|
parent, do not prune it. This modifies how --prune-degenerate
|
|
behaves, and may be useful in projects who always use merge
|
|
--no-ff.
|
|
|
|
Generic callback code snippets
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
--filename-callback <function_body>::
|
|
Python code body for processing filenames; see <<CALLBACKS>>.
|
|
|
|
--message-callback <function_body>::
|
|
Python code body for processing messages (both commit messages and
|
|
tag messages); see <<CALLBACKS>>.
|
|
|
|
--name-callback <function_body>::
|
|
Python code body for processing names of people; see <<CALLBACKS>>.
|
|
|
|
--email-callback <function_body>::
|
|
Python code body for processing emails addresses; see
|
|
<<CALLBACKS>>.
|
|
|
|
--refname-callback <function_body>::
|
|
Python code body for processing refnames; see <<CALLBACKS>>.
|
|
|
|
--blob-callback <function_body>::
|
|
Python code body for processing blob objects; see <<CALLBACKS>>.
|
|
|
|
--commit-callback <function_body>::
|
|
Python code body for processing commit objects; see <<CALLBACKS>>.
|
|
|
|
--tag-callback <function_body>::
|
|
Python code body for processing tag objects; see <<CALLBACKS>>.
|
|
|
|
--reset-callback <function_body>::
|
|
Python code body for processing reset objects; see <<CALLBACKS>>.
|
|
|
|
Location to filter from/to
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
NOTE: Specifying alternate source or target locations implies --partial
|
|
except that the normal default for --replace-refs is used. However, unlike
|
|
normal uses of --partial, this doesn't risk mixing old and new history
|
|
since the old and new histories are in different repositories.
|
|
|
|
--source <source>::
|
|
Git repository to read from
|
|
|
|
--target <target>::
|
|
Git repository to overwrite with filtered history
|
|
|
|
Miscellaneous options
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
--help::
|
|
-h::
|
|
Show a help message and exit.
|
|
|
|
--force::
|
|
-f::
|
|
Ignore fresh clone checks and rewrite history (an irreversible
|
|
operation, especially since it by default ends with an
|
|
immediate pruning of reflogs and old objects). See
|
|
<<FRESHCLONE>>. Note that when cloning repos on a local
|
|
filesystem, it is better to pass `--no-local` to git clone
|
|
than passing `--force` to git-filter-repo.
|
|
|
|
--partial::
|
|
Do a partial history rewrite, resulting in the mixture of old and
|
|
new history. This implies a default of update-no-add for
|
|
--replace-refs, disables rewriting refs/remotes/origin/* to
|
|
refs/heads/*, disables removing of the 'origin' remote, disables
|
|
removing unexported refs, disables expiring the reflog, and
|
|
disables the automatic post-filter gc. Also, this modifies
|
|
--tag-rename and --refname-callback options such that instead of
|
|
replacing old refs with new refnames, it will instead create new
|
|
refs and keep the old ones around. Use with caution.
|
|
|
|
--refs <refs+>::
|
|
Limit history rewriting to the specified refs. Implies --partial.
|
|
In addition to the normal caveats of --partial (mixing old and new
|
|
history, no automatic remapping of refs/remotes/origin/* to
|
|
refs/heads/*, etc.), this also may cause problems for pruning of
|
|
degenerate empty merge commits when negative revisions are
|
|
specified.
|
|
|
|
--dry-run::
|
|
Do not change the repository. Run `git fast-export` and filter its
|
|
output, and save both the original and the filtered version for
|
|
comparison. This also disables rewriting commit messages due to
|
|
not knowing new commit IDs and disables filtering of some empty
|
|
commits due to inability to query the fast-import backend.
|
|
|
|
--debug::
|
|
Print additional information about operations being performed and
|
|
commands being run. (If used together with --dry-run, shows
|
|
extra information about what would be run).
|
|
|
|
--stdin::
|
|
Instead of running `git fast-export` and filtering its output,
|
|
filter the fast-export stream from stdin. The stdin must be in
|
|
the expected input format (e.g. it needs to include original-oid
|
|
directives).
|
|
|
|
--quiet::
|
|
Pass --quiet to other git commands called.
|
|
|
|
OUTPUT
|
|
------
|
|
|
|
Every time filter-repo is run, files are created in the `.git/filter-repo/`
|
|
directory. These files overwritten unconditionally on every run.
|
|
|
|
Commit map
|
|
~~~~~~~~~~
|
|
|
|
The `.git/filter-repo/commit-map` file contains a mapping of how all
|
|
commits were (or were not) changed.
|
|
|
|
* A header is the first line with the text "old" and "new"
|
|
* Commit mappings are in no particular order
|
|
* All commits in range of the rewrite will be listed, even commits
|
|
that are unchanged (e.g. because the commit pre-dated when the
|
|
large file(s) were introduced to the repo).
|
|
* An all-zeros hash, or null SHA, represents a non-existant object.
|
|
When in the "new" column, this means the commit was removed
|
|
entirely.
|
|
|
|
Reference map
|
|
~~~~~~~~~~~~~
|
|
|
|
The `.git/filter-repo/ref-map` file contains a mapping of which local
|
|
references were changed.
|
|
|
|
* A header is the first line with the text "old" and "new"
|
|
* Reference mappings are in no particular order
|
|
* An all-zeros hash, or null SHA, represents a non-existant object.
|
|
When in the "new" column, this means the ref was removed entirely.
|
|
|
|
[[FRESHCLONE]]
|
|
FRESH CLONE SAFETY CHECK AND --FORCE
|
|
------------------------------------
|
|
|
|
Since filter-repo does irreversible rewriting of history, it is
|
|
important to avoid making changes to a repo for which the user doesn't
|
|
have a good backup. The primary defense mechanism is to simply
|
|
educate users and rely on them to be good stewards of their data; thus
|
|
there are several warnings in the documentation about how filter repo
|
|
rewrites history.
|
|
|
|
However, as a service to users, we would like to provide an additional
|
|
safety check beyond the documentation. There isn't a good way to
|
|
check if the user has a good backup, but we can ask a related question
|
|
that is an imperfect but quite reasonable proxy: "Is this repository a
|
|
fresh clone?" Unfortunately, that is also a question we can't get a
|
|
perfect answer to; git provides no way to answer that question.
|
|
However, there are approximately a dozen things that I found that seem
|
|
to always be true of brand new clones (assuming they are either clones
|
|
of remote repositories or are made with the `--no-local` flag), and I
|
|
check for all of those.
|
|
|
|
These checks can have both false positives and false negatives.
|
|
Someone might have a perfectly good backup of their repo without it
|
|
actually being a fresh clone -- but there's no way for filter-repo to
|
|
know that. Conversely, someone could look at all things that
|
|
filter-repo checks for in its safety checks and then just tweak their
|
|
non-backed-up repository to satisfy those conditions (though it would
|
|
take a fair amount of effort, and it's astronomically unlikely that a
|
|
repo that isn't a fresh clone randomly happens to match all the
|
|
criteria). In practice, the safety checks filter-repo uses seem to be
|
|
really good at avoiding people accidentally running filter-repo on a
|
|
repository that they shouldn't be running it on. It even caught me
|
|
once when I did mean to run filter-repo but was in a different
|
|
directory than I thought I was.
|
|
|
|
In short, it's perfectly fine to use `--force` to override the safety
|
|
checks as long as you're okay with filter-repo irreversibly rewriting
|
|
the contents of the current repository. It is a really bad idea to
|
|
get in the habit of always specifying `--force`; if you do, one day
|
|
you will run one of your commands in the wrong directory like I did,
|
|
and you won't have the safety check anymore to bail you out. Also, it
|
|
is definitely NOT okay to recommend `--force` on forums, Q&A sites, or
|
|
in emails to other users without first carefully explaining that
|
|
`--force` means putting your repositories' data at risk. I am
|
|
especially bothered by people who suggest the flag when it clearly is
|
|
NOT needed; they are needlessly putting other peoples' data at risk.
|
|
|
|
[[VERSATILITY]]
|
|
VERSATILITY
|
|
-----------
|
|
|
|
filter-repo has a hierarchy of capabilities on the spectrum from easy to
|
|
use convenience flags that perform pre-defined types of filtering, to
|
|
choices that provide lots of flexibility in controlling how filtering
|
|
occurs. This spectrum includes the following:
|
|
|
|
* Convenience flags making common types of history rewriting simple (e.g.
|
|
--path, --strip-blobs-bigger-than, --replace-text, --mailmap)
|
|
* Options which are shorthand for others or which provide greater control
|
|
than others (e.g. --subdirectory-filter could just be written using
|
|
both a path selection (--path) and a path rename (--path-rename)
|
|
filter; --paths-from-file can handle all other --path* options and more
|
|
such as regex renaming of paths)
|
|
* Generic python callbacks for handling a certain type of data (the
|
|
filename, message, name, email, and refname callbacks)
|
|
* Generic python callbacks for handling fundamental git objects, allowing
|
|
greater control over the combination of data types the object holds
|
|
(the commit, tag, blob, and reset callbacks)
|
|
* The ability to import filter-repo as a module in a python program and
|
|
use its classes and functions for even greater control and flexibility
|
|
while still leveraging lots of basic capabilities. One can even use
|
|
this to write new tools with a completely different interface.
|
|
|
|
For more information about callbacks, see <<CALLBACKS>>. For examples on
|
|
writing python programs that import filter-repo as a module to create new
|
|
history rewriting tools, look at the contrib/filter-repo-demos/ directory.
|
|
That directory includes, among other examples, a reimplementation of
|
|
git-filter-branch which is faster than git-filter-branch, and a
|
|
reimplementation of BFG Repo Cleaner with several bug fixes and new
|
|
features.
|
|
|
|
[[DISCUSSION]]
|
|
DISCUSSION
|
|
----------
|
|
|
|
Using filter-repo is relatively simple, but rewriting history is part of
|
|
a larger discussion in terms of collaboration. When you rewrite
|
|
history, the old and new histories are no longer compatible; if you push
|
|
this history somewhere for others to view, it will look as though you've
|
|
done a rebase of all branches and tags. Make sure you are familiar with
|
|
the "RECOVERING FROM UPSTREAM REBASE" section of linkgit:git-rebase[1]
|
|
(and in particular, "The hard case") before proceeding, in addition to
|
|
this section.
|
|
|
|
Steps to use git-filter-repo as part of the bigger picture of doing a
|
|
history rewrite are roughly as follows:
|
|
|
|
1. Create a clone of your repository (if you created special refs outside
|
|
of refs/heads/ or refs/tags/, make sure to fetch those too). You may
|
|
pass `--bare` or `--mirror` to `git clone`, if you prefer. You should
|
|
pass `--no-local` if the repository you are cloning from is on the local
|
|
filesystem. Avoid other flags; some might confuse the fresh clone
|
|
check, and others could cause parts of the data to be missing that are
|
|
needed for the rewrite.
|
|
|
|
2. (Optional) Run `git filter-repo --analyze`. This will create a
|
|
directory of reports mentioning renames that have occurred in your
|
|
repo and also listing sizes of objects aggregated by
|
|
path/directory/extension/blob-id; this information may be useful in
|
|
choosing how to filter your repo. It can also be useful to re-run
|
|
--analyze after filtering to verify the changes look correct.
|
|
|
|
3. Run filter-repo with your desired filtering options. Many examples
|
|
are given below. For more complex cases, note that doing the
|
|
filtering in multiple steps (by running multiple filter-repo
|
|
invocations in a sequence) is supported. If anything goes wrong here,
|
|
simply delete your clone and restart.
|
|
|
|
4. Push your new repository to its new home (note that
|
|
refs/remotes/origin/* will have been moved to refs/heads/* as the
|
|
first part of filter-repo, so you can just deal with normal branches
|
|
instead of remote tracking branches). While you can force push this
|
|
to the same URL you cloned from, there are good reasons to consider
|
|
pushing to a different location instead:
|
|
|
|
* People who cloned from the original repo will have old history.
|
|
When they fetch the new history you force pushed up, unless they
|
|
do a `git reset --hard @{u}` on their branches or rebase their
|
|
local work, git will think they have hundreds or thousands of
|
|
commits with very similar commit messages as what exist upstream
|
|
(but which include files you wanted excised from history), and
|
|
allow the user to merge the two histories, resulting in what
|
|
looks like two copies of each commit. If they then push this
|
|
history back up, then everyone now has history with two copies of
|
|
each commit and the bad files have returned. You're more likely
|
|
to succeed in forcing people to get rid of the old history if
|
|
they have to clone a new URL.
|
|
|
|
* Rewriting history will rewrite tags; those who have already
|
|
downloaded tags will not get the updated tags by default (see the
|
|
"On Re-tagging" section of linkgit:git-tag[1]). Every user
|
|
trying to use an existing clone will have to forcibly delete all
|
|
tags and re-fetch them; it may be easier for them to just
|
|
re-clone, which they are more likely to do with a new clone URL.
|
|
|
|
* Rewriting history may delete some refs (e.g. branches that only
|
|
had files that you wanted excised from history); unless you run
|
|
git push with the `--mirror` or `--prune` options, those refs
|
|
will continue to exist on the server. If folks then merge these
|
|
branches into others, then people have started mixing old and new
|
|
history. If users had already cloned these branches, removing
|
|
them from the server isn't enough; you need all users to delete
|
|
any local branches based on these refs and run fetch with the
|
|
`--prune` option as well. Simply re-cloning from a new URL is
|
|
easier.
|
|
|
|
* The server may not allow you to force push over some refs.
|
|
For example, code review systems may have special ref
|
|
namespaces (e.g. refs/changes/, refs/pull/,
|
|
refs/merge-requests/) that they have locked down.
|
|
|
|
5. If you still want to push your rewritten history back to the
|
|
original url despite my warnings above, you'll have to manage it
|
|
very carefully:
|
|
|
|
* git-filter-repo deletes the "origin" remote to help avoid people
|
|
accidentally repushing to the same repository, so you'll need to
|
|
remind git what origin's url was. You'll have to look up the
|
|
command for that.
|
|
|
|
* You'll need to carefully synchronize with *everyone* who has
|
|
cloned the repository, and will also need to carefully
|
|
synchronize with *everything* (e.g. CI systems) that has cloned
|
|
it. Every single clone will either need to be thrown away and
|
|
re-cloned, or need to take all the steps outlined in item 4 as
|
|
well as follow the necessary steps from "RECOVERING FROM UPSTREAM
|
|
REBASE" section of linkgit:git-rebase[1]. If you miss fixing any
|
|
clones, you'll risk mixing old and new history and end up with an
|
|
even worse mess to clean up.
|
|
|
|
* Finally, you'll need to consult any documentation from your
|
|
hosting provider about how to remove any server-side references
|
|
to the old commits (example:
|
|
https://docs.gitlab.com/ee/user/project/repository/reducing_the_repo_size_using_git.html[GitLab's
|
|
docs on reducing repository size]).
|
|
|
|
6. (Optional) Some additional considerations
|
|
|
|
* filter-repo by default creates replace refs (see
|
|
linkgit:git-replace[1]) for each rewritten commit ID, allowing
|
|
you to use old (unabbreviated) commit hashes to refer to the
|
|
newly rewritten commits. If you want to use these replace refs,
|
|
push them to the relevant clone URL and tell users to adjust
|
|
their fetch refspec (e.g. `git config --add remote.origin.fetch
|
|
+refs/replace/*:refs/replace/*`) Sadly, some existing git servers
|
|
(e.g. Gerrit, GitHub) do not yet understand replace refs, and
|
|
thus one can't use old commit hashes within their UI; this may
|
|
change in the future. But replace refs at least help users
|
|
locally within the git CLI.
|
|
|
|
* If you have a central repo, you may want to prevent people
|
|
from pushing old commit IDs, in order to avoid mixing old
|
|
and new history. Every repository manager does this
|
|
differently, some provide specialized commands
|
|
(e.g. https://gerrit-review.googlesource.com/Documentation/cmd-ban-commit.html),
|
|
others require you to write hooks.
|
|
|
|
[[EXAMPLES]]
|
|
EXAMPLES
|
|
--------
|
|
|
|
Path based filtering
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
To only keep the 'README.md' file plus the directories 'guides' and
|
|
'tools/releases/':
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path README.md --path guides/ --path tools/releases
|
|
--------------------------------------------------
|
|
|
|
Directory names can be given with or without a trailing slash, and all
|
|
filenames are relative to the toplevel of the repo. To keep all files
|
|
except these paths, just add `--invert-paths`:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path README.md --path guides/ --path tools/releases --invert-paths
|
|
--------------------------------------------------
|
|
|
|
If you want to have both an inclusion filter and an exclusion filter, just
|
|
run filter-repo multiple times. For example, to keep the src/main
|
|
subdirectory but exclude files under src/main named 'data', run:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path src/main/
|
|
git filter-repo --path-glob 'src/*/data' --invert-paths
|
|
--------------------------------------------------
|
|
|
|
Note that the asterisk (`*`) will match across multiple directories, so the
|
|
second command would remove e.g. src/main/org/whatever/data. Also, the
|
|
second command by itself would also remove e.g. src/not-main/foo/data, but
|
|
since src/not-main/ was removed by the first command, that's not an issue.
|
|
Also, the use of quotes around the asterisk is sometimes important to avoid
|
|
glob expansion by the shell.
|
|
|
|
You can also select paths by regular expression (see
|
|
https://docs.python.org/3/library/re.html#regular-expression-syntax).
|
|
For example, to only include files from the repo whose name is in the
|
|
format YYYY-MM-DD.txt and is found at least two subdirectories deep:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path-regex '^.*/.*/[0-9]{4}-[0-9]{2}-[0-9]{2}.txt$'
|
|
--------------------------------------------------
|
|
|
|
If you want two directories to be renamed (and maybe merged if both are
|
|
renamed to the same location), use --path-rename; for example, to rename
|
|
both 'cmds/' and 'src/scripts/' to 'tools/':
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path-rename cmds:tools --path-rename src/scripts/:tools/
|
|
--------------------------------------------------
|
|
|
|
As with `--path`, directories can be specified with or without a
|
|
trailing slash for `--path-rename`.
|
|
|
|
If you do a `--path-rename` to something that was already in use, it will
|
|
be silently overwritten. However, if you try to rename multiple files to
|
|
the same location (e.g. src/scripts/run_release.sh and cmds/run_release.sh
|
|
both existed and had different content with the renames above), then you
|
|
will be given an error. If you have such a case, you may want to add
|
|
another rename command to move one of the paths somewhere else where it
|
|
won't collide:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path-rename cmds/run_release.sh:tools/do_release.sh \
|
|
--path-rename cmds/:tools/ \
|
|
--path-rename src/scripts/:tools/
|
|
--------------------------------------------------
|
|
|
|
Also, `--path-rename` brings up ordering issues; all path arguments are
|
|
applied in order. Thus, a command like
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path-rename sources/:src/main/ --path src/main/
|
|
--------------------------------------------------
|
|
|
|
would make sense but reversing the two arguments would not (src/main/ is
|
|
created by the rename so reversing the two would give you an empty repo).
|
|
Also, note that the rename of cmds/run_release.sh a couple examples ago was
|
|
done before the other renames.
|
|
|
|
Note that path renaming does not do path filtering, thus the following
|
|
command
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path src/main/ --path-rename tools/:scripts/
|
|
--------------------------------------------------
|
|
|
|
would not result in the tools or scripts directories being present, because
|
|
the single filter selected only src/main/. It's likely that you would
|
|
instead want to run:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --path src/main/ --path tools/ --path-rename tools/:scripts/
|
|
--------------------------------------------------
|
|
|
|
If you prefer to filter based solely on basename, use the `--use-base-name`
|
|
flag (though this is incompatible with `--path-rename`). For example, to
|
|
only include README.md and Makefile files from any directory:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --use-base-name --path README.md --path Makefile
|
|
--------------------------------------------------
|
|
|
|
If you wanted to delete all .DS_Store files in any directory, you could
|
|
either use:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --invert-paths --path '.DS_Store' --use-base-name
|
|
--------------------------------------------------
|
|
|
|
or
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --invert-paths --path-glob '*/.DS_Store' --path '.DS_Store'
|
|
--------------------------------------------------
|
|
|
|
(the `--path-glob` isn't sufficient by itself as it might miss a toplevel
|
|
.DS_Store file; further while something like `--path-glob '*.DS_Store'`
|
|
would workaround that problem it would also grab files named `foo.DS_Store`
|
|
or `bar/baz.DS_Store`)
|
|
|
|
Finally, see also the `--filename-callback` from <<CALLBACKS>>.
|
|
|
|
Filtering based on many paths
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
If you have a long list of files, directories, globs, or regular
|
|
expressions to filter on, you can stick them in a file and use
|
|
`--paths-from-file`; for example, with a file named stuff-i-want.txt with
|
|
contents of
|
|
|
|
--------------------------------------------------
|
|
# Blank lines and comment lines are ignored.
|
|
# Examples similar to --path:
|
|
README.md
|
|
guides/
|
|
tools/releases
|
|
|
|
# An example that is like --path-glob:
|
|
glob:*.py
|
|
|
|
# An example that is like --path-regex:
|
|
regex:^.*/.*/[0-9]{4}-[0-9]{2}-[0-9]{2}.txt$
|
|
|
|
# An example of renaming a path
|
|
tools/==>scripts/
|
|
|
|
# An example of using a regex to rename a path
|
|
regex:(.*)/([^/]*)/([^/]*)\.text$==>\2/\1/\3.txt
|
|
--------------------------------------------------
|
|
|
|
then you could run
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --paths-from-file stuff-i-want.txt
|
|
--------------------------------------------------
|
|
|
|
to get a repo containing only the toplevel README.md file, the guides/
|
|
and tools/releases/ directories, all python files, files whose name
|
|
was of the form YYYY.MM-DD.txt at least two subdirectories deep, and
|
|
would rename tools/ to scripts/ and rename files like foo/bar/baz.text
|
|
to bar/foo/baz.txt. Note the special line prefixes of `glob:` and
|
|
`regex:` and the special string `==>` denoting renames.
|
|
|
|
Sometimes you have a way of easily generating all the files you want.
|
|
For example, if you know that none of the currently tracked files have
|
|
any newlines or special characters in them (see core.quotePath from
|
|
`git config --help`) so that `git ls-files` would print all files
|
|
literally one per line, and you knew that you wanted to keep only the
|
|
files that are currently tracked (thus deleting from all commits in
|
|
history any files that only appear on other branches or that only
|
|
appear in older commits), then you could use a pair of commands such
|
|
as
|
|
|
|
--------------------------------------------------
|
|
git ls-files >../paths-i-want.txt
|
|
git filter-repo --paths-from-file ../paths-i-want.txt
|
|
--------------------------------------------------
|
|
|
|
Similarly, you could use --paths-from-file to delete many files. For
|
|
example, you could run `git filter-repo --analyze` to get reports,
|
|
look in one such as .git/filter-repo/analysis/path-deleted-sizes.txt
|
|
and copy all the filenames into a file such as
|
|
/tmp/files-i-dont-want-anymore.txt and then run
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --invert-paths --paths-from-file /tmp/files-i-dont-want-anymore.txt
|
|
--------------------------------------------------
|
|
|
|
to delete them all.
|
|
|
|
Directory based shortcuts
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
Let's say you had a directory structure like the following:
|
|
|
|
module/
|
|
foo.c
|
|
bar.c
|
|
otherDir/
|
|
blah.config
|
|
stuff.txt
|
|
zebra.jpg
|
|
|
|
If you wanted just the module/ directory and you wanted it to become the
|
|
new root so that your new directory structure looked like
|
|
|
|
foo.c
|
|
bar.c
|
|
|
|
then you could run:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --subdirectory-filter module/
|
|
--------------------------------------------------
|
|
|
|
If you wanted all the files from the original repo, but wanted to move
|
|
everything under a subdirectory named my-module/, so that your new
|
|
directory structure looked like
|
|
|
|
my-module/
|
|
module/
|
|
foo.c
|
|
bar.c
|
|
otherDir/
|
|
blah.config
|
|
stuff.txt
|
|
zebra.jpg
|
|
|
|
then you would instead run run
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --to-subdirectory-filter my-module/
|
|
--------------------------------------------------
|
|
|
|
Content based filtering
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
If you want to filter out all files bigger than a certain size, you can use
|
|
`--strip-blobs-bigger-than` with some size (K, M, and G suffixes are
|
|
recognized), e.g.:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --strip-blobs-bigger-than 10M
|
|
--------------------------------------------------
|
|
|
|
If you want to strip out all files with specified git object ids (hashes),
|
|
list the hashes in a file and run
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --strip-blobs-with-ids FILE_WITH_GIT_BLOB_IDS
|
|
--------------------------------------------------
|
|
|
|
If you want to modify file contents, you can do so based on a list of
|
|
expressions in a file, one per line. For example, with a file named
|
|
expressions.txt containing
|
|
|
|
--------------------------------------------------
|
|
p455w0rd
|
|
foo==>bar
|
|
glob:*666*==>
|
|
regex:\bdriver\b==>pilot
|
|
literal:MM/DD/YYYY==>YYYY-MM-DD
|
|
regex:([0-9]{2})/([0-9]{2})/([0-9]{4})==>\3-\1-\2
|
|
--------------------------------------------------
|
|
|
|
then running
|
|
--------------------------------------------------
|
|
git filter-repo --replace-text expressions.txt
|
|
--------------------------------------------------
|
|
|
|
will go through and replace `p455w0rd` with `***REMOVED***`, `foo` with
|
|
`bar`, any line containing `666` with a blank line, the word `driver` with
|
|
`pilot` (but not if it has letters before or after; e.g. `drivers` will be
|
|
unmodified), replace the exact text `MM/DD/YYYY` with `YYYY-MM-DD` and
|
|
replace date strings of the form MM/DD/YYYY with ones of the form
|
|
YYYY-MM-DD. In the expressions file, there are a few things to note:
|
|
|
|
* Every line has a replacement, given by whatever is on the right of
|
|
`==>`. If `==>` does not appear on the line, the default replacement
|
|
is `***REMOVED***`.
|
|
* Lines can start with `literal:`, `glob:`, or `regex:` to specify
|
|
whether to do literal string matches,
|
|
globs (see https://docs.python.org/3/library/fnmatch.html), or regular
|
|
expressions (see https://docs.python.org/3/library/re.html#regular-expression-syntax).
|
|
If none of these are specified, `literal:` is assumed.
|
|
* If multiple matches are found, all are replaced.
|
|
* globs and regexes are applied to the entire file, but without any
|
|
special flags turned on. Some folks may be interested in adding `(?m)`
|
|
to the regex to turn on MULTILINE mode, so that `^` and `$` match the
|
|
beginning and ends of lines rather than the beginning and end of file.
|
|
See https://docs.python.org/3/library/re.html for details.
|
|
|
|
See also the `--blob-callback` from <<CALLBACKS>>.
|
|
|
|
Refname based filtering
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
To rename tags, use `--tag-rename`, e.g.:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --tag-rename foo:bar
|
|
--------------------------------------------------
|
|
|
|
This will rename any tags starting with `foo` to now start with `bar`.
|
|
Either side of the colon could be blank, e.g.
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --tag-rename '':'my-module-'
|
|
--------------------------------------------------
|
|
|
|
For more general refname modification, see `--refname-callback` from
|
|
<<CALLBACKS>>.
|
|
|
|
User and email based filtering
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
To modify username and emails of commits, you can create a mailmap
|
|
file in the format accepted by linkgit:git-shortlog[1]. For example,
|
|
if you have a file named my-mailmap you can run
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --mailmap my-mailmap
|
|
--------------------------------------------------
|
|
|
|
and if the current contents of that file are as follows (if the
|
|
specified mailmap file is version controlled, historical versions of
|
|
the file are ignored):
|
|
|
|
--------------------------------------------------
|
|
Name For User <email@addre.ss>
|
|
<new@ema.il> <old1@ema.il>
|
|
New Name And <new@ema.il> <old2@ema.il>
|
|
New Name And <new@ema.il> Old Name And <old3@ema.il>
|
|
--------------------------------------------------
|
|
|
|
then we can update username and/or emails based on the specified
|
|
mapping.
|
|
|
|
See also the `--name-callback` and `--email-callback` from
|
|
<<CALLBACKS>>.
|
|
|
|
Parent rewriting
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
To replace $commit_A with $commit_B (e.g. make all commits which had
|
|
$commit_A as a parent instead have $commit_B for that parent), and
|
|
rewrite history to make it permanent:
|
|
|
|
--------------------------------------------------
|
|
git replace $commit_A $commit_B
|
|
git filter-repo --force
|
|
--------------------------------------------------
|
|
|
|
To create a new commit with the same contents as $commit_A except with
|
|
different parent(s) and then replace $commit_A with the new commit,
|
|
and rewrite history to make it permanent:
|
|
|
|
--------------------------------------------------
|
|
git replace --graft $commit_A $new_parent_or_parents
|
|
git filter-repo --force
|
|
--------------------------------------------------
|
|
|
|
The reason to specify --force is two-fold: filter-repo will error out
|
|
if no arguments are specified, and the new graft commit would
|
|
otherwise trigger the not-a-fresh-clone check.
|
|
|
|
Partial history rewrites
|
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
To rewrite the history on just one branch (which may cause it to no longer
|
|
share any common history with other branches), use `--refs`. For example,
|
|
to remove a file named 'extraneous.txt' from the 'master' branch:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --invert-paths --path extraneous.txt --refs master
|
|
--------------------------------------------------
|
|
|
|
To rewrite just some recent commits:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --invert-paths --path extraneous.txt --refs master~3..master
|
|
--------------------------------------------------
|
|
|
|
[[CALLBACKS]]
|
|
CALLBACKS
|
|
---------
|
|
|
|
For flexibility, filter-repo allows you to specify functions on the
|
|
command line to further filter all changes. Please note that there
|
|
are some API compatibility caveats associated with these callbacks
|
|
that you should be aware of before using them; see the "API BACKWARD
|
|
COMPATIBILITY CAVEAT" comment near the top of git-filter-repo source
|
|
code.
|
|
|
|
All callback functions are of the same general format. For a command line
|
|
argument like
|
|
|
|
--------------------------------------------------
|
|
--foo-callback 'BODY'
|
|
--------------------------------------------------
|
|
|
|
the following code will be compiled and called:
|
|
|
|
--------------------------------------------------
|
|
def foo_callback(foo):
|
|
BODY
|
|
--------------------------------------------------
|
|
|
|
Thus, you just need to make sure your _BODY_ modifies and returns
|
|
_foo_ appropriately. One important thing to note for all callbacks is
|
|
that filter-repo uses bytestrings (see
|
|
https://docs.python.org/3/library/stdtypes.html#bytes) everywhere
|
|
instead of strings.
|
|
|
|
There are four callbacks that allow you to operate directly on raw
|
|
objects that contain data that's easy to write in
|
|
linkgit:fast-import[1] format:
|
|
|
|
--------------------------------------------------
|
|
--blob-callback
|
|
--commit-callback
|
|
--tag-callback
|
|
--reset-callback
|
|
--------------------------------------------------
|
|
|
|
We'll come back to these later because it is often the case that the
|
|
other callbacks are more convenient. The other callbacks operate on a
|
|
small piece of the raw objects or operate on pieces across multiple
|
|
types of raw object (e.g. author names and committer names and tagger
|
|
names across commits and tags, or refnames across commits, tags, and
|
|
resets, or messages across commits and tags). The convenience
|
|
callbacks are:
|
|
|
|
--------------------------------------------------
|
|
--filename-callback
|
|
--message-callback
|
|
--name-callback
|
|
--email-callback
|
|
--refname-callback
|
|
--------------------------------------------------
|
|
|
|
in each you are expected to simply return a new value based on the one
|
|
passed in. For example,
|
|
|
|
--------------------------------------------------
|
|
git-filter-repo --name-callback 'return name.replace(b"Wiliam", b"William")'
|
|
--------------------------------------------------
|
|
|
|
would result in the following function being called:
|
|
|
|
--------------------------------------------------
|
|
def name_callback(name):
|
|
return name.replace(b"Wiliam", b"William")
|
|
--------------------------------------------------
|
|
|
|
The email callback is quite similar:
|
|
|
|
--------------------------------------------------
|
|
git-filter-repo --email-callback 'return email.replace(b".cm", b".com")'
|
|
--------------------------------------------------
|
|
|
|
The refname callback is also similar, but note that the refname passed in
|
|
and returned are expected to be fully qualified (e.g. b"refs/heads/master"
|
|
instead of just b"master" and b"refs/tags/v1.0.7" instead of b"1.0.7"):
|
|
|
|
--------------------------------------------------
|
|
git-filter-repo --refname-callback '
|
|
# Change e.g. refs/heads/master to refs/heads/prefix-master
|
|
rdir,rpath = os.path.split(refname)
|
|
return rdir + b"/prefix-" + rpath'
|
|
--------------------------------------------------
|
|
|
|
The message callback is quite similar to the previous three callbacks,
|
|
though it operates on a bytestring that is likely more than one line:
|
|
|
|
--------------------------------------------------
|
|
git-filter-repo --message-callback '
|
|
if b"Signed-off-by:" not in message:
|
|
message += b"\nSigned-off-by: Me My <self@and.eye>"
|
|
return re.sub(b"[Ee]-?[Mm][Aa][Ii][Ll]", b"email", message)'
|
|
--------------------------------------------------
|
|
|
|
The filename callback is slightly more interesting. Returning None means
|
|
the file should be removed from all commits, returning the filename
|
|
unmodified marks the file to be kept, and returning a different name means
|
|
the file should be renamed. An example:
|
|
|
|
--------------------------------------------------
|
|
git-filter-repo --filename-callback '
|
|
if b"/src/" in filename:
|
|
# Remove all files with a directory named "src" in their path
|
|
# (except when "src" appears at the toplevel).
|
|
return None
|
|
elif filename.startswith(b"tools/"):
|
|
# Rename tools/ -> scripts/misc/
|
|
return b"scripts/misc/" + filename[6:]
|
|
else:
|
|
# Keep the filename and do not rename it
|
|
return filename
|
|
'
|
|
--------------------------------------------------
|
|
|
|
In contrast, the blob, reset, tag, and commit callbacks are not
|
|
expected to return a value, but are instead expected to modify the
|
|
object passed in. Major fields for these objects are (subject to API
|
|
backward compatibility caveats mentioned previously):
|
|
|
|
* Blob: `original_id` (original hash) and `data`
|
|
* Reset: `ref` (name of reference) and `from_ref` (hash or integer mark)
|
|
* Tag: `ref`, `from_ref`, `original_id`, `tagger_name`, `tagger_email`,
|
|
`tagger_date`, `message`
|
|
* Commit: `branch`, `original_id`, `author_name`, `author_email`,
|
|
`author_date`, `committer_name`, `committer_email`,
|
|
`committer_date`, `message`, `file_changes` (list of
|
|
FileChange objects, each containing a `type`, `filename`,
|
|
`mode`, and `blob_id`), `parents` (list of hashes or integer
|
|
marks)
|
|
|
|
An example of each:
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --blob-callback '
|
|
if len(blob.data) > 25:
|
|
# Mark this blob for removal from all commits
|
|
blob.skip()
|
|
else:
|
|
blob.data = blob.data.replace(b"Hello", b"Goodbye")
|
|
'
|
|
--------------------------------------------------
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --reset-callback 'reset.ref = reset.ref.replace(b"master", b"dev")'
|
|
--------------------------------------------------
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --tag-callback '
|
|
if tag.tagger_name == b"Jim Williams":
|
|
# Omit this tag
|
|
tag.skip()
|
|
else:
|
|
tag.message = tag.message + b"\n\nTag of %s by %s on %s" % (tag.ref, tag.tagger_email, tag.tagger_date)'
|
|
--------------------------------------------------
|
|
|
|
--------------------------------------------------
|
|
git filter-repo --commit-callback '
|
|
# Remove executable files with three 6s in their name (including
|
|
# from leading directories).
|
|
# Also, undo deletion of sources/foo/bar.txt (change types are
|
|
# either b"D" (deletion) or b"M" (add or modify); renames are
|
|
# handled by deleting the old file and adding a new one)
|
|
commit.file_changes = [
|
|
change for change in commit.file_changes
|
|
if not (change.mode == b"100755" and
|
|
change.filename.count(b"6") == 3) and
|
|
not (change.type == b"D" and
|
|
change.filename == b"sources/foo/bar.txt")]
|
|
# Mark all .sh files as executable; modes in git are always one of
|
|
# 100644 (normal file), 100755 (executable), 120000 (symlink), or
|
|
# 160000 (submodule)
|
|
for change in commit.file_changes:
|
|
if change.filename.endswith(b".sh"):
|
|
change.mode = b"100755"
|
|
'
|
|
--------------------------------------------------
|
|
|
|
[[INTERNALS]]
|
|
INTERNALS
|
|
---------
|
|
|
|
You probably don't need to read this section unless you are just very
|
|
curious or you are trying to do a very complex history rewrite.
|
|
|
|
How filter-repo works
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Roughly, filter-repo works by running
|
|
|
|
--------------------------------------------------
|
|
git fast-export <options> | filter | git fast-import <options>
|
|
--------------------------------------------------
|
|
|
|
where filter-repo not only launches the whole pipeline but also serves as
|
|
the _filter_ in the middle. However, filter-repo does a few additional
|
|
things on top in order to make it into a well-rounded filtering tool. A
|
|
sequence that more accurately reflects what filter-repo runs is:
|
|
|
|
1. Verify we're in a fresh clone
|
|
2. `git fetch -u . refs/remotes/origin/*:refs/heads/*`
|
|
3. `git remote rm origin`
|
|
4. `git fast-export --show-original-ids --reference-excluded-parents --fake-missing-tagger --signed-tags=strip --tag-of-filtered-object=rewrite --use-done-feature --no-data --reencode=yes --mark-tags --all | filter | git -c core.ignorecase=false fast-import --date-format=raw-permissive --force --quiet`
|
|
5. `git update-ref --no-deref --stdin`, fed with a list of refs to nuke, and a list of replace refs to delete, create, or update.
|
|
6. `git reset --hard`
|
|
7. `git reflog expire --expire=now --all`
|
|
8. `git gc --prune=now`
|
|
|
|
Some notes or exceptions on each of the above:
|
|
|
|
1. If we're not in a fresh clone, users will not be able to recover if
|
|
they used the wrong command or ran in the wrong repo. (Though
|
|
`--force` overrides this check, and it's also off if you've already
|
|
ran filter-repo once in this repo.)
|
|
2. Technically, we actually use a `git update-ref` command fed with a lot
|
|
of input due to the fact that users can use `--force` when local
|
|
branches might not match remote branches. But this fetch command
|
|
catches the intent rather succinctly.
|
|
3. We don't want users accidentally pushing back to the original repo, as
|
|
discussed in <<DISCUSSION>>. It also reminds users that since history
|
|
has been rewritten, this repo is no longer compatible with the
|
|
original. Finally, another minor benefit is this allows users to push
|
|
with the `--mirror` option to their new home without accidentally
|
|
sending remote tracking branches.
|
|
4. Some of these flags are always used but others are actually
|
|
conditional. For example, filter-repo's `--replace-text` and
|
|
`--blob-callback` options need to work on blobs so `--no-data` cannot
|
|
be passed to fast-export. But when we don't need to work on blobs,
|
|
passing `--no-data` speeds things up. Also, other flags may change
|
|
the structure of the pipeline as well (e.g. `--dry-run` and `--debug`)
|
|
5. We use this step to write replace refs for accessing the newly written
|
|
commit hashes using their previous names. Also, if refs were renamed
|
|
by various steps, we need to delete the old refnames in order to avoid
|
|
mixing old and new history.
|
|
6. Users also have old versions of files in their working tree and index;
|
|
we want those cleaned up to match the rewritten history as well. Note
|
|
that this step is skipped in bare repos.
|
|
7. Reflogs will hold on to old history, so we need to expire them.
|
|
8. We need to gc to avoid mixing new and old history. Also, it shrinks
|
|
the repository for users, so they don't have to do extra work. (Odds
|
|
are that they've only rewritten trees and commits and maybe a few
|
|
blobs, so `--aggressive` isn't needed and would be too slow.)
|
|
|
|
Information about these steps is printed out when `--debug` is passed
|
|
to filter-repo. When doing a `--partial` history rewrite, steps 2, 3,
|
|
7, and 8 are unconditionally skipped, step 5 is skipped if
|
|
`--replace-refs` is `update-no-add`, and just the nuke-unused-refs
|
|
portion of step 5 is skipped if `--replace-refs` is something else.
|
|
|
|
Limitations
|
|
~~~~~~~~~~~
|
|
|
|
Inherited limitations
|
|
^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Since git filter-repo calls fast-export and fast-import to do a lot of the
|
|
heavy lifting, it inherits limitations from those systems:
|
|
|
|
* extended commit headers, if any, are stripped
|
|
* commits get rewritten meaning they will have new hashes; therefore,
|
|
signatures on commits and tags cannot continue to work and instead are
|
|
just removed (thus signed tags become annotated tags)
|
|
* tags of commits are supported. Prior to git-2.24.0, tags of blobs and
|
|
tags of tags are not supported (fast-export would die on such tags).
|
|
tags of trees are not supported in any git version (since fast-export
|
|
ignores tags of trees with a warning and fast-import provides no way to
|
|
import them).
|
|
* annotated and signed tags outside of the refs/tags/ namespace are not
|
|
supported (their location will be mangled in weird ways)
|
|
* fast-import will die on various forms of invalid input, such as a
|
|
timezone with more than four digits
|
|
* fast-export cannot reencode commit messages into UTF-8 if the commit
|
|
message is not valid in its specified encoding (in such cases, it'll
|
|
leave the commit message and the encoding header alone).
|
|
* commits without an author will be given one matching the committer
|
|
* tags without a tagger will be given a fake tagger
|
|
* references that include commit cycles in their history (which can be
|
|
created with linkgit:git-replace[1]) will not be flagged to the user as
|
|
an error but will be silently deleted by fast-export as though the
|
|
branch or tag contained no interesting files
|
|
|
|
There are also some limitations due to the design of these systems:
|
|
|
|
* Trying to insert additional files into the stream can be tricky; since
|
|
fast-export only lists file changes in a merge relative to its first
|
|
parent, if you insert additional files into a commit that is in the
|
|
second (or third or fourth) parent history of a merge, then you also
|
|
need to add it to the merge manually. (Similarly, if you change which
|
|
parent is the first parent in a merge commit, you need to manually
|
|
update the list of file changes to be relative to the new first
|
|
parent.)
|
|
|
|
* fast-export and fast-import work with exact file contents, not patches.
|
|
(e.g. "Whatever the current contents of this file, update them to now
|
|
have these contents") Because of this, removing the changes made in a
|
|
single commit or inserting additional changes to a file in some commit
|
|
and expecting them to propagate forward is not something that can be
|
|
done with these tools. Use linkgit:git-rebase[1] for that.
|
|
|
|
Intrinsic limitations
|
|
^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Some types of filtering have limitations that would affect any tool
|
|
attempting to perform them; the most any tool can do is attempt to notify
|
|
the user when it detects an issue:
|
|
|
|
* When rewriting commit hashes in commit messages, there are a variety
|
|
of cases when the hash will not be updated (whenever this happens, a
|
|
note is written to `.git/filter-repo/suboptimal-issues`):
|
|
** if a commit hash does not correspond to a commit in the old repo
|
|
** if a commit hash corresponds to a commit that gets pruned
|
|
** if an abbreviated hash is not unique
|
|
|
|
* Pruning of empty commits can cause a merge commit to lose an entire
|
|
ancestry line and become a non-merge. If the merge commit had no
|
|
changes then it can be pruned too, but if it still has changes it needs
|
|
to be kept. This might cause minor confusion since the commit will
|
|
likely have a commit message that makes it sound like a merge commit
|
|
even though it's not. (Whenever a merge commit becomes a non-merge
|
|
commit, a note is written to `.git/filter-repo/suboptimal-issues`)
|
|
|
|
Issues specific to filter-repo
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
* Multiple repositories in the wild have been observed which use a bogus
|
|
timezone (`+051800`); google will find you some reports. The intended
|
|
timezone wasn't clear or wasn't always the same. Replace with a
|
|
different bogus timezone that fast-import will accept (`+0261`).
|
|
|
|
* `--path-rename` can result in pathname collisions; to avoid excessive
|
|
memory requirements of tracking which files are in all commits or
|
|
looking up what files exist with either every commit or every usage of
|
|
--path-rename, we just tell the user that they might clobber other
|
|
changes if they aren't careful. We can check if the clobbering comes
|
|
from another --path-rename without much overhead. (Perhaps in the
|
|
future it's worth adding a slow mode to --path-rename that will do the
|
|
more exhaustive checks?)
|
|
|
|
* There is no mechanism for directly controlling which flags are passed
|
|
to fast-export (or fast-import); only pre-defined flags can be turned
|
|
on or off as a side-effect of other options. Direct control would make
|
|
little sense because some options like `--full-tree` would require
|
|
additional code in filter-repo (to parse new directives), and others
|
|
such as `-M` or `-C` would break assumptions used in other places of
|
|
filter-repo.
|
|
|
|
* Partial-repo filtering, while supported, runs counter to filter-repo's
|
|
"avoid mixing old and new history" design. This support has required
|
|
improvements to core git as well (e.g. it depends upon the
|
|
`--reference-excluded-parents` option to fast-export that was added
|
|
specifically for this usage within filter-repo). The `--partial` and
|
|
`--refs` options will continue to be supported since there are people
|
|
with usecases for them; however, I am concerned that this inconsistency
|
|
about mixing old and new history seems likely to lead to user mistakes.
|
|
For now, I just hope that long explanations of caveats in the
|
|
documentation of these options suffice to curtail any such problems.
|
|
|
|
Comments on reversibility
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Some people are interested in reversibility of of a rewrite; e.g. rewrite
|
|
history, possibly add some commits, then unrewrite and get the original
|
|
history back plus a few new "unrewritten" commits. Obviously this is
|
|
impossible if your rewrite involves throwing away information
|
|
(e.g. filtering out files or replacing several different strings with
|
|
`***REMOVED***`), but may be possible with some rewrites. filter-repo is
|
|
likely to be a poor fit for this type of workflow for a few reasons:
|
|
|
|
* most of the limitations inherited from fast-export and fast-import
|
|
are of a type that cause reversibility issues
|
|
* grafts and replace refs, if present, are used in the rewrite and made
|
|
permanent
|
|
* rewriting of commit hashes will probably be reversible, but it is
|
|
possible for rewritten abbreviated hashes to not be unique even if the
|
|
original abbreviated hashes were.
|
|
* filter-repo defaults to several forms of unreversible rewriting that
|
|
you may need to turn off (e.g. the last two bullet points above or
|
|
reencoding commit messages into UTF-8); it's possible that additional
|
|
forms of unreversible rewrites will be added in the future.
|
|
* I assume that people use filter-repo for one-shot conversions, not
|
|
ongoing data transfers. I explicitly reserve the right to change any
|
|
API in filter-repo based on this presumption (and a comment to this
|
|
effect is found in multiple places in the code and examples). You
|
|
have been warned.
|
|
|
|
SEE ALSO
|
|
--------
|
|
linkgit:git-rebase[1], linkgit:git-filter-branch[1]
|
|
|
|
GIT
|
|
---
|
|
Part of the linkgit:git[1] suite
|