git-filter-repo/Documentation/converting-from-filter-branch.md
Elijah Newren 23bec32283 contrib, docs: make discovery of code formatting and linting easier
The desire to format or lint code throughout history has arisen several
times.  It's more natural to do this in filter-branch since it somewhat
forces people to run external commands, but we have an example contrib
demo that shows how to run an external command on each file in history
that I created even before any of these requests came in and yet I still
periodically get requests about it.

Make lint-history ever-so-slightly easier to apply to a subset of
filenames, and include its usage as an extra cheat sheet comparison for
filter-branch-vs-filter-repo commands.

Signed-off-by: Elijah Newren <newren@gmail.com>
2020-06-06 11:54:28 -07:00

11 KiB

Cheat Sheet: Converting from filter-branch

This document is aimed at folks who are familiar with filter-branch and want to learn how to convert over to using filter-repo.

Table of Contents

Half-hearted conversions

You can switch nearly any git filter-branch command to use filter-repo under the covers by just replacing the git filter-branch part of the command with filter-lamely. The git.git regression testsuite passes when I swap out the filter-branch script with filter-lamely, for example. (However, the filter-branch tests are not very comprehensive, so don't rely on that too much.)

Doing a half-hearted conversion has nearly all of the drawbacks of filter-branch and nearly none of the benefits of filter-repo, but it will make your command run a few times faster and makes for a very simple conversion.

You'll get a lot more performance, safety, and features by just switching to direct filter-repo commands.

Intention of "equivalent" commands

filter-branch and filter-repo have different defaults, as highlighted in the Basic Differences section below. As such, getting a command which behaves identically is not possible. Also, sometimes the filter-branch manpage lies, e.g. it says "suppose you want to...from all commits" and then uses a command line like "git filter-branch ... HEAD", which only operates on commits in the current branch rather than on all commits.

Rather than focusing on matching filter-branch output as exactly as possible, I treat the filter-branch examples as idiomatic ways to solve a certain type of problem with filter-branch, and express how one would idiomatically solve the same problem in filter-repo. Sometimes that means the results are not identical, but they are largely the same in each case.

Basic Differences

With git filter-branch, you have a git repository where every single commit (within the branches or revisions you specify) is checked out and then you run one or more shell commands to transform the working copy into your desired end state.

With git filter-repo, you are essentially given an editing tool to operate on the fast-export serialization of a repo. That means there is an input stream of all the contents of the repository, and rather than specifying filters in the form of commands to run, you usually employ a number of common pre-defined filters that provide various ways to slice, dice, or modify the repo based on its components (such as pathnames, file content, user names or emails, etc.) That makes common operations easier, even if it's not as versatile as shell callbacks. For cases where more complexity or special casing is needed, filter-repo provides python callbacks that can operate on the data structures populated from the fast-export stream to do just about anything you want.

filter-branch defaults to working on a subset of the repository, and requires you to specify a branch or branches, meaning you need to specify -- --all to modify all commits. filter-repo by contrast defaults to rewriting everything, and you need to specify --refs <rev-list-args> if you want to limit to just a certain set of branches or range of commits. (Though any <rev-list-args> that begin with a hyphen are not accepted by filter-repo as they look like the start of different options.)

filter-repo also takes care of additional concerns automatically, like rewriting commit messages that reference old commit IDs to instead reference the rewritten commit IDs, pruning commits which do not start empty but become empty due to the specified filters, and automatically shrinking and gc'ing the repo at the end of the filtering operation.

Cheat Sheet: Conversion of Examples from the filter-branch manpage

Removing a file

The filter-branch manual provided three different examples of removing a single file, based on different levels of ease vs. carefulness and performance:

  git filter-branch --tree-filter 'rm filename' HEAD
  git filter-branch --tree-filter 'rm -f filename' HEAD
  git filter-branch --index-filter 'git rm --cached --ignore-unmatch filename' HEAD

All of these just become

  git filter-repo --invert-paths --path filename

Extracting a subdirectory

Extracting a subdirectory via

  git filter-branch --subdirectory-filter foodir -- --all

is one of the easiest commands to convert; it just becomes

  git filter-repo --subdirectory-filter foodir

Moving the whole tree into a subdirectory

Keeping all files but placing them in a new subdirectory via

  git filter-branch --index-filter \
      'git ls-files -s | sed "s-\t\"*-&newsubdir/-" |
              GIT_INDEX_FILE=$GIT_INDEX_FILE.new \
                      git update-index --index-info &&
       mv "$GIT_INDEX_FILE.new" "$GIT_INDEX_FILE"' HEAD

(which happens to be GNU-specific and will fail with BSD userland in very subtle ways) becomes

  git filter-repo --to-subdirectory-filter newsubdir

(which works fine regardless of GNU vs BSD userland differences.)

Re-grafting history

The filter-branch manual provided one example with three different commands that could be used to achieve it, though the first of them had limited applicability (only when the repo had a single initial commit). These three examples were:

  git filter-branch --parent-filter 'sed "s/^\$/-p <graft-id>/"' HEAD
  git filter-branch --parent-filter \
      'test $GIT_COMMIT = <commit-id> && echo "-p <graft-id>" || cat' HEAD
  git replace --graft $commit-id $graft-id
  git filter-branch $graft-id..HEAD

git-replace did not exist when the original two examples were written, but it is clear that the last example is far easier to understand. As such, filter-repo just uses the same mechanism:

  git replace --graft $commit-id $graft-id
  git filter-repo --force

NOTE: --force should usually be avoided unless you have taken care to make sure you have a backup (or are running on a fresh clone of) your repo. It is needed in this case because filter-repo errors out when no arguments are specified, and because it usually first checks whether you are in a fresh clone before irrecoverably rewriting your repository (git-replace created a new graft and thus added something to your previously fresh clone).

Removing commits by a certain author

WARNING: This is a BAD example for BOTH filter-branch and filter-repo. It does not remove the changes the user made from the repo, it just removes the commit in question while smashing the changes from it into any subsequent commits as though the subsequent authors had been responsible for those changes as well. git rebase is likely to be a better fit for what you really want if you are looking at this example. (See also this explanation of the differences between rebase and filter-repo)

This filter-branch example

  git filter-branch --commit-filter '
      if [ "$GIT_AUTHOR_NAME" = "Darl McBribe" ];
      then
          skip_commit "$@";
      else
          git commit-tree "$@";
      fi' HEAD

becomes

  git filter-repo --commit-callback '
      if commit.author_name == b"Darl McBribe":
          commit.skip()
      '

Rewriting commit messages -- removing text

Removing git-svn-id: lines from commit messages via

  git filter-branch --msg-filter '
      sed -e "/^git-svn-id:/d"
      '

becomes

  git filter-repo --message-callback '
      return re.sub(b"^git-svn-id:.*\n", b"", message, flags=re.MULTILINE)
      '

Rewriting commit messages -- adding text

Adding Acked-by lines to the last ten commits via

  git filter-branch --msg-filter '
          cat &&
          echo "Acked-by: Bugs Bunny <bunny@bugzilla.org>"
      ' master~10..master

becomes

  git filter-repo --message-callback '
          return message + b"Acked-by: Bugs Bunny <bunny@bugzilla.org>\n"
      ' --refs master~10..master

Changing author/committer(/tagger?) information

  git filter-branch --env-filter '
      if test "$GIT_AUTHOR_EMAIL" = "root@localhost"
      then
              GIT_AUTHOR_EMAIL=john@example.com
      fi
      if test "$GIT_COMMITTER_EMAIL" = "root@localhost"
      then
              GIT_COMMITTER_EMAIL=john@example.com
      fi
      ' -- --all

becomes either

  # Ensure '<john@example.com> <root@localhost>' is a line in .mailmap, then:
  git filter-repo --use-mailmap

or

  git filter-repo --email-callback '
    return email if email != b"root@localhost" else b"john@example.com"
    '

(and as a bonus both filter-repo alternatives will fix tagger emails too, unlike the filter-branch example)

Restricting to a range

The partial examples

  git filter-branch ... C..H
  git filter-branch ... C..H ^D
  git filter-branch ... D..H ^C

become

  git filter-repo ... --refs C..H
  git filter-repo ... --refs C..H ^D
  git filter-repo ... --refs D..H ^C

Note that filter-branch accepts --not among the revision specifiers, but that appears to python to be a flag name which breaks parsing. So, instead of e.g. --not C as we might use with filter-branch, we can specify ^C to filter-repo.

Cheat Sheet: Additional conversion examples

Running a code formatter or linter on each file with some extension

Running some program on a subset of files is relatively natural in filter-branch:

  git filter-branch --tree-filter '
      git ls-files -z "*.c" \
          | xargs -0 -n 1 clang-format -style=file -i
      '

filter-repo decided not to provide a way to run an external program to do filtering, because most filter-branch uses of this ability are riddled with safety problems and performance issues. However, in special cases like this it's fairly safe. One can write a script that uses filter-repo as a library to achieve this, while also gaining filter-repo's automatic handling of other concerns like rewriting commit IDs in commit messages or pruning commits that become empty. In fact, one of the contrib demos, lint-history, handles this exact type of situation already:

  lint-history --relevant 'return filename.endswith(b".c")' \
      clang-format -style=file -i