rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

Go to file

phiresky 9c285670fd shorten debug output on failure (#7 )		5 years ago
.vscode	pass around config object	5 years ago
ci	well that was stupid	5 years ago
doc	demo output	5 years ago
exampledir	shorten debug output on failure (#7 )	5 years ago
src	shorten debug output on failure (#7 )	5 years ago
.gitignore	(cargo-release) version 0.8.0	5 years ago
.travis.yml	(cargo-release) version 0.8.6	5 years ago
CHANGELOG.md	shorten debug output on failure (#7 )	5 years ago
Cargo.lock	(cargo-release) start next development iteration 0.9.1	5 years ago
Cargo.toml	(cargo-release) start next development iteration 0.9.1	5 years ago
LICENSE.md	update readme	5 years ago
README.md	Add installation instructions to README.md	5 years ago
appveyor.yml	remove copypasted flag from appveyor.yml	5 years ago
rustfmt.toml	initial working version	5 years ago

README.md

rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

rga is a line-oriented search tool that allows you to look for a regex in a multitude of file types. rga wraps the awesome ripgrep and enables it to search in pdf, docx, sqlite, jpg, movie subtitles (mkv, mp4), etc.

For more detail, see this introductory blogpost: https://phiresky.github.io/blog/2019/rga--ripgrep-for-zip-targz-docx-odt-epub-jpg/

rga will recursively descend into archives and match text in every file type it knows.

Here is an example directory with different file types:

demo/
├── greeting.mkv
├── hello.odt
├── hello.sqlite3
└── somearchive.zip
├── dir
│ ├── greeting.docx
│ └── inner.tar.gz
│ └── greeting.pdf
└── greeting.epub

Available Adapters

rga --rga-list-adapters

Adapters:

ffmpeg

Uses ffmpeg to extract video metadata/chapters and subtitles

Extensions: .mkv, .mp4, .avi

pandoc

Uses pandoc to convert binary/unreadable text documents to plain markdown-like text

Extensions: .epub, .odt, .docx, .fb2, .ipynb

poppler

Uses pdftotext (from poppler-utils) to extract plain text from PDF files

Extensions: .pdf

zip

Reads a zip file as a stream and recurses down into its contents

Extensions: .zip

Mime Types: application/zip
decompress

Reads compressed file as a stream and runs a different extractor on the contents.

Extensions: .tgz, .tbz, .tbz2, .gz, .bz2, .xz, .zst

Mime Types: application/gzip, application/x-bzip, application/x-xz, application/zstd
tar

Reads a tar file as a stream and recurses down into its contents

Extensions: .tar

sqlite

Uses sqlite bindings to convert sqlite databases into a simple plain text format

Extensions: .db, .db3, .sqlite, .sqlite3

Mime Types: application/x-sqlite3

The following adapters are disabled by default, and can be enabled using '--rga-adapters=+pdfpages,tesseract':

pdfpages

Converts a pdf to it's individual pages as png files. Only useful in combination with tesseract

Extensions: .pdf

tesseract

Uses tesseract to run OCR on images to make them searchable. May need -j1 to prevent overloading the system. Make sure you have tesseract installed.

Extensions: .jpg, .png

INSTALLATION:

rga should compile with stable Rust. To build it, run the following (or the equivalent in your OS):

   ~$ apt install build-essential pandoc poppler-utils ffmpeg ripgrep cargo
   ~$ cargo install ripgrep_all
   ~$ rga --version    # this should work now

You could do cargo build, instead of cargo install ripgrep_all, to just build rga in the local tree.

USAGE:

rga [FLAGS] [OPTIONS] PATTERN [PATH ...]

FLAGS:

--rga-accurate

Use more accurate but slower matching by mime type

By default, rga will match files using file extensions. Some programs, such as sqlite3, don't care about the file extension at all, so users sometimes use any or no extension at all. With this flag, rga will try to detect the mime type of input files using the magic bytes (similar to the `file` utility), and use that to choose the adapter. Detection is only done on the first 8KiB of the file, since we can't always seek on the input (in archives).

-h, --help

Prints help information

--rga-list-adapters

List all known adapters

--rga-no-cache

Disable caching of results

By default, rga caches the extracted text to a database in ~/.cache/rga if it is small enough. This way, repeated searches on the same set of files will be much faster. If you pass this flag, all caching will be disabled.

--rg-help

Show help for ripgrep itself

--rg-version

Show version of ripgrep itself

-V, --version

Prints version information

OPTIONS:

--rga-adapters=<adapters>...

Change which adapters to use and in which priority order (descending)

"foo,bar" means use only adapters foo and bar. "-bar,baz" means use all default adapters except for bar and baz. "+bar,baz" means use all default adapters and also bar and baz.

--rga-cache-compression-level=<cache-compression-level>

default: 12

--rga-cache-max-blob-len <cache-max-blob-len>

Max compressed size to cache

Longest byte length (after compression) to store in cache. Longer adapter outputs will not be cached and recomputed every time.
default: 2000000

--rga-max-archive-recursion=<max-archive-recursion>

Maximum nestedness of archives to recurse into [default: 4]

-h shows a concise overview, --help shows more detail and advanced options.

All other options not shown here are passed directly to rg, especially

PATTERN

Development

To enable debug logging:

export RUST_LOG=debug
export RUST_BACKTRACE=1

Also rember to disable caching with --rga-no-cache or clear the cache in ~/.cache/rga to debug the adapters.