Archives/pixz - pixz - blob42 source forge

mirror of https://github.com/vasi/pixz synced 2024-10-30 15:21:41 +00:00

Go to file

Dave Vasilevsky 28e0515d75 Start factoring out index decoding		2012-11-04 19:48:28 -05:00
.gitignore	Woops, forgot to add pixz.c, how embarassing!	2010-10-14 02:14:58 -04:00
common.c	Start factoring out index decoding	2012-11-04 19:48:28 -05:00
cpu.c	Dynamically determine the number of CPUs--now we're actually useful	2010-01-16 23:33:56 -05:00
endian.c	Combine linux and BSD endian code	2012-10-13 07:38:14 -04:00
LICENSE	Add license	2011-08-08 15:26:19 -04:00
list.c	We never use the argument to read_file_index	2012-10-14 06:13:23 -04:00
Makefile	Don't hardcode LIBPREFIX	2012-10-13 06:23:18 -04:00
pixz.c	It's ok to decompress a text file to a TTY	2012-10-14 07:33:33 -04:00
pixz.h	Use the read buffer	2012-10-20 21:54:17 -04:00
read.c	Multiple streams are supported	2012-10-20 23:30:07 -04:00
README	Document new flag	2012-10-14 02:01:11 -04:00
test.sh	test.sh: Make it general, make it work	2012-10-13 07:11:16 -04:00
TODO	Link to github	2012-10-13 07:12:48 -04:00
write.c	cleanup	2012-10-14 09:15:42 -04:00

README

Pixz (pronounced 'pixie') is a parallel, indexing version of XZ: https://github.com/vasi/pixz


The existing XZ Utils ( http://tukaani.org/xz/ ) provide great compression in the .xz file format, but they have two significant problems:

* They are single-threaded, while most users nowadays have multi-core computers.
* The .xz files they produce are just one big block of compressed data, rather than a collection of smaller blocks. This makes random access to the original data impossible.


With pixz, both these problems are solved. The most useful commands:

$ pixz foo.tar foo.tpxz         # Compress and index a tarball, multi-core
$ pixz -l foo.tpxz              # Very quickly list the contents of the compressed tarball
$ pixz -d foo.tpxz foo.tar      # Decompress it, multi-core
$ pixz -x dir/file < foo.tpxz | tar x   # Very quickly extract a file, multi-core.
                                        # Also verifies that contents match index.

$ tar -Ipixz -cf foo.tpxz foo           # Create a tarball using pixz for multi-core compression

$ pixz bar bar.xz           # Compress a non-tarball, multi-core
$ pixz -d bar.xz bar        # Decompress it, multi-core


Specifying input and output:

$ pixz < foo.tar > foo.tpxz     # Same as 'pixz foo.tar foo.tpxz'
$ pixz -i foo.tar -o foo.tpxz   # Ditto. These both work for -x, -d and -l too, eg:

$ pixz -x -i foo.tpxz -o foo.tar file1 file2 ... # Extract the files from foo.tpxz into foo.tar

$ pixz foo.tar                  # Compress it to foo.tpxz, removing the original
$ pixz -d foo.tpxz              # Extract it to foo.tar, removing the original


Other flags:

$ pixz -1 foo.tar           # Faster, worse compression
$ pixz -9 foo.tar           # Better, slower compression
$ pixz -p 2 foo.tar         # Cap the number of threads at 2

$ pixz -t foo.tar           # Compress but don't treat it as a tarball (don't index it)
$ pixz -d -t foo.tpxz       # Decompress foo, don't check that contents match index
$ pixz -l -t foo.tpxz       # List the xz blocks instead of files

WARNING: Running pixz without the -t flag will cause it to treat the input as a tarball, as long as it looks vaguely tarball-like. This means if the file starts with at least 1024 zero bytes, pixz will assume it's empty, and truncate the output! If your input files aren't tarballs, run with -t or face possible data-loss.


Compare to:
    plzip
        * About equally complex, efficient
        * lzip format seems less-used
        * Version 1 is theoretically indexable...I think
    ChopZip
        * Python, much simpler
        * More flexible, supports arbitrary compression programs
        * Uses streams instead of blocks, not indexable
        * Splits input and then combines output, much higher disk usage 
    pxz
        * Simpler code
        * Uses OpenMP instead of pthreads
        * Uses streams instead of blocks, not indexable
        * Uses temp files and doesn't combine them until the whole file is compressed, high disk/memory usage

Comparable tools for other compression algorithms:
    pbzip2
        * Not indexable
        * Appears slow
        * bzip2 algorithm is non-ideal
    pigz
        * Not indexable
    dictzip
        * Not parallel


Requirements:
    * libarchive 2.8 or later
    * liblzma 4.999.9-beta-212 or later (from the xz distribution)