NAME¶
dictzip, dictunzip - compress (or expand) files, allowing random access
SYNOPSIS¶
dictzip [options] name
dictunzip [options] name
DESCRIPTION¶
dictzip compresses files using the
gzip(1) algorithm (LZ77) in a
manner which is completely compatible with the
gzip file format. An
extension to the
gzip file format (Extra Field, described in 2.3.1.1 of
RFC 1952) allows extra data to be stored in the header of a compressed file.
Programs like
gzip and
zcat will ignore this extra data.
However,
dictd(8), the DICT protocol dictionary server will make use of
this data to perform pseudo-random access on the file. Files in the
dictzip format should end in ".dz" so that they may be
distinguished from common
gzip files that do not contain the special
header information.
From RFC 1952, the extra field is specified as follows:
If the FLG.FEXTRA bit is set, an "extra field"
is present in the header, with total length XLEN bytes. It consists of a
series of subfields, each of the form:
+---+---+---+---+==================================+
|SI1|SI2| LEN |... LEN bytes of subfield data ...|
+---+---+---+---+==================================+
SI1 and SI2 provide a subfield ID, typically two ASCII letters with some
mnemonic value. Jean-Loup Gailly <gzip@prep.ai.mit.edu> is maintaining a
registry of subfield IDs; please send him any subfield ID you wish to use.
Subfield IDs with SI2 = 0 are reserved for future use.
LEN gives the length of the subfield data, excluding the 4 initial bytes.
The
dictzip program uses 'R' for SI1, and 'A' for SI2 (i.e., "Random
Access"). After the LEN field, the data is arranged as follows:
+---+---+---+---+---+---+===============================+
| VER | CHLEN | CHCNT | ... CHCNT words of data ... |
+---+---+---+---+---+---+===============================+
As per RFC 1952, all data is stored least-significant byte first. For VER 1 of
the data, all values are 16-bits long (2 bytes), and are unsigned integers.
XLEN (which is specified earlier in the header) is a two byte integer, so the
extra field can be 0xffff bytes long, 2 bytes of which are used for the
subfield ID (SI1 and SI1), and 2 bytes of which are used for the subfield
length (LEN). This leaves 0xfffb bytes (0x7ffd 2-byte entries or 0x3ffe 4-byte
entries). Given that the zip output buffer must be 10% + 12 bytes larger than
the input buffer, we can store 58969 bytes per entry, or about 1.8GB if the
2-byte entries are used. If this becomes a limiting factor, another format
version can be selected and defined for 4-byte entries.
For compression, the file is divided up into "chunks" of data, each
chunk is less than 64kB, and can be compressed into an area that is also less
than 64kB long (taking incompressible data into account -- usually the data is
compressed into a block that is much smaller than the original). The CHLEN
field specifies the length of a "chunk" of data. The CHCNT field
specifies how many chunks are preset, and the CHCNT words of data specifies
how long each chunk is after compression (i.e., in the current compressed
file).
To perform random access on the data, the offset and length of the data are
provided to library routines. These routines determine the chunk in which the
desired data begins, and decompresses that chunk. Consecutive chunks are
decompressed as necessary.
TRADEOFFS¶
- Speed
- True random file access is not realized, since any access, even for a
single byte, requires that a 64kB chunk be read and decompressed. This is
slower than accessing a flat text file, but is much, much faster than
performing serial access on a fully compressed file.
- Space
- For the textual dictionary databases we are working with, the use of 64kB
chunks and maximal LZ77 compression realizes a file which is only about 4%
larger than the same file compressed all at once.
OPTIONS¶
- -d or --decompress
- Decompress. This is the default if the executable is called
dictunzip.
- -c or --stdout
- Write output on standard output; keep original files unchanged. This is
only available when decompressing (because parts of the header must be
updated after a write when compressing).
- -f or --force
- Force compression or decompression even if the output file already
exists.
- -h or --help
- Display help.
- -k or --keep
- Do not delete the original file.
- -l or --list
- For each compressed file, list the following fields:
type: dzip, gzip, or text (includes files in unknown formats)
crc: CRC checksum
date and time: from header
chunks: number of chunks in file
size: size of each uncompressed chunk
compr.: compressed size
uncompr.: uncompressed size
ratio: compression ratio (0.0% if unknown)
name: name of uncompressed file
Unlike gzip, the compression method is not detected.
- -L or --license
- Display the dictzip license and quit.
- -t or --test
- Check the compressed file integrity. This option is not implemented.
Instead, it will list the header information.
- -v or --verbose
- Verbose. Display extra information during compression.
- -V or --version
- Version. Display the version number and compilation options then
quit.
- -s start or --start start
- Specify the offer to start decompression, using decimal numbers. The
default is at the beginning of the file.
- -e size or --size size
- Specify the size of the portion of the file to decompress, using decimal
numbers. The default is the whole file.
- -S start or --Start start
- Specify the offer to start decompression, using base64 numbers. The
default is at the beginning of the file.
- -E size or --Size start
- Specify the size of the portion of the file to decompress, using base64
numbers. The default is the whole file.
- -p prefilter or --pre prefilter
- Specify a shell command to execute as a filter before compression or
decompression of a chunk. The pre- and post-compression filters can be
used to provide additional compression or output formatting. The filters
may not increase the buffer size significantly. The pre- and
post-compression filters were designed to provide the most general
interface possible.
- -P postfilter or --post postfilter
- Specify a shell command to execute as a filter after compression or
decompression.
CREDITS¶
dictzip was written by Rik Faith (faith@cs.unc.edu) and is distributed
under the terms of the GNU General Public License. If you need to distribute
under other terms, write to the author.
The main libraries used by this programs (zlib, regex, libmaa) are distributed
under different terms, so you may be able to use the libraries for
applications which are incompatible with the GPL -- please see the copyright
notices and license information that come with the libraries for more
information, and consult with your attorney to resolve these issues.
SEE ALSO¶
dict(1),
dictd(8),
gzip(1),
gunzip(1),
zcat(1)