NAME¶
dirfile-encoding — dirfile database encoding schemes
DESCRIPTION¶
The
Dirfile Standards indicate that
RAW fields defined in the
database are accompanied by binary files containing the field data in the
specified simple data type. In certain situations, it may be advantageous to
convert the binary files in the database into a more convenient form. This is
accomplished by
encoding the binary file into the alternate form. A
common use-case for encoding a binary file is to compress it to save disk
space. Only data is modified by an encoding scheme. Database metadata is
unaffected.
Support for encoding schemes is optional. An implementation need not support any
particular encoding scheme, or may only support certain operations with it,
but should expect to encounter unknown encoding schemes and fail gracefully in
such situations.
Additionally, how a particular encoding is implemented is not specified by the
Dirfile Standards, but, for purposes of interoperability, all dirfile
implementations are encouraged to support the encoding implementation used by
the GetData dirfile reference implementation, elaborated below.
An encoding scheme is local to the particular
format specification
fragment in which it is indicated. This allows a single dirfile to have
binary files which are stored using multiple encodings, by having them defined
in multiple fragments.
The rest of this manual page discusses specifics of the encoding framework
implemented in the GetData library, and does not constitute part of the
Dirfile Standards.
THE GETDATA ENCODING FRAMEWORK¶
The GetData library provides an encoding framework which abstracts binary file
I/O, allowing for generic support for a wide variety of encoding schemes.
Functions which may make use of the encoding framework are:
- gd_add(3), gd_add_raw(3), gd_add_spec(3),
gd_alter_encoding(3), gd_alter_endianness(3),
gd_alter_frameoffset(3), gd_alter_entry(3),
gd_alter_raw(3), gd_alter_spec(3), gd_getdata(3),
gd_move(3), gd_nframes(3), gd_putdata(3),
and gd_rename(3).
Most of the encodings supported by GetData are implemented through external
libraries which handle the actual file I/O and data translation. All such
libraries are optional; a build of the library which omits an external library
will lack support for the associated encoding scheme. In this case, GetData
will still properly identify the encoding scheme, but attempts to use GetData
for file I/O via the encoding will fail with the
GD_E_UNSUPPORTED error
code.
GetData discovers the encoding scheme of a particular RAW field by noting the
filename extension of files associated with the field. Binary files which form
an unencoded dirfile have no file extension. The file extension used by the
other encodings are noted below. Encoding discovery proceeds by searching for
files with the known list of file extensions (in an unspecified order) and
stopping when the first successful match is made. Because of this, when the a
field has multiple data files with different, supported file extensions which
could legitimately be associated with it, the encoding scheme discovered by
GetData is not well defined.
In addition to raw (unencoded) data, GetData supports five other encoding
schemes:
text encoding,
bzip2 encoding,
gzip encoding,
lzma encoding, and
slim encoding, all discussed below.
Text Encoding¶
The Text Encoding is unique among GetData encoding schemes in that it requires
no external library. As a result, all builds of the library contain full
support for this encoding. It is meant to serve as a reference encoding and
example of the encoding framework for work on other encoding schemes.
The Text Encoding replaces the binary data files with 7-bit ASCII files
containing a decimal text encoding of the data, one sample per line. All
operations are supported by the Text Encoding. The file extension of the Text
Encoding is
.txt.
BZip2 Encoding¶
The BZip2 Encoding compresses raw binary files using the Burrows-Wheeler block
sorting text compression algorithm and Huffman coding, as implemented in the
bzip2 format. GetData's BZip2 Encoding scheme is implemented through the the
bzip2 compression library written by Julian Seward. GetData's BZip2
Encoding framework currently lacks write capabilities; as a result the BZip2
Encoding does not support functions which modify binary data.
GetData caches an uncompressed megabyte of data at a time to speed access times.
A call to
get_nframes(3) requires decompression of the entire binary
file to determine its uncompressed size, and may take some time to complete.
The file extension of the BZip2 Encoding is
.bz2.
GZip Encoding¶
The GZip Encoding compresses raw binary files using Lempel-Ziv coding (LZ77) as
implemented in the gzip format. GetData's GZip Encoding scheme is implemented
through the the
zlib compression library written by Jean-loup Gailly
and Mark Adler. GetData's GZip Encoding framework currently lacks write
capabilities; as a result the GZip Encoding does not support functions which
modify binary data.
To speed the operation of
get_nframes(3), the GZip Encoding takes the
uncompressed size of the file the gzip footer, which contains the file's
uncompressed size in bytes, modulo 2^32. As a result, using a field with an
(uncompressed) binary file size larger than 4 GiB as the reference field
will result in the wrong number of frames being reported. The file extension
of the GZip Encoding is
.gz.
LZMA Encoding¶
The LZMA Encoding compresses raw binary files using the Lempel-Ziv Markov Chain
Algorithm (LZMA) as implemented in the xz container format. GetData's LZMA
Encoding scheme is implemented through the
lzma library, part of the
XZ Utils suite written by Lasse Collin, Ville Koskinen, and Igor
Pavlov. GetData's LZMA Encoding framework currently lacks write capabilities;
as a result the LZMA Encoding does not support functions which modify binary
data.
As with the BZip2 Encoding, GetData caches an uncompressed megabyte of data at a
time to speed access times. A call to
get_nframes(3) requires
decompression of the entire binary file to determine its uncompressed size,
and may take some time to complete. The file extension of the LZMA Encoding is
.xz, or
.lzma.
Slim Encoding¶
The Slim Encoding compresses raw binary files using the slimlib compression
library written by Joseph Fowler. The slimlib library was developed at
Princeton University to compress dirfile-like data. GetData's Slim Encoding
framework currently lacks write capabilities; as a result, the Slim Encoding
does not support function which modify binary files. The file extension of the
Slim Encoding is
.slm.
AUTHOR¶
This manual page was by D. V. Wiebe <dvw@ketiltrout.net>.
SEE ALSO¶
dirfile(5),
dirfile-format(5),
bzip2(1),
gzip(1),
zlib(3).