NAME¶
dirfile-format — the dirfile database format specification file
DESCRIPTION¶
The
dirfile format specification fully specifies the raw and derived time
streams and auxiliary information for a
dirfile(5) database.
The format specification is contained in one or more case-sensitive text files
located in the dirfile tree. Each file is known as a
fragment. The
primary fragment is the file called
format located in the base dirfile
directory. This file may contain only part of the format specification, and
may reference other fragments (using the
/INCLUDE directive) containing
further format specification. This inclusion mechanism may be nested
arbitrarily deep.
The explicit text encoding of these files is not specified by these standards,
but must be 7-bit ASCII compatible. Examples of acceptable character encodings
include all the ISO 8859 character sets (
i.e. Latin-1 through
Latin-10, among others), as well as the UTF-8 encoding of Unicode and UCS.
SYNTAX¶
The format specification is composed of field specification lines and directive
lines, optionally separated by blank lines or lines containing only
whitespace. Lines are separated by the line-feed character (0x0A). Unless
escaped (see below), the hash mark (
#) is the comment delimiter; the
comment delimiter, and any text following it to the end of the line, is
ignored.
Tokens¶
Both field specification lines and directive lines consist of several tokens
separated by whitespace. Whitespace consists of one or more whitespace
characters. These are: space (0x20), horizontal tab (0x09), vertical tab
(0x0B), form-feed (0x0C), and carriage return (0x0D). The first token of a
directive line is always a
reserved word. The first token of a field
specification line is never a reserved word. Any amount of whitespace may
precede the first token on a line.
Since tokens are separated by whitespace, to include a whitespace character in a
token, it must either escaped by preceding it by a backslash character
(
\), or be replaced by a
character escape sequence (see below),
or else the token must be enclosed in quotation marks (
"). The
quotation marks themselves will be stripped from the token. The
null-token (that is, the token consisting of zero characters) may be
specified by a pair of quotation marks with nothing between them
(
""). To include a literal quotation mark in a token, it must
be escaped (
\"). Similarly, a hash mark may be included in a token
by including it in a quoted token or else by escaping it (
\#),
otherwise the hash mark will be understood as the comment delimiter.
It is a syntax error to have a line which contains unmatched quotation marks, or
in which the last character is the backslash character.
Several characters when escaped by a preceding backslash character are
interpreted as special characters in tokens. The character escape sequences
are:
- \a
- an alert (bell) character (ASCII 0x07 / U+0007)
- \b
- a backspace character (ASCII 0x08 / U+0008)
- \e
- an escape character (ASCII 0x1B / U+001B)
- \f
- a form-feed character (ASCII 0x0C / U+000C)
- \n
- a line-feed character (ASCII 0x0A / U+000A)
- \r
- a carriage return character (ASCII 0x0D / U+000D)
- \t
- a horizontal tab character (ASCII 0x09 / U+0009)
- \v
- a vertical tab character (ASCII 0x0B / U+000B)
- \\
- a backslash character (ASCII 0x5C / U+005C)
- \ooo
- the single byte given by the octal number ooo.
- \xhh
- the single byte given by the hexadecimal number hh.
- \uhhhhhhh
- the UTF-8 byte sequence encoding the Unicode code point given by the
hexadecimal number hhhhhhh.
Any other character which is escaped is interpreted as the character itself.
(
i.e. \c is interpreted as
c; also, as pointed out above,
\" and
\# are interpreted as simply
" and
#, without their special meanings).
No token may contain the NULL character (ASCII 0x00 / U+0000). Furthermore,
although support is present to create UTF-8 byte sequences, tokens are not
required to be valid UTF-8 sequences. Any byte sequence not containing the
NULL character forms a valid token. However, there may be further restrictions
on allowed characters for a token in a particular situation, (for example,
when used as a field name).
DIRECTIVES¶
There are eight
reserved words, which cannot be used as field names in
the dirfile. Instead, these specify directives. All reserved words start with
an initial forward slash (
/), to distinguish them from field names.
Previous versions of the Standards permitted the omission of the slash. Like
the rest of the format specification, directives are case sensitive.
A number of the directives have
fragment scope. A directive with fragment
scope only applies to the fragment in which it is present, plus any
sub-fragments indicated by the
/INCLUDE directive, but only if those
sub-fragments don't have their own corresponding directive. Directives which
have fragment scope are:
/ENCODING,
/ENDIAN,
/FRAMEOFFSET, and
/PROTECT. Because of these scoping rules, different portions of the
dirfile may have different encodings, endiannesses, frame offsets, or
protection levels.
If a directive with fragment scope appears more than once in a fragment, only
the last such directive will be honoured, with the exception that the effect
of a directive will not be propagated to sub-fragments if the directive line
appears after the sub-fragment is included. The scoping rules of the remaining
directives are discussed below.
- /ENCODING
- The /ENCODING directive specifies the encoding scheme used to encode
binary files in the dirfile. The encoding scheme may be one of the
predefined names listed below, which are described in more detail in
dirfile-encoding(5), or any other site-specific encoding scheme.
The predefined scheme names are:
- none
- The dirfile is unencoded.
- bzip2
- The dirfile is compressed using the bzip2 compression scheme.
- gzip
- The dirfile is compressed using the gzip compression scheme.
- lzma
- The dirfile is compressed using the LZMA compression scheme.
- slim
- The dirfile is compressed using the slim compression scheme.
- text
- The dirfile is text encoded.
Implementations should fail gracefully when encountering an unknown encoding
scheme. If no encoding scheme is specified, behaviour is implementation
dependent. Syntax is:
- /ENCODING <scheme>
The /ENCODING directive has
fragment scope.
- /ENDIAN
- The /ENDIAN directive specifies the endianness of the raw data in the
database. The assumed endianness of raw data in dirfiles which omit this
directive is implementation dependent. Syntax is:
- /ENDIAN ( big | little ) [ arm ]
where the "arm" token should be included if double precision floating
point data are stored in the ARM middle-endian format. The /ENDIAN directive
has
fragment scope.
- /FRAMEOFFSET
- The /FRAMEOFFSET directive specifies the frame number of the first frame
for which data exists in binary files associated with RAW fields.
Syntax is:
- /FRAMEOFFSET <integer>
The /FRAMEOFFSET directive has
fragment scope.
- /INCLUDE
- The /INCLUDE directive specifies another file (called a fragment)
to parse for additional format specification for the dirfile. The
inclusion is treated as if the lines of the fragment were pasted verbatim
in place of the INCLUDE directive line. The exception to this is that RAW
fields specified in the fragment are located in the directory containing
the fragment and not in the directory containing the parent fragment, and
the binary file encoding may be different for each fragment. The fragment
may be specified either with an absolute path, or else a relative path
from the current file. Syntax is:
- /INCLUDE <file>
The /INCLUDE directive has no scope: it is processed immediately and has no
long-term effect.
- /META
- The /META directive specifies a metafield attached to a particular parent
field. The field metadata may be of any allowed type except RAW.
Metafields are retrieved in exactly the same way as regular field data,
but the field code specified consists of the parent and metafield
names joined with a forward slash:
- <parent-field>/<meta-field>
META fields may not be specified before their parent field has been. Syntax is:
- /META <parent-field> {field specification line}
As an illustration of this concept,
- /META pfield meta CONST FLOAT64 3.291882
provides a scalar metadatum called
meta with value 3.291882 attached to
the field
pfield. This particular metafield may be referred to by the
field code "pfield/meta". Note that different parent fields
may have metafields with the same name, since all references to metafields
must include the parent field name. Metafields may not themselves have further
sub-metafields.
As an alternative to the /META directive, a metafield may be specified by a
standard field specification line, using
- <parent-field>/<meta-field>
as the field name. That is, the above example metafield could have also been
specified as:
- pfield/meta CONST FLOAT64 3.291882
The /META directive has no scope: it is processed immediately and has no
long-term effect.
- /PROTECT
- The /PROTECT directive specifies the advisory protection level of the
current fragment and of the RAW fields defined therein. The
protection level indicates whether writing to the fragment, or the binary
data on disk is permitted. Syntax is:
- /PROTECT <level>
Four advisory protection levels are defined:
- none
- No protection at all: data and metadata may be freely changed. This is the
default, if no /PROTECT directive is present.
- format
- The dirfile metadata is protected from change, but RAW data on disk
may be modified.
- data
- The RAW data on disk is protected from change, but metadata may be
modified.
- all
- Both metadata and data on disk are protected from change.
The /PROTECT directive has
fragment scope.
- /REFERENCE
- The /REFERENCE directive specifies the name of the field to use as the
dirfile's reference field (see dirfile(5)). If no /REFERENCE
directive is specified, the first RAW field encountered is used as
the reference field. The /REFERENCE directive must specify a RAW
field. Syntax is:
- /REFERENCE <field-code>
The /REFERENCE directive has
global scope: if multiple /REFERENCE
directives appear in the dirfile metadata, only the last such will be
honoured.
- /VERSION
- The /VERSION directive specifies the particular version of the Dirfile
Standards to which the dirfile format specification conforms. This
directive should occur before any version dependent syntax is encountered.
As of Standards Version 6, no such syntax exists, and this directive is
provided primarily to ease forward compatibility. Syntax is:
- /VERSION <integer>
The /VERSION directive has
immediate scope: its effect is immediate, and
it applies only to metadata below it, including and propagating downwards to
sub-fragments after the directive. Its effect will also propagate upwards back
to the parent fragment, and affect subsequent metadata.
FIELD SPECIFICATION LINES¶
Any line which does not start with a
reserved word is assumed to be a
field specification line. A field specification line consists of at least two
tokens. The first token is the
field name. The second token is the
field type. Subsequent tokens are field parameters. The meaning and
number these parameters depends on the field type specified.
Field Names¶
The first token in a field specification line is the field name. The field name
consists of one or more characters, excluding both ASCII control characters
(the bytes 0x01 through 0x1F), and the characters
- & /;<>.
which are reserved (but see below for the use of
/ to specify
metafields). The field name may not be
INDEX, which is a special,
implicit field which contains the integer frame index. Field names are case
sensitive.
If the field name beginning a field specification line does contain a
/
character, the line is assumed to specify a metafield. See the
/META
directive above for further details.
Field Types¶
There are thirteen field types. Of these, ten are of vector type (
BIT,
DIVIDE,
LINCOM,
LINTERP,
MULTIPLY,
PHASE,
POLYNOM,
RAW,
RECIP, and
SBIT) and three are of
scalar type (
CONST,
CARRAY, and
STRING). The possible
fields types are:
- BIT
- The BIT vector field type extracts one or more bits out of an input vector
field as an unsigned number. Syntax is:
- <field-name> BIT <input> <first-bit>
[<bits>]
which specifies
field-name to be the value of bits
first-bit
through
first-bit+
bits-1 of the input vector field
input,
when
input is converted from its native type to an (endianness
corrected) unsigned 64-bit integer. If
bits is omitted, it is assumed
to be 1. Both
first-bit and
bits may be either literal numbers,
or else the field code of a
CONST or
CARRAY field type
containing their values. The
SBIT field type is a signed version of
this field type.
- CARRAY
- The CARRAY scalar field type is a list of constants fully specified in the
format specification metadata. Syntax is:
- <field-name> CARRAY <type> <value0>
<value1> <value2> ...
where
type may be any supported native data type (see the description of
the
RAW field type below), and
value0,
value1, &c.
are the values of successive elements in the scalar list interpreted as
indicated by
type. No limit is placed on the number of elements in a
CARRAY. (Note: despite being multivalued, this is not considered a
vector field since the elements of the
CARRAY are not indexed by
frames.)
- CONST
- The CONST scalar field type is a constant fully specified in the format
specification metadata. Syntax is:
- <field-name> CONST <type>
<value>
where
type may be any supported native data type (see the description of
the
RAW field type below), and
value is the numerical value of
the constant interpreted as indicated by
type.
- DIVIDE
- The DIVIDE vector field type is the quotient of two vector fields. Syntax
is:
- <field-name> DIVIDE <field1>
<field1>
The derived field will be computed as:
- field-name[n] = field1[n] / field2[n2]
with the index
n2 computed appropriately for the (potentially differing)
sample rates of the input fields. The resultant field will have the same
sample rate as
field1.
- LINCOM
- The LINCOM vector field type is the linear combination of one, two or
three input vector fields. Syntax is:
- <field-name> LINCOM [<n>]
<field1> <a1> <b1> [<field2>
<a2> <b2> [<field3> <a3>
<b3>]]
where
n, if present, indicates the number of input vector fields (1, 2,
or 3). The derived field will be computed as:
- field-name[n] = (a1 * field1[n] + b1) + (a2 * field2[n2] + b2) + (a3 *
field3[n3] + b3)
with the
field2 and
field3 terms included only if specified and
the indices
n2 and
n3 computed appropriately for the
(potentially differing) sample rates of the input fields. The resultant field
will have the same sample rate as
field1. Each supplied co-efficient
(
a1,
b1,
a2, &c.) may be either a
literal number, or else the field code of a
CONST or
CARRAY
field type containing its value.
If
n is not specified, the number of fields is determined by looking at
the supplied parameters. Since it is possible to create a field code which is
identical to a literal number, the third token on the line is assumed to be
n if it the entire token can be parsed as a literal number using the
rules outlined in
strtod(3). That is, if the field code specifying
field1 could be mistaken for a literal number,
n must be
specified to prevent ambiguity.
- LINTERP
- The LINTERP vector field type specifies a table look up based on another
vector field. Syntax is:
- <field-name> LINTERP <input>
<table>
where
input is the input vector field for the table lookup, and
table is the path to the lookup table file for the field. If this path
is relative, it is assumed to be relative to the directory containing the
fragment defining this field. The lookup table file is an ASCII text file with
two whitespace separated columns of
x and
y values. Values are
linearly interpolated between the points specified in the lookup table.
- MULTIPLY
- The MULTIPLY vector field type is the product of two vector fields. Syntax
is:
- <field-name> MULTIPLY <field1>
<field2>
The derived field will be computed as:
- field-name[n] = field1[n] * field2[n2]
with the index
n2 computed appropriately for the (potentially differing)
sample rates of the input fields. The resultant field will have the same
sample rate as
field1.
- PHASE
- The PHASE vector field type shifts an input vector field by the specified
number of samples. Syntax is:
- <field-name> PHASE <input>
<shift>
which specifies
field-name to be the input vector field,
input,
shifted by
shift samples. A positive
shift indicates a forward
shift, towards the end-of-field. Results of shifting past the beginning- or
end-of-field is implementation dependent. The
shift parameter may be
either a literal number, or else the field code of a
CONST or
CARRAY field type containing its values.
- POLYNOM
- The POLYNOM vector field type specifies a polynomial function of a single
input vector field. Syntax is:
- <field_name> POLYNOM <input> <a0>
<a1>
[<a2> [<a3> [<a4> [<a5>]]]]
where
<input> is the input field code, and the order of the
computed polynomial is determined by how many co-efficients are present in the
specification. The derived field is computed as:
- field-name[n] = a0 + a1 * input[n] + a2 * input[n]**2 + a3 * input[n]**3 +
a4 * input[n]**4 + a5 * input[n]**5
where
** is the exponentiation operator, and the higher order terms are
computed only if the corresponding co-efficients a
i are specified. The
coefficients, if specified, may be either literal numbers, or else the field
code of a
CONST or
CARRAY field type containing the value.
- RECIP
- The RECIP vector field type computes the reciprocal of a single input
vector field. Syntax is:
- <field_name> RECIP <input>
<dividend>
where
<input> is the input field code and
<dividend>
is a scalar quantity. The derived field is computed as:
- field-name[n] = dividend / input[n].
The dividend, if specified, may be either literal numbers, or else the field
code of a
CONST or
CARRAY field type containing the value.
- RAW
- The RAW vector field type specifies raw time streams on disk. In this
case, the field name should correspond to the name of the file containing
the time stream. Syntax is:
- <field-name> RAW <type>
<sample-rate>
where
sample-rate is the number of samples per dirfile frame for the time
stream and
type is a token specifying the native data format type:
- UINT8
- unsigned 8-bit integer
- INT8
- signed (two's complement) 8-bit integer
- UINT16
- unsigned 16-bit integer
- INT16
- signed (two's complement) 16-bit integer
- UINT32
- unsigned 32-bit integer
- INT32
- signed (two's complement) 32-bit integer
- UINT64
- unsigned 64-bit integer
- INT64
- signed (two's complement) 64-bit integer
- FLOAT32 or FLOAT
- IEEE-754 standard 32-bit single precision floating point number
- FLOAT64 or DOUBLE
- IEEE-754 standard 64-bit double precision floating point number
- COMPLEX64
- a 64-bit complex number consisting of two IEEE-754 standard 32-bit single
precision floating point numbers representing the real and imaginary parts
of the complex number.
- COMPLEX128
- a 128-bit complex number consisting of two IEEE-754 standard 64-bit double
precision floating point numbers representing the real and imaginary parts
of the complex number.
For more information on the storage of complex valued data, see
dirfile(5).
For backwards compatibility, implementations should also recognise the following
single character type aliases in use prior to Standards Version 5:
- c
- UINT8
- u
- UINT16
- s
- INT16
- U
- UINT32
- i, S
- INT32
- f
- FLOAT32
- d
- FLOAT64
Types
INT8,
UINT64,
INT64,
COMPLEX64,
and
COMPLEX128 are not supported before Standards Version 5, so no
single character type aliases exist for these types. Standards Version 8
removed support for these single character type codes.
The
sample-rate parameter may be either a literal number, or else the
name of a
CONST or
CARRAY field type containing its
values.
- SBIT
- The SBIT vector field type extracts one or more bits out of an input
vector field as a signed number. Syntax is:
- <field-name> SBIT <input> <first-bit>
[<bits>]
which specifies
field-name to be the value of bits
first-bit
through
first-bit+
bits-1 of the input vector field
input,
when
input is converted from its native type to a (endianness
corrected) signed 64-bit integer. If
bits is omitted, it is assumed to
be 1. Both
first-bit and
bits may be either literal numbers, or
else the field code of a
CONST or
CARRAY field type containing
their values. The
BIT field type is an unsigned version of this field
type.
- STRING
- The STRING scalar field type is a character string fully specified in the
format file metadata. Syntax is:
- <field-name> STRING <value>
where
value is the string value of the field. Note that
value is a
single token. To include whitespace in the string, enclose
value in
quotation marks (
"), or else escape the whitespace with the
backslash character (
\).
Field Parameters¶
All input vector field parameters should be
field codes (see below).
Additionally, some of the numerical field parameters may be either literal
numbers or else the
field code of a
CONST field containing the
value, or the
field code of a
CARRAY followed by a left angle
bracket (
<), then an non-negative integer used as the
CARRAY
element index, then a right angle bracket (
>), that is:
- field_code<n>
Parameters which allow non-literal values are indicated above. If the angle
brackets and element index are omitted from a
CARRAY field code used as
a parameter, the first element in the field (index zero) is assumed.
Since it is possible to create a field code which is identical to a literal
number, a parameter is assumed to be the field code of a scalar field only if
the entire token cannot be parsed as a literal number using the rules outlined
in
strtod(3). For example, a
CONST field whose field code
consists solely of digits can never be used as a parameter in a field
specification line.
A literal complex number is specified as two real (floating point) numbers
separated by a semicolon (
;) with no intervening whitespace. So, for
example, the tokens
- 1;0 0;1 4;0 0;5 9.313e2;74.1
represent, respectively, the real unit, the imaginary unit, the real number
four, the imaginary number 5
i, and the complex number 931.3 +
74.1
i. Because the semicolon character cannot be used in field names, a
complex valued literal can never be mistaken for a field code. This allows,
among other things, the composition of complex valued fields from purely real
input fields. For example, a complex valued field,
z, may be created
from a real valued field
re, representing the real part of the complex
number, and the real valued field
im, representing the imaginary part
of the complex number, with the following
LINCOM specification:
- z LINCOM re 1 0 im 0;1 0
Field Codes¶
When specifying the input to a field, either as a scalar parameter, or as an
input vector field to a non-
RAW vector field,
field codes are
used. A
field code is one of:
- •
- a simple field name, indicating a vector or scalar field
- •
- a parent field name, followed by a forward slash, followed by a metafield
name, indicating a metafield. See the description of the /META
directive above for further details.
- •
- either of the above, followed by a period, followed by a representation
suffix, but only if the field or metafield specified is not a
STRING type field.
A
representation suffix may be used used to extract a real number from a
complex value. The available suffixes and their meanings are:
- .a
- This representation indicates the angle (in radians) between the positive
real axis and the value (ie. the complex argument). The argument is in the
range [-pi, pi], and a branch cut exists along the negative real axis. At
the branch cut, -pi is returned if the imaginary part is -0, and pi is
returned if the imaginary part is +0. If z=0, zero is returned.
- .i
- This representation indicates the projection of the value onto the
imaginary axis (ie. the imaginary part of the number).
- .m
- This representation indicates the modulus of the value (ie. its absolute
value).
- .r
- This representation indicates the projection of the value onto the real
axis (ie. the real part of the number).
If the specified field is purely real, the representations are calculated as if
the imaginary part was equal to +0. For example, given a complex valued
vector,
z, a vector containing the real part of
z,
re_z, could be produced with:
- re_z PHASE z.r 0
and similarly for the complex field's imaginary part, argument, and absolute
value. (Although it should be pointed out this simplistic an example isn't
strictly necessary, since
z.r could be used wherever
re_z would
be.)
STANDARDS VERSIONS¶
This document describes Version 8 of the Dirfile Standards.
Version 8 of the Standards (November 2010) added the
DIVIDE,
RECIP, and
CARRAY field types, made the forward slash on
reserved words mandatory, and prohibited using the single character data type
aliases in the specification of
RAW fields. It also introduced the
optional second (
arm) token to the
/ENDIAN directive.
Version 7 of the Standards (October 2009) added the
SBIT and
POLYNOM field types, and the directive-less method of specifying
metafields. It also introduced the data types
COMPLEX128 and
COMPLEX64, along with the notion of
representations. Finally, it
made the number of fields parameter for
LINCOM optional.
Version 6 of the Standards (October 2008) added the
/ENCODING,
/META,
/PROTECT, and
/REFERENCE directives, and the
CONST and
STRING field
types. It permitted whitespace in tokens and introduced the character escape
sequences. It allowed
CONST fields to be used as parameters in field
specification lines. It also removed
FILEFRAM as an alias for
INDEX, and prohibited
. but allowed
# and
\ in
field names.
Version 5 of the Standards (August 2008) added
VERSION and
ENDIAN,
slash demarcation of reserved words, and removed the restriction on field name
length. It introduced the data types
INT8,
INT64, and
UINT64, the new-style type specifiers, and increased the range of the
BIT field type from 32 to 64 bits. It also prohibited the characters
&;<>\| in field names.
Version 4 of the Standards (October 2006) added the
PHASE field type.
Version 3 of the Standards (January 2006) added
INCLUDE and increased the
allowed length of a field name from 16 to 50 characters.
Version 2 of the Standards (September 2005) added the
MULTIPLY field
type.
Version 1 of the Standards (November 2004) added
FRAMEOFFSET and the
optional fourth argument to the
BIT field type.
Version 0 of the Standards (before March 2003) refers to the dirfile standards
supported by the
getdata(3) library originally introduced into the
kst(1) sources, which contained support for all other features covered
by this document.
AUTHORS¶
The dirfile specification was developed by C. B. Netterfield
<netterfield@astro.utoronto.ca>.
Since Standards Version 3, the dirfile specification has been maintained by D.
V. Wiebe <getdata@ketiltrout.net>.
SEE ALSO¶
dirfile(5),
dirfile-encoding(5)