NAME¶
Funtext - Support for Column\-based Text Files
SYNOPSIS¶
This document contains a summary of the options for processing column-based text
files.
DESCRIPTION¶
Funtools will automatically sense and process "standard" column-based
text files as if they were FITS binary tables without any change in Funtools
syntax. In particular, you can filter text files using the same syntax as FITS
binary tables:
fundisp foo.txt'[cir 512 512 .1]'
fundisp -T foo.txt > foo.rdb
funtable foo.txt'[pha=1:10,cir 512 512 10]' foo.fits
The first example displays a filtered selection of a text file. The second
example converts a text file to an RDB file. The third example converts a
filtered selection of a text file to a FITS binary table.
Text files can also be used in Funtools image programs. In this case, you must
provide binning parameters (as with raw event files), using the bincols
keyword specifier:
bincols=([xname[:tlmin[:tlmax:[binsiz]]]],[yname[:tlmin[:tlmax[:binsiz]]]
For example:
funcnts foo'[bincols=(x:1024,y:1024)]' "ann 512 512 0 10 n=10"
Standard Text Files
Standard text files have the following characteristics:
- •
- Optional comment lines start with #
- •
- Optional blank lines are considered comments
- •
- An optional table header consists of the following (in order):
- •
- a single line of alpha-numeric column names
- •
- an optional line of unit strings containing the same number of cols
- •
- an optional line of dashes containing the same number of cols
- •
- Data lines follow the optional header and (for the present) consist of
the same number of columns as the header.
- •
- Standard delimiters such as space, tab, comma, semi\-colon, and bar.
Examples:
# rdb file
foo1 foo2 foo3 foos
---- ---- ---- ----
1 2.2 3 xxxx
10 20.2 30 yyyy
# multiple consecutive whitespace and dashes
foo1 foo2 foo3 foos
--- ---- ---- ----
1 2.2 3 xxxx
10 20.2 30 yyyy
# comma delims and blank lines
foo1,foo2,foo3,foos
1,2.2,3,xxxx
10,20.2,30,yyyy
# bar delims with null values
foo1⎪foo2⎪foo3⎪foos
1⎪⎪3⎪xxxx
10⎪20.2⎪⎪yyyy
# header-less data
1 2.2 3 xxxx
10 20.2 30 yyyy
The default set of token delimiters consists of spaces, tabs, commas,
semi\-colons, and vertical bars. Several parsers are used simultaneously to
analyze a line of text in different ways. One way of analyzing a line is to
allow a combination of spaces, tabs, and commas to be squashed into a single
delimiter (no null values between consecutive delimiters). Another way is to
allow tab, semi\-colon, and vertical bar delimiters to support null values,
i.e. two consecutive delimiters implies a null value (e.g. RDB file). A
successful parser is one which returns a consistent number of columns for all
rows, with each column having a consistent data type. More than one parser can
be successful. For now, it is assumed that successful parsers all return the
same tokens for a given line. (Theoretically, there are pathological cases,
which will be taken care of as needed). Bad parsers are discarded on the fly.
If the header does not exist, then names "col1", "col2",
etc. are assigned to the columns to allow filtering. Furthermore, data types
for each column are determined by the data types found in the columns of the
first data line, and can be one of the following: string, int, and double.
Thus, all of the above examples return the following display:
fundisp foo'[foo1>5]'
FOO1 FOO2 FOO3 FOOS
---------- --------------------- ---------- ------------
10 20.20000000 30 yyyy
Comments Convert to Header Params
Comments which precede data rows are converted into header parameters and will
be written out as such using funimage or funhead. Two styles of comments are
recognized:
1. FITS-style comments have an equal sign "=" between the keyword and
value and an optional slash "/" to signify a comment. The strict
FITS rules on column positions are not enforced. In addition, strings only
need to be quoted if they contain whitespace. For example, the following are
valid FITS-style comments:
# fits0 = 100
# fits1 = /usr/local/bin
# fits2 = "/usr/local/bin /opt/local/bin"
# fits3c = /usr/local/bin /opt/local/bin /usr/bin
# fits4c = "/usr/local/bin /opt/local/bin" / path dir
Note that the fits3c comment is not quoted and therefore its value is the single
token "/usr/local/bin" and the comment is "opt/local/bin
/usr/bin". This is different from the quoted comment in fits4c.
2. Free-form comments can have an optional colon separator between the keyword
and value. In the absence of quote, all tokens after the keyword are part of
the value, i.e. no comment is allowed. If a string is quoted, then slash
"/" after the string will signify a comment. For example:
# com1 /usr/local/bin
# com2 "/usr/local/bin /opt/local/bin"
# com3 /usr/local/bin /opt/local/bin /usr/bin
# com4c "/usr/local/bin /opt/local/bin" / path dir
# com11: /usr/local/bin
# com12: "/usr/local/bin /opt/local/bin"
# com13: /usr/local/bin /opt/local/bin /usr/bin
# com14c: "/usr/local/bin /opt/local/bin" / path dir
Note that com3 and com13 are not quoted, so the whole string is part of the
value, while comz4c and com14c are quoted and have comments following the
values.
Some text files have column name and data type information in the header. You
can specify the format of column information contained in the header using the
"hcolfmt=" specification. See below for a detailed description.
Multiple Tables in a Single File
Multiple tables are supported in a single file. If an RDB-style file is sensed,
then a ^L (vertical tab) will signify end of table. Otherwise, an end of table
is sensed when a new header (i.e., all alphanumeric columns) is found. (Note
that this heuristic does not work for single column tables where the column
type is ASCII and the table that follows also has only one column.) You also
can specify characters that signal an end of table condition using the
eot= keyword. See below for details.
You can access the nth table (starting from 1) in a multi-table file by
enclosing the table number in brackets, as with a FITS extension:
fundisp foo'[2]'
The above example will display the second table in the file. (Index values start
at 1 in oder to maintain logical compatibility with FITS files, where
extension numbers also start at 1).
TEXT() Specifier
As with
ARRAY() and
EVENTS() specifiers for raw image arrays and
raw event lists respectively, you can use
TEXT() on text files to pass
key=value options to the parsers. An empty set of keywords is equivalent to
not having
TEXT() at all, that is:
fundisp foo
fundisp foo'[TEXT()]'
are equivalent. A multi-table index number is placed before the
TEXT()
specifier as the first token, when indexing into a multi\-table:
fundisp foo'[2,TEXT(...)]'
The filter specification is placed after the
TEXT() specifier, separated
by a comma, or in an entirely separate bracket:
fundisp foo'[TEXT(...),circle 512 512 .1]'
fundisp foo'[2,TEXT(...)][circle 512 512 .1]'
Text() Keyword Options
The following is a list of keywords that can be used within the
TEXT()
specifier (the first three are the most important):
- •
- delims="[delims]"
Specify token delimiters for this file. Only a single parser having these
delimiters will be used to process the file.
fundisp foo.fits'[TEXT(delims="!")]'
fundisp foo.fits'[TEXT(delims="\t%")]'
- •
- comchars="[comchars]"
Specify comment characters. You must include "\n" to allow blank
lines. These comment characters will be used for all standard parsers
(unless delims are also specified).
fundisp foo.fits'[TEXT(comchars="!\n")]'
- •
- cols="[name1:type1 ...]"
Specify names and data type of columns. This overrides header names and/or
data types in the first data row or default names and data types for
header-less tables.
fundisp foo.fits'[TEXT(cols="x:I,y:I,pha:I,pi:I,time:D,dx:E,dy:e")]'
If the column specifier is the only keyword, then the cols= is not required
(in analogy with EVENTS()):
fundisp foo.fits'[TEXT(x:I,y:I,pha:I,pi:I,time:D,dx:E,dy:e)]'
Of course, an index is allowed in this case:
fundisp foo.fits'[2,TEXT(x:I,y:I,pha:I,pi:I,time:D,dx:E,dy:e)]'
- •
- eot="[eot delim]"
Specify end of table string specifier for multi-table files. RDB files
support ^L. The end of table specifier is a string and the whole string
must be found alone on a line to signify EOT. For example:
fundisp foo.fits'[TEXT(eot="END")]'
will end the table when a line contains "END" is found. Multiple
lines are supported, so that:
fundisp foo.fits'[TEXT(eot="END\nGAME")]'
will end the table when a line contains "END" followed by a line
containing "GAME".
In the absence of an EOT delimiter, a new table will be sensed when a new
header (all alphanumeric columns) is found.
- •
- null1="[datatype]"
Specify data type of a single null value in row 1. Since column data types
are determined by the first row, a null value in that row will result in
an error and a request to specify names and data types using cols=. If you
only have a one null in row 1, you don't need to specify all names and
columns. Instead, use null1="type" to specify its data
type.
- •
- alen=[n]
Specify size in bytes for ASCII type columns. FITS binary tables only
support fixed length ASCII columns, so a size value must be specified. The
default is 16 bytes.
- •
- nullvalues=["true"⎪"false"]
Specify whether to expect null values. Give the parsers a hint as to whether
null values should be allowed. The default is to try to determine this
from the data.
- •
- whitespace=["true"⎪"false"]
Specify whether surrounding white space should be kept as part of string
tokens. By default surrounding white space is removed from tokens.
- •
- header=["true"⎪"false"]
Specify whether to require a header. This is needed by tables containing all
string columns (and with no row containing dashes), in order to be able to
tell whether the first row is a header or part of the data. The default is
false, meaning that the first row will be data. If a row dashes are
present, the previous row is considered the column name row.
- •
- units=["true"⎪"false"]
Specify whether to require a units line. Give the parsers a hint as to
whether a row specifying units should be allowed. The default is to try to
determine this from the data.
- •
- i2f=["true"⎪"false"]
Specify whether to allow int to float conversions. If a column in row 1
contains an integer value, the data type for that column will be set to
int. If a subsequent row contains a float in that same column, an error
will be signaled. This flag specifies that, instead of an error, the float
should be silently truncated to int. Usually, you will want an error to be
signaled, so that you can specify the data type using cols= (or by
changing the value of the column in row 1).
- •
- comeot=["true"⎪"false"⎪0⎪1⎪2]
Specify whether comment signifies end of table. If comeot is 0 or false,
then comments do not signify end of table and can be interspersed with
data rows. If the value is true or 1 (the default for standard parsers),
then non-blank lines (e.g. lines beginning with '#') signify end of table
but blanks are allowed between rows. If the value is 2, then all comments,
including blank lines, signify end of table.
- •
- lazyeot=["true"⎪"false"]
Specify whether "lazy" end of table should be permitted (default
is true for standard formats, except rdb format where explicit ^L is
required between tables). A lazy EOT can occur when a new table starts
directly after an old one, with no special EOT delimiter. A check for this
EOT condition is begun when a given row contains all string tokens. If, in
addition, there is a mismatch between the number of tokens in the previous
row and this row, or a mismatch between the number of string tokens in the
prev row and this row, a new table is assumed to have been started. For
example:
ival1 sval3
----- -----
1 two
3 four
jval1 jval2 tval3
----- ----- ------
10 20 thirty
40 50 sixty
Here the line "jval1 ..." contains all string tokens. In addition,
the number of tokens in this line (3) differs from the number of tokens in
the previous line (2). Therefore a new table is assumed to have started.
Similarly:
ival1 ival2 sval3
----- ----- -----
1 2 three
4 5 six
jval1 jval2 tval3
----- ----- ------
10 20 thirty
40 50 sixty
Again, the line "jval1 ..." contains all string tokens. The number
of string tokens in the previous row (1) differs from the number of tokens
in the current row(3). We therefore assume a new table as been
started. This lazy EOT test is not performed if lazyeot is explicitly set
to false.
- •
- hcolfmt=[header column format]
Some text files have column name and data type information in the header.
For example, VizieR catalogs have headers containing both column names and
data types:
#Column e_Kmag (F6.3) ?(k_msigcom) K total magnitude uncertainty (4) [ucd=ERROR]
#Column Rflg (A3) (rd_flg) Source of JHK default mag (6) [ucd=REFER_CODE]
#Column Xflg (I1) [0,2] (gal_contam) Extended source contamination (10) [ucd=CODE_MISC]
while Sextractor files have headers containing column names alone:
# 1 X_IMAGE Object position along x [pixel]
# 2 Y_IMAGE Object position along y [pixel]
# 3 ALPHA_J2000 Right ascension of barycenter (J2000) [deg]
# 4 DELTA_J2000 Declination of barycenter (J2000) [deg]
The hcolfmt specification allows you to describe which header lines contain
column name and data type information. It consists of a string defining
the format of the column line, using "$col" (or
"$name") to specify placement of the column name,
"$fmt" to specify placement of the data format, and
"$skip" to specify tokens to ignore. You also can specify tokens
explicitly (or, for those users familiar with how sscanf works, you can
specify scanf skip specifiers using "%*"). For example, the
VizieR hcolfmt above might be specified in several ways:
Column $col ($fmt) # explicit specification of "Column" string
$skip $col ($fmt) # skip one token
%*s $col ($fmt) # skip one string (using scanf format)
while the Sextractor format might be specified using:
$skip $col # skip one token
%*d $col # skip one int (using scanf format)
You must ensure that the hcolfmt statement only senses actual column
definitions, with no false positives or negatives. For example, the first
Sextractor specification, "$skip $col", will consider any header
line containing two tokens to be a column name specifier, while the second
one, "%*d $col", requires an integer to be the first token. In
general, it is preferable to specify formats as explicitly as possible.
Note that the VizieR-style header info is sensed automatically by the
funtools standard VizieR-like parser, using the hcolfmt "Column $col
($fmt)". There is no need for explicit use of hcolfmt in this
case.
- •
- debug=["true"⎪"false"]
Display debugging information during parsing.
Environment Variables
Environment variables are defined to allow many of these
TEXT() values to
be set without having to include them in
TEXT() every time a file is
processed:
keyword environment variable
------- --------------------
delims TEXT_DELIMS
comchars TEXT_COMCHARS
cols TEXT_COLUMNS
eot TEXT_EOT
null1 TEXT_NULL1
alen TEXT_ALEN
bincols TEXT_BINCOLS
hcolfmt TEXT_HCOLFMT
Restrictions and Problems
As with raw event files, the '+' (copy extensions) specifier is not supported
for programs such as funtable.
String to int and int to string data conversions are allowed by the text
parsers. This is done more by force of circumstance than by conviction: these
transitions often happens with VizieR catalogs, which we want to support
fully. One consequence of allowing these transitions is that the text parsers
can get confused by columns which contain a valid integer in the first row and
then switch to a string. Consider the following table:
xxx yyy zzz
---- ---- ----
111 aaa bbb
ccc 222 ddd
The xxx column has an integer value in row one a string in row two, while the
yyy column has the reverse. The parser will erroneously treat the first column
as having data type int:
fundisp foo.tab
XXX YYY ZZZ
---------- ------------ ------------
111 'aaa' 'bbb'
1667457792 '222' 'ddd'
while the second column is processed correctly. This situation can be avoided in
any number of ways, all of which force the data type of the first column to be
a string. For example, you can edit the file and explicitly quote the first
row of the column:
xxx yyy zzz
---- ---- ----
"111" aaa bbb
ccc 222 ddd
[sh] fundisp foo.tab
XXX YYY ZZZ
------------ ------------ ------------
'111' 'aaa' 'bbb'
'ccc' '222' 'ddd'
You can edit the file and explicitly set the data type of the first column:
xxx:3A yyy zzz
------ ---- ----
111 aaa bbb
ccc 222 ddd
[sh] fundisp foo.tab
XXX YYY ZZZ
------------ ------------ ------------
'111' 'aaa' 'bbb'
'ccc' '222' 'ddd'
You also can explicitly set the column names and data types of all columns,
without editing the file:
[sh] fundisp foo.tab'[TEXT(xxx:3A,yyy:3A,zzz:3a)]'
XXX YYY ZZZ
------------ ------------ ------------
'111' 'aaa' 'bbb'
'ccc' '222' 'ddd'
The issue of data type transitions (which to allow and which to disallow) is
still under discussion.
SEE ALSO¶
See
funtools(7) for a list of Funtools help pages