STILTS-TGROUP(1) | Stilts commands | STILTS-TGROUP(1) |
NAME¶
stilts-tgroup - Calculates aggregate functions on groups of rows
SYNOPSIS¶
stilts tgroup [ifmt=<in-format>] [istream=true|false] [in=<table>] [icmd=<cmds>] [ocmd=<cmds>] [omode=out|meta|stats|count|checksum|cgi|discard|topcat|samp|tosql|gui] [out=<out-table>] [ofmt=<out-format>] [keys=<expr> ...] [aggcols=<expr>;<aggregator>[;<name>] ...] [runner=sequential|parallel|parallel<n>|partest] [sort=true|false] [cache=true|false]
DESCRIPTION¶
tgroup identifies groups of rows in a table based on the values in a given column or columns, and calculates statistical quantities or otherwise collapses down the multiple values from other columns into single values representing each group. It does the same job as a SELECT ... GROUP BY statement with aggregate functions in ADQL/SQL.
The keys parameter defines how input rows are grouped, and the aggcols parameter defines what quantities to aggregate from the rows in each group. keys specifies one or more values (column names or expresssions) that must be the same for rows grouped together, while aggcols specifies zero or more columns to be added based on the content of rows in each group. The output table therefore contains one column for each entry in keys and one column for each entry in aggcols, and has one row for each group identified.
This command can therefore be used to count rows or calculate statistical quantities per group. A number of statistical aggregation methods are provided such as mean, median, minimum, maximum etc. For more specialised requirements, for instance quantiles or custom statistics, you can also use the array aggregators which generate an array containing all of the values in the group, and operate on the resulting column using one of the functions in the Arrays class.
By way of comparison, the tgroup invocation: stilts tgroup in=t keys="year detector" aggcols="0;count;num gmag;min;min_gmag gmag;mean" corresponds roughly to the ADQL query: SELECT COUNT(*) AS num, MIN(gmag) AS min_gmag, MEAN(gmag), FROM t GROUP BY year, detector
See also the tgridmap and tskymap commands, which provide similar functionality where the grouping is over evenly spaced numeric/coordinate values.
OPTIONS¶
- A filename.
- A URL.
- The special value "-", meaning standard input. In this case the input format must be given explicitly using the ifmt parameter. Note that not all formats can be streamed in this way.
- A scheme specification of the form :<scheme-name>:<scheme-args>.
- A system command line with either a "<" character at the start, or a "|" character at the end ("<syscmd" or "syscmd|"). This executes the given pipeline and reads from its standard output. This will probably only work on unix-like systems.
In any case, compressed data in one of the supported compression formats (gzip, Unix compress or bzip2) will be decompressed transparently.
Commands may alteratively be supplied in an external file, by using the indirection character '@'. Thus a value of "@filename" causes the file filename to be read for a list of filter commands to execute. The commands in the file may be separated by newline characters and/or semicolons, and lines which are blank or which start with a '#' character are ignored.
Commands may alteratively be supplied in an external file, by using the indirection character '@'. Thus a value of "@filename" causes the file filename to be read for a list of filter commands to execute. The commands in the file may be separated by newline characters and/or semicolons, and lines which are blank or which start with a '#' character are ignored.
Possible values are
- out
- meta
- stats
- count
- checksum
- cgi
- discard
- topcat
- samp
- tosql
- gui
Use the help=omode flag or see SUN/256 for more information.
This parameter must only be given if omode has its default value of "out".
This parameter must only be given if omode has its default value of "out".
Each entry is composed of two or three tokens, separated by semicolon (";") characters:
- <expr>: (required) column name, or expression using the expression language, for the quantity to be aggregated
- <aggregator>: (required) aggregation method
- <name>: (optional) name of output column; if omitted, a name based on the <expr> value will be used
The available <aggregator> values are as follows:
- count: counts the number of rows
- ngood: counts the number of non-blank items
- sum: the sum of all the combined values per bin
- mean: the mean of the combined values
- median: the median
- stdev: the sample standard deviation of the combined values
- stdev-pop: the population standard deviation of the combined values
- max: records the maximum value
- min: records the minimum value
- array: collects all non-blank values into an array
- array-withblanks: collects all values into an array; blank values are represented as zero for integers
- count-long: counts the number of rows, works for >2 billion
- ngood-long: counts the number of non-blank items, works for >2 billion
- sequential: runs using only a single thread
- parallel: runs using multiple threads for large tables, with parallelism given by the number of available processors
- parallel<n>: runs using multiple threads for large tables, with parallelism given by the supplied value <n>
- partest: runs using multiple threads even when tables are small (only intended for testing purposes)
Using parallel processing can speed up execution considerably; however, depending on the I/O operations required, it can also slow it down by disrupting patterns of disk access. If the content of a file is on a solid state disk, or is already in cache for instance because a similar command has been run recently, then parallel will probably be faster. However, if the data is being read directly from a spinning disk, for instance because the file is too large to fit in RAM, then sequential or parallel<n> with a small <n> may be faster.
The value of this parameter should make only very tiny differences to the output table. If you notice significant discrepancies please report them.
In most cases such sorting will be a small overhead on the rest of the work done by this task, so the default is true but if ordering by key is not useful you may save some resources by setting it false. If no sorting is done, the output row order is undefined.
SEE ALSO¶
If the package stilts-doc is installed, the full documentation
SUN/256 is available in HTML format:
file:///usr/share/doc/stilts/sun256/index.html
VERSION¶
STILTS version 3.4.9-debian
This is the Debian version of Stilts, which lack the support of
some file formats and network protocols. For differences see
file:///usr/share/doc/stilts/README.Debian
AUTHOR¶
Mark Taylor (Bristol University)
Mar 2017 |