DATAMASH(1)

User Commands

DATAMASH(1)

NAME¶

datamash - command-line calculations

SYNOPSIS¶

datamash [ OPTION] op [col] [op col ...]

DESCRIPTION¶

Performs numeric/string operations on input from stdin.

'op' is the operation to perform; For grouping operations 'col' is the input field to use.

File operations:¶

: transpose, reverse

Numeric Grouping operations:¶

: sum, min, max, absmin, absmax

Textual/Numeric Grouping operations:¶

: count, first, last, rand unique, collapse, countunique

Statistical Grouping operations:¶

: mean, median, q1, q3, iqr, mode, antimode pstdev, sstdev, pvar, svar, mad, madraw pskew, sskew, pkurt, skurt, dpo, jarque

OPTIONS¶

Grouping Options:¶

-f, --full: print entire input line before op results (default: print only the grouped keys)

-g, --group=X[,Y,Z]: group via fields X,[Y,Z]

--header-in: first input line is column headers

--header-out: print column headers as first line

-H, --headers: same as '--header-in --header-out'

-i, --ignore-case: ignore upper/lower case when comparing text; this affects grouping, and string operations

-s, --sort: sort the input before grouping; this removes the need to manually pipe the input through 'sort'

File Operation Options:¶

--no-strict: allow lines with varying number of fields

--filler=X: fill missing values with X (default %s)

General Options:¶

-t, --field-separator=X: use X instead of TAB as field delimiter

-W, --whitespace: use whitespace (one or more spaces and/or tabs) for field delimiters

-z, --zero-terminated: end lines with 0 byte, not newline

--help: display this help and exit

--version: output version information and exit

AVAILABLE OPERATIONS¶

File operations:¶

transpose: transpose rows, columns of the input file

reverse: reverse field order in each line

Numeric Grouping operations¶

sum: sum the of values

min: minimum value

max: maximum value

absmin: minimum of the absolute values

absmax: maximum of the absolute values

Textual/Numeric Grouping operations¶

count: count number of elements in the group

first: the first value of the group

last: the last value of the group

rand: one random value from the group

unique: comma-separated sorted list of unique values

collapse: comma-separated list of all input values

countunique: number of unique/distinct values

Statistical Grouping operations¶

mean: mean of the values

median: median value

q1: 1st quartile value

q3: 3rd quartile value

iqr: inter-quartile range

mode: mode value (most common value)

antimode: anti-mode value (least common value)

pstdev: population standard deviation

sstdev: sample standard deviation

pvar: population variance

svar: sample variance

mad: median absolute deviation, scaled by constant 1.4826 for normal distributions

madraw: median absolute deviation, unscaled

sskew: skewness of the (sample) group

pskew

skewness of the (population) group
values x reported by 'sskew' and 'pskew' operations:

          x > 0       -  positively skewed / skewed right
      0 > x           -  negatively skewed / skewed left
          x > 1       -  highly skewed right
      1 > x >  0.5    -  moderately skewed right
    0.5 > x > -0.5    -  approximately symmetric
   -0.5 > x > -1      -  moderately skewed left
     -1 > x           -  highly skewed left

skurt: excess Kurtosis of the (sample) group

pkurt: excess Kurtosis of the (population) group

jarque: p-value of the Jarque-Beta test for normality

dpo: p-value of the D'Agostino-Pearson Omnibus test for normality;
for 'jarque' and 'dpo' operations:
null hypothesis is normality;
low p-Values indicate non-normal data;
high p-Values indicate null-hypothesis cannot be rejected.

EXAMPLES¶

Print the sum and the mean of values from column 1:

$ seq 10 |  datamash sum 1 mean 1
55  5.5

Group input based on field 1, and sum values (per group) on field 2:

$ cat example.txt
A  10
A  5
B  9
B  11
$  datamash -g 1 sum 2 < example.txt
A  15
B  20

Unsorted input must be sorted (with '-s'):

$ cat example.txt
A  10
C  4
B  9
C  1
A  5
B  11
$  datamash -s -g1 sum 2 < example.txt
A  15
B  20
C  5

Which is equivalent to:

$ cat example.txt | sort -k1,1 |  datamash -g 1 sum 2

Use -h (--headers) if the input file has a header line:

# Given a file with student name, field, test score...
$ head -n5 scores_h.txt
Name           Major            Score
Shawn          Engineering      47
Caleb          Business         87
Christian      Business         88
Derek          Arts             60

# Calculate the mean and standard devian for each major
$  datamash --sort --headers --group 2 mean 3 pstdev 3 < scores_h.txt
 (or use short form)
$  datamash -sH -g2 mean 3 pstdev 3 < scores_h.txt
GroupBy(Major)    mean(Score)   pstdev(Score)
Arts              68.9          10.1
Business          87.3           4.9
Engineering       66.5          19.1
Health-Medicine   90.6           8.8
Life-Sciences     55.3          19.7
Social-Sciences   60.2          16.6

Reverse field order in each line:

$ seq 6 | paste - - |  datamash reverse
2    1
4    3
6    5

Transpose rows, columns:

$ seq 6 | paste - - |  datamash transpose
1    3    5
2    4    6

ADDITIONAL INFORMATION¶

See GNU Datamash Website (http://www.gnu.org/software/datamash)

AUTHOR¶

Written by Assaf Gordon.

COPYRIGHT¶

This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

Source file:	datamash.1.en.gz (from datamash 1.0.6-2)
Source last updated:	2014-07-31T02:59:41Z
Converted to HTML:	2018-12-30T04:22:52Z

NAME¶

SYNOPSIS¶

DESCRIPTION¶

File operations:¶

Numeric Grouping operations:¶

Textual/Numeric Grouping operations:¶

Statistical Grouping operations:¶

OPTIONS¶

Grouping Options:¶

File Operation Options:¶

General Options:¶

AVAILABLE OPERATIONS¶

File operations:¶

Numeric Grouping operations¶

Textual/Numeric Grouping operations¶

Statistical Grouping operations¶

EXAMPLES¶

ADDITIONAL INFORMATION¶

AUTHOR¶

COPYRIGHT¶

SEE ALSO¶