table of contents
mcxload(1) | USER COMMANDS | mcxload(1) |
NAME¶
mcxload - load matrices and tab files from label format
SYNOPSIS¶
mcxload -abc <fname> (label file) -o <fname> ( output file) [-abc <fname> (label file)] [-123 <fname> ( identifier file)] [-o <fname> (output file) ] [--stream-mirror (symmetrify, same domain) ] [--stream-split (assume different domains)] [-re <mode> (edge deduplication mode)] [-ri <mode> (image symmetrification mode) ] [-sif <fname> (SIF label file)] [-etc <fname> ('etc' label file)] [-etc-ai <fname> ( leaderless 'etc' label file)] [--expect-values ( expect label:weight format)] [-235 <fname> ( leader '235' label file)] [-235-ai <fname> ( leaderless '235' label file)] [-packed <fname> (file/stream in binary format)] [-pack-cnum <num> (set column range)] [-pack-rnum <num> ( set row range)] [-123-max <int> ( set domain range)] [-123-maxc <int> ( set column range)] [-123-maxr <int> ( set row range)] [-write-tab <fname> ( save domain tab)] [-write-tabc <fname> ( save column tab)] [-write-tabr <fname> ( save row tab)] [-strict-tab <fname> ( tab universe)] [-strict-tabc <fname> ( tabc universe)] [-strict-tabr <fname> ( tabr universe)] [-restrict-tab <fname> ( tab world)] [-restrict-tabc <fname> ( tabc world)] [-restrict-tabr <fname> ( tabr world)] [-extend-tab <fname> ( tab launch)] [-extend-tabc <fname> ( tabc launch)] [-extend-tabr <fname> ( tabr launch)] [--stream-log ( log transform stream values)] [--stream-neg-log ( negative log transform stream values)] [--stream-neg-log10 (negative log-10 transform stream values) ] [-stream-tf (transform stream values) ] [-tf <tf-spec> (transform (not so) final matrix) ] [--transpose (transpose)] [--write-binary ( output binary format)] [--debug ( debug)] [-h (print synopsis, exit)] [--apropos ( print synopsis, exit)] [--version (print version, exit)]
GETTING STARTED¶
mcxload --stream-mirror -abc data1.txt -o data1.mci -write-tab data1.tab mcxload --stream-mirror -etc data2.txt -o data2.mci -write-tab data2.tab mcxload --stream-mirror -sif data3.txt -o data3.mci -write-tab data3.tab
When the output should be an undirected graph it is safest to always use the --stream-mirror option. Edges are stored bidirectionally as two arcs, and this option instructs mcxload to ensure that both arcs are present. In the above examples three different types of format are read. In all formats, the basic unit of specification is that of an arc specified by a source node, a destination node, and optionally a weight. All formats are line based, with -abc specifying a single arc and -etc and -sif specifying multiple arcs corresponding to a shared source node. For -abc the format is
<source-label> <destination-label> [<weight>]
The last field, specifying the arc weight, is optional. If not present the arc weight will be set to the default weight of 1.0. For -sif the format is
<source-label> <relation-type> <destination-label> <destination-label> ...
There can be an arbitrary number of destination labels. The relation type field in the second column is required but will be ignored. As an extension it is possible to specify weights, requiring the use of the --expect-values option. Weights are specified by tagging them onto the destination label separated by a colon:
<source-label> <relation-type> <destination-label>:<weight> <destination-label>:<weight> ...
Finally, the format for the -etc option is the same, except that the relation type column is dropped.
DESCRIPTION¶
mcxload reads label input from a file. The format of the file should be line-based, each line containing two white-space separated strings (labels) and optionally a number separated from the second label by whitespace. In the absence of a value, mcxload will use the default value 1.0. If a tab is present on an input line, mcxload will assume that the tab character is the separator for that line. Lines for which the first non-whitespace character is an octothorpe ('#') are skipped. mcxload will transform the labels into mcl numerical identifiers and the pairs of labels into graph edges or equivalently matrix entries. The weight of an edge is the value associated with the associated labels. mcxload constructs dictionaries (sometimes just one) that map labels onto mcl identifiers as it goes along. It can optionally write these to file. In MCL (family) parlance, such a dictionary written to file is called a tab file. It is possible to specify numerical identifiers directly with the -123 option. In this case mcxload assumes a canonical domain (cf mcxio) and will create the minimal canonical domain that supports the data. Also bear in mind the caveat further below. It is possible to effectively predeclare labels and thus enforce an a-priori known mapping of labels onto numerical identifiers. Labels receive an identifier in the order in which they occur in the input. Predeclaring labels can be achieved by having them appear in the desired order and setting the edge weight to zero. A major mcxload modality is whether the input refers to a single domain or to two separate domains. An example of the first is where labels are names of people and the value is the extent to which they like one another. This encodes a likability graph where all the nodes represent people. The reasonable thing to do in this case is to create a single dictionary with all names wherever they occur. All tab options (as opposed to tabc and tabr) pertain to this scenario and likewise for the options --graph and --stream-mirror. An example of the second mode is where the first label is again the name of a person, the second label is the name of an animal species, and the value is the extent to which that person appreciates the species. In this case, the reasonable thing to do is to create two dictionaries, one for persons and one for species. All tabc and tabr options pertain to this scenario. The tabc options always refer to the first label and the tabr options always refer to the second label. The letters c and r refer to column and row respectively. The latter are the names of the matrix domains corresponding to the input domains. Refer to mcxio(5). A further mcxload modality is whether it constructs dictionaries on the fly, or whether it proceeds from a tab file already available. By default mcxload will construct dictionaries on the fly. You need to save them with the appropriate -write option(s). All the strict options read a tab file and require any labels in the -abc label input to be present in the corresponding tab file. mcxload will then fail in the face of absent labels. All the restrict options simply ignore labels that are not found in the corresponding tab file. The extend options extend the existing tab file with labels that are not found. It presumably only makes sense to do so if the corresponding -write options are used as well. The input stream is deduplicated on a per-node neighbourhood basis using the -re option. mcxload has a few options to transform or select based on the values in the input stream and the values in the constructed matrix. These are --stream-log, --stream-neg-log, --stream-neg-log10, -stream-tf and -tf. Refer to mcxio(5) for a description of the syntax accepted by the latter two options - it is a syntax accepted by a few more mcl siblings. Finally it is possible to transpose the final result using the --transpose option. Keep in mind that mcxload does not accordingly change its idea of row and column domains. The final matrix can be symmetrified using the -ri option. The -etc, -235 and -sif options assume a format where all entries for a given column (or equivalently all neighbours for a given node) are joined onto a single line. This can be useful e.g. to read in externally generated clusterings. The -etc and -sif options expect label input, whereas the -235 options expects numbers in the input that are mapped directly onto mcl numerical identifiers. The SIF format expected by -sif requires a relationship type in the second field on each line; this is ignored. As an extension to the SIF format weights may optionally follow the labels, separated from them with a colon character. CAVEAT
i Read the input stream, apply -stream-tf transformation specification, and optionally push reverse elements ( --stream-mirror).
ii Deduplicate edges in the context of all edges/arcs originating from a given node according to the -re option.
iii Apply transpose symmetrification according to the -ri option, if used.
iv Apply -tf transformation specification.
OPTIONS¶
-abc <fname> (label file)
-123 <fname> (identifier file)
-o <fname> (output file)
--stream-mirror (symmetrify, same domain)
--stream-split (assume different domains)
-re <max|add|mul|first|last> (deduplication mode)
-write-tab <fname> (save domain tab)
-write-tabc <fname> (save column tab)
-write-tabr <fname> (save row tab)
-strict-tab <fname> (tab universe)
-strict-tabc <fname> (tabc universe)
-strict-tabr <fname> (tabr universe)
-restrict-tab <fname> (tab world)
-restrict-tabc <fname> (tabc world)
-restrict-tabr <fname> (tabr world)
-extend-tab <fname> (tab launch)
-extend-tabc <fname> (tabc launch)
-extend-tabr <fname> (tabr launch)
-123-max <int> (set domain range)
-123-maxc <int> (set column range)
-123-maxr <int> (set row range)
--stream-log (log transform stream values)
--stream-neg-log (negative log transform stream values)
--stream-neg-log10 (negative log-10 transform stream values)
-stream-tf (transform stream values)
-tf <tf-spec> (transform (not so) final matrix)
-ri (<max|add|mul>)
--transpose (transpose)
-etc <fname> ('etc' label file)
-etc-ai <fname> (leaderless 'etc' label file)
-235 <fname> ('235' label file)
-235-ai <fname> (leaderless '235' label file)
-sif <fname> (SIF label file)
--expect-values (expect label:weight format)
-packed <fname> (file/stream in binary format)
-pack-cnum <num> (set column range)
-pack-rnum <num> (set row range)
--write-binary (output binary format)
--debug (debug)
AUTHOR¶
Stijn van Dongen.
SEE ALSO¶
mcxio(5), mcxdump(1), mcl(1), mclfaq(7), and mclfamily(7) for an overview of all the documentation and the utilities in the mcl family.
16 May 2014 | mcxload 14-137 |