mcxassemble(1) | USER COMMANDS | mcxassemble(1) |
NAME¶
mcxassemble - transform raw cooccurrence data to mcl matrix format.
SYNOPSIS¶
mcxassemble -b base (base name) [-o fname (write to file fname) ] [--write-binary (write output in binary format) ] [--map (apply base.map)] [-raw-tf ( apply transform spec to input)] [-rv MODE ( repeated vectors)] [-re MODE (repeated entries) ] [-ri MODE (adding mirror image)] [-r MODE (repeated entries/vectors/images)] [-prm-tf ( apply transform spec to primary matrix)] [-sym-tf ( apply transform spec to symmetrified matrix)] [-q (quiet mode)] The options above embody the default setup when using mcxassemble. There are many more options which mostly provide subtly different ways of doing input/output, set warning levels, or regulate how repeated entries and vectors should be treated. The full list of options is shown below. Read DESCRIPTION for learning about mcxassemble input/output and the functionality it provides. NOTE
DESCRIPTION¶
mcxassemble enables easy matrix creation from an intermediate raw matrix format that can easily be constructed from a one-pass-parse of cooccurrence data. The basic setup is as follows.
• Parse cooccurrence data from some external format.
• Transform cooccurrence data to raw mcl data as you parse.
• When done, write out required header and domain information to a separate file. The domain information can be built during the parsing stage.
• Use mcxassemble to construct a valid matrix from the raw data and the header information.
• Nodes can optionally be relabeled by writing a separate map file to be read by mcxassemble, which takes the form of a very thin matrix file.
The easiest thing to do is to group all input/output files under the same base name, say base. A standard way of proceeding, which will lead to a concise mcxassemble command line, is by creating the input files base.raw and base.hdr, and optionally the file base.map. The default behaviour of mcxassemble is then to create base.sym as the resulting matrix file, containing the symmetrized matrix constructed from the raw input. Example
mcxdeblast --score=e --sort=a blastresult mcxassemble -b blastresult -r max --map
mcxdeblast will generate the files blastresult.hdr, blastresult.raw, and blastresult.map. The --sort=a option will create a map file corresponding with alphabetic ordering. These files are processed by mcxassemble and it will generate the file blastresult.sym. The -r option tells mcxassemble that repeated entries should be maxed; each time the largest entry seen thus far will be taken. Header file
2 4:0.34 1:2.8838 4:2.328 1:4.238 1:12 $ 1 2:7.8 $ 2 1:0.01 4:20.3 3:2 $The listing of vectors need not be sorted, and neither does a vector itself need to be sorted - the mcl generic matrix format is actually not different in this respect. Furthermore, duplicate entries and duplicate vectors are allowed. This is in fact again allowed in the generic format, except that where applications expect generic format warnings will be issued and duplicate entries will be disregarded. mcxassemble allows customizable behaviour dictating how to merge repeated entries. Refer to the -re, -rv, -r options below. The vectors read by mcxassemble do have to match the domains specified in the header file. The leading index that specifies the column index has to be present in the column domain; all subsequent indices that specify column entries have to be present in the row domain. If one concatenates the contents of the header file and the data file, the result is almost but not quite a file containing a matrix in syntactically correct mcl generic matrix format. The parts missing are the (mclmatrix introduction token, (followed by) the begin token, and the closing ) token. Map file
• The column domain and row domain are of the same cardinality.
• Each column has exactly one entry.
• Each row domain index occurs in exactly one column.
Such a matrix is used to relabel the nodes as found in the raw data. A situation that might occur when parsing some external format (and producing raw matrix format), is that ID's (indices) are handed out on the fly during the parse. Afterwards, one may want to relabel the IDs such that they correspond with an alphabetic listing of the quantity that is represented by the node domain, or by some other sort criterion. A map file is then typically generated by the parser, as that is the utility in charge of the IDs. A small example of a map file for a graph containing five nodes is the following:
(mclheader mcltype matrix dimensions 5x5 ) (mclmatrix begin 0 4 $ # mno 1 2 $ # ghi 2 1 $ # def 3 3 $ # jkl 4 0 $ # abc )
This corresponds to a relabeling such that the associated strings will be ordered alphabetically. Note that comments can be used to link string identifiers with indices. This map file says e.g. that the string identifier "mno" is represented by index 0 in the raw data, and by index 4 in the matrix output by mcxassemble.
OPTIONS¶
-b base (base name)
-hdr fname (read header file)
-raw fname (read raw file)
--map (apply base.map)
--cmap (apply base.cmap)
--rmap (apply base.rmap)
-map fname (apply fname)
-rmap fname (apply fname)
-cmap fname (apply fname)
-tag tag (apply base.tag)
-rtag tag (apply base.tag)
-ctag tag (apply base.tag)
-skw fname (write skew matrix)
-prm fname (write primary result matrix)
--prm (write base.prm)
--skw (write base.skw)
-n (do not write default symmetrized result)
-xo suf (write base.suf)
-o fname (write to file fname)
-i (read from single data file)
-digits int (digits width)
--write-binary (write output in binary format)
-s (check for symmetry)
-raw-tf <tf-spec> (apply transform spec to input)
-prm-tf (apply transform spec to primary matrix)
-sym-tf (apply transform spec to symmetrified matrix)
-rv add|max|min|mul|left|right (action for repeated vectors)
-re add|max|min|mul|left|right (action for repeated entries)
-ri add|max|min|mul (adding mirror image)
-r add|max|min|mul|left|right (same for entries and vectors)
0 1:30 1:50 2:60 4:70 3:20 1:40 2:40 $
is encountered in the input, listing entries for the vector labeled with index 0. If -re add or -r add is used, it will transform to the vector
0 1:120 2:60 3:20 4:70 $
If -re max or -r add is used instead, it will transform to the vector
0 1:40 2:60 3:20 4:70 $
Suppose add mode is used, and that later on another vector specification for the index 0 is found, leading to this transformed vector:
0 1:60 2:80 4:40 $
If -rv max was specified, this new vector is combined with the previous vector by taking the entry wise maximum:
0 1:120 2:60 3:20 4:70 $ # first (transformed) vector 0 1:60 2:80 4:40 $ # second vector 0 1:120 2:80 3:20 4:70 $ # entry wise maximum
Finally, suppose that somewhere one or more vector listings were specified for index 3, which eventually led to an entry 0:50. The final symmetrization step will take the [0,3] entry of weight 20 and combine it with the [3,0] entry of weight 50. The resulting matrix will then have the [0,3] and the [3,0] entry both equal to either the maximum, the sum, or the product of the two quantities 50 and 20.
--quiet-re (quiet for repeated entries)
--quiet-rv (quiet for repeated vectors)
-q (the two above combined)
AUTHOR¶
Stijn van Dongen.
SEE ALSO¶
mcxio(5), mcl(1), mcxload(1) and mclfamily(7) for an overview of all the documentation and the utilities in the mcl family.
16 May 2014 | mcxassemble 14-137 |