.\" Automatically generated by Pod::Man 4.14 (Pod::Simple 3.43) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{\ . if \nF \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" ======================================================================== .\" .IX Title "MAP2SLIM 1p" .TH MAP2SLIM 1p "2023-12-18" "perl v5.36.0" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" map2slim \- maps gene associations to a 'slim' ontology .SH "SYNOPSIS" .IX Header "SYNOPSIS" .Vb 2 \& cd go \& map2slim GO_slims/goslim_generic.obo ontology/gene_ontology.obo gene\-associations/gene_association.fb .Ve .SH "DESCRIPTION" .IX Header "DESCRIPTION" Given a \s-1GO\s0 slim file, and a current ontology (in one or more files), this script will map a gene association file (containing annotations to the full \s-1GO\s0) to the terms in the \s-1GO\s0 slim. .PP The script can be used to either create a new gene association file, containing the most pertinent \s-1GO\s0 slim accessions, or in count-mode, in which case it will give distinct gene product counts for each slim term .PP The association file format is described here: .PP .SH "ARGUMENTS" .IX Header "ARGUMENTS" .IP "\-b \fBbucket slim file\fR" 4 .IX Item "-b bucket slim file" This argument adds \fBbucket terms\fR to the slim ontology; see the documentation below for an explanation. The new slim ontology file, including bucket terms will be written to \fBbucket slim file\fR .IP "\-outmap \fBslim mapping file\fR" 4 .IX Item "-outmap slim mapping file" This will generate a mapping file for every term in the full ontology showing both the most pertinent slim term and all slim terms that are ancestors. If you use this option, do \s-1NOT\s0 supply a gene-associations file .IP "shownames" 4 .IX Item "shownames" (Only works with \-outmap) .Sp Show the names of the term in the slim mapping file .IP "\-c" 4 .IX Item "-c" This will force map2slim to give counts of the assoc file, rather than map it .IP "\-t" 4 .IX Item "-t" When used in conjunction with \fB\-c\fR will tab the output so that the indentation reflects the tree hierarchy in the slim file .IP "\-o \fBout file\fR" 4 .IX Item "-o out file" This will write the mapped assocs (or counts) to the specified file, rather than to the screen .SH "DOWNLOAD" .IX Header "DOWNLOAD" This script is part of the \fBgo-perl\fR package, available from \s-1CPAN\s0 .PP .PP This script will not work without installing go-perl .SS "\s-1MAPPING ALGORITHM\s0" .IX Subsection "MAPPING ALGORITHM" \&\s-1GO\s0 is a \s-1DAG,\s0 not a tree. This means that there is often more than one path from a \s-1GO\s0 term up to the root Gene_Ontology node; the path may intersect multiple terms in the slim ontology \- which means that one annotation can map to multiple slim terms! .PP (\fBnote\fR you need to view this online to see the image below \- if you are not viewing this on the http://www.geneontology.org site, you can look at the following \s-1URL:\s0 ) .PP A hypothetical example blue circles show terms in the \s-1GO\s0 slim, and yellow circles show terms in the full ontology. The full ontology subsumes the slim, so the blue terms are also in the ontology. .PP .Vb 8 \& GO ID MAPS TO SLIM ID ALL SLIM ANCESTORS \& ===== =============== ================== \& 5 2+3 2,3,1 \& 6 3 only 3,1 \& 7 4 only 4,3,1 \& 8 3 only 3,1 \& 9 4 only 4,3,1 \& 10 2+3 2,3,1 .Ve .PP The 2nd column shows the most pertinent \s-1ID\s0(s) in the slim the direct mapping. The 3rd column shows all ancestors in the slim. .PP Note in particular the mapping of \s-1ID 9\s0 although this has two paths to the root through the slim via 3 and 4, 3 is discarded because it is subsumed by 4. .PP On the other hand, 10 maps to both 2 and 3 because these are both the first slim \s-1ID\s0 in the two valid paths to the root, and neither subsumes the other. .PP The algorithm used is: .PP to map any one term in the full ontology: find all valid paths through to the root node in the full ontology .PP for each path, take the first slim term encountered in the path .PP discard any redundant slim terms in this set ie slim terms subsumed by other slim terms in the set .SS "\s-1BUCKET TERMS\s0" .IX Subsection "BUCKET TERMS" If you run the script with the \-b option, bucket terms will be added. For any term P in the slim, if P has at least one child C, a bucket term P' will be created under P. This is a catch-all term for mapping any term in the full ontology that is a descendant of P, but \s-1NOT\s0 a descendant of any child of P in the slim ontology. .PP For example, the slim generic.0208 has the following terms and structure: .PP .Vb 3 \& %DNA binding ; GO:0003677 \& %chromatin binding ; GO:0003682 \& %transcription factor activity ; GO:0003700, GO:0000130 .Ve .PP After adding bucket terms, it will look like this: .PP .Vb 4 \& %DNA binding ; GO:0003677 \& %chromatin binding ; GO:0003682 \& %transcription factor activity ; GO:0003700 ; synonym:GO:0000130 \& @bucket:Z\-OTHER\-DNA binding ; slim_temp_id:12 .Ve .PP Terms from the full ontology that are other children of \s-1DNA\s0 binding, such as single-stranded \s-1DNA\s0 binding and its descendents will map to the bucket term. .PP The bucket term has a slim \s-1ID\s0 which is transient and is there only to facilitate the mapping. It should not be used externally. .PP The bucket term has the prefix Z\-OTHER; the Z is a hack to make sure that the term is always listed last in the alphabetic ordering. .PP The algorithm is slightly modified if bucket terms are used. The bucket term has an implicit relationship to all \s-1OTHER\s0 siblings not in the slim. .PP \fIDo I need bucket terms?\fR .IX Subsection "Do I need bucket terms?" .PP Nowadays most slim files are entirely or nearly 'complete', that is there are no gaps. This means the the \-b option will not produce noticeable different results. For example, you may see a bucket term OTHER-binding created, with nothing annotated to it: because all the children of binding in the \s-1GO\s0 are represented in the slim file. .PP The bucket option is really only necessary for some of the older archived slim files, which are static and were generated in a fairly ad-hoc way; they tend to accumulate 'gaps' over time (eg \s-1GO\s0 will add a new child of binding, but the static slim file won't be up to date, so any gene products annotated to this new term will map to OTHER-binding in the slim) .SS "\s-1GRAPH MISMATCHES\s0" .IX Subsection "GRAPH MISMATCHES" Note that the slim ontology file(s) may be out of date with respect to the current ontology. .PP Currently map2slim does not flag graph mismatches between the slim graph and the graph in the full ontology file; it takes the full ontology as being the real graph. However, the slim ontology will be used to format the results if you select \fB\-t \-c\fR as options. .SS "\s-1OUTPUT\s0" .IX Subsection "OUTPUT" In normal mode, a standard format gene-association file will be written. The \s-1GO ID\s0 column (5) will contain \s-1GO\s0 slim IDs. The mapping corresponds to the 2nd column in the table above. Note that the output file may contain more lines that the input file. This is because some full \s-1GO\s0 IDs have more than one pertinent slim \s-1ID.\s0 .PP \fI\s-1COUNT MODE\s0\fR .IX Subsection "COUNT MODE" .PP map2slim can be run with the \-c option, which will gives the counts of distinct gene products mapped to each slim term. The columns are as follows .IP "\s-1GO\s0 Term" 4 .IX Item "GO Term" The first column is the \s-1GO ID\s0 followed by the term name (the term name is provided as it is found in both the full \s-1GO\s0 and slim ontologies \- these will usually be the same but occasionally the slim file will lage behind changes in the \s-1GO\s0 file) .IP "Count of gene products for which this is the most relevant slim term" 4 .IX Item "Count of gene products for which this is the most relevant slim term" the number of distinct gene products for which this is the most pertinent/direct slim \s-1ID.\s0 By most direct we mean that either the association is made directly to this term, \s-1OR\s0 the association is made to a child of this slim term \s-1AND\s0 there is no child slim term which the association maps to. .Sp For most slims, this count will be equivalent to the number of associations directly mapped to this slim term. However, some older slim files are \*(L"spotty\*(R" in that they admit \*(L"gaps\*(R". For example, if the slim has all children of \*(L"biological process\*(R" with the exception of \&\*(L"behavior\*(R" then all annotations to \*(L"behavior\*(R" or its children will be counted here .Sp see example below .IP "Count of gene products inferred to be associated with slim term" 4 .IX Item "Count of gene products inferred to be associated with slim term" and the number of distinct gene products which are annotated to any descendant of this slim \s-1ID\s0 (or annotated directly to the slim \&\s-1ID\s0). .IP "obsoletion flag" 4 .IX Item "obsoletion flag" .PD 0 .IP "\s-1GO\s0 ontology" 4 .IX Item "GO ontology" .PD .PP To take an example; if we use \-t and \-c like this: .PP .Vb 1 \& map2slim \-t \-c GO_slims/goslim_generic.obo ontology/gene_ontology.obo gene\-associations/gene_association.fb .Ve .PP Then part of the results may look like this: .PP .Vb 6 \& GO:0008150 biological_process (biological_process) 34 10025 biological_process \& GO:0007610 behavior (behavior) 632 632 biological_process \& GO:0000004 biological process unknown (biological process unknown) 832 832 biological_process \& GO:0007154 cell communication (cell communication) 333 1701 biological_process \& GO:0008037 cell recognition (cell recognition) 19 19 biological_process \&19 products were mapped to GO:0008037 or one of its children. (GO:0008037 is a leaf node in the slim, so the two counts are identical). .Ve .PP On the other hand, \s-1GO:0008150\s0 only gets 34 products for which this is the most relevant term. This is because most annotations would map to some child of \s-1GO:0008150\s0 in the slim, such as \s-1GO:0007610\s0 (behavior). These 34 gene products are either annotated directly to \&\s-1GO:0008150,\s0 or to some child of this term which is not in the slim. This can point to 'gaps' in the slim. Note that running map2slim with the \-b option will 'plug' these gaps with artificial filler terms. .SH "AUTHOR" .IX Header "AUTHOR" Chris Mungall \s-1BDGP\s0 .SH "SEE ALSO" .IX Header "SEE ALSO" http://www.godatabase.org/dev .PP GO::Parser .PP GO::Model::Graph