.TH XTRACT 1 2019-02-26 NCBI "NCBI Entrez Direct User's Manual" .SH NAME xtract \- convert XML into a table of data values .SH SYNOPSIS \fBxtract\fP [\|\fB\-help\fP\|] [\|\fB\-strict\fP\|] [\|\fB\-mixed\fP\|] [\|\fB\-accent\fP\|] [\|\fB\-ascii\fP\|] [\|\fB\-compress\fP\|] [\|\fB\-stops\fP\|] [\|\fB\-input\fP\ \fIfilename\fP\|] [\|\fB\-transform\fP\ \fIfilename\fP\|] [\|\fB\-pattern\fP\ \fIexpr\fP\|] [\|\fB\-group\fP\ \fIexpr\fP\|] [\|\fB\-block\fP\ \fIexpr\fP\|] [\|\fB\-subset\fP\ \fIexpr\fP\|] [\|\fB\-if\fP\ \fIexpr\fP\ [\|\fIconstraint\fP\|]\|] [\|\fB\-unless\fP\ \fIexpr\fP\ [\|\fIconstraint\fP\|]\|] [\|\fB\-and\fP\ \fIcondition\fP\|] [\|\fB\-or\fP\ \fIcondition\fP\|] [\|\fB\-else\fP\|] [\|\fB\-position\fP\ \fIpos\fP\|] [\|\fB\-select\fP\ \fIcondition\fP\|] [\|\fB\-equals\fP\ \fIstr\fP\|] [\|\fB\-contains\fP\ \fIstr\fP\|] [\|\fB\-is-within\fP\ \fIstr\fP\|] [\|\fB\-starts\-with\fP\ \fIstr\fP\|] [\|\fB\-ends\-with\fP\ \fIstr\fP\|] [\|\fB\-is\-not\fP\ \fIstr\fP\|] [\|\fB\-gt\fP\ \fIN\fP\|] [\|\fB\-ge\fP\ \fIN\fP\|] [\|\fB\-lt\fP\ \fIN\fP\|] [\|\fB\-le\fP\ \fIN\fP\|] [\|\fB\-eq\fP\ \fIN\fP\|] [\|\fB\-ne\fP\ \fIN\fP\|] [\|\fB\-ret\fP\ \fIstr\fP\|] [\|\fB\-tab\fP\ \fIstr\fP\|] [\|\fB\-sep\fP\ \fIstr\fP\|] [\|\fB\-pfx\fP\ \fIstr\fP\|] [\|\fB\-sfx\fP\ \fIstr\fP\|] [\|\fB\-plg\fP\ \fIstr\fP\|] [\|\fB\-elg\fP\ \fIstr\fP\|] [\|\fB\-rst\fP\|] [\|\fB\-clr\fP\|] [\|\fB\-pfc\fP\ \fIstr\fP\|] [\|\fB\-deq\fP\ \fIstr\fP\|] [\|\fB\-wrp\fP\ \fItag\fP\|] [\|\fB\-def\fP\ \fIstr\fP\|] [\|\fB\-lbl\fP\ \fIstr\fP\|] [\|\fB\-element\fP\ \fIelement\fP\|] [\|\fB\-first\fP\ \fIelement\fP\|] [\|\fB\-last\fP\ \fIelement\fP\|] [\|\fB\-\fP\fINAME\fP\|] [\|\fB\-num\fP\ \fIelement\fP\|] [\|\fB\-len\fP\ \fIelement\fP\|] [\|\fB\-sum\fP\ \fIelement\fP\|] [\|\fB\-min\fP\ \fIelement\fP\|] [\|\fB\-max\fP\ \fIelement\fP\|] [\|\fB\-inc\fP\ \fIelement\fP\|] [\|\fB\-dec\fP\ \fIelement\fP\|] [\|\fB\-sub\fP\ \fIelement\fP\|] [\|\fB\-avg\fP\ \fIelement\fP\|] [\|\fB\-dev\fP\ \fIelement\fP\|] [\|\fB\-med\fP\ \fIelement\fP\|] [\|\fB\-bin\fP\ \fIelement\fP\|] [\|\fB\-bit\fP\ \fIelement\fP\|] [\|\fB\-encode\fP\ \fIelement\fP\|] [\|\fB\-upper\fP\ \fIelement\fP\|] [\|\fB\-lower\fP\ \fIelement\fP\|] [\|\fB\-title\fP\ \fIelement\fP\|] [\|\fB\-year\fP\ \fIelement\fP\|] [\|\fB\-translate\fP\ \fIelement\fP\|] [\|\fB\-terms\fP\ \fIelement\fP\|] [\|\fB\-words\fP\ \fIelement\fP\|] [\|\fB\-pairs\fP\ \fIelement\fP\|] [\|\fB\-reverse\fP\ \fIelement\fP\|] [\|\fB\-letters\fP\ \fIelement\fP\|] [\|\fB\-clauses\fP\ \fIelement\fP\|] [\|\fB\-indices\fP\ \fIelement\fP\|] [\|\fB\-e2index\fP\|] [\|\fB\-revcomp\fP\|] [\|\fB\-nucleic\fP\|] [\|\fB\-0\-based\fP\ \fIelement\fP\|] [\|\fB\-1\-based\fP\ \fIelement\fP\|] [\|\fB\-ucsc\-based\fP\ \fIelement\fP\|] [\|\fB\-insd\fP\ \fIarg\fP\ ...\|] [\|\fB\-head\fP\ \fIstr\fP\|] [\|\fB\-tail\fP\ \fIstr\fP\|] [\|\fB\-hd\fP\ \fIstr\fP\|] [\|\fB\-tl\fP\ \fIstr\fP\|] [\|\fB\-format\fP\ \fIfmt\fP\|] [\|\fB\-unicode\fP\ \fIstyle\fP\|] [\|\fB\-script\fP\ \fIstyle\fP\|] [\|\fB\-mathml\ terse\fP\|] [\|\fB\-filter\fP\ \fIelement\fP \fIaction\fP\ \fItarget\fP\|] [\|\fB\-verify\fP\|] [\|\fB\-outline\fP\|] [\|\fB\-synopsis\fP\|] [\|\fB\-skip\fP\ \fIfilename\fP\|] [\|\fB\-examples\fP\|] [\|\fB\-version\fP\|] .SH DESCRIPTION \fBxtract\fP converts an XML document into a table of data values according to user\-specified rules. .SH OPTIONS .SS Processing Flags .TP \fB\-strict\fP Remove HTML and MathML tags. .TP \fB\-mixed\fP Allow mixed content XML. .TP \fB\-accent\fP Delete Unicode accents and diacritical marks. .TP \fB\-ascii\fP Convert Unicode to numeric HTML character entities. .TP \fB\-compress\fP Compress runs of spaces. .TP \fB\-stops\fP Retain stop words in selected phrases. .SS Data Source .TP \fB\-input\fP\ \fIfilename\fP Read XML from file instead of standard input. .TP \fB\-transform\fP\ \fIfilename\fP File of substitutions for \fB\-translate\fP. .SS Exploration Argument Hierarchy .PD 0 .TP \fB\-pattern\fP\ \fIexpr\fP .TP \fB\-group\fP\ \fIexpr\fP .TP \fB\-block\fP\ \fIexpr\fP .TP \fB\-subset\fP\ \fIexpr\fP Name of record within set. Use of different argument names allows command-line control of nested looping. .PD .SS Exploration Constructs .PD 0 .IP Object 15 \fBDateRevised\fP .IP Parent/Child 15 \fBBook/AuthorList\fP .IP Heterogeneous 15 \fB"PubmedArticleSet/*"\fP .IP Exhaustive \fB"History/**"\fP .IP Nested 15 \fB"*/Taxon"\fP .IP Recursive 15 \fB"**/Gene-commentary"\fP .PD .SS Conditional Execution .TP \fB\-if\fP\ \fIexpr\fP\ [\|\fIconstraint\fP\|] Element (or \fB@\fP\fIattribute\fP) must exist and satisfy any specified constraint. .TP \fB\-unless\fP\ \fIexpr\fP\ [\|\fIconstraint\fP\|] Skip if element matches. .TP \fB\-and\fP\ \fIcondition\fP Preceding and following tests must both pass. .TP \fB\-or\fP\ \fIcondition\fP Any passing test suffices. .TP \fB\-else\fP Execute if conditional test failed. .TP \fB\-position\fP\ \fIpos\fP .BR first / last / outer / inner / even / odd / all . .TP \fB\-select\fP\ \fIcondition\fP Select record subset by conditions. .SS String Constraints .TP \fB\-equals\fP\ \fIstr\fP String must match exactly. .TP \fB\-contains\fP\ \fIstr\fP Substring must be present. .TP \fB\-is-within\fP\ \fIstr\fP String must be present. .TP \fB\-starts\-with\fP\ \fIstr\fP Substring must be at beginning. .TP \fB\-ends\-with\fP\ \fIstr\fP Substring must be at end. .TP \fB\-is\-not\fP\ \fIstr\fP String must not match. .SS Numeric Constraints .TP \fB\-gt\fP\ \fIN\fP Greater than. .TP \fB\-ge\fP\ \fIN\fP Greater than or equal to. .TP \fB\-lt\fP\ \fIN\fP Less than to. .TP \fB\-le\fP\ \fIN\fP Less than or equal to. .TP \fB\-eq\fP\ \fIN\fP Equal to. .TP \fB\-ne\fP\ \fIN\fP Not equal to. .SS Format Customization .TP \fB\-ret\fP\ \fIstr\fP Override line break between patterns. .TP \fB\-tab\fP\ \fIstr\fP Replace tab character between fields. .TP \fB\-sep\fP\ \fIstr\fP Separator between group members. .TP \fB\-pfx\fP\ \fIstr\fP Prefix to print before group. .TP \fB\-sfx\fP\ \fIstr\fP Suffix to print after group. .TP \fB\-plg\fP\ \fIstr\fP Prologue to print once before elements. .TP \fB\-elg\fP\ \fIstr\fP Epilogue to print once after elements. .TP \fB\-rst\fP Reset \fB\-sep\fP through \fB\-elg\fP. .TP \fB\-clr\fP Clear queued tab separator. .TP \fB\-pfc\fP\ \fIstr\fP Preface combines \fB\-clr\fP and \fB\-pfx\fP. .TP \fB\-deq\fP\ \fIstr\fP Delete and replace queued tab separator. .TP \fB\-wrp\fP\ \fItag\fP Wrap elements in XML object. .TP \fB\-def\fP\ \fIstr\fP Default placeholder for missing fields. .TP \fB\-lbl\fP\ \fIstr\fP Insert arbitrary text. .SS Element Selection .TP \fB\-element\fP\ \fIelement\fP Print all items that match tag name. .TP \fB\-first\fP\ \fIelement\fP Only print value of first item. .TP \fB\-last\fP\ \fIelement\fP Only print value of last item. .TP \fB\-\fP\fINAME\fP Record value in named variable. .SS \-element Constructs .PD 0 .IP Tag 15 \fBCaption\fP .IP Group 15 \fBInitials,LastName\fP .IP Parent/Child \fBMedlineCitation/PMID\fP .IP Recursive 15 \fB"**/Gene-commentary_accession"\fP .IP Unrestricted 15 \fBPubDate/*\fP .IP Attribute 15 \fBDescriptorName@MajorTopicYN\fP .IP Range \fBMedlineDate[1:4]\fP .IP Substring \fB"Title[phospholipase | rattlesnake]"\fP .IP "Object Count" 15 \fB"#Author"\fP .IP "Item Length" 15 \fB"%Title"\fP .IP "Element Depth" 15 \fB"^PMID"\fP .IP Variable 15 \fB"&NAME"\fP .PD .SS Special \-element Operations .PD 0 .IP "Parent Index" 15 \fB"+"\fP .IP "Object Name" 15 \fB"+"\fP .IP "XML Subtree" 15 \fB"*"\fP .IP Children 15 \fB"$"\fP .IP Attributes 15 \fB"@"\fP .PD .SS Numeric Processing .TP \fB\-num\fP\ \fIelement\fP Count. .TP \fB\-len\fP\ \fIelement\fP Length. .TP \fB\-sum\fP\ \fIelement\fP Sum. .TP \fB\-min\fP\ \fIelement\fP Minimum. .TP \fB\-max\fP\ \fIelement\fP Maximum. .TP \fB\-inc\fP\ \fIelement\fP Increment. .TP \fB\-dec\fP\ \fIelement\fP Decrement. .TP \fB\-sub\fP\ \fIelement\fP Difference. .TP \fB\-avg\fP\ \fIelement\fP Average. .TP \fB\-dev\fP\ \fIelement\fP Deviation. .TP \fB\-med\fP\ \fIelement\fP Median. .TP \fB\-bin\fP\ \fIelement\fP Binary. .TP \fB\-bit\fP\ \fIelement\fP Bit count. .SS String Processing .TP \fB\-encode\fP\ \fIelement\fP URL\-encode \fB<\fP, \fB>\fP, \fB&\fP, \fB\(dq\fP, and \fB\[aq]\fP characters. .TP \fB\-upper\fP\ \fIelement\fP Convert text to uppercase. .TP \fB\-lower\fP\ \fIelement\fP Convert text to lowercase. .TP \fB\-title\fP\ \fIelement\fP Capitalize initial letters of words. .TP \fB\-year\fP\ \fIelement\fP Extract first 4-digit year from string. .TP \fB\-translate\fP\ \fIelement\fP Substitute values with \fB\-transform\fP table. .SS Text Processing .TP \fB\-terms\fP\ \fIelement\fP Partition text at spaces. .TP \fB\-words\fP\ \fIelement\fP Split at punctuation marks. .TP \fB\-pairs\fP\ \fIelement\fP Adjacent informative words. .TP \fB\-reverse\fP\ \fIelement\fP Reverse words in string. .TP \fB\-letters\fP\ \fIelement\fP Separate individual letters. .TP \fB\-clauses\fP\ \fIelement\fP Break at phrase separators. .TP \fB\-indices\fP\ \fIelement\fP Word pair index generation. .TP \fB\-e2index\fP Create Entrez index XML. .SS Sequence Processing .TP \fB\-revcomp\fP Reverse\-complement nucleotide sequence. .TP \fB\-nucleic\fP Subrange determines forward or revcomp. .SS Sequence Coordinates .TP \fB\-0\-based\fP\ \fIelement\fP Zero\-based. .TP \fB\-1\-based\fP\ \fIelement\fP One\-based. .TP \fB\-ucsc\-based\fP\ \fIelement\fP Half\-open. .SS Command Generator .TP \fB\-insd\fP\ \fIarg\fP\ ... Generate INSDSeq extraction commands. Print them if invoked standalone; run them if invoked as part of a pipeline. Requires one or more arguments, which may appear in the following order: .RS .\".PD 0 .IP Descriptor(s) 15 .BR INSDSeq_sequence / INSDSeq_definition / INSDSeq_division "/... [\|...\|]" .IP Completeness 15 .BR complete / partial .IP Feature(s) 15 .BR CDS / mRNA /...[\| , ...\|] .IP Qualifier(s) .BR INSDFeature_key / \(dq#INSDInterval\(dq / gene / product "/... [\|...\|]" .\".PD .RE .SS Miscellaneous .TP \fB\-head\fP\ \fIstr\fP Print before everything else. .TP \fB\-tail\fP\ \fIstr\fP Print after everything else. .TP \fB\-hd\fP\ \fIstr\fP Print before each record. .TP \fB\-tl\fP\ \fIstr\fP Print after each record. .PD .SS Phrase Filtering .TP \fB\-require\fP\ \fIstr\fP Keep records that contain a given phrase. .TP \fB\-exclude\fP\ \fIstr\fP Keep records that do not contain a given phrase. .SS Reformatting .TP \fB\-format\fP\ \fIfmt\fP .PD 0 .RS .IP \fBclean\fP 9 .IP \fBcopy\fP 9 Fast block copy (still applies processing flags). .IP \fBcompact\fP 9 Compress runs of spaces. .IP \fBflush\fP 9 Suppress line indentation. .IP \fBindent\fP 9 Indent according to nesting depth. .IP \fBexpand\fP 9 Place each attribute on a separate line. .RE .PD .TP \fB\-unicode\fP\ \fIstyle\fP How to handle Unicode superscript and subscript digits (first converted to ASCII form in all cases). .PD 0 .RS .IP \fBfuse\fP 9 Run them all together, with no additional markup. .IP \fBspace\fP 9 Add spaces between digits in different positions. .IP \fBperiod\fP 9 Add periods between digits in different positions. .IP \fBbrackets\fP 9 Surround superscripts by square brackets and subscripts by parentheses. .IP \fBmarkdown\fP 9 Surround superscripts with carets and subscripts with tildes. .IP \fBslash\fP 9 Add backslashes when going up in height and forward slashes when going down. .IP \fBtag\fP 9 Put superscripts in XML \fBsup\fP elements and subscripts in \fBsub\fP elements. .RE .PD .TP \fB\-script\fP\ \fIstyle\fP How to handle XML \fBsup\fP and \fBsub\fP elements (denoting superscripts and subscripts, respectively). .PD 0 .RS .IP \fBbrackets\fP 9 Surround superscripts by square brackets and subscripts by parentheses. .IP \fBmarkdown\fP 9 Surround superscripts with carets and subscripts with tildes. .RE .PD .TP \fB\-mathml\ terse\fP Flatten MathML markup tersely. .SS Modification .TP \fB\-filter\fP\ \fIelement\fP \fIaction\fP\ \fItarget\fP Actions: .PD 0 .RS .IP \fBretain\fP 12 Keep matching elements (no\-op). .IP \fBremove\fP 12 Remove matching elements. .IP \fBencode\fP 12 HTML\-escape special characters. .IP \fBdecode\fP 12 Decode HTML escapes. .IP \fBshrink\fP 12 Compress runs of spaces. .IP \fBexpand\fP 12 Place each attribute on a separate line. .IP \fBaccent\fP 12 Strip off Unicode accents. .PD .P Targets: .PD 0 .IP \fBcontent\fP 12 Plain\-text content. .IP \fBcdata\fP 12 \fBCDATA\fP blocks. .IP \fBcomment\fP 12 Comments. .IP \fBobject\fP 12 The whole object. .IP \fBattributes\fP 12 Attributes. .IP \fBcontainer\fP 12 Start and end tags. .RE .PD .SS Summary .TP \fB\-outline\fP Display outline of XML structure. .TP \fB\-synopsis\fP Display count of unique XML paths. .SS Documentation .TP \fB\-help\fP Print usage information and some example argument combinations. .TP \fB\-examples\fP Complete examples of \fBedirect\fP(1) and \fBxtract\fP usage. .TP \fB-version\fP Print version number. .SH NOTES String constraints use case\-insensitive comparisons. Numeric constraints and selection arguments use integer values. \fB\-num\fP and \fB\-len\fP selections are synonyms for Object Count (\fB#\fP) and Item Length (\fB%\fP). \fB\-words\fP, \fB\-pairs\fP, and \fB\-indices\fP convert to lower case. .SH SEE ALSO .BR edirect (1), .BR pm\-index (1), .BR pm\-invert (1), .BR pm\-stash (1), .BR rchive (1), .BR transmute (1), .BR xy\-plot (1).