XTRACT(1) | NCBI Entrez Direct User's Manual | XTRACT(1) |
NAME¶
xtract - NCBI Entrez Direct XML conversion and transformation tool
SYNOPSIS¶
xtract [-help] [-strict] [-mixed] [-self] [-accent] [-ascii] [-compress] [-stops] [-input filename] [-transform filename] [-aliases filename] [-pattern expr] [-group expr] [-block expr] [-subset expr] [-path path] [-if expr [constraint]] [-unless expr [constraint]] [-and condition] [-or condition] [-else] [-position pos] [-equals str] [-contains str] [-includes str] [-is-within str] [-starts-with str] [-ends-with str] [-is-not str] [-is-before str] [-is-after str] [-matches str] [-resembles str] [-is-equal-to expr] [-differs-from expr] [-gt N] [-ge N] [-lt N] [-le N] [-eq N] [-ne N] [-ret str] [-tab str] [-sep str] [-pfx str] [-sfx str] [-rst] [-clr] [-pfc str] [-deq str] [-def str] [-lbl str] [-set tag] [-rec tag] [-wrp tag] [-enc tag] [-plg str] [-elg str] [-pkg tag] [-fwd str] [-awd str] [-element element] [-first element] [-last element] [-backward element] [-NAME] [--STATS] [-num element] [-len element] [-sum element] [-acc element] [-min element] [-max element] [-inc element] [-dec element] [-sub element] [-avg element] [-dev element] [-med element] [-mul element] [-div element] [-mod element] [-bin element] [-oct element] [-hex element] [-bit element] [-pad element] [-encode element] [-upper element] [-lower element] [-chain element] [-title element] [-mirror element] [-alnum element] [-basic element] [-plain element] [-simple element] [-author element] [-prose element] [-terms element] [-words element] [-pairs element] [-order element] [-reverse element] [-letters element] [-clauses element] [-year element] [-month element] [-date element] [-page element] [-auth element] [-initials element] [-jour element] [-trim element] [-wct element] [-doi element] [-translate element] [-classify element] [-replace -reg target -exp replacement] [-revcomp] [-nucleic] [-fasta] [-ncbi2na] [-ncbi4na] [-molwt] [-0-based element] [-1-based element] [-ucsc-based element] [-insd arg ...] [-histogram] [-e2index [extras]] [-indices element] [-article element] [-abstract element] [-paragraph element] [-stemmed element] [-head str] [-tail str] [-hd str] [-tl str] [-select condition] [-in filename] [-sort[-fwd] element] [-sort-rev element] [-format fmt [-unicode style]] [-verify] [-outline] [-synopsis] [-contour [delimiter]] [-examples] [-unix] [-version]
DESCRIPTION¶
xtract converts an XML document into a table of data values according to user-specified rules.
OPTIONS¶
Processing Flags¶
- -strict
- Remove HTML and MathML tags.
- -mixed
- Allow mixed content XML.
- -self
- Allow detection of empty self-closing tags.
- -accent
- Delete Unicode accents and diacritical marks.
- -ascii
- Convert Unicode to numeric HTML character entities.
- -compress
- Compress runs of spaces.
- -stops
- Retain stop words in selected phrases.
Data Source¶
- -input filename
- Read XML from file instead of standard input.
- -transform filename
- File of substitutions for -translate.
- -aliases filename
- Mappings file for -classify operation.
Exploration Argument Hierarchy¶
- -pattern expr
- -group expr
- -block expr
- -subset expr
- Name of record within set. Use of different argument names allows command-line control of nested looping.
Path Navigation¶
- -path path
- Explore by list of adjacent object names.
Exploration Constructs¶
- Object
- DateRevised
- Parent/Child
- Book/AuthorList
- Path
- MedlineCitation/Article/Journal/JournalIssue/PubDate
- Heterogeneous
- "PubmedArticleSet/*"
- Exhaustive
- "History/**"
- Nested
- "*/Taxon"
Conditional Execution¶
- -if expr [constraint]
- Element (or @attribute) must exist and satisfy any specified constraint.
- -unless expr [constraint]
- Skip if element matches.
- -and condition
- Preceding and following tests must both pass.
- -or condition
- Any passing test suffices.
- -else
- Execute if conditional test failed.
- -position pos
- first/last/outer/inner/even/odd/all.
String Constraints¶
- -equals str
- String must match exactly.
- -contains str
- Substring must be present.
- -includes str
- Substring must match at word boundaries.
- -is-within str
- String must be present.
- -starts-with str
- Substring must be at beginning.
- -ends-with str
- Substring must be at end.
- -is-not str
- String must not match.
- -is-before str
- First string < second string.
- -is-after str
- First string > second string.
- -matches str
- Matches without commas or semicolons.
- -resembles str
- Requires all words, but in any order.
Object Constraints¶
- -is-equal-to expr
- Object values must match.
- -differs-from expr
- Object values must differ.
Numeric Constraints¶
Format Customization¶
- -ret str
- Override line break between patterns.
- -tab str
- Replace tab character between fields.
- -sep str
- Separator between group members.
- -pfx str
- Prefix to print before group.
- -sfx str
- Suffix to print after group.
- -rst
- Reset -sep through -elg.
- -clr
- Clear queued tab separator.
- -pfc str
- Preface combines -clr and -pfx.
- -deq str
- Delete and replace queued tab separator.
- -def str
- Default placeholder for missing fields.
- -lbl str
- Insert arbitrary text.
XML Generation¶
- -set tag
- XML tag for entire set.
- -rec tag
- XML tag for each record.
- -wrp tag
- Wrap elements in XML object.
- -enc tag
- Encase instance in XML object.
- -plg str
- Prologue to print before instance.
- -elg str
- Epilogue to print after instance.
- -pkg tag
- Package subset in XML object.
- -fwd str
- Foreword to print before subset.
- -awd str
- Afterword to print after subset.
Element Selection¶
- -element element
- Print all items that match tag name.
- -first element
- Only print value of first item.
- -last element
- Only print value of last item.
- -backward element
- Print values in reverse order.
- -NAME
- Record value in named variable.
- --STATS
- Accumulate values into variable.
-element Constructs¶
- Tag
- Caption
- Group
- Initials,LastName
- Parent/Child
- MedlineCitation/PMID
- Recursive
- "**/Gene-commentary_accession"
- Unrestricted
- PubDate/*
- Attribute
- DescriptorName@MajorTopicYN
- Range
- MedlineDate[1:4]
- Substring
- "Title[phospholipase | rattlesnake]"
- Object Count
- "#Author"
- Item Length
- "%Title"
- Element Depth
- "^PMID"
- Variable
- "&NAME"
Special -element Operations¶
- Parent Index
- "+"
- Object Name
- "?"
- Object Value
- "~"
- XML Subtree
- "*"
- Children
- "$"
- Attributes
- "@"
- ASN.1 Record
- "."
- JSON Record
- "%"
Numeric Processing¶
- -num element
- Count.
- -len element
- Length.
- -sum element
- Sum.
- -acc element
- Accumulator.
- -min element
- Minimum.
- -max element
- Maximum.
- -inc element
- Increment.
- -dec element
- Decrement.
- -sub element
- Difference.
- -avg element
- Average.
- -dev element
- Deviation.
- -med element
- Median.
- -mul element
- Product.
- -div element
- Quotient.
- -mod element
- Remainder.
- -bin element
- Binary.
- -oct element
- Octal.
- -hex element
- Hexadecimal.
- -bit element
- Bit count.
- -pad element
- Zero-pad to eight digits.
Character Processing¶
- -encode element
- XML-encode <, >, &, ", and ' characters.
- -upper element
- Convert text to uppercase.
- -lower element
- Convert text to lowercase.
- -chain element
- Change spaces to underscores.
- -title element
- Capitalize initial letters of words.
- -mirror element
- Reverse order of letters.
- -alnum element
- Non-alphanumeric characters to space.
String Processing¶
- -basic element
- Convert superscripts and subscripts.
- -plain element
- Remove embedded mixed-content markup tags.
- -simple element
- Normalize accented letters; spell Greek letters.
- Multi-step author cleanup.
- -prose element
- Text conversion to ASCII.
Text Processing¶
- -terms element
- Partition text at spaces.
- -words element
- Split at punctuation marks.
- -pairs element
- Adjacent informative words.
- -order element
- Rearrange words in sorted order.
- -reverse element
- Reverse words in string.
- -letters element
- Separate individual letters.
- -clauses element
- Break at phrase separators.
Citation Functions¶
- -year element
- Extract first 4-digit year from string.
- -month element
- Match first month name and return a corresponding integer.
- -date element
- YYYY/MM/DD from -unit "PubDate" -date "*"
- -page element
- Get digits (and letters) of first page number.
- -auth element
- Change GenBank authors to Medline form.
- -initials element
- Parse initials from forename or given name.
- -jour element
- Clean up journal name punctuation.
- -trim element
- Remove extra spaces and leading zeros.
- -wct element
- Count number of -words in a string.
- -doi element
- Add https://doi.org/ prefix, URL encode.
Value Transformation¶
- -translate element
- Substitute values with -transform table.
- -classify element
- Substring word or phrase matches to -aliases table.
Regular Expression¶
- -replace
- Substitute text using regular expressions.
- -reg target
- Target expression.
- -exp pattern
- Replacement pattern.
Sequence Processing¶
- -revcomp
- Reverse complement nucleotide sequence.
- -nucleic
- Subrange determines forward or revcomp.
- -fasta
- Split sequence into blocks of 70 uppercase letters.
- -ncbi2na
- Expand ncbi2na to IUPAC. (May need to truncate result to actual sequence length.)
- -ncbi4na
- Expand ncbi4na to IUPAC. (May need to truncate result to actual sequence length.)
- -molwt
- Calculate molecular weight of peptide.
Sequence Coordinates¶
- -0-based element
- Zero-based.
- -1-based element
- One-based.
- -ucsc-based element
- Half-open.
Command Generator¶
- -insd arg ...
- Generate INSDSeq extraction commands. Print them if invoked standalone; run them if invoked as part of a pipeline. Requires one or more arguments, which may appear in the following order:
- Descriptor(s)
- INSDSeq_sequence/INSDSeq_definition/INSDSeq_division/... [...]
- Completeness
- complete/partial
- Feature(s)
- CDS/mRNA/...[,...]
- Qualifier(s)
- INSDFeature_key/"#INSDInterval"/gene/product/feat_location/sub_sequence/... [...]
Frequency Table¶
- -histogram
- Collects data for sort-uniq-count(1) on entire set of records.
Entrez Indexing¶
- -e2index [extras]
- Create Entrez index XML. extras (true or false; false by default) indicates whether to index extra fields.
- -indices element
- Index normalized words.
- -article element
- Title positional index.
- -abstract element
- Abstract positional index.
- -paragraph element
- Index text paragraphs.
- -stemmed element
- Apply Porter2 algorithm.
Output Organization¶
Record Selection¶
- -select condition
- Select record subset by conditions.
- -in filename
- File of identifiers to use for selection.
Record Rearrangement¶
- -sort[-fwd] element
- Element to use as sort key.
- -sort-rev element
- Sort records in reverse order.
Reformatting¶
Validation¶
- -verify
- Report XML data integrity problems.
Summary¶
- -outline
- Display outline of XML structure.
- -synopsis
- Display individual XML paths.
- -contour [delimiter]
- Display XML paths to leaf nodes (delimited by / by default).
Documentation¶
NOTES¶
String constraints use case-insensitive comparisons.
Numeric constraints and selection arguments use integer values.
-num and -len selections are synonyms for Object Count (#) and Item Length (%).
-words, -pairs, and -indices convert to lower case.
SEE ALSO¶
archive-pmc(1), archive-pubmed(1), custom-index(1), disambiguate-nucleotides(1), download-ncbi-data(1), ds2pme(1), esample(1), fetch-pmc(1), fetch-pubmed(1), find-in-gene(1), fuse-segments(1), gene2range(1), hgvs2spdi(1), index-extras(1), index-pubmed(1), pma2pme(1), rchive(1), snp2hgvs(1), snp2tbl(1), sort-uniq-count(1), spdi2tbl(1), tbl2prod(1), transmute(1), uniq-table(1), xml2fsa(1), xml2tbl(1), xy-plot(1).
2023-02-26 | NCBI |