Scroll to navigation

XTRACT(1) NCBI Entrez Direct User's Manual XTRACT(1)

NAME

xtract - convert XML into a table of data values

SYNOPSIS

xtract [-help] [-strict] [-mixed] [-accent] [-ascii] [-compress] [-stops] [-input filename] [-transform filename] [-pattern expr] [-group expr] [-block expr] [-subset expr] [-if expr [constraint]] [-unless expr [constraint]] [-and condition] [-or condition] [-else] [-position pos] [-select condition] [-equals str] [-contains str] [-is-within str] [-starts-with str] [-ends-with str] [-is-not str] [-gt N] [-ge N] [-lt N] [-le N] [-eq N] [-ne N] [-ret str] [-tab str] [-sep str] [-pfx str] [-sfx str] [-plg str] [-elg str] [-rst] [-clr] [-pfc str] [-deq str] [-wrp tag] [-def str] [-lbl str] [-element element] [-first element] [-last element] [-NAME] [-num element] [-len element] [-sum element] [-min element] [-max element] [-inc element] [-dec element] [-sub element] [-avg element] [-dev element] [-med element] [-bin element] [-bit element] [-encode element] [-upper element] [-lower element] [-title element] [-year element] [-translate element] [-terms element] [-words element] [-pairs element] [-reverse element] [-letters element] [-clauses element] [-indices element] [-e2index] [-revcomp] [-nucleic] [-0-based element] [-1-based element] [-ucsc-based element] [-insd arg ...] [-head str] [-tail str] [-hd str] [-tl str] [-format fmt] [-unicode style] [-script style] [-mathml terse] [-filter element action target] [-verify] [-outline] [-synopsis] [-skip filename] [-examples] [-version]

DESCRIPTION

xtract converts an XML document into a table of data values according to user-specified rules.

OPTIONS

Processing Flags

-strict
Remove HTML and MathML tags.
-mixed
Allow mixed content XML.
-accent
Delete Unicode accents and diacritical marks.
-ascii
Convert Unicode to numeric HTML character entities.
-compress
Compress runs of spaces.
-stops
Retain stop words in selected phrases.

Data Source

-input filename
Read XML from file instead of standard input.
-transform filename
File of substitutions for -translate.

Exploration Argument Hierarchy

-pattern expr
-group expr
-block expr
-subset expr
Name of record within set. Use of different argument names allows command-line control of nested looping.

Exploration Constructs

Object
DateRevised
Parent/Child
Book/AuthorList
Heterogeneous
"PubmedArticleSet/*"
Exhaustive
"History/**"
Nested
"*/Taxon"
Recursive
"**/Gene-commentary"

Conditional Execution

-if expr [constraint]
Element (or @attribute) must exist and satisfy any specified constraint.
-unless expr [constraint]
Skip if element matches.
-and condition
Preceding and following tests must both pass.
-or condition
Any passing test suffices.
-else
Execute if conditional test failed.
-position pos
first/last/outer/inner/even/odd/all.
-select condition
Select record subset by conditions.

String Constraints

-equals str
String must match exactly.
-contains str
Substring must be present.
-is-within str
String must be present.
-starts-with str
Substring must be at beginning.
-ends-with str
Substring must be at end.
-is-not str
String must not match.

Numeric Constraints

-gt N
Greater than.
-ge N
Greater than or equal to.
-lt N
Less than to.
-le N
Less than or equal to.
-eq N
Equal to.
-ne N
Not equal to.

Format Customization

-ret str
Override line break between patterns.
-tab str
Replace tab character between fields.
-sep str
Separator between group members.
-pfx str
Prefix to print before group.
-sfx str
Suffix to print after group.
-plg str
Prologue to print once before elements.
-elg str
Epilogue to print once after elements.
-rst
Reset -sep through -elg.
-clr
Clear queued tab separator.
-pfc str
Preface combines -clr and -pfx.
-deq str
Delete and replace queued tab separator.
-wrp tag
Wrap elements in XML object.
-def str
Default placeholder for missing fields.
-lbl str
Insert arbitrary text.

Element Selection

-element element
Print all items that match tag name.
-first element
Only print value of first item.
-last element
Only print value of last item.
-NAME
Record value in named variable.

-element Constructs

Tag
Caption
Group
Initials,LastName
Parent/Child
MedlineCitation/PMID
Recursive
"**/Gene-commentary_accession"
Unrestricted
PubDate/*
Attribute
DescriptorName@MajorTopicYN
Range
MedlineDate[1:4]
Substring
"Title[phospholipase | rattlesnake]"
Object Count
"#Author"
Item Length
"%Title"
Element Depth
"^PMID"
Variable
"&NAME"

Special -element Operations

Parent Index
"+"
Object Name
"+"
XML Subtree
"*"
Children
"$"
Attributes
"@"

Numeric Processing

-num element
Count.
-len element
Length.
-sum element
Sum.
-min element
Minimum.
-max element
Maximum.
-inc element
Increment.
-dec element
Decrement.
-sub element
Difference.
-avg element
Average.
-dev element
Deviation.
-med element
Median.
-bin element
Binary.
-bit element
Bit count.

String Processing

-encode element
URL-encode <, >, &, ", and ' characters.
-upper element
Convert text to uppercase.
-lower element
Convert text to lowercase.
-title element
Capitalize initial letters of words.
-year element
Extract first 4-digit year from string.
-translate element
Substitute values with -transform table.

Text Processing

-terms element
Partition text at spaces.
-words element
Split at punctuation marks.
-pairs element
Adjacent informative words.
-reverse element
Reverse words in string.
-letters element
Separate individual letters.
-clauses element
Break at phrase separators.
-indices element
Word pair index generation.
-e2index
Create Entrez index XML.

Sequence Processing

-revcomp
Reverse-complement nucleotide sequence.
-nucleic
Subrange determines forward or revcomp.

Sequence Coordinates

-0-based element
Zero-based.
-1-based element
One-based.
-ucsc-based element
Half-open.

Command Generator

-insd arg ...
Generate INSDSeq extraction commands. Print them if invoked standalone; run them if invoked as part of a pipeline. Requires one or more arguments, which may appear in the following order:
Descriptor(s)
INSDSeq_sequence/INSDSeq_definition/INSDSeq_division/... [...]
Completeness
complete/partial
Feature(s)
CDS/mRNA/...[,...]
Qualifier(s)
INSDFeature_key/"#INSDInterval"/gene/product/... [...]

Miscellaneous

-head str
Print before everything else.
-tail str
Print after everything else.
-hd str
Print before each record.
-tl str
Print after each record.

Phrase Filtering

-require str
Keep records that contain a given phrase.
-exclude str
Keep records that do not contain a given phrase.

Reformatting

-format fmt
clean
copy
Fast block copy (still applies processing flags).
compact
Compress runs of spaces.
flush
Suppress line indentation.
indent
Indent according to nesting depth.
expand
Place each attribute on a separate line.
-unicode style
How to handle Unicode superscript and subscript digits (first converted to ASCII form in all cases).
fuse
Run them all together, with no additional markup.
space
Add spaces between digits in different positions.
period
Add periods between digits in different positions.
brackets
Surround superscripts by square brackets and subscripts by parentheses.
markdown
Surround superscripts with carets and subscripts with tildes.
slash
Add backslashes when going up in height and forward slashes when going down.
tag
Put superscripts in XML sup elements and subscripts in sub elements.
-script style
How to handle XML sup and sub elements (denoting superscripts and subscripts, respectively).
brackets
Surround superscripts by square brackets and subscripts by parentheses.
markdown
Surround superscripts with carets and subscripts with tildes.
-mathml terse
Flatten MathML markup tersely.

Modification

-filter element action target
Actions:
retain
Keep matching elements (no-op).
remove
Remove matching elements.
encode
HTML-escape special characters.
decode
Decode HTML escapes.
shrink
Compress runs of spaces.
expand
Place each attribute on a separate line.
accent
Strip off Unicode accents.

Targets:

content
Plain-text content.
cdata
CDATA blocks.
comment
Comments.
object
The whole object.
attributes
Attributes.
container
Start and end tags.

Summary

-outline
Display outline of XML structure.
-synopsis
Display count of unique XML paths.

Documentation

-help
Print usage information and some example argument combinations.
-examples
Complete examples of edirect(1) and xtract usage.
-version
Print version number.

NOTES

String constraints use case-insensitive comparisons.

Numeric constraints and selection arguments use integer values.

-num and -len selections are synonyms for Object Count (#) and Item Length (%).

-words, -pairs, and -indices convert to lower case.

SEE ALSO

edirect(1), pm-index(1), pm-invert(1), pm-stash(1), rchive(1), transmute(1), xy-plot(1).
2019-02-26 NCBI