NAME¶
hxpipe - convert XML file to a format easier to parse with Perl or AWK
SYNOPSIS¶
hxpipe [
-l ] [
-- ] [
file-or-URL ]
DESCRIPTION¶
hxpipe parses an HTML or XML file and outputs a line-oriented
representation of it that is well suited to further processing with AWK or
similar tools. The format is similar to the ESIS (Element Structure
Information Set) that is output by nsgmls/onsgmls.
The reverse operation, converting back to mark-up, is performed by the
hxunpipe program.
The output format is as follows:
- <!--comment-->
- Comments are output as
*comment
I.e., a single line starting with "*" followed by the text of the
comment. Line feeds, carriage returns and tabs in the text are written as
"\n", "\r" and "\t", respectively. Text that
looks like a numerical character entity is written with the
"&" replaced by "\". The line ends with a line
feed.
-
- Note that onsgmls outputs comments starting with a "_" instead
of a "*" and doesn't replace the "&" of numerical
character entities by "\" (and by default it omits comments
altogether).
- <?processing instruction>
- Processing instructions are output as
?processing instruction
I.e., a single line starting with a "?" followed by the text of
the processing instruction. The text is escaped as for comments (see
above).
- <!DOCTYPE root PUBLIC "-//foo//DTD bar//EN"
"http://example.org/dtd">
- DOCTYPEs are output as one of the following:
!root "-//foo//DTD bar//EN" http://example.org/dtd
!root "-//foo//DTD bar//EN"
!root "" http://example.org/dtd
!root ""
for respectively: a DOCTYPE with (1) both a public and a system identifier,
(2) only a public identifier, (3) only a system identifier, or (4) neither
of the two. I.e., a single line starting with a "!", followed by
a space and a possibly empty quoted string, followed optionally by a space
and arbitrary text. Note the quotes for the public identifier and the
absence of quotes for the system identifier.
- <elt att1="value1" att2="value2">
- A start tag is output as
Aatt1 CDATA value1
Aatt2 CDATA value2
(elt
I.e., as zero or more lines for the attributes and one line for the element
type. Each line for an attribute starts with "A" followed by the
name of the attribute, a space, the literal string "CDATA",
another space, and the attribute value. The text of the attribute value is
escaped as for comments (see above). The line for the element type starts
with "(" followed by the element type.
-
- hxpipe does not read DTDs and assumes that attributes are always
CDATA. It never generates other types (IMPLIED, TOKEN, ID, etc.), unlike
onsgmls.
- </elt>
- End tags are output as
)elt
I.e., as a line starting with ")" followed by the element
type.
- <empty att1="val1" att2="val2"/>
- Empty elements (in XML) are output as
Aatt1 CDATA val1
Aatt2 CDATA val2
|empty
I.e., as zero or more lines for attributes and one line starting with
"|" followed by the element type.
-
- Note that onsgmls never outputs "|". (However, it can
optionally output a line consisting of a single "e" just before
the "(" line, to indicate that the element is empty.)
- text
- Text is output as
-text
I.e., as a single line starting with a "-". The text is escaped as
for comments (see above).
- line numbers
- When the -l option is in effect, hxpipe will intersperse the
output with lines of the form
L12
where "12" is replaced with the line number in the source where
the next output came from.
hxpipe does not normalize the input and does not add mising tags. It is
thus possible that there are unequal numbers of "(" and
")" lines. If it is important that every start tag is matched by an
end tag, pipe the input through
hxnormalize -x first.
OPTIONS¶
The following options are supported:
- -l
- Add "L" lines to the output to indicate the line numbers in the
source.
OPERANDS¶
The following operand is supported:
- file-or-URL
- The name or URL of an HTML file. If absent, standard input is read
instead.
EXIT STATUS¶
The following exit values are returned:
- 0
- Successful completion.
- > 0
- An error occurred in the parsing of the HTML file. hxpipe will try
to correct the error and produce output anyway.
ENVIRONMENT¶
To use a proxy to retrieve remote files, set the environment variables
http_proxy and
ftp_proxy. E.g.,
http_proxy="http://localhost:8080/"
BUGS¶
The error recovery for incorrect HTML is primitive.
hxnormalize can
currently only retrieve remote files over HTTP. It doesn't handle
password-protected files, nor files whose content depends on HTTP
"cookies."
SEE ALSO¶
hxunpipe(1),
onsgmls(1).