NAME¶
tagsoup - convert nasty, ugly HTML to clean XHTML
SYNOPSIS¶
java -jar /usr/share/java/tagsoup.jar [
options ] [
files ]
DESCRIPTION¶
Rectify arbitrary HTML into clean XHTML, using a tailored description of HTML.
The output will be well-formed XML, but not necessarily
valid XHTML.
- --files
- multiple input files should be processed into corresponding output
files
- --encoding=encoding
- specifies the encoding of input files
- --output-encoding=encoding
- specifies the encoding of the output (if the encoding name begins with
``utf'', the output will not contain character entities; otherwise, all
non-ASCII characters are represented as entities)
- --html
- output rectified HTML rather than XML, omitting the XML declaration and
any namespace declarations
- --method=html
- output rectified HTML rather than XML (end-tags are omitted for empty
elements, and no character escaping is done in script and style
elements)
- --omit-xml-declaration
- omit the XML declaration
- --lexical
- output lexical features (specifically comments and any DOCTYPE
declaration)
- --nons
- suppress namespaces in output
- --nobogons
- suppress unknown non-HTML elements in output
- --nodefaults
- suppress default attribute values
- --nocolons
- change explicit colons in element and attribute names to underscores
- --norestart
- don't restart any restartable elements
- --ignorable
- pass through ignorable whitespace (whitespace in element-only content) via
SAX method handler ignorableWhitespace
- --any
- treat unknown non-HTML elements as allowing any content (default)
- --emptybogons
- treat unknown non-HTML elements as empty elements
- --norootbogons
- don't allow unknown non-HTML elements to be root elements
- --doctype-system=system-id
- force DOCTYPE declaration to be output with specified system
identifier
- --doctype-public=public-id
- force DOCTYPE declaration to be output with specified public
identifier
- --standalone=[yes|no]
- specify standalone pseudo-attribute in output XML declaration
- --version=version
- specify version pseudo-attribute in output XML declaration (does not
affect actual version of XML output)
- --nocdata
- treat the CDATA-content elements script and style as
ordinary elements (mostly for testing)
- --pyx
- output PYX format rather than XML (mostly for testing)
- --pyxin
- input is PYX-format HTML (mostly for testing)
- --reuse
- reuse the same Parser object internally (for testing only)
- --help
- output basic help
- --version
- output version number
TagSoup is a parser and reformatter for nasty, ugly HTML. Its normal
processing mode is to accept HTML files on the command line, or from the
standard input if none are given, and output them as clean XML to the standard
output. The encoding is assumed to be the platform-local encoding on input,
and is always UTF-8 on output.
When the
--files option is given, each input file is processed into an
output file of the corresponding name, with the extension changed to
xhtml. If the extension is already
xhtml, it is changed to
xhtml_.
TagSoup will repair, by whatever means necessary, violations of XML
well-formedness. In particular, it will fix up malformed attribute names and
supply missing attribute-value quotation marks. More significantly, it
supplies end-tags where HTML allows them to be omitted, and sometimes where it
doesn't. It will even supply start-tags where necessary; for example, if a
document begins with a <li> tag, TagSoup will automatically prefix it
with <html><body><ul>.
BUGS¶
TagSoup can be fooled by missing close quotes after attribute values, and by
incorrect character encodings (it does not contain an encoding guesser).
TagSoup doesn't understand namespace declarations, which are not properly part
of HTML. Instead, any element or attribute name beginning
foo: will be
put into the artificial namespace urn:x-prefix:
foo.
For the same reasons, namespace-qualified attributes like xml:space can't be
returned as default values, though an explicit attribute in the xml namespace
will be returned with the proper namespace URI.
AUTHOR¶
John Cowan <cowan@ccil.org>
COPYRIGHT¶
Copyright © 2002-2008 John Cowan
TagSoup is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.