\# This file is part of TagSoup and is Copyright 2002-2008 by John Cowan.
\#
\# TagSoup is licensed under the Apache License,
\# Version 2.0.  You may obtain a copy of this license at
\# http://www.apache.org/licenses/LICENSE-2.0 .  You may also have
\# additional legal rights not granted by this license.
\#
\# TagSoup is distributed in the hope that it will be useful, but
\# unless required by applicable law or agreed to in writing, TagSoup
\# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
\# OF ANY KIND, either express or implied; not even the implied warranty
\# of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
\#
.TH TAGSOUP "1" "January 2008" "TagSoup 1.2.1" "User Commands"
.SH NAME
tagsoup \- convert nasty, ugly HTML to clean XHTML
.SH SYNOPSIS
.B java -jar /usr/share/java/tagsoup.jar
[
.I options
] [
.I files
]
.SH DESCRIPTION
.\" Add any additional description here
.PP
Rectify arbitrary HTML into clean XHTML,
using a tailored description of HTML.
The output will be well-formed XML, but not necessarily
.I valid
XHTML.
.PP
.TP
.B --files
multiple input
.I files
should be processed into corresponding output files
.TP
.BI --encoding= encoding
specifies the encoding of input files
.TP
.BI --output-encoding= encoding
specifies the encoding of the output
(if the encoding name begins with ``utf'',
the output will not contain character entities;
otherwise, all non-ASCII characters are
represented as entities)
.TP
.B --html
output rectified HTML rather than XML,
omitting the XML declaration
and any namespace declarations
.TP
.B --method=html
output rectified HTML rather than XML
(end-tags are omitted for empty elements,
and no character escaping is done in
script and style elements)
.TP
.B --omit-xml-declaration
omit the XML declaration
.TP
.B --lexical
output lexical features (specifically comments and any DOCTYPE declaration)
.TP
.B --nons
suppress namespaces in output
.TP
.B --nobogons
suppress unknown non-HTML elements in output
.TP
.B --nodefaults
suppress default attribute values
.TP
.B --nocolons
change explicit colons
in element and attribute names
to underscores
.TP
.B --norestart
don't restart any restartable elements
.TP
.B --ignorable
pass through ignorable whitespace
(whitespace in element-only content)
via SAX method handler ignorableWhitespace
.TP
.B --any
treat unknown non-HTML elements as allowing any content (default)
.TP
.B --emptybogons
treat unknown non-HTML elements as empty elements
.TP
.B --norootbogons
don't allow unknown non-HTML elements to be root elements
.TP
.BI --doctype-system= system-id
force DOCTYPE declaration to be output with specified system identifier
.TP
.BI --doctype-public= public-id
force DOCTYPE declaration to be output with specified public identifier
.TP
.B --standalone=[yes|no]
specify standalone pseudo-attribute in output XML declaration
.TP
.BI --version= version
specify version pseudo-attribute in output XML declaration
(does not affect actual version of XML output)
.TP
.B --nocdata
treat the CDATA-content elements
.I script
and
.I style
as ordinary elements
(mostly for testing)
.TP
.B --pyx
output PYX format rather than XML
(mostly for testing)
.TP
.B --pyxin
input is PYX-format HTML
(mostly for testing)
.TP
.B --reuse
reuse the same Parser object internally
(for testing only)
.TP
.B --help
output basic help
.TP
.B --version
output version number
.PP
.B TagSoup
is a parser and reformatter for nasty, ugly HTML.
Its normal processing mode is to accept HTML files on the command line,
or from the standard input if none are given, and output them
as clean XML
to the standard output.  The encoding is assumed to be the platform-local
encoding on input, and is always UTF-8 on output.
.PP
When the
.B --files
option is given, each input file is processed into an output file of the
corresponding name, with the extension changed to
.IR xhtml .
If the extension is already
.IR xhtml ,
it is changed to
.IR xhtml_ .
.PP
TagSoup will repair, by whatever means necessary,
violations of XML well-formedness.  In particular, it will fix up
malformed attribute names and supply missing attribute-value quotation marks.
More significantly, it supplies end-tags where HTML allows them
to be omitted, and sometimes where it doesn't.  It will even supply
start-tags where necessary; for example, if a document begins with a
<li> tag, TagSoup will automatically prefix it with <html><body><ul>.
.PP
.SH BUGS
TagSoup can be fooled by missing close quotes after attribute values, and by
incorrect character encodings (it does not contain an encoding guesser).
.PP
TagSoup doesn't understand namespace declarations, which are not properly
part of HTML.  Instead, any element or attribute name beginning
.IR foo :
will be put into the artificial namespace
.RI urn:x-prefix: foo .
.PP
For the same reasons, namespace-qualified attributes like
xml:space
can't be returned as default values,
though an explicit attribute in the xml namespace
will be returned with the proper namespace URI.
.SH AUTHOR
John Cowan <cowan@ccil.org>
.SH COPYRIGHT
Copyright \(co 2002-2008 John Cowan
.br
TagSoup is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.