.\" Automatically generated by Pod::Man 4.14 (Pod::Simple 3.43)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings.  \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote.  \*(C+ will
.\" give a nicer C++.  Capital omega is used to do unbreakable dashes and
.\" therefore won't be available.  \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
.    ds -- \(*W-
.    ds PI pi
.    if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
.    if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\"  diablo 12 pitch
.    ds L" ""
.    ds R" ""
.    ds C` ""
.    ds C' ""
'br\}
.el\{\
.    ds -- \|\(em\|
.    ds PI \(*p
.    ds L" ``
.    ds R" ''
.    ds C`
.    ds C'
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\"
.\" If the F register is >0, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD.  Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.\"
.\" Avoid warning from groff about undefined register 'F'.
.de IX
..
.nr rF 0
.if \n(.g .if rF .nr rF 1
.if (\n(rF:(\n(.g==0)) \{\
.    if \nF \{\
.        de IX
.        tm Index:\\$1\t\\n%\t"\\$2"
..
.        if !\nF==2 \{\
.            nr % 0
.            nr F 2
.        \}
.    \}
.\}
.rr rF
.\" ========================================================================
.\"
.IX Title "HTML::Encoding 3pm"
.TH HTML::Encoding 3pm "2022-12-06" "perl v5.36.0" "User Contributed Perl Documentation"
.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH "NAME"
HTML::Encoding \- Determine the encoding of HTML/XML/XHTML documents
.SH "SYNOPSIS"
.IX Header "SYNOPSIS"
.Vb 3
\&  use HTML::Encoding \*(Aqencoding_from_http_message\*(Aq;
\&  use LWP::UserAgent;
\&  use Encode;
\&  
\&  my $resp = LWP::UserAgent\->new\->get(\*(Aqhttp://www.example.org\*(Aq);
\&  my $enco = encoding_from_http_message($resp);
\&  my $utf8 = decode($enco => $resp\->content);
.Ve
.SH "WARNING"
.IX Header "WARNING"
The interface and implementation are guaranteed to change before this
module reaches version 1.00! Please send feedback to the author of
this module.
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
HTML::Encoding helps to determine the encoding of \s-1HTML\s0 and \s-1XML/XHTML\s0
documents...
.SH "DEFAULT ENCODINGS"
.IX Header "DEFAULT ENCODINGS"
Most routines need to know some suspected character encodings which
can be provided through the \f(CW\*(C`encodings\*(C'\fR option. This option always
defaults to the \f(CW$HTML::Encoding::DEFAULT_ENCODINGS\fR array reference
which means the following encodings are considered by default:
.PP
.Vb 6
\&  * ISO\-8859\-1
\&  * UTF\-16LE
\&  * UTF\-16BE
\&  * UTF\-32LE
\&  * UTF\-32BE
\&  * UTF\-8
.Ve
.PP
If you change the values or pass custom values to the routines note
that Encode must support them in order for this module to work
correctly.
.SH "ENCODING SOURCES"
.IX Header "ENCODING SOURCES"
\&\f(CW\*(C`encoding_from_xml_document\*(C'\fR, \f(CW\*(C`encoding_from_html_document\*(C'\fR, and
\&\f(CW\*(C`encoding_from_http_message\*(C'\fR return in list context the encoding
source and the encoding name, possible encoding sources are
.PP
.Vb 6
\&  * protocol         (Content\-Type: text/html;charset=encoding)
\&  * bom              (leading U+FEFF)
\&  * xml              (<?xml version=\*(Aq1.0\*(Aq encoding=\*(Aqencoding\*(Aq?>)
\&  * meta             (<meta http\-equiv=...)
\&  * default          (default fallback value)
\&  * protocol_default (protocol default)
.Ve
.SH "ROUTINES"
.IX Header "ROUTINES"
Routines exported by this module at user option. By default, nothing
is exported.
.IP "encoding_from_content_type($content_type)" 2
.IX Item "encoding_from_content_type($content_type)"
Takes a byte string and uses HTTP::Headers::Util to extract the
charset parameter from the \f(CW\*(C`Content\-Type\*(C'\fR header value and returns
its value or \f(CW\*(C`undef\*(C'\fR (or an empty list in list context) if there
is no such value. Only the first component will be examined
(\s-1HTTP/1.1\s0 only allows for one component), any backslash escapes in
strings will be unescaped, all leading and trailing quote marks
and white-space characters will be removed, all white-space will be
collapsed to a single space, empty charset values will be ignored
and no case folding is performed.
.Sp
Examples:
.Sp
.Vb 11
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-+
\&  | encoding_from_content_type(...)         | returns   |
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-+
\&  | "text/html"                             | undef     |
\&  | "text/html,text/plain;charset=utf\-8"    | undef     |
\&  | "text/html;charset="                    | undef     |
\&  | "text/html;charset=\e"\e\eu\e\et\e\ef\e\e\-\e\e8\e"" | \*(Aqutf\-8\*(Aq   |
\&  | "text/html;charset=utf\e\e\-8"             | \*(Aqutf\e\e\-8\*(Aq |
\&  | "text/html;charset=\*(Aqutf\-8\*(Aq"             | \*(Aqutf\-8\*(Aq   |
\&  | "text/html;charset=\e" UTF\-8 \e""         | \*(AqUTF\-8\*(Aq   |
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-+
.Ve
.Sp
If you pass a string with the \s-1UTF\-8\s0 flag turned on the string will
be converted to bytes before it is passed to HTTP::Headers::Util.
The return value will thus never have the \s-1UTF\-8\s0 flag turned on (this
might change in future versions).
.ie n .IP "encoding_from_byte_order_mark($octets [, %options])" 2
.el .IP "encoding_from_byte_order_mark($octets [, \f(CW%options\fR])" 2
.IX Item "encoding_from_byte_order_mark($octets [, %options])"
Takes a sequence of octets and attempts to read a byte order mark
at the beginning of the octet sequence. It will go through the list
of \f(CW$options\fR{encodings} or the list of default encodings if no
encodings are specified and match the beginning of the string against
any byte order mark octet sequence found.
.Sp
The result can be ambiguous, for example qq(\exFF\exFE\ex00\ex00) could
be both, a complete \s-1BOM\s0 in \s-1UTF\-32LE\s0 or a \s-1UTF\-16LE BOM\s0 followed by a
U+0000 character. It is also possible that \f(CW$octets\fR starts with
something that looks like a byte order mark but actually is not.
.Sp
encoding_from_byte_order_mark sorts the list of possible encodings
by the length of their \s-1BOM\s0 octet sequence and returns in scalar
context only the encoding with the longest match, and all encodings
ordered by length of their \s-1BOM\s0 octet sequence in list context.
.Sp
Examples:
.Sp
.Vb 12
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+
\&  | Input                   | Encodings  | Result                |
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+
\&  | "\exFF\exFE\ex00\ex00"      | default    | qw(UTF\-32LE)          |
\&  | "\exFF\exFE\ex00\ex00"      | default    | qw(UTF\-32LE UTF\-16LE) |
\&  | "\exEF\exBB\exBF"          | default    | qw(UTF\-8)             |
\&  | "Hello World!"          | default    | undef                 |
\&  | "\exDD\ex73\ex66\ex73"      | default    | undef                 |
\&  | "\exDD\ex73\ex66\ex73"      | UTF\-EBCDIC | qw(UTF\-EBCDIC)        |
\&  | "\ex2B\ex2F\ex76\ex38\ex2D"  | default    | undef                 |
\&  | "\ex2B\ex2F\ex76\ex38\ex2D"  | UTF\-7      | qw(UTF\-7)             |
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+
.Ve
.Sp
Note however that for \s-1UTF\-7\s0 it is in theory possible that the U+FEFF
combines with other characters in which case such detection would fail,
for example consider:
.Sp
.Vb 6
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-+
\&  | Input                                | Encodings | Result    |
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-+
\&  | "\ex2B\ex2F\ex76\ex38\ex41\ex39\ex67\ex2D"   | default   | undef     |
\&  | "\ex2B\ex2F\ex76\ex38\ex41\ex39\ex67\ex2D"   | UTF\-7     | undef     |
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-+
.Ve
.Sp
This might change in future versions, although this is not very
relevant for most applications as there should never be need to use
\&\s-1UTF\-7\s0 in the encoding list for existing documents.
.Sp
If no \s-1BOM\s0 can be found it returns \f(CW\*(C`undef\*(C'\fR in scalar context and an
empty list in list context. This routine should not be used with
strings with the \s-1UTF\-8\s0 flag turned on.
.IP "encoding_from_xml_declaration($declaration)" 2
.IX Item "encoding_from_xml_declaration($declaration)"
Attempts to extract the value of the encoding pseudo-attribute in an \s-1XML\s0
declaration or text declaration in the character string \f(CW$declaration\fR. If
there does not appear to be such a value it returns nothing. This would
typically be used with the return values of xml_declaration_from_octets.
Normalizes whitespaces like encoding_from_content_type.
.Sp
Examples:
.Sp
.Vb 10
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-+
\&  | encoding_from_xml_declaration(...)        | Result  |
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-+
\&  | "<?xml version=\*(Aq1.0\*(Aq encoding=\*(Aqutf\-8\*(Aq?>"  | \*(Aqutf\-8\*(Aq |
\&  | "<?xml encoding=\*(Aqutf\-8\*(Aq?>"                | \*(Aqutf\-8\*(Aq |
\&  | "<?xml encoding=\e"utf\-8\e"?>"              | \*(Aqutf\-8\*(Aq |
\&  | "<?xml foo=\*(Aqbar\*(Aq encoding=\*(Aqutf\-8\*(Aq?>"      | \*(Aqutf\-8\*(Aq |
\&  | "<?xml encoding=\*(Aqa\*(Aq encoding=\*(Aqb\*(Aq?>"       | \*(Aqa\*(Aq     |
\&  | "<?xml encoding=\*(Aq a    b \*(Aq?>"             | \*(Aqa b\*(Aq   |
\&  | "<?xml\-stylesheet encoding=\*(Aqutf\-8\*(Aq?>"     | undef   |
\&  | " <?xml encoding=\*(Aqutf\-8\*(Aq?>"               | undef   |
\&  | "<?xml encoding =\ex{2028}\*(Aqutf\-8\*(Aq?>"       | \*(Aqutf\-8\*(Aq |
\&  | "<?xml version=\*(Aq1.0\*(Aq encoding=utf\-8?>"    | undef   |
\&  | "<?xml x=\*(Aqencoding=\e"a\e"\*(Aq encoding=\*(Aqb\*(Aq?>" | \*(Aqa\*(Aq     |
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-+
.Ve
.Sp
Note that \fBencoding_from_xml_declaration()\fR determines the encoding even
if the \s-1XML\s0 declaration is not well-formed or violates other requirements
of the relevant \s-1XML\s0 specification as long as it can find an encoding
pseudo-attribute in the provided string. This means \s-1XML\s0 processors must
apply further checks to determine whether the entity is well-formed, etc.
.ie n .IP "xml_declaration_from_octets($octets [, %options])" 2
.el .IP "xml_declaration_from_octets($octets [, \f(CW%options\fR])" 2
.IX Item "xml_declaration_from_octets($octets [, %options])"
Attempts to find a \*(L">\*(R" character in the byte string \f(CW$octets\fR using the
encodings in \f(CW$encodings\fR and upon success attempts to find a preceding
\&\*(L"<\*(R" character. Returns all the strings found this way in the order of
number of successful matches in list context and the best match in
scalar context. Should probably be combined with the only user of this
routine, encoding_from_xml_declaration... You can modify the list of
suspected encodings using \f(CW$options\fR{encodings};
.ie n .IP "encoding_from_first_chars($octets [, %options])" 2
.el .IP "encoding_from_first_chars($octets [, \f(CW%options\fR])" 2
.IX Item "encoding_from_first_chars($octets [, %options])"
Assuming that documents start with \*(L"<\*(R" optionally preceded by whitespace
characters, encoding_from_first_chars attempts to determine an encoding
by matching \f(CW$octets\fR against something like /^[@{$options{whitespace}}]*</
in the various suspected \f(CW$options\fR{encodings}.
.Sp
This is useful to distinguish e.g. \s-1UTF\-16LE\s0 from \s-1UTF\-8\s0 if the byte string
does not start with a byte order mark nor an \s-1XML\s0 declaration (e.g. if the
document is a \s-1HTML\s0 document) to get at least a base encoding which can be
used to decode enough of the document to find <meta> elements using
encoding_from_meta_element. \f(CW$options\fR{whitespace} defaults to qw/CR \s-1LF SP TB/.\s0
Returns nothing if unsuccessful. Returns the matching encodings in order
of the number of octets matched in list context and the best match in
scalar context.
.Sp
Examples:
.Sp
.Vb 10
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+
\&  | String        | Encoding | Result              |
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+
\&  | \*(Aq<!DOCTYPE \*(Aq  | UTF\-16LE | UTF\-16LE            |
\&  | \*(Aq <!DOCTYPE \*(Aq | UTF\-16LE | UTF\-16LE            |
\&  | \*(Aq...\*(Aq         | UTF\-16LE | undef               |
\&  | \*(Aq...<\*(Aq        | UTF\-16LE | undef               |
\&  | \*(Aq<\*(Aq           | UTF\-8    | ISO\-8859\-1 or UTF\-8 |
\&  | "<!\-\-\exF6\-\->" | UTF\-8    | ISO\-8859\-1 or UTF\-8 |
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+
.Ve
.ie n .IP "encoding_from_meta_element($octets, $encname [, %options])" 2
.el .IP "encoding_from_meta_element($octets, \f(CW$encname\fR [, \f(CW%options\fR])" 2
.IX Item "encoding_from_meta_element($octets, $encname [, %options])"
Attempts to find <meta> elements in the document using HTML::Parser.
It will attempt to decode chunks of the byte string using \f(CW$encname\fR
to characters before passing the data to HTML::Parser. An optional
\&\f(CW%options\fR hash can be provided which will be passed to the HTML::Parser
constructor. It will stop processing the document if it encounters
.Sp
.Vb 4
\&  * </head>
\&  * encoding errors
\&  * the end of the input
\&  * ... (see todo)
.Ve
.Sp
If relevant <meta> elements, i.e. something like
.Sp
.Vb 1
\&  <meta http\-equiv=Content\-Type content=\*(Aq...\*(Aq>
.Ve
.Sp
are found, uses encoding_from_content_type to extract the charset
parameter. It returns all such encodings it could find in document
order in list context or the first encoding in scalar context (it
will currently look for others regardless of calling context) or
nothing if that fails for some reason.
.Sp
Note that there are many edge cases where this does not yield in
\&\*(L"proper\*(R" results depending on the capabilities of the HTML::Parser
version and the options you pass for it, for example,
.Sp
.Vb 6
\&  <!DOCTYPE html PUBLIC "\-//W3C//DTD HTML 4.01//EN" [
\&    <!ENTITY content_type "text/html;charset=utf\-8">
\&  ]>
\&  <meta http\-equiv="Content\-Type" content="&content_type;">
\&  <title></title>
\&  <p>...</p>
.Ve
.Sp
This would likely not detect the \f(CW\*(C`utf\-8\*(C'\fR value if HTML::Parser
does not resolve the entity. This should however only be a concern
for documents specifically crafted to break the encoding detection.
.ie n .IP "encoding_from_xml_document($octets, [, %options])" 2
.el .IP "encoding_from_xml_document($octets, [, \f(CW%options\fR])" 2
.IX Item "encoding_from_xml_document($octets, [, %options])"
Uses encoding_from_byte_order_mark to detect the encoding using a
byte order mark in the byte string and returns the return value of
that routine if it succeeds. Uses xml_declaration_from_octets and
encoding_from_xml_declaration and returns the encoding for which
the latter routine found most matches in scalar context, and all
encodings ordered by number of occurrences in list context. It
does not return a value of neither byte order mark not inbound
declarations declare a character encoding.
.Sp
Examples:
.Sp
.Vb 10
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-+
\&  | Input                      | Encoding | Encodings | Result   |
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-+
\&  | "<?xml?>"                  | UTF\-16   | default   | UTF\-16BE |
\&  | "<?xml?>"                  | UTF\-16LE | default   | undef    |
\&  | "<?xml encoding=\*(Aqutf\-8\*(Aq?>" | UTF\-16LE | default   | utf\-8    |
\&  | "<?xml encoding=\*(Aqutf\-8\*(Aq?>" | UTF\-16   | default   | UTF\-16BE |
\&  | "<?xml encoding=\*(Aqcp37\*(Aq?>"  | CP37     | default   | undef    |
\&  | "<?xml encoding=\*(Aqcp37\*(Aq?>"  | CP37     | CP37      | cp37     |
\&  +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-+
.Ve
.Sp
Lacking a return value from this routine and higher-level protocol
information (such as protocol encoding defaults) processors would
be required to assume that the document is \s-1UTF\-8\s0 encoded.
.Sp
Note however that the return value depends on the set of suspected
encodings you pass to it. For example, by default, \s-1EBCDIC\s0 encodings
would not be considered and thus for
.Sp
.Vb 1
\&  <?xml version=\*(Aq1.0\*(Aq encoding=\*(Aqcp37\*(Aq?>
.Ve
.Sp
this routine would return the undefined value. You can modify the
list of suspected encodings using \f(CW$options\fR{encodings}.
.ie n .IP "encoding_from_html_document($octets, [, %options])" 2
.el .IP "encoding_from_html_document($octets, [, \f(CW%options\fR])" 2
.IX Item "encoding_from_html_document($octets, [, %options])"
Uses encoding_from_xml_document and encoding_from_meta_element to
determine the encoding of \s-1HTML\s0 documents. If \f(CW$options\fR{xhtml} is
set to a false value uses encoding_from_byte_order_mark and 
encoding_from_meta_element to determine the encoding. The xhtml
option is on by default. The \f(CW$options\fR{encodings} can be used to
modify the suspected encodings and \f(CW$options\fR{parser_options} can
be used to modify the HTML::Parser options in
encoding_from_meta_element (see the relevant documentation).
.Sp
Returns nothing if no declaration could be found, the winning
declaration in scalar context and a list of encoding source
and encoding name in list context, see \s-1ENCODING SOURCES.\s0
.Sp
\&...
.Sp
Other problems arise from differences between \s-1HTML\s0 and \s-1XHTML\s0 syntax
and encoding detection rules, for example, the input could be
.Sp
.Vb 1
\&  Content\-Type: text/html
\&
\&  <?xml version=\*(Aq1.0\*(Aq encoding=\*(Aqutf\-8\*(Aq?>
\&  <!DOCTYPE html PUBLIC "\-//W3C//DTD HTML 4.01//EN"
\&  "http://www.w3.org/TR/html4/strict.dtd">
\&  <meta http\-equiv = "Content\-Type"
\&           content = "text/html;charset=iso\-8859\-2">
\&  <title></title>
\&  <p>...</p>
.Ve
.Sp
This is a perfectly legal \s-1HTML 4.01\s0 document and implementations
might be expected to consider the document \s-1ISO\-8859\-2\s0 encoded as
\&\s-1XML\s0 rules for encoding detection do not apply to \s-1HTML\s0 documents.
This module attempts to avoid making decisions which rules apply
for a specific document and would thus by default return 'utf\-8'
for this input.
.Sp
On the other hand, if the input omits the encoding declaration,
.Sp
.Vb 1
\&  Content\-Type: text/html
\&
\&  <?xml version=\*(Aq1.0\*(Aq?>
\&  <!DOCTYPE html PUBLIC "\-//W3C//DTD HTML 4.01//EN"
\&  "http://www.w3.org/TR/html4/strict.dtd">
\&  <meta http\-equiv = "Content\-Type"
\&           content = "text/html;charset=iso\-8859\-2">
\&  <title></title>
\&  <p>...</p>
.Ve
.Sp
It would return 'iso\-8859\-2'. Similar problems would arise from
other differences between \s-1HTML\s0 and \s-1XHTML,\s0 for example consider
.Sp
.Vb 1
\&  Content\-Type: text/html
\&
\&  <?foo >
\&  <!DOCTYPE html PUBLIC "\-//W3C//DTD XHTML 1.0 Strict//EN"
\&      "http://www.w3.org/TR/xhtml1/DTD/xhtml1\-strict.dtd">
\&  <html ...
\&  ?>
\&  ...
\&  <!DOCTYPE html PUBLIC "\-//W3C//DTD HTML 4.01//EN">
\&  ...
.Ve
.Sp
If this is processed using \s-1HTML\s0 rules, the first > will end the
processing instruction and the \s-1XHTML\s0 document type declaration
would be the relevant declaration for the document, if it is
processed using \s-1XHTML\s0 rules, the ?> will end the processing
instruction and the \s-1HTML\s0 document type declaration would be the
relevant declaration.
.Sp
\&\s-1IOW,\s0 an application would need to assume a certain character
encoding (family) to process enough of the document to determine
whether it is \s-1XHTML\s0 or \s-1HTML\s0 and the result of this detection would
depend on which processing rules are assumed in order to process it.
It is thus in essence not possible to write a \*(L"perfect\*(R" detection
algorithm, which is why this routine attempts to avoid making any
decisions on this matter.
.ie n .IP "encoding_from_http_message($message [, %options])" 2
.el .IP "encoding_from_http_message($message [, \f(CW%options\fR])" 2
.IX Item "encoding_from_http_message($message [, %options])"
Determines the encoding of \s-1HTML / XML / XHTML\s0 documents enclosed
in \s-1HTTP\s0 message. \f(CW$message\fR is an object compatible to HTTP::Message,
e.g. a HTTP::Response object. \f(CW%options\fR is a hash with the following
possible entries:
.RS 2
.IP "encodings" 2
.IX Item "encodings"
array references of suspected character encodings, defaults to
\&\f(CW$HTML::Encoding::DEFAULT_ENCODINGS\fR.
.IP "is_html" 2
.IX Item "is_html"
Regular expression matched against the content_type of the message
to determine whether to use \s-1HTML\s0 rules for the entity body, defaults
to \f(CW\*(C`qr{^text/html$}i\*(C'\fR.
.IP "is_xml" 2
.IX Item "is_xml"
Regular expression matched against the content_type of the message
to determine whether to use \s-1XML\s0 rules for the entity body, defaults
to \f(CW\*(C`qr{^.+/(?:.+\e+)?xml$}i\*(C'\fR.
.IP "is_text_xml" 2
.IX Item "is_text_xml"
Regular expression matched against the content_type of the message
to determine whether to use text/html rules for the message, defaults
to \f(CW\*(C`qr{^text/(?:.+\e+)?xml$}i\*(C'\fR. This will only be checked if is_xml
matches aswell.
.IP "html_default" 2
.IX Item "html_default"
Default encoding for documents determined (by is_html) as \s-1HTML,\s0
defaults to \f(CW\*(C`ISO\-8859\-1\*(C'\fR.
.IP "xml_default" 2
.IX Item "xml_default"
Default encoding for documents determined (by is_xml) as \s-1XML,\s0
defaults to \f(CW\*(C`UTF\-8\*(C'\fR.
.IP "text_xml_default" 2
.IX Item "text_xml_default"
Default encoding for documents determined (by is_text_xml) as text/xml,
defaults to \f(CW\*(C`undef\*(C'\fR in which case the default is ignored. This should
be set to \f(CW\*(C`US\-ASCII\*(C'\fR if desired as this module is by default
inconsistent with \s-1RFC 3023\s0 which requires that for text/xml documents
without a charset parameter in the \s-1HTTP\s0 header \f(CW\*(C`US\-ASCII\*(C'\fR is assumed.
.Sp
This requirement is inconsistent with \s-1RFC 2616\s0 (\s-1HTTP/1.1\s0) which requires
to assume \f(CW\*(C`ISO\-8859\-1\*(C'\fR, has been widely ignored and is thus disabled by
default.
.IP "xhtml" 2
.IX Item "xhtml"
Whether the routine should look for an encoding declaration in the
\&\s-1XML\s0 declaration of the document (if any), defaults to \f(CW1\fR.
.IP "default" 2
.IX Item "default"
Whether the relevant default value should be returned when no other
information can be determined, defaults to \f(CW1\fR.
.RE
.RS 2
.Sp
This is furhter possibly inconsistent with \s-1XML MIME\s0 types that differ
in other ways from application/xml, for example if the \s-1MIME\s0 Type does
not allow for a charset parameter in which case applications might be
expected to ignore the charset parameter if erroneously provided.
.RE
.SH "EBCDIC SUPPORT"
.IX Header "EBCDIC SUPPORT"
By default, this module does not support \s-1EBCDIC\s0 encodings. To enable
support for \s-1EBCDIC\s0 encodings you can either change the
\&\f(CW$HTML::Encodings::DEFAULT_ENCODINGS\fR array reference or pass the
encodings to the routines you use using the encodings option, for
example
.PP
.Vb 2
\&  my @try = qw/UTF\-8 UTF\-16LE cp500 posix\-bc .../;
\&  my $enc = encoding_from_xml_document($doc, encodings => \e@try);
.Ve
.PP
Note that there are some subtle differences between various \s-1EBCDIC\s0
encodings, for example \f(CW\*(C`!\*(C'\fR is mapped to 0x5A in \f(CW\*(C`posix\-bc\*(C'\fR and
to 0x4F in \f(CW\*(C`cp500\*(C'\fR; these differences might affect processing in
yet undetermined ways.
.SH "TODO"
.IX Header "TODO"
.Vb 4
\&  * bundle with test suite
\&  * optimize some routines to give up once successful
\&  * avoid transcoding for HTML::Parser if e.g. ISO\-8859\-1
\&  * consider adding a "HTML5" modus of operation?
.Ve
.SH "SEE ALSO"
.IX Header "SEE ALSO"
.Vb 12
\&  * http://www.w3.org/TR/REC\-xml/#charencoding
\&  * http://www.w3.org/TR/REC\-xml/#sec\-guessing
\&  * http://www.w3.org/TR/xml11/#charencoding
\&  * http://www.w3.org/TR/xml11/#sec\-guessing
\&  * http://www.w3.org/TR/html4/charset.html#h\-5.2.2
\&  * http://www.w3.org/TR/xhtml1/#C_9
\&  * http://www.ietf.org/rfc/rfc2616.txt
\&  * http://www.ietf.org/rfc/rfc2854.txt
\&  * http://www.ietf.org/rfc/rfc3023.txt
\&  * perlunicode
\&  * Encode
\&  * HTML::Parser
.Ve
.SH "AUTHOR / COPYRIGHT / LICENSE"
.IX Header "AUTHOR / COPYRIGHT / LICENSE"
.Vb 2
\&  Copyright (c) 2004\-2008 Bjoern Hoehrmann <bjoern@hoehrmann.de>.
\&  This module is licensed under the same terms as Perl itself.
.Ve