NAME¶
HTML::Encoding - Determine the encoding of HTML/XML/XHTML documents
SYNOPSIS¶
use HTML::Encoding 'encoding_from_http_message';
use LWP::UserAgent;
use Encode;
my $resp = LWP::UserAgent->new->get('http://www.example.org');
my $enco = encoding_from_http_message($resp);
my $utf8 = decode($enco => $resp->content);
WARNING¶
The interface and implementation are guranteed to change before this module
reaches version 1.00! Please send feedback to the author of this module.
DESCRIPTION¶
HTML::Encoding helps to determine the encoding of HTML and XML/XHTML
documents...
DEFAULT ENCODINGS¶
Most routines need to know some suspected character encodings which can be
provided through the "encodings" option. This option always defaults
to the $HTML::Encoding::DEFAULT_ENCODINGS array reference which means the
following encodings are considered by default:
* ISO-8859-1
* UTF-16LE
* UTF-16BE
* UTF-32LE
* UTF-32BE
* UTF-8
If you change the values or pass custom values to the routines note that Encode
must support them in order for this module to work correctly.
ENCODING SOURCES¶
"encoding_from_xml_document", "encoding_from_html_document",
and "encoding_from_http_message" return in list context the encoding
source and the encoding name, possible encoding sources are
* protocol (Content-Type: text/html;charset=encoding)
* bom (leading U+FEFF)
* xml (<?xml version='1.0' encoding='encoding'?>)
* meta (<meta http-equiv=...)
* default (default fallback value)
* protocol_default (protocol default)
ROUTINES¶
Routines exported by this module at user option. By default, nothing is
exported.
- encoding_from_content_type($content_type)
- Takes a byte string and uses HTTP::Headers::Util to extract
the charset parameter from the "Content-Type" header value and
returns its value or "undef" (or an empty list in list context)
if there is no such value. Only the first component will be examined
(HTTP/1.1 only allows for one component), any backslash escapes in strings
will be unescaped, all leading and trailing quote marks and white-space
characters will be removed, all white-space will be collapsed to a single
space, empty charset values will be ignored and no case folding is
performed.
Examples:
+-----------------------------------------+-----------+
| encoding_from_content_type(...) | returns |
+-----------------------------------------+-----------+
| "text/html" | undef |
| "text/html,text/plain;charset=utf-8" | undef |
| "text/html;charset=" | undef |
| "text/html;charset=\"\\u\\t\\f\\-\\8\"" | 'utf-8' |
| "text/html;charset=utf\\-8" | 'utf\\-8' |
| "text/html;charset='utf-8'" | 'utf-8' |
| "text/html;charset=\" UTF-8 \"" | 'UTF-8' |
+-----------------------------------------+-----------+
If you pass a string with the UTF-8 flag turned on the string will be
converted to bytes before it is passed to HTTP::Headers::Util. The return
value will thus never have the UTF-8 flag turned on (this might change in
future versions).
- encoding_from_byte_order_mark($octets [, %options])
- Takes a sequence of octets and attempts to read a byte
order mark at the beginning of the octet sequence. It will go through the
list of $options{encodings} or the list of default encodings if no
encodings are specified and match the beginning of the string against any
byte order mark octet sequence found.
The result can be ambiguous, for example qq(\xFF\xFE\x00\x00) could be both,
a complete BOM in UTF-32LE or a UTF-16LE BOM followed by a U+0000
character. It is also possible that $octets starts with something that
looks like a byte order mark but actually is not.
encoding_from_byte_order_mark sorts the list of possible encodings by the
length of their BOM octet sequence and returns in scalar context only the
encoding with the longest match, and all encodings ordered by length of
their BOM octet sequence in list context.
Examples:
+-------------------------+------------+-----------------------+
| Input | Encodings | Result |
+-------------------------+------------+-----------------------+
| "\xFF\xFE\x00\x00" | default | qw(UTF-32LE) |
| "\xFF\xFE\x00\x00" | default | qw(UTF-32LE UTF-16LE) |
| "\xEF\xBB\xBF" | default | qw(UTF-8) |
| "Hello World!" | default | undef |
| "\xDD\x73\x66\x73" | default | undef |
| "\xDD\x73\x66\x73" | UTF-EBCDIC | qw(UTF-EBCDIC) |
| "\x2B\x2F\x76\x38\x2D" | default | undef |
| "\x2B\x2F\x76\x38\x2D" | UTF-7 | qw(UTF-7) |
+-------------------------+------------+-----------------------+
Note however that for UTF-7 it is in theory possible that the U+FEFF
combines with other characters in which case such detection would fail,
for example consider:
+--------------------------------------+-----------+-----------+
| Input | Encodings | Result |
+--------------------------------------+-----------+-----------+
| "\x2B\x2F\x76\x38\x41\x39\x67\x2D" | default | undef |
| "\x2B\x2F\x76\x38\x41\x39\x67\x2D" | UTF-7 | undef |
+--------------------------------------+-----------+-----------+
This might change in future versions, although this is not very relevant for
most applications as there should never be need to use UTF-7 in the
encoding list for existing documents.
If no BOM can be found it returns "undef" in scalar context and an
empty list in list context. This routine should not be used with strings
with the UTF-8 flag turned on.
- encoding_from_xml_declaration($declaration)
- Attempts to extract the value of the encoding
pseudo-attribute in an XML declaration or text declaration in the
character string $declaration. If there does not appear to be such a value
it returns nothing. This would typically be used with the return values of
xml_declaration_from_octets. Normalizes whitespaces like
encoding_from_content_type.
Examples:
+-------------------------------------------+---------+
| encoding_from_xml_declaration(...) | Result |
+-------------------------------------------+---------+
| "<?xml version='1.0' encoding='utf-8'?>" | 'utf-8' |
| "<?xml encoding='utf-8'?>" | 'utf-8' |
| "<?xml encoding=\"utf-8\"?>" | 'utf-8' |
| "<?xml foo='bar' encoding='utf-8'?>" | 'utf-8' |
| "<?xml encoding='a' encoding='b'?>" | 'a' |
| "<?xml encoding=' a b '?>" | 'a b' |
| "<?xml-stylesheet encoding='utf-8'?>" | undef |
| " <?xml encoding='utf-8'?>" | undef |
| "<?xml encoding =\x{2028}'utf-8'?>" | 'utf-8' |
| "<?xml version='1.0' encoding=utf-8?>" | undef |
| "<?xml x='encoding=\"a\"' encoding='b'?>" | 'a' |
+-------------------------------------------+---------+
Note that encoding_from_xml_declaration() determines the encoding
even if the XML declaration is not well-formed or violates other
requirements of the relevant XML specification as long as it can find an
encoding pseudo-attribute in the provided string. This means XML
processors must apply further checks to determine whether the entity is
well-formed, etc.
- xml_declaration_from_octets($octets [, %options])
- Attempts to find a ">" character in the byte
string $octets using the encodings in $encodings and upon success attempts
to find a preceding "<" character. Returns all the strings
found this way in the order of number of successful matches in list
context and the best match in scalar context. Should probably be combined
with the only user of this routine, encoding_from_xml_declaration... You
can modify the list of suspected encodings using $options{encodings};
- encoding_from_first_chars($octets [, %options])
- Assuming that documents start with "<"
optionally preceded by whitespace characters, encoding_from_first_chars
attempts to determine an encoding by matching $octets against something
like /^[@{$options{whitespace}}]*</ in the various suspected
$options{encodings}.
This is useful to distinguish e.g. UTF-16LE from UTF-8 if the byte string
does not start with a byte order mark nor an XML declaration (e.g. if the
document is a HTML document) to get at least a base encoding which can be
used to decode enough of the document to find <meta> elements using
encoding_from_meta_element. $options{whitespace} defaults to qw/CR LF SP
TB/. Returns nothing if unsuccessful. Returns the matching encodings in
order of the number of octets matched in list context and the best match
in scalar context.
Examples:
+---------------+----------+---------------------+
| String | Encoding | Result |
+---------------+----------+---------------------+
| '<!DOCTYPE ' | UTF-16LE | UTF-16LE |
| ' <!DOCTYPE ' | UTF-16LE | UTF-16LE |
| '...' | UTF-16LE | undef |
| '...<' | UTF-16LE | undef |
| '<' | UTF-8 | ISO-8859-1 or UTF-8 |
| "<!--\xF6-->" | UTF-8 | ISO-8859-1 or UTF-8 |
+---------------+----------+---------------------+
- encoding_from_meta_element($octets, $encname [,
%options])
- Attempts to find <meta> elements in the document
using HTML::Parser. It will attempt to decode chunks of the byte string
using $encname to characters before passing the data to HTML::Parser. An
optional %options hash can be provided which will be passed to the
HTML::Parser constructor. It will stop processing the document if it
encounters
* </head>
* encoding errors
* the end of the input
* ... (see todo)
If relevant <meta> elements, i.e. something like
<meta http-equiv=Content-Type content='...'>
are found, uses encoding_from_content_type to extract the charset parameter.
It returns all such encodings it could find in document order in list
context or the first encoding in scalar context (it will currently look
for others regardless of calling context) or nothing if that fails for
some reason.
Note that there are many edge cases where this does not yield in
"proper" results depending on the capabilities of the
HTML::Parser version and the options you pass for it, for example,
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" [
<!ENTITY content_type "text/html;charset=utf-8">
]>
<meta http-equiv="Content-Type" content="&content_type;">
<title></title>
<p>...</p>
This would likely not detect the "utf-8" value if HTML::Parser
does not resolve the entity. This should however only be a concern for
documents specifically crafted to break the encoding detection.
- encoding_from_xml_document($octets, [, %options])
- Uses encoding_from_byte_order_mark to detect the encoding
using a byte order mark in the byte string and returns the return value of
that routine if it succeeds. Uses xml_declaration_from_octets and
encoding_from_xml_declaration and returns the encoding for which the
latter routine found most matches in scalar context, and all encodings
ordered by number of occurences in list context. It does not return a
value of neither byte order mark not inbound declarations declare a
character encoding.
Examples:
+----------------------------+----------+-----------+----------+
| Input | Encoding | Encodings | Result |
+----------------------------+----------+-----------+----------+
| "<?xml?>" | UTF-16 | default | UTF-16BE |
| "<?xml?>" | UTF-16LE | default | undef |
| "<?xml encoding='utf-8'?>" | UTF-16LE | default | utf-8 |
| "<?xml encoding='utf-8'?>" | UTF-16 | default | UTF-16BE |
| "<?xml encoding='cp37'?>" | CP37 | default | undef |
| "<?xml encoding='cp37'?>" | CP37 | CP37 | cp37 |
+----------------------------+----------+-----------+----------+
Lacking a return value from this routine and higher-level protocol
information (such as protocol encoding defaults) processors would be
required to assume that the document is UTF-8 encoded.
Note however that the return value depends on the set of suspected encodings
you pass to it. For example, by default, EBCDIC encodings would not be
considered and thus for
<?xml version='1.0' encoding='cp37'?>
this routine would return the undefined value. You can modify the list of
suspected encodings using $options{encodings}.
- encoding_from_html_document($octets, [, %options])
- Uses encoding_from_xml_document and
encoding_from_meta_element to determine the encoding of HTML documents. If
$options{xhtml} is set to a false value uses encoding_from_byte_order_mark
and encoding_from_meta_element to determine the encoding. The xhtml option
is on by default. The $options{encodings} can be used to modify the
suspected encodings and $options{parser_options} can be used to modify the
HTML::Parser options in encoding_from_meta_element (see the relevant
documentation).
Returns nothing if no declaration could be found, the winning declaration in
scalar context and a list of encoding source and encoding name in list
context, see ENCODING SOURCES.
...
Other problems arise from differences between HTML and XHTML syntax and
encoding detection rules, for example, the input could be
Content-Type: text/html
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<meta http-equiv = "Content-Type"
content = "text/html;charset=iso-8859-2">
<title></title>
<p>...</p>
This is a perfectly legal HTML 4.01 document and implementations might be
expected to consider the document ISO-8859-2 encoded as XML rules for
encoding detection do not apply to HTML documents. This module attempts to
avoid making decisions which rules apply for a specific document and would
thus by default return 'utf-8' for this input.
On the other hand, if the input omits the encoding declaration,
Content-Type: text/html
<?xml version='1.0'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<meta http-equiv = "Content-Type"
content = "text/html;charset=iso-8859-2">
<title></title>
<p>...</p>
It would return 'iso-8859-2'. Similar problems would arise from other
differences between HTML and XHTML, for example consider
Content-Type: text/html
<?foo >
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html ...
?>
...
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
...
If this is processed using HTML rules, the first > will end the
processing instruction and the XHTML document type declaration would be
the relevant declaration for the document, if it is processed using XHTML
rules, the ?> will end the processing instruction and the HTML document
type declaration would be the relevant declaration.
IOW, an application would need to assume a certain character encoding
(family) to process enough of the document to determine whether it is
XHTML or HTML and the result of this detection would depend on which
processing rules are assumed in order to process it. It is thus in essence
not possible to write a "perfect" detection algorithm, which is
why this routine attempts to avoid making any decisions on this
matter.
- encoding_from_http_message($message [, %options])
- Determines the encoding of HTML / XML / XHTML documents
enclosed in HTTP message. $message is an object compatible to
HTTP::Message, e.g. a HTTP::Response object. %options is a hash with the
following possible entries:
- encodings
- array references of suspected character encodings, defaults
to $HTML::Encoding::DEFAULT_ENCODINGS.
- is_html
- Regular expression matched against the content_type of the
message to determine whether to use HTML rules for the entity body,
defaults to "qr{^text/html$}i".
- is_xml
- Regular expression matched against the content_type of the
message to determine whether to use XML rules for the entity body,
defaults to "qr{^.+/(?:.+\+)?xml$}i".
- is_text_xml
- Regular expression matched against the content_type of the
message to determine whether to use text/html rules for the message,
defaults to "qr{^text/(?:.+\+)?xml$}i". This will only be
checked if is_xml matches aswell.
- html_default
- Default encoding for documents determined (by is_html) as
HTML, defaults to "ISO-8859-1".
- xml_default
- Default encoding for documents determined (by is_xml) as
XML, defaults to "UTF-8".
- text_xml_default
- Default encoding for documents determined (by is_text_xml)
as text/xml, defaults to "undef" in which case the default is
ignored. This should be set to "US-ASCII" if desired as this
module is by default inconsistent with RFC 3023 which requires that for
text/xml documents without a charset parameter in the HTTP header
"US-ASCII" is assumed.
This requirement is inconsistent with RFC 2616 (HTTP/1.1) which requires to
assume "ISO-8859-1", has been widely ignored and is thus
disabled by default.
- xhtml
- Whether the routine should look for an encoding declaration
in the XML declaration of the document (if any), defaults to 1.
- default
- Whether the relevant default value should be returned when
no other information can be determined, defaults to 1.
This is furhter possibly inconsistent with XML MIME types that differ in other
ways from application/xml, for example if the MIME Type does not allow for a
charset parameter in which case applications might be expected to ignore the
charset parameter if erroneously provided.
EBCDIC SUPPORT¶
By default, this module does not support EBCDIC encodings. To enable support for
EBCDIC encodings you can either change the $HTML::Encodings::DEFAULT_ENCODINGS
array reference or pass the encodings to the routines you use using the
encodings option, for example
my @try = qw/UTF-8 UTF-16LE cp500 posix-bc .../;
my $enc = encoding_from_xml_document($doc, encodings => \@try);
Note that there are some subtle differences between various EBCDIC encodings,
for example "!" is mapped to 0x5A in "posix-bc" and to
0x4F in "cp500"; these differences might affect processing in yet
undetermined ways.
TODO¶
* bundle with test suite
* optimize some routines to give up once successful
* avoid transcoding for HTML::Parser if e.g. ISO-8859-1
* consider adding a "HTML5" modus of operation?
SEE ALSO¶
* http://www.w3.org/TR/REC-xml/#charencoding
* http://www.w3.org/TR/REC-xml/#sec-guessing
* http://www.w3.org/TR/xml11/#charencoding
* http://www.w3.org/TR/xml11/#sec-guessing
* http://www.w3.org/TR/html4/charset.html#h-5.2.2
* http://www.w3.org/TR/xhtml1/#C_9
* http://www.ietf.org/rfc/rfc2616.txt
* http://www.ietf.org/rfc/rfc2854.txt
* http://www.ietf.org/rfc/rfc3023.txt
* perlunicode
* Encode
* HTML::Parser
AUTHOR / COPYRIGHT / LICENSE¶
Copyright (c) 2004-2008 Bjoern Hoehrmann <bjoern@hoehrmann.de>.
This module is licensed under the same terms as Perl itself.