NAME¶
Test::utf8 - handy utf8 tests
SYNOPSIS¶
# check the string is good
is_valid_string($string); # check the string is valid
is_sane_utf8($string); # check not double encoded
# check the string has certain attributes
is_flagged_utf8($string1); # has utf8 flag set
is_within_ascii($string2); # only has ascii chars in it
isnt_within_ascii($string3); # has chars outside the ascii range
is_within_latin_1($string4); # only has latin-1 chars in it
isnt_within_ascii($string5); # has chars outside the latin-1 range
DESCRIPTION¶
This module is a collection of tests useful for dealing with utf8 strings in
Perl.
This module has two types of tests: The validity tests check if a string is
valid and not corrupt, whereas the characteristics tests will check that
string has a given set of characteristics.
Validity Tests¶
- is_valid_string($string, $testname)
- Checks if the string is "valid", i.e. this passes
and returns true unless the internal utf8 flag hasn't been set on scalar
that isn't made up of a valid utf-8 byte sequence.
This should never happen and, in theory, this test should always
pass. Unless you (or a module you use) goes monkeying around inside a
scalar using Encode's private functions or XS code you shouldn't ever end
up in a situation where you've got a corrupt scalar. But if you do, and
you do, then this function should help you detect the problem.
To be clear, here's an example of the error case this can detect:
my $mark = "Mark";
my $leon = "L\x{e9}on";
is_valid_string($mark); # passes, not utf-8
is_valid_string($leon); # passes, not utf-8
my $iloveny = "I \x{2665} NY";
is_valid_string($iloveny); # passes, proper utf-8
my $acme = "L\x{c3}\x{a9}on";
Encode::_utf8_on($acme); # (please don't do things like this)
is_valid_string($acme); # passes, proper utf-8 byte sequence upgraded
Encode::_utf8_on($leon); # (this is why you don't do things like this)
is_valid_string($leon); # fails! the byte \x{e9} isn't valid utf-8
- is_sane_utf8($string, $name)
- This test fails if the string contains something that looks
like it might be dodgy utf8, i.e. containing something that looks like the
multi-byte sequence for a latin-1 character but perl hasn't been
instructed to treat as such. Strings that are not utf8 always
automatically pass.
Some examples may help:
# This will pass as it's a normal latin-1 string
is_sane_utf8("Hello L\x{e9}eon");
# this will fail because the \x{c3}\x{a9} looks like the
# utf8 byte sequence for e-acute
my $string = "Hello L\x{c3}\x{a9}on";
is_sane_utf8($string);
# this will pass because the utf8 is correctly interpreted as utf8
Encode::_utf8_on($string)
is_sane_utf8($string);
Obviously this isn't a hundred percent reliable. The edge case where this
will fail is where you have "\x{c2}" (which is "LATIN
CAPITAL LETTER WITH CIRCUMFLEX") or "\x{c3}" (which is
"LATIN CAPITAL LETTER WITH TILDE") followed by one of the
latin-1 punctuation symbols.
# a capital letter A with tilde surrounded by smart quotes
# this will fail because it'll see the "\x{c2}\x{94}" and think
# it's actually the utf8 sequence for the end smart quote
is_sane_utf8("\x{93}\x{c2}\x{94}");
However, since this hardly comes up this test is reasonably reliable in most
cases. Still, care should be applied in cases where dynamic data is placed
next to latin-1 punctuation to avoid false negatives.
There exists two situations to cause this test to fail; The string contains
utf8 byte sequences and the string hasn't been flagged as utf8 (this
normally means that you got it from an external source like a C library;
When Perl needs to store a string internally as utf8 it does it's own
encoding and flagging transparently) or a utf8 flagged string contains
byte sequences that when translated to characters themselves look like a
utf8 byte sequence. The test diagnostics tells you which is the case.
String Characteristic Tests¶
These routines allow you to check the range of characters in a string. Note that
these routines are blind to the actual encoding perl internally uses to store
the characters, they just check if the string contains only characters that
can be represented in the named encoding:
- is_within_ascii
- Tests that a string only contains characters that are in
the ASCII charecter set.
- is_within_latin_1
- Tests that a string only contains characters that are in
latin-1.
Simply check if a scalar is or isn't flagged as utf8 by perl's internals:
- is_flagged_utf8($string, $name)
- Passes if the string is flagged by perl's internals as
utf8, fails if it's not.
- isnt_flagged_utf8($string,$name)
- The opposite of "is_flagged_utf8", passes if and
only if the string isn't flagged as utf8 by perl's internals.
Note: you can refer to this function as "isn't_flagged_utf8" if
you really want to.
AUTHOR¶
Written by Mark Fowler
mark@twoshortplanks.com
COPYRIGHT¶
Copyright Mark Fowler 2004,2012. All rights reserved.
This program is free software; you can redistribute it and/or modify it under
the same terms as Perl itself.
BUGS¶
None known. Please report any to me via the CPAN RT system. See
http://rt.cpan.org/ for more details.
SEE ALSO¶
Test::DoubleEncodedEntities for testing for double encoded HTML entities.