.\" Automatically generated by Pod::Man 4.09 (Pod::Simple 3.35) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .if !\nF .nr F 0 .if \nF>0 \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} .\} .\" ======================================================================== .\" .IX Title "Encoding::FixLatin 3pm" .TH Encoding::FixLatin 3pm "2017-09-06" "perl v5.26.0" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" Encoding::FixLatin \- takes mixed encoding input and produces UTF\-8 output .SH "SYNOPSIS" .IX Header "SYNOPSIS" .Vb 1 \& use Encoding::FixLatin qw(fix_latin); \& \& my $utf8_string = fix_latin($mixed_encoding_string); .Ve .SH "DESCRIPTION" .IX Header "DESCRIPTION" Most encoding conversion tools take input in one encoding and produce output in another encoding. This module takes input which may contain characters in more than one encoding and makes a best effort to convert them all to \s-1UTF\-8\s0 output. .SH "EXPORTS" .IX Header "EXPORTS" Nothing is exported by default. The only public function is \f(CW\*(C`fix_latin\*(C'\fR which will be exported on request (as per \s-1SYNOPSIS\s0). .SH "FUNCTIONS" .IX Header "FUNCTIONS" .SS "fix_latin( string, options ... )" .IX Subsection "fix_latin( string, options ... )" Decodes the supplied 'string' and returns a \s-1UTF\-8\s0 version of the string. The following rules are used: .IP "\(bu" 4 \&\s-1ASCII\s0 characters (single bytes in the range 0x00 \- 0x7F) are passed through unchanged. .IP "\(bu" 4 Well-formed \s-1UTF\-8\s0 multi-byte characters are also passed through unchanged. .IP "\(bu" 4 \&\s-1UTF\-8\s0 multi-byte character which are over-long but otherwise well-formed are converted to the shortest \s-1UTF\-8\s0 normal form. .IP "\(bu" 4 Bytes in the range 0xA0 \- 0xFF are assumed to be Latin\-1 characters (\s-1ISO8859\-1\s0 encoded) and are converted to \s-1UTF\-8.\s0 .IP "\(bu" 4 Bytes in the range 0x80 \- 0x9F are assumed to be Win\-Latin\-1 characters (\s-1CP1252\s0 encoded) and are converted to \s-1UTF\-8.\s0 Except for the five bytes in this range which are not defined in \s-1CP1252\s0 (see the \f(CW\*(C`ascii_hex\*(C'\fR option below). .PP The achilles heel of these rules is that it's possible for certain combinations of two consecutive Latin\-1 characters to be misinterpreted as a single \s-1UTF\-8\s0 character \- ie: there is some risk of data corruption. See the '\s-1LIMITATIONS\s0' section below to quantify this risk for the type of data you're working with. .PP If you pass in a string that is already a \s-1UTF\-8\s0 character string (the utf8 flag is set on the Perl scalar) then the string will simply be returned unchanged. However if the 'bytes_only' option is specified (see below), the returned string will be a byte string rather than a character string. The rules described above will not be applied in either case. .PP The \f(CW\*(C`fix_latin\*(C'\fR function accepts options as name => value pairs. Recognised options are: .IP "bytes_only => 1/0" 4 .IX Item "bytes_only => 1/0" The value returned by fix_latin is normally a Perl character string and will have the utf8 flag set if it contains non-ASCII characters. If you set the \&\f(CW\*(C`bytes_only\*(C'\fR option to a true value, the returned string will be a binary string of \s-1UTF\-8\s0 bytes. The utf8 flag will not be set. This is useful if you're going to immediately use the string in an \s-1IO\s0 operation and wish to avoid the overhead of converting to and from Perl's internal representation. .IP "ascii_hex => 1/0" 4 .IX Item "ascii_hex => 1/0" Bytes in the range 0x80\-0x9F are assumed to be \s-1CP1252,\s0 however \s-1CP1252\s0 does not define a mapping for 5 of these bytes (0x81, 0x8D, 0x8F, 0x90 and 0x9D). Use this option to specify how they should be handled: .RS 4 .IP "\(bu" 4 If the ascii_hex option is set to true (the default), these bytes will be converted to 3 character \s-1ASCII\s0 hex strings of the form \f(CW%XX\fR. For example the byte 0x81 will become \f(CW%81\fR. .IP "\(bu" 4 If the ascii_hex option is set to false, these bytes will be treated as Latin\-1 control characters and converted to the equivalent \s-1UTF\-8\s0 multi-byte sequences. .RE .RS 4 .Sp When processing text strings you will almost certainly never encounter these bytes at all. The most likely reason you would see them is if a malicious attacker was feeding random bytes to your application. It is difficult to conceive of a scenario in which it makes sense to change this option from its default setting. .RE .IP "overlong_fatal => 1/0" 4 .IX Item "overlong_fatal => 1/0" An over-long \s-1UTF\-8\s0 byte sequence is one which uses more than the minimum number of bytes required to represent the character. Use this option to specify how overlong sequences should be handled. .RS 4 .IP "\(bu" 4 If the overlong_fatal option is set to false (the default) over-long sequences will be converted to the shortest normal \s-1UTF\-8\s0 sequence. For example the input byte string \*(L"\exC0\exBCscript>\*(R" would be converted to \*(L"