.\" Automatically generated by Pod::Man 4.07 (Pod::Simple 3.32) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .if !\nF .nr F 0 .if \nF>0 \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} .\} .\" ======================================================================== .\" .IX Title "Unicode::Japanese 3pm" .TH Unicode::Japanese 3pm "2015-06-02" "perl v5.24.1" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" Unicode::Japanese \- Convert encoding of japanese text .SH "SYNOPSIS" .IX Header "SYNOPSIS" .Vb 2 \& use Unicode::Japanese; \& use Unicode::Japanese qw(unijp); \& \& # convert utf8 \-> sjis \& \& print Unicode::Japanese\->new($str)\->sjis; \& print unijp($str)\->sjis; # same as above. \& \& # convert sjis \-> utf8 \& \& print Unicode::Japanese\->new($str,\*(Aqsjis\*(Aq)\->get; \& \& # convert sjis (imode_EMOJI) \-> utf8 \& \& print Unicode::Japanese\->new($str,\*(Aqsjis\-imode\*(Aq)\->get; \& \& # convert zenkaku (utf8) \-> hankaku (utf8) \& \& print Unicode::Japanese\->new($str)\->z2h\->get; .Ve .SH "DESCRIPTION" .IX Header "DESCRIPTION" The Unicode::Japanese module converts encoding of japanese text from one encoding to another. .SS "\s-1FEATURES\s0" .IX Subsection "FEATURES" .IP "\(bu" 2 An instance of Unicode::Japanese internally holds a string in \s-1UTF\-8.\s0 .IP "\(bu" 2 This module is implemented in two ways: \s-1XS\s0 and pure perl. If efficiency is important for you, you should build and install the \s-1XS\s0 module. If you don't want to, or if you can't build the \s-1XS\s0 module, you may use the pure perl module instead. In that case, only you have to do is to copy Japanese.pm into somewhere in \f(CW@INC\fR. .IP "\(bu" 2 This module can convert characters from zenkaku (full-width) form to hankaku (half-width) form, and vice versa. Conversion between hiragana (one of two sets of japanese phonetical alphabet) and katakana (another set of japanese phonetical alphabet) is also supported. .IP "\(bu" 2 This module has mapping tables for emoji (graphic characters) defined by various japanese mobile phones; DoCoMo i\-mode, \s-1ASTEL\s0 dot-i and J\-PHONE J\-Sky. Those letters are mapped on Unicode Private Use Area so unicode strings it outputs are still valid even if they contain emoji, and you can safely pass them to other software that can handle Unicode. .IP "\(bu" 2 This module can map some emoji from one set to another. Different mobile phones define different sets of emoji, so mapping each other is not always possible. But since some emoji exist in two or more sets with similar appearance, this module considers those emoji to be the same. .IP "\(bu" 2 This module uses the mapping table for \s-1MS\-CP932\s0 instead of the standard Shift_JIS. The Shift_JIS encoding used by MS-Windows (\s-1MS\-SJIS/MS\-CP932\s0) slightly differs from the standard. .IP "\(bu" 2 When the module converts strings from Unicode to Shift_JIS, EUC-JP or \&\s-1ISO\-2022\-JP,\s0 unicode letters which can't be represented in those encodings will be encoded in \*(L"&#dddd;\*(R" form (decimal character reference). Note, however, that letters in Unicode Private Use Area will be replaced with '?' mark ('\s-1QUESTION MARK\s0'; U+003F) instead of being encoded. In addition, encoding to character sets for mobile phones makes every unrepresentable letters being '?' mark. .IP "\(bu" 2 On perl\-5.8.0 or later, this module handles the \s-1UTF\-8\s0 flag: the method \fIutf8()\fR returns \s-1UTF\-8 \s0\fIbyte\fR string, and the method \fIgetu()\fR returns \s-1UTF\-8 \s0\fIcharacter\fR string. .Sp Currently the method \fIget()\fR returns \s-1UTF\-8 \s0\fIbyte\fR string but this behavior may be changed in the future. .Sp Methods like \fIsjis()\fR, \fIjis()\fR, \fIutf8()\fR, and such like return \fIbyte\fR string. \fInew()\fR, \&\fIset()\fR, \fIgetcode()\fR methods just ignore the \s-1UTF\-8\s0 flag of strings they take. .SH "REQUIREMENT" .IX Header "REQUIREMENT" .IP "\(bu" 4 perl 5.10.x, 5.8.x, etc. (5.004 and later) .IP "\(bu" 4 (optional) C Compiler. This module supports both \s-1XS\s0 and Pure Perl. If you have no C Compilers, Unicode::Japanese will be installed as Pure Perl module. .IP "\(bu" 4 (optional) Test.pm and Test::More for testing. .PP No other modules are required at run time. .SH "METHODS" .IX Header "METHODS" .ie n .IP "$s = Unicode::Japanese\->new($str [, $icode [, $encode]])" 4 .el .IP "\f(CW$s\fR = Unicode::Japanese\->new($str [, \f(CW$icode\fR [, \f(CW$encode\fR]])" 4 .IX Item "$s = Unicode::Japanese->new($str [, $icode [, $encode]])" Create a new instance of Unicode::Japanese. .Sp Any given parameters will be internally passed to the method \*(L"set\*(R"(). .ie n .IP "$s = unijp($str [, $icode [, $encode]])" 4 .el .IP "\f(CW$s\fR = unijp($str [, \f(CW$icode\fR [, \f(CW$encode\fR]])" 4 .IX Item "$s = unijp($str [, $icode [, $encode]])" Same as Unicode::Jananese\->new(...). .ie n .IP "$s\->set($str [, $icode [, $encode]])" 4 .el .IP "\f(CW$s\fR\->set($str [, \f(CW$icode\fR [, \f(CW$encode\fR]])" 4 .IX Xref "set" .IX Item "$s->set($str [, $icode [, $encode]])" .RS 4 .PD 0 .ie n .IP "$str: string" 2 .el .IP "\f(CW$str:\fR string" 2 .IX Item "$str: string" .ie n .IP "$icode: optional character encoding (default: 'utf8')" 2 .el .IP "\f(CW$icode:\fR optional character encoding (default: 'utf8')" 2 .IX Item "$icode: optional character encoding (default: 'utf8')" .ie n .IP "$encode: optional binary encoding (default: no binary encodings are assumed)" 2 .el .IP "\f(CW$encode:\fR optional binary encoding (default: no binary encodings are assumed)" 2 .IX Item "$encode: optional binary encoding (default: no binary encodings are assumed)" .RE .RS 4 .PD .Sp Store a string into the instance. .Sp Possible character encodings are: .Sp .Vb 10 \& auto \& utf8 ucs2 ucs4 \& utf16\-be utf16\-le utf16 \& utf32\-be utf32\-le utf32 \& sjis cp932 euc euc\-jp jis \& sjis\-imode sjis\-imode1 sjis\-imode2 \& utf8\-imode utf8\-imode1 utf8\-imode2 \& sjis\-doti sjis\-doti1 \& sjis\-jsky sjis\-jsky1 sjis\-jsky2 \& jis\-jsky jis\-jsky1 jis\-jsky2 \& utf8\-jsky utf8\-jsky1 utf8\-jsky2 \& sjis\-au sjis\-au1 sjis\-au2 \& jis\-au jis\-au1 jis\-au2 \& sjis\-icon\-au sjis\-icon\-au1 sjis\-icon\-au2 \& euc\-icon\-au euc\-icon\-au1 euc\-icon\-au2 \& jis\-icon\-au jis\-icon\-au1 jis\-icon\-au2 \& utf8\-icon\-au utf8\-icon\-au1 utf8\-icon\-au2 \& ascii binary .Ve .Sp (see also \*(L"\s-1SUPPORTED ENCODINGS\*(R"\s0.) .Sp If you want the Unicode::Japanese detect the character encoding of string, you must explicitly specify 'auto' as the second argument. In that case, the given string will be passed to the method \fIgetcode()\fR to guess the encoding. .Sp For binary encodings, only 'base64' is currently supported. If you specify \&'base64' as the third argument, the given string will be decoded using Base64 decoder. .Sp Specify 'binary' as the second argument if you want your string to be stored without modification. .Sp When you specify 'sjis\-imode' or 'sjis\-doti' as the character encoding, any occurences of '&#dddd;' (decimal character reference) in the string will be interpreted and decoded as code point of emoji, just like emoji implanted into the string in binary form. .Sp Since encoded forms of strings in various encodings are not clearly distinctive to each other, it is not always certainly possible to detect what encoding is used for a given string. .Sp When a given string is possibly interpreted as both Shift_JIS and \s-1UTF\-8\s0 string, this module considers such a string to be encoded in Shift_JIS. And if the encoding is not distinguishable between 'sjis\-au' and 'sjis\-doti', this module considers it 'sjis\-au'. .RE .ie n .IP "$str = $s\->get" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->get" 4 .IX Item "$str = $s->get" .RS 4 .PD 0 .ie n .IP "$str: string (\s-1UTF\-8\s0)" 2 .el .IP "\f(CW$str:\fR string (\s-1UTF\-8\s0)" 2 .IX Item "$str: string (UTF-8)" .RE .RS 4 .PD .Sp Get the internal string in \s-1UTF\-8.\s0 .Sp This method currently returns a byte string (whose \s-1UTF\-8\s0 flag is turned off), but this behavior may be changed in the future. .Sp If you absolutely want a byte string, you should use the method \fIutf8()\fR instead. And if you want a character string (whose \s-1UTF\-8\s0 flag is turned on), you have to use the method \fIgetu()\fR. .RE .ie n .IP "$str = $s\->getu" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->getu" 4 .IX Item "$str = $s->getu" .RS 4 .PD 0 .ie n .IP "$str: string (\s-1UTF\-8\s0)" 2 .el .IP "\f(CW$str:\fR string (\s-1UTF\-8\s0)" 2 .IX Item "$str: string (UTF-8)" .RE .RS 4 .PD .Sp Get the internal string in \s-1UTF\-8.\s0 .Sp On perl\-5.8.0 or later, this method returns a character string with its \s-1UTF\-8\s0 flag turned on. .RE .ie n .IP "$code = $s\->getcode($str)" 4 .el .IP "\f(CW$code\fR = \f(CW$s\fR\->getcode($str)" 4 .IX Item "$code = $s->getcode($str)" .RS 4 .PD 0 .ie n .IP "$str: string" 2 .el .IP "\f(CW$str:\fR string" 2 .IX Item "$str: string" .ie n .IP "$code: name of character encoding" 2 .el .IP "\f(CW$code:\fR name of character encoding" 2 .IX Item "$code: name of character encoding" .RE .RS 4 .PD .Sp Detect the character encoding of given string. .Sp Note that this method, exceptionaly, doesn't deal with the internal string of an instance. .Sp To guess the encoding, the following algorithm is used: .Sp (For pure perl implementation) .IP "1." 4 If the string has an \s-1UTF\-32 BOM,\s0 its encoding is 'utf32'. .IP "2." 4 If it has an \s-1UTF\-16 BOM,\s0 its encoding is 'utf16'. .IP "3." 4 If it is valid for \s-1UTF\-32BE,\s0 its encoding is 'utf32\-be'. .IP "4." 4 If it is valid for \s-1UTF\-32LE,\s0 its encoding is 'utf32\-le'. .IP "5." 4 If it contains no \s-1ESC\s0 characters or bytes whose eighth bit is on, its encoding is 'ascii'. Every \s-1ASCII\s0 control characters (0x00\-0x1F and 0x7F) except \s-1ESC \&\s0(0x1B) are considered to be in the range of 'ascii'. .IP "6." 4 If it contains escape sequences of \s-1ISO\-2022\-JP,\s0 its encoding is 'jis'. .IP "7." 4 If it contains any emoji defined for J\-PHONE, its encoding is 'sjis\-jsky'. .IP "8." 4 If it is valid for EUC-JP, its encoding is 'euc'. .IP "9." 4 If it is valid for Shift_JIS, its encoding is 'sjis'. .IP "10." 4 If it contains any emoji defined for au, and everything else is valid for Shift_JIS, its encoding is 'sjis\-au'. .IP "11." 4 If it contains any emoji defined for i\-mode, and everything else is valid for Shift_JIS, its encoding is 'sjis\-imode'. .IP "12." 4 If it contains any emoji defined for dot-i, and everything else is valid for Shift_JIS, its encoding is 'sjis\-doti'. .IP "13." 4 If it is valid for \s-1UTF\-8,\s0 its encoding is 'utf8'. .IP "14." 4 If no conditions above are fulfilled, its encoding is 'unknown'. .RE .RS 4 .Sp (For \s-1XS\s0 implementation) .IP "1." 4 If the string has an \s-1UTF\-32 BOM,\s0 its encoding is 'utf32'. .IP "2." 4 If it has an \s-1UTF\-16 BOM,\s0 its encoding is 'utf16'. .IP "3." 4 Find all possible encodings that might have been applied to the string from the following: .Sp ascii / euc / sjis / jis / utf8 / utf32\-be / utf32\-le / sjis-jsky / sjis-imode / sjis-au / sjis-doti .IP "4." 4 If any encodings have been found possible, this module picks out one encoding having the highest priority among them. The priority order is as follows: .Sp utf32\-be / utf32\-le / ascii / jis / euc / sjis / sjis-jsky / sjis-imode / sjis-au / sjis-doti / utf8 .IP "5." 4 If no conditions above are fulfilled, its encoding is 'unknown'. .RE .RS 4 .Sp Pay attention to the following pitfalls in the above algorithm: .IP "\(bu" 2 \&\s-1UTF\-8\s0 strings might be accidentally considered to be encoded in Shift_JIS. .IP "\(bu" 2 \&\s-1UCS\-2\s0 strings (sequence of raw \s-1UCS\-2\s0 letters in big-endian; each letters has always 2 bytes) can't be detected because they look like nothing but sequences of random bytes whose length is an even number. .IP "\(bu" 2 \&\s-1UTF\-16\s0 strings must have \s-1BOM\s0 to be detected. .IP "\(bu" 2 Emoji are only be recognized if they are implanted into the string in binary form. If they are described in '&#dddd;' form, they aren't considered to be emoji. .RE .RS 4 .Sp Since the \s-1XS\s0 and pure perl implementations use different algorithms to guess encoding, they may guess differently for the same string. Especially, the pure perl implementation finds Shift_JIS strings containing \s-1ESC\s0 character (0x1B) to be actually encoded in Shift_JIS but \s-1XS\s0 implementation doesn't. This is because such strings can hardly be distinguished from 'sjis\-jsky'. In addition, EUC-JP strings containing \s-1ESC\s0 character are also rejected for the same reason. .RE .ie n .IP "$code = $s\->getcodelist($str)" 4 .el .IP "\f(CW$code\fR = \f(CW$s\fR\->getcodelist($str)" 4 .IX Item "$code = $s->getcodelist($str)" .RS 4 .PD 0 .ie n .IP "$str: string" 2 .el .IP "\f(CW$str:\fR string" 2 .IX Item "$str: string" .ie n .IP "$code: name of character encodings" 2 .el .IP "\f(CW$code:\fR name of character encodings" 2 .IX Item "$code: name of character encodings" .RE .RS 4 .PD .Sp Detect the character encoding of given string. .Sp Unlike the method \fIgetcode()\fR, \fIgetcodelist()\fR returns a list of possible encodings. .RE .ie n .IP "$str = $s\->conv($ocode, $encode)" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->conv($ocode, \f(CW$encode\fR)" 4 .IX Item "$str = $s->conv($ocode, $encode)" .RS 4 .PD 0 .ie n .IP "$ocode: character encoding (possible encodings are:)" 2 .el .IP "\f(CW$ocode:\fR character encoding (possible encodings are:)" 2 .IX Item "$ocode: character encoding (possible encodings are:)" .PD .Vb 10 \& utf8 ucs2 ucs4 utf16 \& sjis cp932 euc euc\-jp jis \& sjis\-imode sjis\-imode1 sjis\-imode2 \& utf8\-imode utf8\-imode1 utf8\-imode2 \& sjis\-doti sjis\-doti1 \& sjis\-jsky sjis\-jsky1 sjis\-jsky2 \& jis\-jsky jis\-jsky1 jis\-jsky2 \& utf8\-jsky utf8\-jsky1 utf8\-jsky2 \& sjis\-au sjis\-au1 sjis\-au2 \& jis\-au jis\-au1 jis\-au2 \& sjis\-icon\-au sjis\-icon\-au1 sjis\-icon\-au2 \& euc\-icon\-au euc\-icon\-au1 euc\-icon\-au2 \& jis\-icon\-au jis\-icon\-au1 jis\-icon\-au2 \& utf8\-icon\-au utf8\-icon\-au1 utf8\-icon\-au2 \& binary .Ve .Sp (see also \*(L"\s-1SUPPORTED ENCODINGS\*(R"\s0.) .Sp Some encodings for mobile phones have a trailing digit like 'sjis\-au2'. Those digits represent the version number of encodings. Such encodings have a variant with no trailing digits, like 'sjis\-au', which is the same as the latest version among its variants. .ie n .IP "$encode: optional binary encoding" 2 .el .IP "\f(CW$encode:\fR optional binary encoding" 2 .IX Item "$encode: optional binary encoding" .PD 0 .ie n .IP "$str: string" 2 .el .IP "\f(CW$str:\fR string" 2 .IX Item "$str: string" .RE .RS 4 .PD .Sp Get the internal string of instance with encoding it using a given character encoding method. .Sp If you want the resulting string to be encoded in Base64, specify 'base64' as the second argument. .Sp On perl\-5.8.0 or later, the \s-1UTF\-8\s0 flag of resulting string is turned off even if you specify 'utf8' to the first argument. .RE .ie n .IP "$s\->tag2bin" 4 .el .IP "\f(CW$s\fR\->tag2bin" 4 .IX Item "$s->tag2bin" Interpret decimal character references (&#dddd;) in the instance, and replaces them with single characters they represent. .ie n .IP "$s\->z2h" 4 .el .IP "\f(CW$s\fR\->z2h" 4 .IX Item "$s->z2h" Replace zenkaku (full-width) letters in the instance with hankaku (half-width) letters. .ie n .IP "$s\->h2z" 4 .el .IP "\f(CW$s\fR\->h2z" 4 .IX Item "$s->h2z" Replace hankaku (half-width) letters in the instance with zenkaku (full-width) letters. .ie n .IP "$s\->hira2kata" 4 .el .IP "\f(CW$s\fR\->hira2kata" 4 .IX Item "$s->hira2kata" Replace any hiragana in the instance with katakana. .ie n .IP "$s\->kata2hira" 4 .el .IP "\f(CW$s\fR\->kata2hira" 4 .IX Item "$s->kata2hira" Replace any katakana in the instance with hiragana. .ie n .IP "$str = $s\->jis" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->jis" 4 .IX Item "$str = $s->jis" \&\f(CW$str:\fR byte string in \s-1ISO\-2022\-JP\s0 .Sp Get the internal string of instance with encoding it in \s-1ISO\-2022\-JP.\s0 .ie n .IP "$str = $s\->euc" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->euc" 4 .IX Item "$str = $s->euc" \&\f(CW$str:\fR byte string in EUC-JP .Sp Get the internal string of instance with encoding it in EUC-JP. .ie n .IP "$str = $s\->utf8" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->utf8" 4 .IX Item "$str = $s->utf8" \&\f(CW$str:\fR byte string in \s-1UTF\-8\s0 .Sp Get the internal \s-1UTF\-8\s0 string of instance. .Sp On perl\-5.8.0 or later, the \s-1UTF\-8\s0 flag of resulting string is turned off. .ie n .IP "$str = $s\->ucs2" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->ucs2" 4 .IX Item "$str = $s->ucs2" \&\f(CW$str:\fR byte string in \s-1UCS\-2\s0 .Sp Get the internal string of instance as a sequence of raw \s-1UCS\-2\s0 letters in big-endian. Note that this is different from \s-1UTF\-16BE\s0 as raw \s-1UCS\-2\s0 sequence has no concept of surrogate pair. .ie n .IP "$str = $s\->ucs4" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->ucs4" 4 .IX Item "$str = $s->ucs4" \&\f(CW$str:\fR byte string in \s-1UCS\-4\s0 .Sp Get the internal string of instance as a sequence of raw \s-1UCS\-4\s0 letters in big-endian. This is practically the same as \s-1UTF\-32BE.\s0 .ie n .IP "$str = $s\->utf16" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->utf16" 4 .IX Item "$str = $s->utf16" \&\f(CW$str:\fR byte string in \s-1UTF\-16\s0 .Sp Get the insternal string of instance with encoding it in \s-1UTF\-16\s0 in big-endian with no \s-1BOM\s0 prepended. .ie n .IP "$str = $s\->sjis" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->sjis" 4 .IX Item "$str = $s->sjis" \&\f(CW$str:\fR byte string in Shift_JIS .Sp Get the internal string of instance with encoding it in Shift_JIS (MS-SJIS / \&\s-1MS\-CP932\s0). .ie n .IP "$str = $s\->sjis_imode" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->sjis_imode" 4 .IX Item "$str = $s->sjis_imode" \&\f(CW$str:\fR byte string in 'sjis\-imode' .Sp Get the internal string of instance with encoding it in 'sjis\-imode'. .ie n .IP "$str = $s\->sjis_imode1" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->sjis_imode1" 4 .IX Item "$str = $s->sjis_imode1" \&\f(CW$str:\fR byte string in 'sjis\-imode1' .Sp Get the internal string of instance with encoding it in 'sjis\-imode1'. .ie n .IP "$str = $s\->sjis_imode2" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->sjis_imode2" 4 .IX Item "$str = $s->sjis_imode2" \&\f(CW$str:\fR byte string in 'sjis\-imode2' .Sp Get the internal string of instance with encoding it in 'sjis\-imode2'. .ie n .IP "$str = $s\->sjis_doti" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->sjis_doti" 4 .IX Item "$str = $s->sjis_doti" \&\f(CW$str:\fR byte string in 'sjis\-doti' .Sp Get the internal string of instance with encoding it in 'sjis\-doti'. .ie n .IP "$str = $s\->sjis_jsky" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->sjis_jsky" 4 .IX Item "$str = $s->sjis_jsky" \&\f(CW$str:\fR byte string in 'sjis\-jsky' .Sp Get the internal string of instance with encoding it in 'sjis\-jsky'. .ie n .IP "$str = $s\->sjis_jsky1" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->sjis_jsky1" 4 .IX Item "$str = $s->sjis_jsky1" \&\f(CW$str:\fR byte string in 'sjis\-jsky1' .Sp Get the internal string of instance with encoding it in 'sjis\-jsky1'. .ie n .IP "$str = $s\->sjis_jsky" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->sjis_jsky" 4 .IX Item "$str = $s->sjis_jsky" \&\f(CW$str:\fR byte string in 'sjis\-jsky' .Sp Get the internal string of instance with encoding it in 'sjis\-jsky'. .ie n .IP "$str = $s\->sjis_icon_au" 4 .el .IP "\f(CW$str\fR = \f(CW$s\fR\->sjis_icon_au" 4 .IX Item "$str = $s->sjis_icon_au" \&\f(CW$str:\fR byte string in 'sjis\-icon\-au' .Sp Get the internal string of instance with encoding it in 'sjis\-icon\-au'. .ie n .IP "$str_arrayref = $s\->strcut($len)" 4 .el .IP "\f(CW$str_arrayref\fR = \f(CW$s\fR\->strcut($len)" 4 .IX Item "$str_arrayref = $s->strcut($len)" .RS 4 .PD 0 .ie n .IP "$len: maximum length of each chunks (in number of full-width characters)" 2 .el .IP "\f(CW$len:\fR maximum length of each chunks (in number of full-width characters)" 2 .IX Item "$len: maximum length of each chunks (in number of full-width characters)" .ie n .IP "$str_arrayref: reference to array of strings" 2 .el .IP "\f(CW$str_arrayref:\fR reference to array of strings" 2 .IX Item "$str_arrayref: reference to array of strings" .RE .RS 4 .PD .Sp Split the internal string of instance into chunks of a given length. .Sp On perl\-5.8.0 or later, \s-1UTF\-8\s0 flags of each chunks are turned on. .RE .ie n .IP "$len = $s\->strlen" 4 .el .IP "\f(CW$len\fR = \f(CW$s\fR\->strlen" 4 .IX Item "$len = $s->strlen" \&\f(CW$len:\fR character width of the internal string .Sp Calculate the character width of the internal string. Half-width characters have width of one unit, and full-width characters have width of two units. .ie n .IP "$s\->join_csv(@values);" 4 .el .IP "\f(CW$s\fR\->join_csv(@values);" 4 .IX Item "$s->join_csv(@values);" \&\f(CW@values:\fR array of strings .Sp Build a line of \s-1CSV\s0 from the arguments, and store it into the instance. The resulting line has a trailing line break (\*(L"\en\*(R"). .ie n .IP "@values = $s\->split_csv;" 4 .el .IP "\f(CW@values\fR = \f(CW$s\fR\->split_csv;" 4 .IX Item "@values = $s->split_csv;" \&\f(CW@values:\fR array of strings .Sp Parse a line of \s-1CSV\s0 in the instance and return each columns. The line will be \&\fIchomp()\fRed before getting parsed. .Sp If the internal string was decoded from 'binary' encoding (see methods \fInew()\fR and \&\fIset()\fR), the \s-1UTF\-8\s0 flags of the resulting array of strings are turned off. Otherwise the flags are turned on. .SH "SUPPORTED ENCODINGS" .IX Header "SUPPORTED ENCODINGS" .Vb 10 \& +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-+\-\-\-\-\-+\-\-\-\-\-\-\-+ \& |encoding | in | out | guess | \& +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-+\-\-\-\-\-+\-\-\-\-\-\-\-+ \& |auto : OK : \-\- | \-\-\-\-\- | \& +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-+\-\-\-\-\-+\-\-\-\-\-\-\-+ \& |utf8 : OK : OK | OK | \& |ucs2 : OK : OK | \-\-\-\-\- | \& |ucs4 : OK : OK | \-\-\-\-\- | \& |utf16\-be : OK : \-\- | \-\-\-\-\- | \& |utf16\-le : OK : \-\- | \-\-\-\-\- | \& |utf16 : OK : OK | OK(#) | \& |utf32\-be : OK : \-\- | OK | \& |utf32\-le : OK : \-\- | OK | \& |utf32 : OK : \-\- | OK(#) | \& +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-+\-\-\-\-\-+\-\-\-\-\-\-\-+ \& |sjis : OK : OK | OK | \& |cp932 : OK : OK | \-\-\-\-\- | \& |euc : OK : OK | OK | \& |euc\-jp : OK : OK | \-\-\-\-\- | \& |jis : OK : OK | OK | \& +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-+\-\-\-\-\-+\-\-\-\-\-\-\-+ \& |sjis\-imode : OK : OK | OK | \& |sjis\-imode1 : OK : OK | \-\-\-\-\- | \& |sjis\-imode2 : OK : OK | \-\-\-\-\- | \& |utf8\-imode : OK : OK | \-\-\-\-\- | \& |utf8\-imode1 : OK : OK | \-\-\-\-\- | \& |utf8\-imode2 : OK : OK | \-\-\-\-\- | \& +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-+\-\-\-\-\-+\-\-\-\-\-\-\-+ \& |sjis\-doti : OK : OK | OK | \& |sjis\-doti1 : OK : OK | \-\-\-\-\- | \& +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-+\-\-\-\-\-+\-\-\-\-\-\-\-+ \& |sjis\-jsky : OK : OK | OK | \& |sjis\-jsky1 : OK : OK | \-\-\-\-\- | \& |sjis\-jsky2 : OK : OK | \-\-\-\-\- | \& |jis\-jsky : OK : OK | \-\-\-\-\- | \& |jis\-jsky1 : OK : OK | \-\-\-\-\- | \& |jis\-jsky2 : OK : OK | \-\-\-\-\- | \& |utf8\-jsky : OK : OK | \-\-\-\-\- | \& |utf8\-jsky1 : OK : OK | \-\-\-\-\- | \& |utf8\-jsky2 : OK : OK | \-\-\-\-\- | \& +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-+\-\-\-\-\-+\-\-\-\-\-\-\-+ \& |sjis\-au : OK : OK | OK | \& |sjis\-au1 : OK : OK | \-\-\-\-\- | \& |sjis\-au2 : OK : OK | \-\-\-\-\- | \& |jis\-au : OK : OK | \-\-\-\-\- | \& |jis\-au1 : OK : OK | \-\-\-\-\- | \& |jis\-au2 : OK : OK | \-\-\-\-\- | \& |sjis\-icon\-au : OK : OK | \-\-\-\-\- | \& |sjis\-icon\-au1 : OK : OK | \-\-\-\-\- | \& |sjis\-icon\-au2 : OK : OK | \-\-\-\-\- | \& |euc\-icon\-au : OK : OK | \-\-\-\-\- | \& |euc\-icon\-au1 : OK : OK | \-\-\-\-\- | \& |euc\-icon\-au2 : OK : OK | \-\-\-\-\- | \& |jis\-icon\-au : OK : OK | \-\-\-\-\- | \& |jis\-icon\-au1 : OK : OK | \-\-\-\-\- | \& |jis\-icon\-au2 : OK : OK | \-\-\-\-\- | \& |utf8\-icon\-au : OK : OK | \-\-\-\-\- | \& |utf8\-icon\-au1 : OK : OK | \-\-\-\-\- | \& |utf8\-icon\-au2 : OK : OK | \-\-\-\-\- | \& +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-+\-\-\-\-\-+\-\-\-\-\-\-\-+ \& |ascii : OK : \-\- | OK | \& |binary : OK : OK | \-\-\-\-\- | \& +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-+\-\-\-\-\-+\-\-\-\-\-\-\-+ \& (#): guessed when it has bom. .Ve .SS "\s-1GUESSING ORDER\s0" .IX Subsection "GUESSING ORDER" .Vb 10 \& 1. utf32 (#) \& 2. utf16 (#) \& 3. utf32\-be \& 4. utf32\-le \& 5. ascii \& 6. jis \& 7. sjis\-jsky (pp) \& 8. euc \& 9. sjis \& 10. sjis\-jsky (xs) \& 11. sjis\-au \& 12. sjis\-imode \& 13. sjis\-doti \& 14. utf8 \& 15. unknown .Ve .SH "DESCRIPTION OF UNICODE MAPPING" .IX Header "DESCRIPTION OF UNICODE MAPPING" Transcoding between Unicode encodings and other ones is performed as below: .IP "Shift_JIS" 2 .IX Item "Shift_JIS" This module uses the mapping table of \s-1MS\-CP932.\s0 .Sp .Sp When the module tries to convert Unicode string to Shift_JIS, it represents most letters which isn't available in Shift_JIS as decimal character reference ('&#dddd;'). There is one exception to this: every graphic characters for mobile phones are replaced with '?' mark. .Sp For variants of Shift_JIS defined for mobile phones, every unrepresentable characters are replaced with '?' mark unlike the plain Shift_JIS. .IP "\s-1EUC\-JP/ISO\-2022\-JP\s0" 2 .IX Item "EUC-JP/ISO-2022-JP" This module doesn't directly convert Unicode string from/to EUC-JP or \&\s-1ISO\-2022\-JP:\s0 it once converts from/to Shift_JIS and then do the rest translation. So characters which aren't available in the Shift_JIS can not be properly translated. .IP "DoCoMo i\-mode" 2 .IX Item "DoCoMo i-mode" This module maps emoji in the range of F800 \- F9FF to U+0FF800 \- U+0FF9FF. .IP "\s-1ASTEL\s0 dot-i" 2 .IX Item "ASTEL dot-i" This module maps emoji in the range of F000 \- F4FF to U+0FF000 \- U+0FF4FF. .IP "J\-PHONE J\-SKY" 2 .IX Item "J-PHONE J-SKY" The encoding method defined by J\-SKY is as follows: first an escape sequence \&\*(L"\ee\e$\*(R" comes to indicate the beginning of emoji, then the first byte of an emoji comes next, then the second bytes of at least one emoji comes next, then \*(L"\ex0f\*(R" comes last to indicate the end of emoji. If a string contains a series of emoji whose first bytes are identical, such sequence can be compressed by cascading second bytes of them to the single first byte. .Sp This module considers a pair of those first and second bytes to be one letter, and map them from 4500 \- 47FF to U+0FFB00 \- U+0FFDFF. .Sp When the module encodes J\-SKY emoji, it performs the compression automatically. .IP "\s-1AU\s0" 2 .IX Item "AU" This module maps \s-1AU\s0 emoji to U+0FF500 \- U+0FF6FF. .SH "PurePerl mode" .IX Header "PurePerl mode" .Vb 1 \& use Unicode::Japanese qw(PurePerl); .Ve .PP If you want to explicitly take the pure perl implementation, pass \&\f(CW\*(AqPurePerl\*(Aq\fR to the argument of the \f(CW\*(C`use\*(C'\fR statement. .SH "BUGS" .IX Header "BUGS" Please report bugs and requests to \f(CW\*(C`bug\-unicode\-japanese at rt.cpan.org\*(C'\fR or . If you report them to the web interface, any progress to your report will be automatically sent back to you. .IP "\(bu" 2 This module doesn't directly convert Unicode string from/to EUC-JP or \&\s-1ISO\-2022\-JP:\s0 it once converts from/to Shift_JIS and then do the rest translation. So characters which aren't available in the Shift_JIS can not be properly translated. .IP "\(bu" 2 The \s-1XS\s0 implementation of \fIgetcode()\fR fails to detect the encoding when the given string contains \ee while its encoding is EUC-JP or Shift_JIS. .IP "\(bu" 2 Japanese.pm is composed of textual perl script and binary character conversion table. If you transfer it on \s-1FTP\s0 using \s-1ASCII\s0 mode, the file will collapse. .SH "SUPPORT" .IX Header "SUPPORT" You can find documentation for this module with the perldoc command. .PP .Vb 1 \& perldoc Unicode::Japanese .Ve .PP You can find more information at: .IP "\(bu" 4 AnnoCPAN: Annotated \s-1CPAN\s0 documentation .Sp .IP "\(bu" 4 \&\s-1CPAN\s0 Ratings .Sp .IP "\(bu" 4 \&\s-1RT: CPAN\s0's request tracker .Sp .IP "\(bu" 4 Search \s-1CPAN\s0 .Sp .SH "CREDITS" .IX Header "CREDITS" Thanks very much to: .PP \&\s-1NAKAYAMA\s0 Nao .PP \&\s-1SUGIURA\s0 Tatsuki & Debian \s-1JP\s0 Project .SH "COPYRIGHT & LICENSE" .IX Header "COPYRIGHT & LICENSE" Copyright 2001\-2008 \&\s-1SANO\s0 Taku (\s-1SAWATARI\s0 Mikage) and \s-1YAMASHINA\s0 Hio, all rights reserved. .PP This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.