table of contents
other versions
- wheezy 5.14.2-21+deb7u3
- jessie 5.20.2-3+deb8u6
- testing 5.24.1-3
- unstable 5.24.1-3
- experimental 5.26.0-1
other sections
Unicode::Collate(3perl) | Perl Programmers Reference Guide | Unicode::Collate(3perl) |
NAME¶
Unicode::Collate - Unicode Collation AlgorithmSYNOPSIS¶
use Unicode::Collate; #construct $Collator = Unicode::Collate->new(%tailoring); #sort @sorted = $Collator->sort(@not_sorted); #compare $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.Note: Strings in @not_sorted, $a and $b are interpreted according to Perl's Unicode support. See perlunicode, perluniintro, perlunitut, perlunifaq, utf8. Otherwise you can use "preprocess" or should decode them before.
DESCRIPTION¶
This module is an implementation of Unicode Technical Standard #10 (a.k.a. UTS #10) - Unicode Collation Algorithm (a.k.a. UCA).Constructor and Tailoring¶
The "new" method returns a collator object. If new() is called with no parameters, the collator should do the default collation.$Collator = Unicode::Collate->new( UCA_Version => $UCA_Version, alternate => $alternate, # alias for 'variable' backwards => $levelNumber, # or \@levelNumbers entry => $element, hangul_terminator => $term_primary_weight, ignoreName => qr/$ignoreName/, ignoreChar => qr/$ignoreChar/, katakana_before_hiragana => $bool, level => $collationLevel, normalization => $normalization_form, overrideCJK => \&overrideCJK, overrideHangul => \&overrideHangul, preprocess => \&preprocess, rearrange => \@charList, suppress => \@charList, table => $filename, undefName => qr/$undefName/, undefChar => qr/$undefChar/, upper_before_lower => $bool, variable => $variable, );
- UCA_Version
- If the revision (previously "tracking version")
number of UCA is given, behavior of that revision is emulated on
collating. If omitted, the return value of "UCA_Version()" is
used.
UCA Unicode Standard DUCET (@version) ------------------------------------------------------- 8 3.1 3.0.1 (3.0.1d9) 9 3.1 with Corrigendum 3 3.1.1 (3.1.1) 11 4.0 4.0.0 (4.0.0) 14 4.1.0 4.1.0 (4.1.0) 16 5.0 5.0.0 (5.0.0) 18 5.1.0 5.1.0 (5.1.0) 20 5.2.0 5.2.0 (5.2.0) 22 6.0.0 6.0.0 (6.0.0)
- alternate
- -- see 3.2.2 Alternate Weighting, version 8 of UTS #10
- backwards
- -- see 3.1.2 French Accents, UTS #10.
backwards => $levelNumber or \@levelNumbers
- entry
- -- see 3.1 Linguistic Features; 3.2.1 File Format, UTS #10.
entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt) 0063 0068 ; [.0E6A.0020.0002.0063] # ch 0043 0068 ; [.0E6A.0020.0007.0043] # Ch 0043 0048 ; [.0E6A.0020.0008.0043] # CH 006C 006C ; [.0F4C.0020.0002.006C] # ll 004C 006C ; [.0F4C.0020.0007.004C] # Ll 004C 004C ; [.0F4C.0020.0008.004C] # LL 00F1 ; [.0F7B.0020.0002.00F1] # n-tilde 006E 0303 ; [.0F7B.0020.0002.00F1] # n-tilde 00D1 ; [.0F7B.0020.0008.00D1] # N-tilde 004E 0303 ; [.0F7B.0020.0008.00D1] # N-tilde ENTRY entry => <<'ENTRY', # for DUCET v4.0.0 (allkeys-4.0.0.txt) 00E6 ; [.0E33.0020.0002.00E6][.0E8B.0020.0002.00E6] # ae ligature as <a><e> 00C6 ; [.0E33.0020.0008.00C6][.0E8B.0020.0008.00C6] # AE ligature as <A><E> ENTRY
- hangul_terminator
- -- see 7.1.4 Trailing Weights, UTS #10.
- ignoreChar
- ignoreName
- -- see 3.2.2 Variable Weighting, UTS #10.
- katakana_before_hiragana
- -- see 7.3.1 Tertiary Weight Table, UTS #10.
- level
- -- see 4.3 Form Sort Key, UTS #10.
Level 1: alphabetic ordering Level 2: diacritic ordering Level 3: case ordering Level 4: tie-breaking (e.g. in the case when variable is 'shifted') ex.level => 2,
- normalization
- -- see 4.1 Normalize, UTS #10.
- overrideCJK
- -- see 7.1 Derived Collation Elements, UTS #10.
In the CJK Unified Ideographs block: U+4E00..U+9FA5 if UCA_Version is 8 to 11. U+4E00..U+9FBB if UCA_Version is 14 to 16. U+4E00..U+9FC3 if UCA_Version is 18. U+4E00..U+9FCB if UCA_Version is 20 or greater. In the CJK Unified Ideographs Extension blocks: Ext.A (U+3400..U+4DB5) and Ext.B (U+20000..U+2A6D6) in any UCA_Version. Ext.C (U+2A700..U+2B734) if UCA_Version is 20 or greater. Ext.D (U+2B740..U+2B81D) if UCA_Version is 22 or greater.
overrideCJK => sub { my $u = shift; # get a Unicode codepoint my $b = pack('n', $u); # to UTF-16BE my $s = your_unicode_to_sjis_converter($b); # convert my $n = unpack('n', $s); # convert sjis to short [ $n, 0x20, 0x2, $u ]; # return the collation element },
overrideCJK => sub { my $u = shift; # get a Unicode codepoint my $b = pack('n', $u); # to UTF-16BE my $s = your_unicode_to_sjis_converter($b); # convert my $n = unpack('n', $s); # convert sjis to short return $n; # return the primary weight },
overrideCJK => sub {()}, # CODEREF returning empty list # where ->eq("Pe\x{4E00}rl", "Perl") is true # as U+4E00 is a CJK unified ideograph and to be ignorable.
- overrideHangul
- -- see 7.1 Derived Collation Elements, UTS #10.
- preprocess
- -- see 5.1 Preprocessing, UTS #10.
preprocess => sub { my $str = shift; $str =~ s/\b(?:an?|the)\s+//gi; return $str; },
$sjis_collator = Unicode::Collate->new( preprocess => \&your_shiftjis_to_unicode_decoder, ); @result = $sjis_collator->sort(@shiftjis_strings);
- rearrange
- -- see 3.1.3 Rearrangement, UTS #10.
rearrange => [ 0x0E40..0x0E44, 0x0EC0..0x0EC4 ],
- suppress
- -- see suppress contractions in 5.14.11 Special-Purpose
Commands, UTS #35 (LDML).
suppress => [0x0400..0x0417, 0x041A..0x0437, 0x043A..0x045F],
- table
- -- see 3.2 Default Unicode Collation Element Table, UTS
#10.
$onlyABC = Unicode::Collate->new( table => undef, entry => << 'ENTRIES', 0061 ; [.0101.0020.0002.0061] # LATIN SMALL LETTER A 0041 ; [.0101.0020.0008.0041] # LATIN CAPITAL LETTER A 0062 ; [.0102.0020.0002.0062] # LATIN SMALL LETTER B 0042 ; [.0102.0020.0008.0042] # LATIN CAPITAL LETTER B 0063 ; [.0103.0020.0002.0063] # LATIN SMALL LETTER C 0043 ; [.0103.0020.0008.0043] # LATIN CAPITAL LETTER C ENTRIES );
- undefChar
- undefName
- -- see 6.3.4 Reducing the Repertoire, UTS #10.
undefChar => qr/[^\0-\x{fffd}]/,
- upper_before_lower
- -- see 6.6 Case Comparisons, UTS #10.
- variable
- -- see 3.2.2 Variable Weighting, UTS #10.
variable => 'blanked', 'non-ignorable', 'shifted', or 'shift-trimmed'.
'Blanked' Variable elements are made ignorable at levels 1 through 3; considered at the 4th level. 'Non-Ignorable' Variable elements are not reset to ignorable. 'Shifted' Variable elements are made ignorable at levels 1 through 3 their level 4 weight is replaced by the old level 1 weight. Level 4 weight for Non-Variable elements is 0xFFFF. 'Shift-Trimmed' Same as 'shifted', but all FFFF's at the 4th level are trimmed.
Methods for Collation¶
- "@sorted = $Collator->sort(@not_sorted)"
- Sorts a list of strings.
- "$result = $Collator->cmp($a, $b)"
- Returns 1 (when $a is greater than $b) or 0 (when $a is equal to $b) or -1 (when $a is lesser than $b).
- "$result = $Collator->eq($a, $b)"
- "$result = $Collator->ne($a, $b)"
- "$result = $Collator->lt($a, $b)"
- "$result = $Collator->le($a, $b)"
- "$result = $Collator->gt($a, $b)"
- "$result = $Collator->ge($a, $b)"
- They works like the same name operators as theirs.
eq : whether $a is equal to $b. ne : whether $a is not equal to $b. lt : whether $a is lesser than $b. le : whether $a is lesser than $b or equal to $b. gt : whether $a is greater than $b. ge : whether $a is greater than $b or equal to $b.
- "$sortKey = $Collator->getSortKey($string)"
- -- see 4.3 Form Sort Key, UTS #10.
$Collator->getSortKey($a) cmp $Collator->getSortKey($b) is equivalent to $Collator->cmp($a, $b)
- "$sortKeyForm = $Collator->viewSortKey($string)"
- Converts a sorting key into its representation form. If
"UCA_Version" is 8, the output is slightly different.
use Unicode::Collate; my $c = Unicode::Collate->new(); print $c->viewSortKey("Perl"),"\n"; # output: # [0B67 0A65 0B7F 0B03 | 0020 0020 0020 0020 | 0008 0002 0002 0002 | FFFF FFFF FFFF FFFF] # Level 1 Level 2 Level 3 Level 4
Methods for Searching¶
DISCLAIMER: If "preprocess" or "normalization" parameter is true for $Collator, calling these methods ("index", "match", "gmatch", "subst", "gsubst") is croaked, as the position and the length might differ from those on the specified string. (And "rearrange" and "hangul_terminator" parameters are neglected.) The "match", "gmatch", "subst", "gsubst" methods work like "m//", "m//g", "s///", "s///g", respectively, but they are not aware of any pattern, but only a literal substring.- "$position = $Collator->index($string, $substring[, $position])"
- "($position, $length) = $Collator->index($string, $substring[, $position])"
- If $substring matches a part of $string, returns the
position of the first occurrence of the matching part in scalar context;
in list context, returns a two-element list of the position and the length
of the matching part.
my $Collator = Unicode::Collate->new( normalization => undef, level => 1 ); # (normalization => undef) is REQUIRED. my $str = "Ich muss studieren Perl."; my $sub = "MUeSS"; my $match; if (my($pos,$len) = $Collator->index($str, $sub)) { $match = substr($str, $pos, $len); }
- "$match_ref = $Collator->match($string, $substring)"
- "($match) = $Collator->match($string, $substring)"
- If $substring matches a part of $string, in scalar context,
returns a reference to the first occurrence of the matching part
($match_ref is always true if matches, since every reference is
true); in list context, returns the first occurrence of the
matching part.
if ($match_ref = $Collator->match($str, $sub)) { # scalar context print "matches [$$match_ref].\n"; } else { print "doesn't match.\n"; } or if (($match) = $Collator->match($str, $sub)) { # list context print "matches [$match].\n"; } else { print "doesn't match.\n"; }
- "@match = $Collator->gmatch($string, $substring)"
- If $substring matches a part of $string, returns all the
matching parts (or matching count in scalar context).
- "$count = $Collator->subst($string, $substring, $replacement)"
- If $substring matches a part of $string, the first
occurrence of the matching part is replaced by $replacement ($string is
modified) and return $count (always equals to 1).
- "$count = $Collator->gsubst($string, $substring, $replacement)"
- If $substring matches a part of $string, all the
occurrences of the matching part is replaced by $replacement ($string is
modified) and return $count.
my $Collator = Unicode::Collate->new( normalization => undef, level => 1 ); # (normalization => undef) is REQUIRED. my $str = "Camel donkey zebra came\x{301}l CAMEL horse cAm\0E\0L..."; $Collator->gsubst($str, "camel", sub { "<b>$_[0]</b>" }); # now $str is "<b>Camel</b> donkey zebra <b>came\x{301}l</b> <b>CAMEL</b> horse <b>cAm\0E\0L</b>..."; # i.e., all the camels are made bold-faced.
Other Methods¶
- "%old_tailoring = $Collator->change(%new_tailoring)"
- "$modified_collator = $Collator->change(%new_tailoring)"
- Change the value of specified keys and returns the changed
part.
$Collator = Unicode::Collate->new(level => 4); $Collator->eq("perl", "PERL"); # false %old = $Collator->change(level => 2); # returns (level => 4). $Collator->eq("perl", "PERL"); # true $Collator->change(%old); # returns (level => 2). $Collator->eq("perl", "PERL"); # false
$Collator->change(level => 2)->eq("perl", "PERL"); # true $Collator->eq("perl", "PERL"); # true; now max level is 2nd. $Collator->change(level => 4)->eq("perl", "PERL"); # false
- "$version = $Collator->version()"
- Returns the version number (a string) of the Unicode Standard which the "table" file used by the collator object is based on. If the table does not include a version line (starting with @version), returns "unknown".
- "UCA_Version()"
- Returns the revision number of UTS #10 this module consults, that should correspond with the DUCET incorporated.
- "Base_Unicode_Version()"
- Returns the version number of UTS #10 this module consults, that should correspond with the DUCET incorporated.
EXPORT¶
No method will be exported.INSTALL¶
Though this module can be used without any "table" file, to use this module easily, it is recommended to install a table file in the UCA format, by copying it under the directory <a place in @INC>/Unicode/Collate. The most preferable one is "The Default Unicode Collation Element Table" (aka DUCET), available from the Unicode Consortium's website:http://www.unicode.org/Public/UCA/ http://www.unicode.org/Public/UCA/latest/allkeys.txt (latest version)If DUCET is not installed, it is recommended to copy the file from http://www.unicode.org/Public/UCA/latest/allkeys.txt to <a place in @INC>/Unicode/Collate/allkeys.txt manually.
CAVEATS¶
- Normalization
- Use of the "normalization" parameter requires the
Unicode::Normalize module (see Unicode::Normalize).
- Conformance Test
- The Conformance Test for the UCA is available under
<http://www.unicode.org/Public/UCA/>.
AUTHOR, COPYRIGHT AND LICENSE¶
The Unicode::Collate module for perl was written by SADAHIRO Tomoyuki, <SADAHIRO@cpan.org>. This module is Copyright(C) 2001-2011, SADAHIRO Tomoyuki. Japan. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. The file Unicode/Collate/allkeys.txt was copied verbatim from <http://www.unicode.org/Public/UCA/6.0.0/allkeys.txt>. This file is Copyright (c) 1991-2010 Unicode, Inc. All rights reserved. Distributed under the Terms of Use in <http://www.unicode.org/copyright.html>.SEE ALSO¶
- Unicode Collation Algorithm - UTS #10
- <http://www.unicode.org/reports/tr10/>
- The Default Unicode Collation Element Table (DUCET)
- <http://www.unicode.org/Public/UCA/latest/allkeys.txt>
- The conformance test for the UCA
- <http://www.unicode.org/Public/UCA/latest/CollationTest.html>
- Hangul Syllable Type
- <http://www.unicode.org/Public/UNIDATA/HangulSyllableType.txt>
- Unicode Normalization Forms - UAX #15
- <http://www.unicode.org/reports/tr15/>
- Unicode Locale Data Markup Language (LDML) - UTS #35
- <http://www.unicode.org/reports/tr35/>
2011-09-26 | perl v5.14.2 |