.\" Automatically generated by Pod::Man 4.14 (Pod::Simple 3.40) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{\ . if \nF \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" ======================================================================== .\" .IX Title "Unicode::GCString 3pm" .TH Unicode::GCString 3pm "2020-11-09" "perl v5.32.0" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" Unicode::GCString \- String as Sequence of UAX #29 Grapheme Clusters .SH "SYNOPSIS" .IX Header "SYNOPSIS" .Vb 2 \& use Unicode::GCString; \& $gcstring = Unicode::GCString\->new($string); .Ve .SH "DESCRIPTION" .IX Header "DESCRIPTION" Unicode::GCString treats Unicode string as a sequence of \&\fIextended grapheme clusters\fR defined by Unicode Standard Annex #29 [\s-1UAX\s0 #29]. .PP \&\fBGrapheme cluster\fR is a sequence of Unicode character(s) that consists of one \&\fBgrapheme base\fR and optional \fBgrapheme extender\fR and/or \&\fB“prepend” character\fR. It is close in that people consider as \fIcharacter\fR. .SS "Public Interface" .IX Subsection "Public Interface" \fIConstructors\fR .IX Subsection "Constructors" .IP "new (\s-1STRING,\s0 [\s-1KEY\s0 => \s-1VALUE, ...\s0])" 4 .IX Item "new (STRING, [KEY => VALUE, ...])" .PD 0 .IP "new (\s-1STRING,\s0 [\s-1LINEBREAK\s0])" 4 .IX Item "new (STRING, [LINEBREAK])" .PD \&\fIConstructor\fR. Create new grapheme cluster string (Unicode::GCString object) from Unicode string \s-1STRING.\s0 .Sp About optional \s-1KEY\s0 => \s-1VALUE\s0 pairs see \*(L"Options\*(R" in Unicode::LineBreak. On second form, Unicode::LineBreak object \s-1LINEBREAK\s0 controls breaking features. .Sp \&\fBNote\fR: The first form was introduced by release 2012.10. .IP "copy" 4 .IX Item "copy" \&\fICopy constructor\fR. Create a copy of grapheme cluster string. Next position of new string is set at beginning. .PP \fISizes\fR .IX Subsection "Sizes" .IP "chars" 4 .IX Item "chars" \&\fIInstance method\fR. Returns number of Unicode characters grapheme cluster string includes, i.e. length as Unicode string. .IP "columns" 4 .IX Item "columns" \&\fIInstance method\fR. Returns total number of columns of grapheme clusters defined by built-in character database. For more details see \*(L"\s-1DESCRIPTION\*(R"\s0 in Unicode::LineBreak. .IP "length" 4 .IX Item "length" \&\fIInstance method\fR. Returns number of grapheme clusters contained in grapheme cluster string. .PP \fIOperations as String\fR .IX Subsection "Operations as String" .IP "as_string" 4 .IX Item "as_string" .PD 0 .ie n .IP """""""\s-1OBJECT\s0""""""" 4 .el .IP "\f(CW``\fR\s-1OBJECT\s0\f(CW''\fR" 4 .IX Item """OBJECT""" .PD \&\fIInstance method\fR. Convert grapheme cluster string to Unicode string explicitly. .IP "cmp (\s-1STRING\s0)" 4 .IX Item "cmp (STRING)" .PD 0 .ie n .IP "\s-1STRING\s0 ""cmp"" \s-1STRING\s0" 4 .el .IP "\s-1STRING\s0 \f(CWcmp\fR \s-1STRING\s0" 4 .IX Item "STRING cmp STRING" .PD \&\fIInstance method\fR. Compare strings. There are no oddities. One of each \s-1STRING\s0 may be Unicode string. .IP "concat (\s-1STRING\s0)" 4 .IX Item "concat (STRING)" .PD 0 .ie n .IP "\s-1STRING\s0 ""."" \s-1STRING\s0" 4 .el .IP "\s-1STRING\s0 \f(CW.\fR \s-1STRING\s0" 4 .IX Item "STRING . STRING" .PD \&\fIInstance method\fR. Concatenate STRINGs. One of each \s-1STRING\s0 may be Unicode string. Note that number of columns (see \fBcolumns()\fR) or grapheme clusters (see \fBlength()\fR) of resulting string is not always equal to sum of both strings. Next position of new string is that set on the left value. .IP "join ([\s-1STRING, ...\s0])" 4 .IX Item "join ([STRING, ...])" \&\fIInstance method\fR. Join STRINGs inserting grapheme cluster string. Any of STRINGs may be Unicode string. .IP "substr (\s-1OFFSET,\s0 [\s-1LENGTH,\s0 [\s-1REPLACEMENT\s0]])" 4 .IX Item "substr (OFFSET, [LENGTH, [REPLACEMENT]])" \&\fIInstance method\fR. Returns substring of grapheme cluster string. \&\s-1OFFSET\s0 and \s-1LENGTH\s0 are based on grapheme clusters. If \s-1REPLACEMENT\s0 is specified, substring is replaced by it. \&\s-1REPLACEMENT\s0 may be Unicode string. .Sp Note: This method cannot return the lvalue, unlike built-in \fBsubstr()\fR. .PP \fIOperations as Sequence of Grapheme Clusters\fR .IX Subsection "Operations as Sequence of Grapheme Clusters" .IP "as_array" 4 .IX Item "as_array" .PD 0 .ie n .IP """@{""\s-1OBJECT\s0""}""" 4 .el .IP "\f(CW@{\fR\s-1OBJECT\s0\f(CW}\fR" 4 .IX Item "@{OBJECT}" .IP "as_arrayref" 4 .IX Item "as_arrayref" .PD \&\fIInstance method\fR. Convert grapheme cluster string to an array of grapheme clusters. .IP "eos" 4 .IX Item "eos" \&\fIInstance method\fR. Test if current position is at end of grapheme cluster string. .IP "item ([\s-1OFFSET\s0])" 4 .IX Item "item ([OFFSET])" \&\fIInstance method\fR. Returns OFFSET-th grapheme cluster. If \s-1OFFSET\s0 was not specified, returns next grapheme cluster. .IP "next" 4 .IX Item "next" .PD 0 .ie n .IP """<""\s-1OBJECT\s0"">""" 4 .el .IP "\f(CW<\fR\s-1OBJECT\s0\f(CW>\fR" 4 .IX Item "" .PD \&\fIInstance method\fR, iterative. Returns next grapheme cluster and increment next position. .IP "pos ([\s-1OFFSET\s0])" 4 .IX Item "pos ([OFFSET])" \&\fIInstance method\fR. If optional \s-1OFFSET\s0 is specified, set next position by it. Returns next position of grapheme cluster string. .PP \fIMiscelaneous\fR .IX Subsection "Miscelaneous" .IP "lbc" 4 .IX Item "lbc" \&\fIInstance method\fR. Returns Line Breaking Class (See Unicode::LineBreak) of the first character of first grapheme cluster. .IP "lbcext" 4 .IX Item "lbcext" \&\fIInstance method\fR. Returns Line Breaking Class (See Unicode::LineBreak) of the last grapheme extender of last grapheme cluster. If there are no grapheme extenders or its class is \s-1CM,\s0 value of last grapheme base will be returned. .SH "CAVEATS" .IX Header "CAVEATS" .IP "\(bu" 4 The grapheme cluster should not be referred to as \*(L"grapheme\*(R" even though Larry does. .IP "\(bu" 4 On Perl around 5.10.1, implicit conversion from Unicode::GCString object to Unicode string sometimes let \f(CW"utf8_mg_pos_cache_update"\fR cache be confused. .Sp For example, instead of doing .Sp .Vb 1 \& $sub = substr($gcstring, $i, $j); .Ve .Sp do .Sp .Vb 1 \& $sub = substr("$gcstring", $i, $j); \& \& $sub = substr($gcstring\->as_string, $i, $j); .Ve .IP "\(bu" 4 This module implements \fIdefault\fR algorithm for determining grapheme cluster boundaries. Tailoring mechanism has not been supported yet. .SH "VERSION" .IX Header "VERSION" Consult \f(CW$VERSION\fR variable. .SS "Incompatible Changes" .IX Subsection "Incompatible Changes" .IP "Release 2013.10" 4 .IX Item "Release 2013.10" .RS 4 .PD 0 .IP "\(bu" 4 .PD The \fBnew()\fR method can take non-Unicode string argument. In this case it will be decoded by iso\-8859\-1 (Latin 1) character set. That method of former releases would die with non-Unicode inputs. .RE .RS 4 .RE .SH "SEE ALSO" .IX Header "SEE ALSO" [\s-1UAX\s0 #29] Mark Davis (ed.) (2009\-2013). \&\fIUnicode Standard Annex #29: Unicode Text Segmentation\fR, Revisions 15\-23. . .SH "AUTHOR" .IX Header "AUTHOR" Hatuka*nezumi \- \s-1IKEDA\s0 Soji .SH "COPYRIGHT" .IX Header "COPYRIGHT" Copyright (C) 2009\-2013 Hatuka*nezumi \- \s-1IKEDA\s0 Soji. .PP This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.