'\" t
.\" Title: yaz-icu
.\" Author: Index Data
.\" Generator: DocBook XSL Stylesheets v1.79.1
.\" Date: 03/25/2020
.\" Manual: Commands
.\" Source: YAZ 5.29.0
.\" Language: English
.\"
.TH "YAZ\-ICU" "1" "03/25/2020" "YAZ 5.29.0" "Commands"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
yaz-icu \- YAZ ICU utility
.SH "SYNOPSIS"
.HP \w'\fByaz\-icu\fR\ 'u
\fByaz\-icu\fR [\-c\ \fIconfig\fR] [\-p\ \fIopt\fR] [\-s] [\-x] [infile]
.SH "DESCRIPTION"
.PP
\fByaz\-icu\fR
is a utility which demonstrates the ICU chain module of yaz\&. (yaz/icu\&.h)\&.
.PP
The utility can be used in two ways\&. It may read some text using an XML configuration for configuring ICU and show text analysis\&. This mode is triggered by option
\-c
which specifies the configuration to be used\&. The input file is read from standard input or from a file if
infile
is specified\&.
.PP
The utility may also show ICU information\&. This is triggered by option
\-p\&.
.SH "OPTIONS"
.PP
\-c \fIconfig\fR
.RS 4
Specifies the file containing ICU chain configuration which is XML based\&.
.RE
.PP
\-p \fItype\fR
.RS 4
Specifies extra information to be printed about the ICU system\&. If
\fItype\fR
is
c
then ICU converters are printed\&. If
\fItype\fR
is
l, then available locales are printed\&. If
\fItype\fR
is
t, then available transliterators are printed\&.
.RE
.PP
\-s
.RS 4
Specifies that output should include sort key as well\&. Note that sort key differs between ICU versions\&.
.RE
.PP
\-x
.RS 4
Specifies that output should be XML based rather than "text" based\&.
.RE
.SH "ICU CHAIN CONFIGURATION"
.PP
The ICU chain configuration specifies one or more rules to convert text data into tokens\&. The configuration format is XML based\&.
.PP
The toplevel element must be named
icu_chain\&. The
icu_chain
element has one required attribute
locale
which specifies the ICU locale to be used in the conversion steps\&.
.PP
The
icu_chain
element must include elements where each element specifies a conversion step\&. The conversion is performed in the order in which the conversion steps are specified\&. Each conversion element takes one attribute:
rule
which serves as argument to the conversion step\&.
.PP
The following conversion elements are available:
.PP
casemap
.RS 4
Converts case (and rule specifies how):
.PP
l
.RS 4
Lower case using ICU function u_strToLower\&.
.RE
.PP
u
.RS 4
Upper case using ICU function u_strToUpper\&.
.RE
.PP
t
.RS 4
To title using ICU function u_strToTitle\&.
.RE
.PP
f
.RS 4
Fold case using ICU function u_strFoldCase\&.
.RE
.sp
.RE
.PP
display
.RS 4
This is a meta step which specifies that a term/token is to be displayed\&. This term is retrieved in an application using function icu_chain_token_display (yaz/icu\&.h)\&.
.RE
.PP
transform
.RS 4
Specifies an ICU transform rule using a transliterator Identifier\&. The rule attribute is the transliterator Identifier\&. See
\m[blue]\fBICU Transforms\fR\m[]\&\s-2\u[1]\d\s+2
for more information\&.
.RE
.PP
transliterate
.RS 4
Specifies a rule\-based transliterator\&. The rule attribute is the custom transformation rule to be used\&. See
\m[blue]\fBICU Transforms\fR\m[]\&\s-2\u[1]\d\s+2
for more information\&.
.RE
.PP
tokenize
.RS 4
Breaks / tokenizes a string into components using ICU functions ubrk_open, ubrk_setText, \&.\&. \&. The rule is one of:
.PP
l
.RS 4
Line\&. ICU: UBRK_LINE\&.
.RE
.PP
s
.RS 4
Sentence\&. ICU: UBRK_SENTENCE\&.
.RE
.PP
w
.RS 4
Word\&. ICU: UBRK_WORD\&.
.RE
.PP
c
.RS 4
Character\&. ICU: UBRK_CHARACTER\&.
.RE
.PP
t
.RS 4
Title\&. ICU: UBRK_TITLE\&.
.RE
.sp
.RE
.PP
join
.RS 4
Joins tokens into one string\&. The rule attribute is the joining string, which may be empty\&. The join conversion element was added in YAZ 4\&.2\&.49\&.
.RE
.SH "EXAMPLES"
.PP
The following command analyzes text in file
text
using ICU chain configuration
chain\&.xml:
.sp
.if n \{\
.RS 4
.\}
.nf
cat text | yaz\-icu \-c chain\&.xml
.fi
.if n \{\
.RE
.\}
.sp
The chain\&.xml might look as follows:
.sp
.if n \{\
.RS 4
.\}
.nf
.fi
.if n \{\
.RE
.\}
.sp
.SH "SEE ALSO"
.PP
\fByaz\fR(7)
.PP
\m[blue]\fBICU Home\fR\m[]\&\s-2\u[2]\d\s+2
.PP
\m[blue]\fBICU Transforms\fR\m[]\&\s-2\u[1]\d\s+2
.SH "AUTHORS"
.PP
\fBIndex Data\fR
.SH "NOTES"
.IP " 1." 4
ICU Transforms
.RS 4
\%http://userguide.icu-project.org/transforms/general
.RE
.IP " 2." 4
ICU Home
.RS 4
\%http://www.icu-project.org/
.RE