.TH "TRAIN-KYTEA" "1" .SH "NAME" kytea \(em a word segmentation/pronunciation estimation tool .SH "SYNOPSIS" .PP \fBtrain\-kytea\fR [\fBoptions\fP] .SH "DESCRIPTION" .PP This manual page documents briefly the \fBtrain\-kytea\fR command. .PP This manual page was written for the \fBDebian\fP distribution because the original program does not have a manual page. Instead, it has documentation in the GNU \fBInfo\fP format; see below. .PP \fBkytea\fR is morphological analysis system based on pointwise predictors. It separetes sentences into words, tagging and predict pronunciations. The pronunciation of KyTea is same as cutie. .PP .SH "OPTIONS" .PP A summary of options is included below. .SS Input/Output Options: .IP "\fB-encode\fP" 11 The text encoding to be used (utf8/euc/sjis; default: utf8) .IP "\fB-full\fP" 11 A fully annotated training corpus (multiple possible) .IP "\fB-tok\fP" 11 A training corpus that is tokenized with no tags (multiple possible) .IP "\fB-part\fP" 11 A partially annotated training corpus (multiple possible) .IP "\fB-conf\fP" 11 A confidence annotated training corpus (multiple possible) .IP "\fB-feat\fP" 11 A file containing features generated by \-featout .IP "\fB-dict\fP" 11 A dictionary file (one 'word/pron' entry per line, multiple possible) .IP "\fB-subword\fP" 11 A file of subword units. This will enable unknown word PE. .IP "\fB-model\fP" 11 The file to write the trained model to .IP "\fB-modtext\fP" 11 Print a text model (instead of the default binary) .IP "\fB-featout\fP" 11 Write the features used in training the model to this file .SS Model Training Options (basic) .IP "\fB-nows\fP" 11 Don't train a word segmentation model .IP "\fB-notags\fP" 11 Skip the training of tagging, do only word segmentation .IP "\fB-global\fP" 11 Train the nth tag with a global model (good for POS, bad for PE) .IP "\fB-debug\fP" 11 The debugging level during training (0=silent, 1=normal, 2=detailed) .SS Model Training Options (for advanced users): .IP "\fB-charw\fP" 11 The character window to use for WS (3) .IP "\fB-charn\fP" 11 The character n\-gram length to use for WS for WS (3) .IP "\fB-typew\fP" 11 The character type window to use for WS (3) .IP "\fB-typen\fP" 11 The character type n\-gram length to use for WS for WS (3) .IP "\fB-dictn\fP" 11 Dictionary words greater than \-dictn will be grouped together (4) .IP "\fB-unkn\fP" 11 Language model n\-gram order for unknown words (3) .IP "\fB-eps\fP" 11 The epsilon stopping criterion for classifier training .IP "\fB-cost\fP" 11 The cost hyperparameter for classifier training .IP "\fB-nobias\fP" 11 Don't use a bias value in classifier training .IP "\fB-solver\fP" 11 The solver (1=SVM, 7=logistic regression, etc.; default 1, see LIBLINEAR documentation for more details) .SS Format Options (for advanced users): .IP "\fB-wordbound\fP" 11 The separator for words in full annotation (" ") .IP "\fB-tagbound\fP" 11 The separator for tags in full/partial annotation ("/") .IP "\fB-elembound\fP" 11 The separator for candidates in full/partial annotation ("&") .IP "\fB-unkbound\fP" 11 Indicates unannotated boundaries in partial annotation (" ") .IP "\fB-skipbound\fP" 11 Indicates skipped boundaries in partial annotation ("?") .IP "\fB-nobound\fP" 11 Indicates non-existence of boundaries in partial annotation ("-") .IP "\fB-hasbound\fP" 11 Indicates existence of boundaries in partial annotation ("|") .PP .RE .SH "AUTHOR" .PP This manual page was written by Koichi Akabe vbkaisetsu@gmail.com for the \fBDebian\fP system (and may be used by others). Permission is granted to copy, distribute and/or modify this document under the terms of the GNU General Public License, Version 2 any later version published by the Free Software Foundation. .PP On Debian systems, the complete text of the GNU General Public License can be found in /usr/share/common-licenses/GPL.