NAME¶
kytea — a word segmentation/pronunciation estimation tool
SYNOPSIS¶
train-kytea [
options]
DESCRIPTION¶
This manual page documents briefly the
train-kytea command.
This manual page was written for the
Debian distribution because the
original program does not have a manual page. Instead, it has documentation in
the GNU
Info format; see below.
kytea is morphological analysis system based on pointwise predictors. It
separetes sentences into words, tagging and predict pronunciations. The
pronunciation of KyTea is same as cutie.
OPTIONS¶
A summary of options is included below.
- -encode
- The text encoding to be used (utf8/euc/sjis; default: utf8)
- -full
- A fully annotated training corpus (multiple possible)
- -tok
- A training corpus that is tokenized with no tags (multiple possible)
- -part
- A partially annotated training corpus (multiple possible)
- -conf
- A confidence annotated training corpus (multiple possible)
- -feat
- A file containing features generated by -featout
- -dict
- A dictionary file (one 'word/pron' entry per line, multiple possible)
- -subword
- A file of subword units. This will enable unknown word PE.
- -model
- The file to write the trained model to
- -modtext
- Print a text model (instead of the default binary)
- -featout
- Write the features used in training the model to this file
Model Training Options (basic)¶
- -nows
- Don't train a word segmentation model
- -notags
- Skip the training of tagging, do only word segmentation
- -global
- Train the nth tag with a global model (good for POS, bad for PE)
- -debug
- The debugging level during training (0=silent, 1=normal, 2=detailed)
Model Training Options (for advanced users):¶
- -charw
- The character window to use for WS (3)
- -charn
- The character n-gram length to use for WS for WS (3)
- -typew
- The character type window to use for WS (3)
- -typen
- The character type n-gram length to use for WS for WS (3)
- -dictn
- Dictionary words greater than -dictn will be grouped together (4)
- -unkn
- Language model n-gram order for unknown words (3)
- -eps
- The epsilon stopping criterion for classifier training
- -cost
- The cost hyperparameter for classifier training
- -nobias
- Don't use a bias value in classifier training
- -solver
- The solver (1=SVM, 7=logistic regression, etc.; default 1, see LIBLINEAR
documentation for more details)
- -wordbound
- The separator for words in full annotation (" ")
- -tagbound
- The separator for tags in full/partial annotation ("/")
- -elembound
- The separator for candidates in full/partial annotation
("&")
- -unkbound
- Indicates unannotated boundaries in partial annotation ("
")
- -skipbound
- Indicates skipped boundaries in partial annotation ("?")
- -nobound
- Indicates non-existence of boundaries in partial annotation
("-")
- -hasbound
- Indicates existence of boundaries in partial annotation
("|")
AUTHOR¶
This manual page was written by Koichi Akabe vbkaisetsu@gmail.com for the
Debian system (and may be used by others). Permission is granted to
copy, distribute and/or modify this document under the terms of the GNU
General Public License, Version 2 any later version published by the Free
Software Foundation.
On Debian systems, the complete text of the GNU General Public License can be
found in /usr/share/common-licenses/GPL.