NAME¶

kytea — a word segmentation/pronunciation estimation tool

SYNOPSIS¶

train-kytea [options]

DESCRIPTION¶

This manual page documents briefly the train-kytea command.

This manual page was written for the Debian distribution because the original program does not have a manual page. Instead, it has documentation in the GNU Info format; see below.

kytea is morphological analysis system based on pointwise predictors. It separetes sentences into words, tagging and predict pronunciations. The pronunciation of KyTea is same as cutie.

OPTIONS¶

A summary of options is included below.

Input/Output Options:¶

-encode: The text encoding to be used (utf8/euc/sjis; default: utf8)

-full: A fully annotated training corpus (multiple possible)

-tok: A training corpus that is tokenized with no tags (multiple possible)

-part: A partially annotated training corpus (multiple possible)

-conf: A confidence annotated training corpus (multiple possible)

-feat: A file containing features generated by -featout

-dict: A dictionary file (one 'word/pron' entry per line, multiple possible)

-subword: A file of subword units. This will enable unknown word PE.

-model: The file to write the trained model to

-modtext: Print a text model (instead of the default binary)

-featout: Write the features used in training the model to this file

Model Training Options (basic)¶

-nows: Don't train a word segmentation model

-notags: Skip the training of tagging, do only word segmentation

-global: Train the nth tag with a global model (good for POS, bad for PE)

-debug: The debugging level during training (0=silent, 1=normal, 2=detailed)

Model Training Options (for advanced users):¶

-charw: The character window to use for WS (3)

-charn: The character n-gram length to use for WS for WS (3)

-typew: The character type window to use for WS (3)

-typen: The character type n-gram length to use for WS for WS (3)

-dictn: Dictionary words greater than -dictn will be grouped together (4)

-unkn: Language model n-gram order for unknown words (3)

-eps: The epsilon stopping criterion for classifier training

-cost: The cost hyperparameter for classifier training

-nobias: Don't use a bias value in classifier training

-solver: The solver (1=SVM, 7=logistic regression, etc.; default 1, see LIBLINEAR documentation for more details)

Format Options (for advanced users):¶

-wordbound: The separator for words in full annotation (" ")

-tagbound: The separator for tags in full/partial annotation ("/")

-elembound: The separator for candidates in full/partial annotation ("&")

-unkbound: Indicates unannotated boundaries in partial annotation (" ")

-skipbound: Indicates skipped boundaries in partial annotation ("?")

-nobound: Indicates non-existence of boundaries in partial annotation ("-")

-hasbound: Indicates existence of boundaries in partial annotation ("|")

AUTHOR¶

This manual page was written by Koichi Akabe vbkaisetsu@gmail.com for the Debian system (and may be used by others). Permission is granted to copy, distribute and/or modify this document under the terms of the GNU General Public License, Version 2 any later version published by the Free Software Foundation.

On Debian systems, the complete text of the GNU General Public License can be found in /usr/share/common-licenses/GPL.

Source file:	train-kytea.1.en.gz (from kytea 0.4.6+dfsg-2)
Source last updated:	2013-09-15T21:38:34Z
Converted to HTML:	2018-08-09T23:11:08Z