.TH PHONETISAURUS "1" "February 2013" "phonetisaurus 0.7.8" "User Commands" .SH NAME phonetisaurus\-align \- Dictionary aligner .SH SYNOPSIS \fBphonetisaurus\-align\fR \-\-input=\fIdictionary.bsf\fR \-\-ofile=\fItraining.corpus\fR [\fIOPTIONS\fR] .SH DESCRIPTION \fBphonetisaurus\-align\fR This tool read an input dictionary and produce an aligned corpus that can be used to train a model for Grapheme\-to\-Phoneme conversion. .SH INPUT FORMAT The input format is a two columns plain-text file. The first column is supposed to contain a graphemes sequence (e.g., the orthographic form of a word). The second column is supposed to contain the corresponding phonemes sequence. By default the two columns are separated by a TAB character (it is possible to change the separator using the \fB\-\-delim\fR option), each character of the first column is supposed to be a grapheme (it is possible to specify a grapheme separator using \fB\-\-seq1_sep\fR), phonemes in the second column are separated by spaces (it is possible to change the phoneme separator using \fB\-\-seq2_sep\fR). Input example: ABBREVIATE AH B R IY V IY EY T .SH OPTIONS .HP \fB\-\-help\fR=<\fIbool\fR> (default: false) .IP show usage information .HP \fB\-\-helpshort\fR=<\fIbool\fR> (default: false) .IP show brief usage information .HP \fB\-\-tmpdir\fR=<\fIstring\fR> (default: "/tmp/") .IP temporary directory .HP \fB\-\-v\fR=<\fIint32\fR> (default: 0) .IP verbose level .HP \fB\-\-fst_align\fR=<\fIbool\fR> (default: false) .IP Write FST data aligned where appropriate .HP \fB\-\-fst_default_cache_gc\fR=<\fIbool\fR> (default: true) .IP Enable garbage collection of cache .HP \fB\-\-fst_default_cache_gc_limit\fR=<\fIint64\fR> (default: 1048576) .IP Cache byte size that triggers garbage collection .HP \fB\-\-fst_verify_properties\fR=<\fIbool\fR> (default: false) .IP Verify fst properties queried by TestProperties .HP \fB\-\-fst_weight_parentheses\fR=<\fIstring\fR> (default: "") .IP Characters enclosing the first weight of a printed composite weight (e.g. pair weight, tuple weight and derived classes) to ensure proper I/O of nested composite weights; must have size 0 (none) or 2 (open and close parenthesis) .HP \fB\-\-fst_weight_separator\fR=<\fIstring\fR> (default: "") .IP Character separator between printed composite weights; must be a single character .HP \fB\-\-save_relabel_ipairs\fR=<\fIstring\fR> (default: "") .IP Save input relabel pairs to file .HP \fB\-\-save_relabel_opairs\fR=<\fIstring\fR> (default: "") .IP Save output relabel pairs to file \fB\-\-delim\fR=<\fIstring\fR> (default: " ") .IP Delimiter used to separate input and output tokens. .HP \fB\-\-eps\fR=<\fIstring\fR> (default: "") .IP Epsilon symbol. .HP \fB\-\-fb\fR=<\fIbool\fR> (default: false) .IP Use forward\-backward pruning for the alignment lattices. .HP \fB\-\-input\fR=<\fIstring\fR> (default: "") .IP Two\-column input file to align. .HP \fB\-\-iter\fR=<\fIint32\fR> (default: 11) .IP Maximum number of EM iterations to perform. .HP \fB\-\-lattice\fR=<\fIbool\fR> (default: false) .IP Write out the alignment lattices as an fst archive (.far). .HP \fB\-\-model\fR=<\fIbool\fR> (default: true) .IP Load a pre\-trained model for use. .HP \fB\-\-mbr\fR=<\fIbool\fR> (default: false) .IP Use the LMBR decoder (not yet implemented). .HP \fB\-\-model_file\fR=<\fIstring\fR> (default: "") .IP FST\-format alignment model to load. .HP \fB\-\-nbest\fR=<\fIint32\fR> (default: 1) .IP Output the N\-best alignments given the model. .HP \fB\-\-ofile\fR=<\fIstring\fR> (default: "") .IP Output file to write the aligned dictionary to. .HP \fB\-\-penalize\fR=<\fIbool\fR> (default: true) .IP Penalize scores. .HP \fB\-\-penalize_em\fR=<\fIbool\fR> (default: false) .IP Penalize links during EM training. .HP \fB\-\-pthresh\fR=<\fIdouble\fR> (default: \-99) .HP Pruning threshold. Use to prune unlikely N\-best candidates when using multiple alignments. .HP \fB\-\-restrict\fR=<\fIbool\fR> (default: true) .IP Restrict links to M\-1, 1\-N during initialization. .HP \fB\-\-s1_char_delim\fR=<\fIstring\fR> (default: "") .IP Sequence one input delimiter. .HP \fB\-\-s1s2_sep\fR=<\fIstring\fR> (default: "}") .IP Token used to separate input\-output subsequences in the g2p model. .HP \fB\-\-s2_char_delim\fR=<\fIstring\fR> (default: " ") .IP Sequence two input delimiter. .HP \fB\-\-seq1_del\fR=<\fIbool\fR> (default: true) .IP Allow deletions in sequence one. .HP \fB\-\-seq1_max\fR=<\fIint32\fR> (default: 2) .IP Maximum subsequence length for sequence one. .HP \fB\-\-seq1_sep\fR=<\fIstring\fR> (default: "|") .IP Multi\-token separator for input tokens. .HP \fB\-\-seq2_del\fR=<\fIbool\fR> (default: true) .IP Allow deletions in sequence two. .HP \fB\-\-seq2_max\fR=<\fIint32\fR> (default: 2) .IP Maximum subsequence length for sequence two. .HP \fB\-\-seq2_sep\fR=<\fIstring\fR> (default: "|") .IP Multi\-token separator for output tokens. .HP \fB\-\-skip\fR=<\fIstring\fR> (default: "_") .IP Skip token used to represent null transitions. Distinct from epsilon. .HP \fB\-\-thresh\fR=<\fIdouble\fR> (default: 1e\-10) .IP Delta threshold for EM training termination. .HP \fB\-\-write_model\fR=<\fIstring\fR> (default: "") .IP Write out the alignment model in OpenFst format to filename. .HP \fB\-\-fst_compat_symbols\fR=<\fIbool\fR> (default: true) .IP Require symbol tables to match when appropriate .HP \fB\-\-fst_field_separator\fR=<\fIstring\fR> (default: " ") .IP Set of characters used as a separator between printed fields .HP \fB\-\-fst_error_fatal\fR=<\fIbool\fR> (default: true) .IP FST errors are fatal; o.w. return objects flagged as bad: e.g., FSTs \- kError prop. true, FST weights \- not a Member()