ucto(1) | General Commands Manual | ucto(1) |
NAME¶
ucto - Unicode TokenizerSYNOPSIS¶
ucto [[options]] [input‐file] [[output‐file]]DESCRIPTION¶
ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several languages.OPTIONS¶
-c configfileread settings from a file
-d value
set debug mode to 'value'
-e value
set input encoding. (default UTF8)
-N value
set UTF8 output normalization. (default NFC)
-f
disable filtering of special characters
-L language
Automatically selects a configuration file by language code.
The language code is generally a three-letter iso-639-3 code. For example,
'fra' will select the file tokconfig‐fra from the installation
directory
-l
Convert to all lowercase
-u
Convert to all uppercase
-n
Emit one sentence per line on output
-m
Assume one sentence per line on input
--passthru
Don't tokenize, but perform input decoding and simple
token role detection
--filterpunct
remove most of the punctuation from the output. (not from
abreviations!)
-P
Disable Paragraph Detection
-Q
Enable Quote Detection. (this is experimental and may
lead to unexpected results)
-S
Disable Sentence Detection
-s <string>
Set End‐of‐sentence marker. (Default
<utt>)
-V
Show version information
-v
set Verbose mode
-F
Read a FoLiA XML document, tokenize it, and output the
modified doc. (this disables usage of most other options: -nulPQvsS)
--textclasscls
When tokenizing a FoLiA XML document, search for text
nodes of class 'cls'
-X
Output FoLiA XML. (this disables usage of most other
options: -nulPQvsS)
--id <DocId>
Use the specified Document ID for the FoLiA XML
-x <DocId> (obsolete)
Output FoLiA XML, use the specified Document ID. (this
disables usage of most other options: -nulPQvsS)
obsolete Use -X and --id instead
BUGS¶
likelyAUTHORS¶
Maarten van Gompel proycon@anaproy.nlKo van der Sloot Timbl@uvt.nl
2014 december 2 |