General Commands Manual

NAME¶

ucto - Unicode Tokenizer

SYNOPSYS¶

ucto [[options]] [input-file] [[output-file]]

DESCRIPTION¶

ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several languages.

OPTIONS¶

-c configfile

read settings from a file

-d value

set debug mode to 'value'

-e value

set input encoding. (default UTF8)

-f

disable filtering of special characters

-L language

Automatically selects a configuration file by language code. e.g. 'fr' will select the file tokconfig-fr from the installation directory

-l

Convert to all lowercase

-u

Convert to all uppercase

-n

Emit one sentence per line on output

-m

Assume one sentence per line on input

--passthru

Don't tokenize, but perform input decoding and simple token role detection

-P

Disable Paragraph Detection

-Q

Enable Quote Detection. (this is experimental and may lead to unexpected results)

-S

Disable Sentence Detection

-s <string>

Set End-of-sentence marker. (Default <utt>)

-V

Show version information

-v

set Verbose mode

-F

Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: -nulPQvsS)

--textclass cls

When tokenizing a FoLiA XML document, search for text nodes of class 'cls'

-X

Output FoLiA XML. (this disables usage of most other options: -nulPQvsS)

--id <DocId>

Use the specified Document ID for the FoLiA XML

-x <DocId> (obsolete)

Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: -nulPQvsS)

obsolete Use -X and --id instead

BUGS¶

likely

AUTHORS¶

Maarten van Gompel proycon@anaproy.nl

Ko van der Sloot Timbl@uvt.nl

2013 march 6

Source file:	ucto.1.en.gz (from ucto 0.5.3-3.1+b1)
Source last updated:	2013-12-07T11:36:26Z
Converted to HTML:	2018-12-24T04:27:07Z