Scroll to navigation



hfst-tokenize - =perform matching/lookup on text streams


hfst-tokenize [--segment | --xerox | --cg | --giella-cg] [OPTIONS...] RULESET


perform matching/lookup on text streams

Common options:

Print help message
Print version info
Print verbosely while processing
Only print fatal erros and requested output
Alias of --quiet
Newline as input separator (default is blank line)
Print nonmatching text
Print weights (overrides earlier -W option)
Don't print weights (default; overrides earlier -w, or -w implied by -g, options)
(by default only one utf-8 character is tokenized at a time regardless of what is present in the alphabet)
Output only analyses whose weight is within B from best result
Limit search after having used S seconds per input
Output no more than N best weight classes (where analyses with equal weight constitute a class
Remove duplicate analyses
Segmenting / tokenization mode (default)
Tokenization with one sentence per line, space-separated tokens
Xerox output
Constraint Grammar output
Ignore contents of unescaped [] (cf. apertium-destxt); flush on NUL
CG format used in Giella infrastructure (implies -w and -l2, treats @PMATCH_INPUT_MARK@ as subreading separator, expects tags to be Multichar_symbols, flush on NUL)
CoNLL-U format
FinnPos output
VISL input and output (implies -W, handles <s> as blocks and <STYLE> inline)

Use standard streams for input and output (for now).


Report bugs to <> or directly to our bug tracker at: <>

hfst-tokenize home page: <>
General help using HFST software: <>


Copyright © 2017 University of Helsinki, License GPLv3: GNU GPL version 3 <>
This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

August 2018 HFST