NAME¶
lt-proc - This application is part of the lexical processing modules and tools (
lttoolbox )
This tool is part of the apertium machine translation architecture:
http://www.apertium.org.
SYNOPSIS¶
lt-proc [
-a | -c | -g | -n | -p |
-s | -t | -v | -h ] fst_file [input_file
[output_file]]
lt-proc [
--analysis | --case-sensitive | --generation
| --non-marked-gen | --post-generation | --sao |
--transliteration | --version | --help ] fst_file
[input_file [output_file]]
DESCRIPTION¶
lt-proc is the application responsible for providing the four lexical
processing functionalities
•
morphological analyser ( option
-a
)
•
lexical transfer ( option
-n )
•
morphological generator ( option
-g )
•
post-generator ( option
-p )
It accomplishes these tasks by reading binary files containing a compact and
efficient representation of dictionaries (a class of finite-state transducers
called augmented letter transducers). These files are generated by
lt-comp(1).
It is worth to mention that some characters (`
[', `
]',
`
$', `
^', `
/', `
+') are
special chars used
for format and encapsulation. They should be escaped if they have to be used
literally, for instance: `
['...`
]' are ignored and the format
of a
linefeed is `
^...
$'.
OPTIONS¶
- -a, --analysis
- Tokenizes the text in surface forms (lexical units as they appear in
texts) and delivers, for each surface form, one or more lexical forms
consisting of lemma, lexical category and morphological inflection
information. Tokenization is not straightforward due to the existence, on
the one hand, of contractions, and, on the other hand, of multi-word
lexical units. For contractions, the system reads in a single surface form
and delivers the corresponding sequence of lexical forms. Multi-word
surface forms are analysed in a left-to-right, longest-match fashion.
Multi-word surface forms may be invariable (such as a multi-word
preposition or conjunction) or inflected (for example, in es,
"echaban de menos", "they missed", is a form of
the imperfect indicative tense of the verb "echar de
menos", "to miss"). Limited support for some kinds
of discontinuous multi-word units is also available. Single-word surface
forms analysis produces output like the one in these examples:
"cantar" ->
`^cantar/cantar<vblex><inf>$' or `
"daba" -> `
^daba/dar<vblex><pii><p1><sg>/dar<vblex><pii><p3><sg>$'.
- -c, --case-sensitive
- Use the literal case of the incoming characters
- -g, --generation
- Delivers a target-language surface form for each target-language lexical
form, by suitably inflecting it.
- -n, --non-marked-gen
- Morphological generation (like -g) but without unknown word marks
(asterisk `*').
- -p, --post-generation
- Performs orthographical operations such as contractions and
apostrophations. The post-generator is usually dormant (just copies
the input to the output) until a special alarm symbol contained in
some target-language surface forms wakes it up to perform a
particular string transformation if necessary; then it goes back to
sleep.
- -s, --sao
- Input processing is in orthoepikon (previously `sao')
annotation system format: http://orthoepikon.sf.net.
- -t, --transliteration
- Apply a transliteration dictionary
- -v, --version
- Display the version number.
- -h, --help
- Display this help.
FILES¶
input_file The input compiled dictionary.
SEE ALSO¶
lt-expand(1), lt-comp(1), apertium-tagger(1),
apertium-translator(1).
BUGS¶
Lots of...lurking in the dark and waiting for you!
AUTHOR¶
(c) 2005,2006 Universitat d'Alacant / Universidad de Alicante. All rights
reserved.