other versions
other sections
MMORPH(5) | File Formats Manual | MMORPH(5) |
NAME¶
mmorph - MULTEXT morphology tool formalism syntaxDESCRIPTION¶
A mmorph morphology description file is divided into declaration sections. Each section starts by a section header (` @ Alphabets', ` @ Attributes', etc.) followed by a sequence of declarations. Each declaration starts by a name, followed by a colon (` :') and the definition associated to the name. Here is a brief description of each section:@ Alphabets¶
In this section the lexical and surface alphabet are declared. All symbols forming each alphabet has to be listed. Symbols may appear in both the lexical and surface alphabet definition in which case it is considered a bi-level symbol, otherwise it is a lexical only or surface only symbol. Symbols are usually letters (eg. a, b, c) , but may also consist of longer names ( beta, schwa). Symbol names consisting of one special character (` :' or `(') may be specified by enclosing them in double quotes (` :' or `(').Lexical : a b c d e f g h i j k l m n o p q r s t u v w x y z
"-" "." "," "?" "!"
"\"" "'" ":" ";" "("
")" strong_e
Surface : a b c d e f g h i j k l m n o p q r s t u v w x y z
"-" "." "," "?" "!"
"\"" "'" ":" ";" "("
")" " "
@ Attributes¶
In this section, the name of attributes (sometimes called features) and their associated value set. At most 32 different values may be declared for an attribute.Gender : feminine masculine neuter Number : singular plural Person : 1st 2nd 3rd Transitive : yes no Inflection : base intermediate final
@ Types¶
In this section, the different types of feature structures are declared. The attributes allowed for each type are listed. Attributes that are only used within the scope of the tool and have no meaning outside can be listed after a bar (` |'). The values of these local attributes ar not stored in the database or written on the final output of the program.Noun : Gender Number Verb : Tense Person Gender Number Transitive | Inflection
Typed feature structures¶
Typed feature structures are used in the grammar and spelling rules. It is the specification of a type and the value of some associated attributes. The list of attribute specifications is enclosed in square brackets (` [' and `]').Noun[ Gender=feminine Number=singular ]
Noun[ Gender=masculine|neuter ] Noun[ Gender!=feminine ]
@ Grammar¶
This section contains the rules that specify the structure of words. It has the general shape of a context free grammar over typed feature structures. There are three basic types of rules: binary, goal and affixes. Binary rules specify the result of the concatenation of two elements. This is written as:Rule_name : Lhs <- Rhs1 Rhs2
Rule_1 : Noun[ Gender=feminine Number=singular ] <- Noun[ Gender=feminine Number=singular ] NounSuffix[ Gender=feminine ]
Rule_2 : Noun[ Gender=$A Number=$number ] <- Noun[ Gender=$A Number=$number ] NounSuffix[ Gender=$A ]
Rule_3 : Noun[ Gender=$A Number=$number ] <- Noun[ Gender=$A Number=$number ] NounSuffix[ Gender=$A=masculine|neuter ]
Plural_s : "s" NounSuffix[ Number=plural ] Feminine_e : "e" NounSuffix[ Gender=feminine ] ing : "ing" VerbSuffix[ Tense=present_participle ]
Goal_1 : Noun[] Goal_2 : Verb[ inflection=final ]
Rule_4 : Noun[ gender=$G number=plural ] <- Noun[ gender=$G number=singular invariant=yes]
Append_e : Noun[ Gender=feminine Number=$number ] <- Noun[ Gender=feminine Number=$number ] "e" NounSuffix[ Gender=feminine ] anti : Noun[ Gender=$gender Number=$number ] <- "anti" NounPrefix[] Noun[ Gender=$gender Number=$number ]
@ Classes¶
This optional section contains the definition of symbol classes. Each class is defined as a set of symbols, or other classes. If the class contains only bi-level elements it is a bi-level class, otherwise it is a lexical or surface class.Dental : d t Vowel : a e i o u Vowel_y : Vowel y Consonant: b c d f g h j k l m n p q r s t v w x z
@ Pairs¶
This optional section contains the definition of pair disjunctions. Each disjunction is defined as a set of pairs. Explicit pairs specify a sequence of surface symbols and a sequence of zero or one lexical symbol, one of them possibly empty. A sequence is enclosed between angle brackets ` <' and ` >'. The empty sequence is indicated with `<>'. In the current implementation only the surface part of a pair can be a sequence of more than one element. The special symbol ` ?' stands for the class of all possible symbols, including the morpheme and word boundary.s_x_z_1 : s/s x/x z/z VowelPair1: a/a e/e i/i o/o u/u VowelPair2: Vowel/Vowel ie.y: <i e>/y Delete_e: <>/e Insert_d: d/<> Surface_Vowel: Vowel/? Lexical_s: ?/s
DoubleConsonant: <b b>/b <d d>/d
<f f>/f <g g>/g <k k>/k
<m m>/m <p p>/p <s s>/s
<t t>/t <v v>/v <z z>/z
s_x_z_2 : s x z VowelPair3 : Vowel
@ Spelling¶
In this section are declared the two level spelling rules. A spelling rule consist of a kind indicator followed by a left context a focus and a right context. The kind indicator is ` =>' if the rule is optional, ` <=>' if it is obligatory and ` <=' if it is a surface coercion rule. The contexts may be empty. The focus is surrounded by two ` -'. The contexts and the focus consist of a sequence of pairs or pair disjunctions declared in the ` @ Pairs section. A morpheme boundary is indicated by a ` +' or a `*', a word boundary is indicated by a ` ~'.Sibilant_s: <=> s_x_z_1 * - e/<> - s Gemination: <=> Consonant Vowel - DoubleConsonant - * Vowel i_y_optionnel: => a - i/y - * ?/e
Sibilant_s: <=> s_x_z_1 * - e/<> - s NounSuffix[ Number=plural ]
@ Lexicon¶
This section is optional and can also be repeated. This section lists all the lexical entries of the morphological description. Unlike the other sections, definitions do not have a name. A definition consist of a typed feature strucure followed by a list of lexical stems that share that feature structure. A lexical stem consists of the string used in the concatenation specified by the grammar rules followed by ` =' and a reference string. The reference string can be anything and usually is used to indicate the canonical form of the word or an identifier of an external database entry.Noun[ Number=singular ] "table" = "table" "chair" = "chair" Verb[ Transitive=yes|no Inflection=base ] "bow" = "bow1" Noun[ Number=singular ] "bow" = "bow2"
Noun[ Number=singular ] "table" "chair"
FORMAL SYNTAX¶
The formal syntax description below is in Backus Naur Form (BNF). The following conventions apply:< id> is a non-terminal symbol (within angle brackets). ID is a token (terminal symbol, all uppercase). < id>? means zero or one occurrence of <id> (i.e. <id> is optional). < id>* is zero or more occurrences of <id>. < id>+ is one or more occurrences of <id>. ::= separates a non-terminal symbol and its expansion. | indicates an alternative expansion. ; starts a comment (not part of the definition).The start symbol corresponding to a complete description is named <Start>. Symbols that parse but do nothing are marked with `; not operational'.
<Start> ::= <AlphabetDecl> <AttDecl> <TypeDecl> <GramDecl> <ClassDecl>? <PairDecl>? <SpellDecl>? <LexDecl>* <AlphabetDecl> ::= ALPHABETS <LexicalDef> <SurfaceDef> <LexicalDef> ::= <LexicalName> COLON <LexicalSymbol>+ <SurfaceDef> ::= <SurfaceName> COLON <SurfaceSymbol>+ <LexicalSymbol> ::= <LexicalSymbolName> ; lexical only | <BiLevelSymbolName> ; both lexical and surface <SurfaceSymbol> ::= <SurfaceSymbolName> ; surface only | <BiLevelSymbolName> ; both lexical and surface <AttDecl> ::= ATTRIBUTES <AttDef>+ <AttDef> ::= <AttName> COLON <ValName>+ <TypeDecl> ::= TYPES <TypeDef>+ <TypeDef> ::= <TypeName> COLON <AttName>+ <NoProjAtt>? <NoProjAtt> ::= BAR <AttName>+ <LexDecl> ::= LEXICON <LexDef>+ <LexDef> ::= <Tfs> <Lexical>+ <Lexical> ::= LEXICALSTRING <BaseForm>? <BaseForm> ::= EQUAL LEXICALSTRING <Tfs> ::= <TypeName> <AttSpec>? <VarTfs> ::= <TypeName> <VarAttSpec>? <AttSpec> ::= LBRA <AttVal>* RBRA <VarAttSpec> ::= LBRA <VarAttVal>* RBRA <AttVal> ::= <AttName> <ValSpec> <VarAttVal> ::= <AttName> <VarValSpec> <ValSpec> ::= EQUAL <ValSet> | NOTEQUAL <ValSet> <VarValSpec> ::= <ValSpec> | EQUAL DOLLAR <VarName> | EQUAL DOLLAR <VarName> <ValSpec> <ValSet> ::= <ValName> <ValSetRest>* <ValSetRest> ::= BAR <ValName> <GramDecl> ::= GRAMMAR <Rule>+ <RuleDef> ::= <RuleName> COLON <RuleBody> <RuleBody> ::= <VarTfs> LARROW <Rhs> | <Tfs> ; goal rule | LEXICALSTRING <Tfs> ; lexical affix <Rhs> ::= <VarTfs> ; unary rule | <VarTfs> <VarTfs> ; binary rule | LEXICALSTRING <Tfs> <VarTfs> ; prefix rule | <VarTfs> <Tfs> LEXICALSTRING ; suffix rule <ClassDecl> ::= CLASSES <ClassDef>+ <ClassDef> ::= <LexicalClassName> COLON <LexicalClass>+ | <SurfaceClassName> COLON <SurfaceClass>+ | <BiLevelClassName> COLON <BiLevelClass>+ <LexicalClass> ::= <LexicalSymbol> | <LexicalClassName> | <BiLevelClassName> <SurfaceClass> ::= <SurfaceSymbol> | <SurfaceClassName> | <BiLevelClassName> <BiLevelClass> ::= <BiLevelSymbolName> | <BiLevelClassName> <PairDecl> ::= PAIRS <PairDef>+ <PairDef> ::= <PairName> COLON <PairDef>+ <PairDef> ::= <PairName> COLON <Pair>+ <Pair> ::= <SurfaceSequence> SLASH <LexicalSequence> | <PairName> | <BiLevelClassName> | <BiLevelSymbolName> SurfaceSequence ::= LANGLE <SurfaceSymbol>* RANGLE | SURFACESTRING | <SurfaceClass> | ANY LexicalSequence ::= LANGLE <LexicalSymbol>* RANGLE | LEXICALSTRING | <LexicalClass> | ANY <SpellDecl> ::= SPELLING <SpellDef>+ <SpellDef> ::= <SpellName> COLON <Arrow> <LeftContext> <Focus> <RightContext> <Constraint>* <LeftContext> ::= <Pattern>* <RightContext> ::= <Pattern>* <Focus> ::= CONTEXTBOUNDARY <Pattern>+ CONTEXTBOUNDARY <Pattern> ::= <Pair> | MORPHEMEBOUNDARY | WORDBOUNDARY | CONCATBOUNDARY <Constraint> ::= <Tfs> <Arrow> ::= RARROW | BIARROW | COERCEARROW <AttName> ::= NAME <BiLevelClassName> ::= NAME <BiLevelSymbolName> ::= NAME | SYMBOLSTRING <LexicalClassName> ::= NAME <LexicalName> ::= NAME <LexicalSymbolName> ::= NAME | SYMBOLSTRING <PairName> ::= NAME <RuleName> ::= NAME <SpellName> ::= NAME <SurfaceClassName> ::= NAME <SurfaceName> ::= NAME <SurfaceSymbolName> ::= NAME | SYMBOLSTRING <TypeName> ::= NAME <ValName> ::= NAME <VarName> ::= NAME
Simple tokens¶
Simple tokens of the BNF above are defined as follow: The token name on the left correspond to the literal character or characters on the right:ANY ? BAR | BIARROW <=> COERCEARROW <= COLON : CONCATBOUNDARY * CONTEXTBOUNDARY - DOLLAR $ EQUAL = LANGLE < LARROW <- LBRA ] MORPHEMEBOUNDARY + NOTEQUAL != RARROW => RANGLE < RBRA [ SLASH / WORDBOUNDARY ~ ALPHABETS @Alphabets ATTRIBUTES @Attributes CLASSES @Classes GRAMMAR @Grammar LEXICON @Lexicon PAIRS @Pairs SPELLING @Spelling TYPES @TypesIn the section header tokens above, spaces may separate the ` @' from the reserved word.
Complex tokens¶
- NAME
-
category 33 Rule_9 __2__ Proper.Noun
- LEXICALSTRING
-
- SURFACESTRING
- is a string of surface symbols
- SYMBOLSTRING
-
"table" "," "" "double quote is \" and backslash is \\" "&strong_e;" "escape like in C : \t is ASCII tab" "escape with octal code: \011 is ASCII tab"
#include "verb.entries"
will splice in the content of the file verb.entries at the point where
this directive occurs.
The ` #' should be the first character on the line. Tabs or spaces may
separate ` #' and `include'. The file name must be quoted. Only
tabs or spaces may occur on the rest of the line. Inclusion can be nested up
to 10 levels.
SEE ALSO¶
mmorph(1).G. Russell and D. Petitpierre, MMORPH - The Multext
Morphology Program, Version 2.3, October 1995, MULTEXT deliverable
report for task 2.3.1.
AUTHOR¶
Dominique Petitpierre, ISSCO, <petitp@divsun.unige.ch>COMMENTS¶
The parser for the morphology description formalims above was written using yacc (1) and flex (1). Flex was written by Vern Paxson, <vern@ee.lbl.gov>, and is distributed in the framework of the GNU project under the condition of the GNU General Public LicenseVersion 2.3, October 1995 |