NAME¶
index.noun, data.noun, index.verb, data.verb, index.adj, data.adj, index.adv,
data.adv - WordNet database files
noun.exc, verb.exc. adj.exc adv.exc - morphology exception lists
sentidx.vrb, sents.vrb - files used by search code to display sentences
illustrating the use of some specific verbs
DESCRIPTION¶
For each syntactic category, two files are needed to represent the contents of
the WordNet database -
index.pos and
data.pos,
where
pos is
noun,
verb,
adj and
adv. The
other auxiliary files are used by the WordNet library's searching functions
and are needed to run the various WordNet browsers.
Each index file is an alphabetized list of all the words found in WordNet in the
corresponding part of speech. On each line, following the word, is a list of
byte offsets (
synset_offsets) in the corresponding data file, one for
each synset containing the word. Words in the index file are in lower case
only, regardless of how they were entered in the lexicographer files. This
folds various orthographic representations of the word into one line enabling
database searches to be case insensitive. See
wninput(5WN) for a
detailed description of the lexicographer files
A data file for a syntactic category contains information corresponding to the
synsets that were specified in the lexicographer files, with relational
pointers resolved to
synset_offsets. Each line corresponds to a synset.
Pointers are followed and hierarchies traversed by moving from one synset to
another via the
synset_offsets.
The exception list files,
pos.exc, are used to help the
morphological processor find base forms from irregular inflections.
The files
sentidx.vrb and
sents.vrb contain sentences illustrating
the use of specific senses of some verbs. These files are used by the
searching software in response to a request for verb sentence frames. Generic
sentence frames are displayed when an illustrative sentence is not present.
The various database files are in ASCII formats that are easily read by both
humans and machines. All fields, unless otherwise noted, are separated by one
space character, and all lines are terminated by a newline character. Fields
enclosed in italicized square brackets may not be present.
See
wngloss(7WN) for a glossary of WordNet terminology and a discussion
of the database's content and logical organization.
Each index file begins with several lines containing a copyright notice, version
number and license agreement. These lines all begin with two spaces and the
line number so they do not interfere with the binary search algorithm that is
used to look up entries in the index files. All other lines are in the
following format. In the field descriptions,
number always refers to a
decimal integer unless otherwise defined.
lemma pos synset_cnt p_cnt [ptr_symbol...] sense_cnt tagsense_cnt synset_offset [synset_offset...]
- lemma
- lower case ASCII text of word or collocation. Collocations are formed by
joining individual words with an underscore ( _) character.
- pos
- Syntactic category: n for noun files, v for verb files,
a for adjective files, r for adverb files.
All remaining fields are with respect to senses of
lemma in
pos.
- synset_cnt
- Number of synsets that lemma is in. This is the number of senses of
the word in WordNet. See Sense Numbers below for a
discussion of how sense numbers are assigned and the order of
synset_offsets in the index files.
- p_cnt
- Number of different pointers that lemma has in all synsets
containing it.
- ptr_symbol
- A space separated list of p_cnt different types of pointers that
lemma has in all synsets containing it. See wninput(5WN) for
a list of pointer_symbols. If all senses of lemma have no
pointers, this field is omitted and p_cnt is 0.
- sense_cnt
- Same as sense_cnt above. This is redundant, but the field was
preserved for compatibility reasons.
- tagsense_cnt
- Number of senses of lemma that are ranked according to their
frequency of occurrence in semantic concordance texts.
- synset_offset
- Byte offset in data.pos file of a synset containing
lemma. Each synset_offset in the list corresponds to a
different sense of lemma in WordNet. synset_offset is an 8
digit, zero-filled decimal integer that can be used with fseek(3)
to read a synset from the data file. When passed to
read_synset(3WN) along with the syntactic category, a data
structure containing the parsed synset is returned.
Each data file begins with several lines containing a copyright notice, version
number and license agreement. These lines all begin with two spaces and the
line number. All other lines are in the following format. Integer fields are
of fixed length, and are zero-filled.
synset_offset lex_filenum ss_type w_cnt word lex_id [word lex_id...] p_cnt [ptr...] [frames...] | gloss
- synset_offset
- Current byte offset in the file represented as an 8 digit decimal
integer.
- lex_filenum
- Two digit decimal integer corresponding to the lexicographer file name
containing the synset. See lexnames(5WN) for the list of filenames
and their corresponding numbers.
- ss_type
- One character code indicating the synset type:
n NOUN
v VERB
a ADJECTIVE
s ADJECTIVE SATELLITE
r ADVERB
- w_cnt
- Two digit hexadecimal integer indicating the number of words in the
synset.
- word
- ASCII form of a word as entered in the synset by the lexicographer, with
spaces replaced by underscore characters ( _). The text of the word
is case sensitive, in contrast to its form in the corresponding
index. pos file, that contains only lower-case forms. In
data.adj, a word is followed by a syntactic marker if one
was specified in the lexicographer file. A syntactic marker is appended,
in parentheses, onto word without any intervening spaces. See
wninput(5WN) for a list of the syntactic markers for
adjectives.
- lex_id
- One digit hexadecimal integer that, when appended onto lemma,
uniquely identifies a sense within a lexicographer file. lex_id
numbers usually start with 0, and are incremented as additional
senses of the word are added to the same file, although there is no
requirement that the numbers be consecutive or begin with 0. Note
that a value of 0 is the default, and therefore is not present in
lexicographer files.
- p_cnt
- Three digit decimal integer indicating the number of pointers from this
synset to other synsets. If p_cnt is 000 the synset has no
pointers.
- ptr
- A pointer from this synset to another. ptr is of the form:
pointer_symbol synset_offset pos source/target
where synset_offset is the byte offset of the target synset in the
data file corresponding to pos.
The source/target field distinguishes lexical and semantic pointers.
It is a four byte field, containing two two-digit hexadecimal integers.
The first two digits indicates the word number in the current (source)
synset, the last two digits indicate the word number in the target synset.
A value of 0000 means that pointer_symbol represents a
semantic relation between the current (source) synset and the target
synset indicated by synset_offset.
A lexical relation between two words in different synsets is represented by
non-zero values in the source and target word numbers. The first and last
two bytes of this field indicate the word numbers in the source and target
synsets, respectively, between which the relation holds. Word numbers are
assigned to the word fields in a synset, from left to right,
beginning with 1.
See wninput(5WN) for a list of pointer_symbols, and semantic
and lexical pointer classifications.
- frames
- In data.verb only, a list of numbers corresponding to the generic
verb sentence frames for words in the synset. frames is of
the form:
f_cnt + f_num w_num [ + f_num w_num...]
where f_cnt a two digit decimal integer indicating the number of
generic frames listed, f_num is a two digit decimal integer frame
number, and w_num is a two digit hexadecimal integer indicating the
word in the synset that the frame applies to. As with pointers, if this
number is 00, f_num applies to all words in the
synset. If non-zero, it is applicable only to the word indicated. Word
numbers are assigned as described for pointers. Each f_num w_num
pair is preceded by a +. See wninput(5WN) for the text of
the generic sentence frames.
- gloss
- Each synset contains a gloss. A gloss is represented as a vertical
bar ( |), followed by a text string that continues until the end of
the line. The gloss may contain a definition, one or more example
sentences, or both.
Sense Numbers¶
Senses in WordNet are generally ordered from most to least frequently used, with
the most common sense numbered
1. Frequency of use is determined by the
number of times a sense is tagged in the various semantic concordance texts.
Senses that are not semantically tagged follow the ordered senses. The
tagsense_cnt field for each entry in the
index.pos files
indicates how many of the senses in the list have been tagged.
The
cntlist(5WN) file provided with the database lists the number of
times each sense is tagged in the semantic concordances. The data from
cntlist is used by
grind(1WN) to order the senses of each word.
When the
index.
pos files are generated, the
synset_offsets are output in sense number order, with sense 1 first in
the list. Senses with the same number of semantic tags are assigned unique but
consecutive sense numbers. The WordNet
OVERVIEW search
displays all senses of the specified word, in all syntactic categories, and
indicates which of the senses are represented in the semantically tagged
texts.
Exception lists are alphabetized lists of inflected forms of words and their
base forms. The first field of each line is an inflected form, followed by a
space separated list of one or more base forms of the word. There is one
exception list file for each syntactic category.
Note that the noun and verb exception lists were automatically generated from a
machine-readable dictionary, and contain many words that are not in WordNet.
Also, for many of the inflected forms, base forms could be easily derived
using the standard rules of detachment programmed into Morphy (See
morph(7WN)). These anomalies are allowed to remain in the exception
list files, as they do no harm.
Verb Example Sentences¶
For some verb senses, example sentences illustrating the use of the verb sense
can be displayed. Each line of the file
sentidx.vrb contains a
sense_key followed by a space and a comma separated list of example
sentence template numbers, in decimal. The file
sents.vrb lists all of
the example sentence templates. Each line begins with the template number
followed by a space. The rest of the line is the text of a template example
sentence, with
%s used as a placeholder in the text for the verb. Both
files are sorted alphabetically so that the
sense_key and template
sentence number can be used as indices, via
binsrch(3WN), into the
appropriate file.
When a request for
FRAMES is made, the WordNet search code
looks for the sense in
sentidx.vrb. If found, the sentence template(s)
listed is retrieved from
sents.vrb, and the
%s is replaced with
the verb. If the sense is not found, the applicable generic sentence frame(s)
listed in
frames is displayed.
NOTES¶
Information in the
data.pos and
index.pos files
represents all of the word senses and synsets in the WordNet database. The
word,
lex_id, and
lex_filenum fields together uniquely
identify each word sense in WordNet. These can be encoded in a
sense_key as described in
senseidx(5WN). Each synset in the
database can be uniquely identified by combining the
synset_offset for
the synset with a code for the syntactic category (since it is possible for
synsets in different
data.pos files to have the same
synset_offset).
The WordNet system provide both command line and window-based browser interfaces
to the database. Both interfaces utilize a common library of search and
morphology code. The source code for the library and interfaces is included in
the WordNet package. See
wnintro(3WN) for an overview of the WordNet
source code.
ENVIRONMENT VARIABLES (UNIX)¶
- WNHOME
- Base directory for WordNet. Default is /usr/local/WordNet-3.0.
- WNSEARCHDIR
- Directory in which the WordNet database has been installed. Default is
WNHOME/dict.
REGISTRY (WINDOWS)¶
- HKEY_LOCAL_MACHINE\SOFTWARE\WordNet\3.0\WNHome
- Base directory for WordNet. Default is C:\Program
Files\WordNet\3.0.
FILES¶
- index.pos
- database index files
- data.pos
- database data files
- *.vrb
- files of sentences illustrating the use of verbs
- pos.exc
- morphology exception lists
SEE ALSO¶
grind(1WN),
wn(1WN),
wnb(1WN),
wnintro(3WN),
binsrch(3WN),
wnintro(5WN),
cntlist(5WN),
lexnames(5WN),
senseidx(5WN),
wninput(5WN),
morphy(7WN),
wngloss(7WN),
wngroups(7WN),
wnstats(7WN).