NAME¶
hspell - Hebrew spellchecker (C API)
SYNOPSIS¶
#include <hspell.h>
int hspell_init(struct dict_radix **dictp
, int flags
);
void hspell_uninit(struct dict_radix *dictp
);
int hspell_check_word(struct dict_radix *dict
, const char
*word
, int *preflen
);
void hspell_trycorrect(struct dict_radix *dict
, const char
*word
, struct corlist *cl
);
int corlist_init(struct corlist *cl
);
int corlist_free(struct corlist *cl
);
int corlist_n(struct corlist *cl
);
char *corlist_str(struct corlist *cl
, int i
);
int hspell_is_canonic_gimatria(const char *word
);
typedef int hspell_word_split_callback_func(const char *word, const char
*baseword, int preflen, int prefspec);
int hspell_enum_splits(struct dict_radix *dict
, const char
*word
, hspell_word_split_callback_func *enumf
);
void hspell_set_dictionary_path(const char *path
);
const char *hspell_get_dictionary_path(void);
DESCRIPTION¶
This manual describes the C API of the Hspell Hebrew spellchecker. Please refer
to
hspell(1) for a description of the Hspell project, its spelling
standard, and how it works.
The
hspell_init() function must be called first to initialize the Hspell
library. It sets up some global structures (see CAVEATS section) and then
reads the necessary dictionary files (whose places are fixed when the library
is built). The
'dictp' parameter is a pointer to a
struct
dict_radix* object, which is modified to point to a newly allocated
dictionary. A typical
hspell_init() call therefore looks like
struct dict_radix *dict;
hspell_init(&dict, flags);
Note that the (struct dict_radix*) type is an opaque pointer - the library user
has no access to the separate fields in this structure.
The
'flags' parameter can contain a bitwise or'ing of several flags that
modify Hspell's default behavior; Turning on HSPELL_OPT_HE_SHEELA allows
Hspell to recognize the interrogative He prefix (he ha-she'ela).
HSPELL_OPT_DEFAULT is a synonym for turning on no special flag, i.e., it
evaluates to 0.
hspell_init() returns 0 on success, or negative numbers on errors.
Currently, the only error is -1, meaning the dictionary files could not be
read.
The
hspell_uninit() function undoes the effects of
hspell_init(),
freeing any memory that was allocated during initialization.
The
hspell_check_word() function checks whether a certain word is a
correct Hebrew word (possibly with prefix particles attached in a
syntacticly-correct manner). 1 is returned if the word is correct, or 0 if it
is incorrect.
The
'word' parameter should be a single Hebrew word, in the iso8859-8
encoding, possibly containing the ASCII quote or double-quote characters
(signifying the geresh and gershayim used in Hebrew for abbreviations,
acronyms, and a few foreign sounds). If the calling programs works with other
encodings, it must convert the word to iso8859-8 first. In particular cp1255
(the MS-Windows Hebrew encoding) extensions to iso8859-8 like niqqud
characters, geresh or gershayim, are currently not recognized and must be
removed from the word prior to calling
hspell_check_word().
Into the
'preflen' parameter, the function writes back the number of
characters it recognized as a prefix particle - the rest of the 'word' is a
stand-alone word. Because Hebrew words typically can be read in several
different ways, this feature (of getting just one prefix from one possible
reading) is usually not very useful, and it is likely to be removed in a
future version.
The
hspell_enum_splits() function provides a way to get all possible
splitting of the given
'word' into an optional prefix particle and a
stand-alone word. For each possible (and legal, as some words cannot accept
certain prefixes) split, a user-defined callback function is called. This
callback function is given the whole word, the length of the prefix, the
stand-alone word, and a bitfield which describes what types of words this
prefix can get. Note that in some cases, a word beginning with the letter waw
gets this waw doubled before a prefix, so sometimes
strlen(word)!=strlen(baseword)+preflen.
The
hspell_trycorrect() tries to find a list of possible corrections for
an incorrect word. Because in Hebrew the word density is high (a random string
of letters, especially if short, has a high probability of being a correct
word), this function attempts to try corrections based on the assumption of a
spelling error (replacement of letters that sound alike, missing or spurious
immot qri'a), not typo (slipped finger on the keyboard, etc.) - see also
CAVEATS.
hspell_trycorrect() returns the correction list into a structure of type
struct corlist. This structure must be first allocated with a call to
corlist_init() and subsequently freed with
corlist_free(). The
corlist_n() macro returns the number of words held in an allocated
corlist, and
corlist_str() returns the i'th word. Accordingly, here is
an example usage of
hspell_trycorrect():
struct corlist cl;
printf ("Found misspelled word %s. Possible corrections:\n", w);
corlist_init (&cl);
hspell_trycorrect (dict, w, &cl);
for (i=0; i<corlist_n(&cl); i++) {
printf ("%s\n", corlist_str(&cl, i));
}
The
hspell_is_canonic_gimatria() function checks whether the given word
is a
canonic gimatria - i.e., the proper way to write in gimatria the
number it represents. The caller might want to accept canonic gimatria as
proper Hebrew words, even if
hspell_check_word() previously reported
such word to be a non-existent word.
hspell_is_canonic_gimatria()
returns the number represented as gimatria in 'word' if it is indeed proper
gimatria (in canonic form), or 0 otherwise.
hspell_init() normally reads the dictionary files from a path compiled
into the library. This makes sense when the library's code and the
dictionaries are distributed together, but in some scenarios the library user
might want to use the Hspell dictionaries that are already present on the
system in an arbitrary path. The function
hspell_set_dictionary_path()
can be used to set this path, and should be used before calling
hspell_init(). The given path is that of the word list, and other input
files have that path with an appended prefix.
hspell_get_dictionary_path() can be used to find the current path. On
many installations, this defaults to
"/usr/local/share/hspell/hebrew.wgz".
LINKING¶
On most systems, the Hspell library is compiled to use the Zlib library for
reading the compressed dictionaries. Therefore, a program linking with the
Hspell library must also be linked with the Zlib library (usually, by adding
"-lz" to the compilation line).
Programs that use
autoconf to search for the Hspell library, should
remember to tell AC_CHECK_LIB to also link with the -lz library when checking
for -lhspell.
CAVEATS¶
While the API described here has been stable for years, it may change in the
future. Users are encouraged to compare the values of the integer macros
HSPELL_VERSION_MAJOR and
HSPELL_VERSION_MINOR to those expected
by the writer of the program. A third macro,
HSPELL_VERSION_EXTRA
contains a string which can describe subrelease modifications (e.g., beta
versions).
The current Hspell C API is very low-level, in the sense that it leaves the user
to implement many features that some users take for granted that a
spell-checker should provide. For example it doesn't provide any facilities
for a user-defined personal dictionary. It also has separate functions for
checking valid Hebrew words and valid gimatria, and no function to do both. It
is assumed that the caller - a bigger spell-checking library or word processor
(for example), will already have these facilities. If not, you may wish to
look at the sources of
hspell(1) for an example implementation.
Currently there is no concept of separate Hspell "contexts" in an
application. Some of the context is now global for the entire application:
currently, a single list of legal prefix-particles is kept, and the dictionary
read by
hspell_init() is always read from the global default place.
This may be solved in a later version, e.g., by switching to an API like:
context = hspell_new_context();
hspell_set_dictionary_path(context, "/some/path/hebrew.wgz");
hspell_init(context, flags);
...
hspell_check_word(context, word, preflenp);
Note that despite the global context mentioned above, after initialization all
functions described here are
thread-safe, because they only read the
dictionary data, not write to it.
hspell_trycorrect() is not as powerful as it could have been, with typos
or certain kinds of spelling mistakes not giving useful correction
suggestions. Along with more types of corrections,
hspell_trycorrect()
needs a better way to order the likelihood of the corrections, as an unordered
list of 100 corrections would be just as useful (or rather, useless) as none.
In some cases of errors during
hspell_init(), warning messages are
printed to the standard errors. This is a bad thing for a library to do.
There are too many CAVEATS in this manual.
VERSION¶
The version of
hspell described by this manual page is 1.1 (December 31,
2009)
COPYRIGHT¶
Copyright (C) 2000-2009, Nadav Har'El <nyh@math.technion.ac.il> and Dan
Kenigsberg <danken@cs.technion.ac.il>.
Hspell is free software, released under the GNU General Public License (GPL).
Note that not only the programs in the distribution, but also the dictionary
files and the generated word lists, are licensed under the GPL. There is no
warranty of any kind.
See the LICENSE file for more information and the exact license terms.
The latest version of this software can be found in
http://hspell.ivrix.org.il/
SEE ALSO¶
hspell(1)