NAME¶
unac - remove accents from string or character
SYNOPSIS¶
#include <unac.h>
const char* unac_version();
int unac_string(const char* charset,
const char* in, int in_length,
char** out, int* out_length);
int unac_string_utf16(const char* in, int in_length,
char** out, int* out_length);
/* MACRO: side effect on unaccented and length arguments */
unac_char_utf16(unsigned short c,
unsigned short* unaccented,
int length);
const char* unac_version()
/*
* The level argument can be one of:
* UNAC_DEBUG_NONE UNAC_DEBUG_LOW UNAC_DEBUG_HIGH
*/
void unac_debug(int level)
typedef void (*unac_debug_print_t)(const char* message, void* data);
void unac_debug_callback(int level, unac_debug_print_t function, void* data)
DESCRIPTION¶
unac is a C library that removes accents from characters, regardless of
the character set (ISO-8859-15, ISO-CELTIC, KOI8-RU...) as long as
iconv(3) is able to convert it into UTF-16 (Unicode).
The
unac_string function is given a charset (ISO-8859-15 for instance)
and a string. It converts the string into UTF-16 and calls the
unac_string_utf16 function to remove all accents from the UTF-16
version. The unaccented string is then converted into the original charset
(ISO-8859-15 for instance) and returned to the caller of
unac_string.
unac does a little more than removing accents: every character that is
made of two character such as
æ (ISO-8859-15 octal code 346) will
be expanded in two characters
a and
e. Should a character be
made of three characters, it would be decomposed in the same way.
The conversion from and to UTF-16 is done with
iconv(3). The
iconv
-l command will list all available charsets. Using UTF-16 as a pivot
implies an overhead but ensures that accents can be removed from all character
for which there is an equivalent character in Unicode.
unac_char_utf16 is a CPP macro that returns a pointer to the unaccented
equivalent of a given UTF-16 character. It is the basic building block of
unac.
unac_string_utf16 repeatidly applies the
unac_char_utf16 macro to
each character of an UTF-16 string.
FUNCTIONS¶
- int unac_string(const char* charset, const char* in,
size_t in_length, char** out, size_t* out_length)
-
Returns the unaccented equivalent of the string 'in' of length
'in_length' bytes. The returned string is stored in the pointer
pointed by the 'out' argument and the length of the 'out'
string, in bytes, is stored in the integer pointed by the 'out_length
' argument. If the '*out' pointer is not null, it must point to
an area allocated by malloc(3) and the length of the array must be
specified in the '*out_length' argument. Both arguments
'*out' and '*out_length' will be replaced with the return
values when the function returns on success. The '*out' pointer may
point to a memory location that has been reallocated (using
realloc(3)) by the unac_string function. There is no
guarantee that '*out' is identical to the value given by the caller. The
pointer provided as '*out' by the caller may not be useable when the
function returns (either error or success). If the '*out' pointer
is null, the unac_string function allocates a new memory block
using malloc(3). It is the responsibility of the caller to
deallocate the area returned in the '*out' pointer.
The return value of unac_string is 0 on success and -1 on error, in
which case the errno variable is set to the corresponding error code. See
the ERROR section below for more information. The iconv(3) manual
page may also help.
- int unac_string_utf16(const char* in, int in_length,
char** out, int* out_length)
-
Has the same effect as unac_string("UTF-16", in, in_length,
out, out_length). Since the unac_string_utf16 is the backend
function of the unac_string function it is more efficient because
no charset conversion of the input string (from and to UTF-16) is
necessary.
- unac_char_utf16(const unsigned short c, unsigned short*
p, int l)
-
Warning: this is a macro, each argument may be evaluated more than
once. Returns the unaccented equivalent of the UTF-16 character
'c' in the pointer 'p'. The length of the unsigned short
array pointed by 'p' is returned in the 'l' argument.
- const char* unac_version()
-
Return the version number of unac.
- void unac_debug(int level)
- Set the debug level of the unac library to 'level'.
Possible values are: UNAC_DEBUG_NONE for no debug at all, UNAC_DEBUG_LOW
for terse human readable information, UNAC_DEBUG_HIGH for very detailed
information only usable when translating a few strings.
unac_debug_callback with anything but UNAC_DEBUG_NONE is not thread safe.
- void unac_debug_callback(int level, unac_debug_print_t
function, void* data)
-
Set the debug level and define a printing function callback. The
'level' is the same as in unac_debug. The 'function' is in
charge of dealing with the debug messages, presumably to print them to the
user. The 'data' is an opaque pointer that is passed along to
function, should it need to manage a persistent context.
The prototype of 'function' accepts two arguments. The first is the
debug message (const char*), the second is the opaque pointer given
as 'data' argument to unac_debug_callback.
If 'function' is NULL, messages are printed on the standard error
output using fprintf(stderr...).
unac_debug_callback with anything but UNAC_DEBUG_NONE is not thread
safe.
ERRORS¶
- EINVAL
-
the requested conversion pair is not available. For instance, when
specifying the ISO-0000 charset (imaginary), it means it is not possible
to convert from ISO-0000 to UTF-16.
EXAMPLES¶
Convert the
été string into
ete.
#include <unac.h>
char* out = 0;
int out_length = 0;
if(unac_string("ISO-8859-1", "été", strlen("été"), &out, &out_length)) {
perror("unac_string");
} else {
printf("%.*s0, out_length, out);
free(out);
}
IMPLEMENTATION NOTES¶
The endianess of the UTF-16 strings manipulated by
unac must always be
big endian. When using
iconv(3) to translate strings, UTF-16BE should
be used instead of UTF-16 to make sure it is big endian (BE). On some systems
where UTF-16BE is not available,
unac will rely on UTF-16 and hope it
is properly big endian encoded. For more information check RFC2781
(
http://www.faqs.org/rfcs/rfc2781.html: UTF-16, an encoding of ISO 10646).
The
unac library uses the Unicode database to map accented letters to
their unaccented equivalent. Mapping tables are generated from the
UnicodeData-4.0.0.txt file (as found at
http://www.unicode.org/Public/4.0-Update/) by the
builder perl script.
The
builder script inserts these tables in the
unac.h and
unac.c files, replacing the existing ones. Looking for the
'Generated by builder' string in the
unac.[ch] files allows to
spot the various parts handled by the builder script.
Some desirable decompositions may not be included in the UnicodeData file, such
as AE. To complement the standard decompositions for the purpose of the unac
library, the
unaccent-local-map.perl script is used. It maps character
names (such as
LATIN SMALL LETTER AE) to an array of character names
into which it will be decomposed. This script is used by the
builder
script and has precendence over decomposition rules defined in the Unicode
data file.
The library data occupies 30KB where a simple minded table would occupy around
512Kbytes. The idea used to compress the tables is that many Unicode
characters do not have unaccented equivalent. Instead of relying on a table
mapping each Unicode character to the corresponding unaccented character, an
intermediate array of pointers is created. In the drawing below, the range of
UTF-16 character is not accurate but illustrates the method. The
unac_data_table points to a set of
unac_dataXX arrays. Each
pointer covers a range of UTF-16 characters (4 in the example below). When a
range of character does not contain any accented character, unac_data_table
always points to the same array : unac_data0. Since there are many characters
without accents, this is enough to achieve a good compression.
unac_data15 unac_data16
[ NULL, NULL, NULL, e ] <---- /------> [ a, NULL, NULL, NULL ]
| |
| |
^ ^
|-----| |-----| |-----| |-----| |-----| |-----|
[ ... a b c d e f g h i j k é à 0 1 2 3 4 5 6 7 8 9 A... ] unac_data_table
|-----| |-----| |-----| |-----| |-----| |-----|
v v v v
| | | |
| | | |
--------------------------------------/
|
V
[ NULL, NULL, NULL, NULL ]
unac_data0
Beside this simple optimization, a table (unac_positions) listing the actual
position of the unaccented replacement within a block (unac_dataXX) is
necessary because they are not of fixed length. Some characters such as
æ will be replaced by two characters
a and
e
therefore unac_dataXX has a variable size.
The unaccented equivalent of an UTF-16 character is calculated by applying a
compatibility decomposition and then stripping all characters that
belong to the
mark category. For a precise definition see the
Unicode-4.0 normalization forms at
http://www.unicode.org/unicode/reports/tr15/.
All original Unicode data files were taken from
http://www.unicode.org/Public
and are subject to the
UCD Terms of Use.
http://www.unicode.org/Public/4.0-Update/UCD-4.0.0.html#UCD_Terms
Disclaimer
The Unicode Character Database is provided as is by Unicode, Inc. No claims are
made as to fitness for any particular purpose. No warranties of any kind are
expressed or implied. The recipient agrees to determine applicability of
information provided. If this file has been purchased on magnetic or optical
media from Unicode, Inc., the sole remedy for any claim will be exchange of
defective media within 90 days of receipt.
This disclaimer is applicable for all other data files accompanying the Unicode
Character Database, some of which have been compiled by the Unicode
Consortium, and some of which have been supplied by other sources.
Limitations on Rights to Redistribute This Data
Recipient is granted the right to make copies in any form for internal
distribution and to freely use the information supplied in the creation of
products supporting the UnicodeTM Standard. The files in the Unicode Character
Database can be redistributed to third parties or other organizations (whether
for profit or not) as long as this notice and the disclaimer notice are
retained. Information can be extracted from these files and used in
documentation or programs, as long as there is an accompanying notice
indicating the source.
The file Unihan.txt contains older and inconsistent Terms of Use. That language
is overridden by these terms.
BUGS¶
The input string must not contain partially formed characters, there is no
support for this case.
UTF-16 surrogates are not handled.
Unicode may contain bugs in the decomposition of characters. When you suspect
such a bug on a given string, add a test case with the faulty string in the
t_unac.in test script (you will find it in the source distribution) and
run
make check. It will describe, in a very verbose way, how the string
was unaccented. You may then fix the UnicodeData-4.0.0.txt file and run
make check again to make sure the problem is solved. Please send such
fixes to the author and to the Unicode consortium.
SEE ALSO¶
unaccent(1),
iconv(3)
http://www.unicode.org/
http://oss.software.ibm.com/icu/
http://www.gnu.org/manual/glibc-2.2.5/libc.html
AUTHOR¶
Loic Dachary loic@senga.org
http://www.senga.org/unac/