NAME¶
Text::Unidecode -- plain ASCII transliterations of Unicode text
SYNOPSIS¶
use utf8;
use Text::Unidecode;
print unidecode(
"XX\n"
# Chinese characters for Beijing (U+5317 U+4EB0)
);
# That prints: Bei Jing
DESCRIPTION¶
It often happens that you have non-Roman text data in Unicode, but you can't
display it-- usually because you're trying to show it to a user via an
application that doesn't support Unicode, or because the fonts you need aren't
accessible. You could represent the Unicode characters as "???????"
or "\15BA\15A0\1610...", but that's nearly useless to the user who
actually wants to read what the text says.
What Text::Unidecode provides is a function, "unidecode(...)" that
takes Unicode data and tries to represent it in US-ASCII characters (i.e., the
universally displayable characters between 0x00 and 0x7F). The representation
is almost always an attempt at
transliteration-- i.e., conveying, in
Roman letters, the pronunciation expressed by the text in some other writing
system. (See the example in the synopsis.)
NOTE:
To make sure your perldoc/Pod viewing setup for viewing this page is working:
The six-letter word "resume" should look like "resume"
with an "/" accent on each "e".
For further tests, and help if that doesn't work, see below, "A POD
ENCODING TEST".
DESIGN PHILOSOPHY¶
Unidecode's ability to transliterate from a given language is limited by two
factors:
- •
- The amount and quality of data in the written form of the original
language
So if you have Hebrew data that has no vowel points in it, then Unidecode
cannot guess what vowels should appear in a pronunciation. S f y hv n vwls
n th npt, y wn't gt ny vwls n th tpt. (This is a specific application of
the general principle of "Garbage In, Garbage Out".)
- •
- Basic limitations in the Unidecode design
Writing a real and clever transliteration algorithm for any single language
usually requires a lot of time, and at least a passable knowledge of the
language involved. But Unicode text can convey more languages than I could
possibly learn (much less create a transliterator for) in the entire rest
of my lifetime. So I put a cap on how intelligent Unidecode could be, by
insisting that it support only context- insensitive
transliteration. That means missing the finer details of any given writing
system, while still hopefully being useful.
Unidecode, in other words, is quick and dirty. Sometimes the output is not so
dirty at all: Russian and Greek seem to work passably; and while Thaana
(Divehi, AKA Maldivian) is a definitely non-Western writing system, setting up
a mapping from it to Roman letters seems to work pretty well. But sometimes
the output is
very dirty: Unidecode does quite badly on Japanese
and Thai.
If you want a smarter transliteration for a particular language than Unidecode
provides, then you should look for (or write) a transliteration algorithm
specific to that language, and apply it instead of (or at least before)
applying Unidecode.
In other words, Unidecode's approach is broad (knowing about dozens of writing
systems), but shallow (not being meticulous about any of them).
FUNCTIONS¶
Text::Unidecode provides one function, "unidecode(...)", which is
exported by default. It can be used in a variety of calling contexts:
- "$out = unidecode( $in );" # scalar context
- This returns a copy of $in, transliterated.
- "$out = unidecode( @in );" # scalar context
- This is the same as "$out = unidecode(join "",
@in);"
- "@out = unidecode( @in );" # list context
- This returns a list consisting of copies of @in, each transliterated. This
is the same as "@out = map scalar(unidecode($_)), @in;"
- "unidecode( @items );" # void context
- "unidecode( @bar, $foo, @baz );" # void context
- Each item on input is replaced with its transliteration. This is the same
as "for(@bar, $foo, @baz) { $_ = unidecode($_) }"
You should make a minimum of assumptions about the output of
"unidecode(...)". For example, if you assume an all-alphabetic
(Unicode) string passed to "unidecode(...)" will return an
all-alphabetic string, you're wrong-- some alphabetic Unicode characters are
transliterated as strings containing punctuation (e.g., the Armenian letter
"X" (U+0539), currently transliterates as "T`" (capital-T
then a backtick).
However, these are the assumptions you
can make:
- •
- Each character 0x0000 - 0x007F transliterates as itself. That is,
"unidecode(...)" is 7-bit pure.
- •
- The output of "unidecode(...)" always consists entirely of
US-ASCII characters-- i.e., characters 0x0000 - 0x007F.
- •
- All Unicode characters translate to a sequence of (any number of)
characters that are newline ("\n") or in the range
0x0020-0x007E. That is, no Unicode character translates to
"\x01", for example. (Altho if you have a "\x01" on
input, you'll get a "\x01" in output.)
- •
- Yes, some transliterations produce a "\n" but it's just a few,
and only with good reason. Note that the value of newline ("\n")
varies from platform to platform-- see perlport.
- •
- Some Unicode characters may transliterate to nothing (i.e., empty
string).
- •
- Very many Unicode characters transliterate to multi-character sequences.
E.g., Unihan character U+5317, "X", transliterates as the
four-character string "Bei ".
- •
- Within these constraints, I may change the transliteration of
characters in future versions. For example, if someone convinces me that
that the Armenian letter "X", currently transliterated as
"T`", would be better transliterated as "D", I
may well make that change.
- •
- Unfortunately, there are many characters that Unidecode doesn't know a
transliteration for. This is generally because the character has been
added since I last revised the Unidecode data tables. I'm always catching
up!
DESIGN GOALS AND CONSTRAINTS¶
Text::Unidecode is meant to be a transliterator of last resort, to be used once
you've decided that you can't just display the Unicode data as is,
and once
you've decided you don't have a more clever, language-specific
transliterator available, or once you've
already applied smarter
algorithms or mappings that you prefer and you now just want Unidecode to do
cleanup.
Unidecode transliterates context-insensitively-- that is, a given character is
replaced with the same US-ASCII (7-bit ASCII) character or characters, no
matter what the surrounding characters are.
The main reason I'm making Text::Unidecode work with only context-insensitive
substitution is that it's fast, dumb, and straightforward enough to be
feasible. It doesn't tax my (quite limited) knowledge of world languages. It
doesn't require me writing a hundred lines of code to get the Thai
syllabification right (and never knowing whether I've gotten it wrong, because
I don't know Thai), or spending a year trying to get Text::Unidecode to use
the ChaSen algorithm for Japanese, or trying to write heuristics for telling
the difference between Japanese, Chinese, or Korean, so it knows how to
transliterate any given Uni-Han glyph. And moreover, context-insensitive
substitution is still mostly useful, but still clearly couldn't be mistaken
for authoritative.
Text::Unidecode is an example of the 80/20 rule in action-- you get 80% of the
usefulness using just 20% of a "real" solution.
A "real" approach to transliteration for any given language can
involve such increasingly tricky contextual factors as these:
- The previous / preceding character(s)
- What a given symbol "X" means, could depend on whether it's
followed by a consonant, or by vowel, or by some diacritic character.
- Syllables
- A character "X" at end of a syllable could mean something
different from when it's at the start-- which is especially problematic
when the language involved doesn't explicitly mark where one syllable
stops and the next starts.
- Parts of speech
- What "X" sounds like at the end of a word, depends on whether
that word is a noun, or a verb, or what.
- Meaning
- By semantic context, you can tell that this ideogram "X" means
"shoe" (pronounced one way) and not "time" (pronounced
another), and that's how you know to transliterate it one way instead of
the other.
- Origin of the word
- "X" means one thing in loanwords and/or placenames (and
derivatives thereof), and another in native words.
- "It's just that way"
- "X" normally makes the /X/ sound, except for this list of
seventy exceptions (and words based on them, sometimes indirectly). Or:
you never can tell which of the three ways to pronounce "X" this
word actually uses; you just have to know which it is, so keep a
dictionary on hand!
- Language
- The character "X" is actually used in several different
languages, and you have to figure out which you're looking at before you
can determine how to transliterate it.
Out of a desire to avoid being mired in
any of these kinds of contextual
factors, I chose to exclude
all of them and just stick with
context-insensitive replacement.
A POD ENCODING TEST¶
- •
- "Brontee" is six characters that should look like
"Bronte", but with double-dots on the "e"
character.
- •
- "Resume" is six characters that should look like
"Resume", but with /-shaped accents on the "e"
characters.
- •
- "laeti" should be four letters long-- the second letter
should not be two letters "ae", but should be a single letter
that looks like an "a" entirely fused with an
"e".
- •
- "XXXXXX" is six Greek characters that should look kind of like:
xpovoc
- •
- "XXX XXX XXXXX" is three short Russian words that should look a
lot like: KAK BAC 3OBYT
- •
- "XX" is two Malayalam characters that should look like: sw
- •
- "XXXX" is four Chinese characters that should look like:
"Y=+-"
If all of those come out right, your Pod viewing setup is working fine-- welcome
to the 2010s! If those are full of garbage characters, consider viewing this
page as HTML at <
https://metacpan.org/pod/Text::Unidecode> or
<
http://search.cpan.org/perldoc?Text::Unidecode>
If things look mostly okay, but the Malayalam and/or the Chinese are just
question-marks or empty boxes, it's probably just that your computer lacks the
fonts for those.
TODO¶
Lots:
* Rebuild the Unihan database. (Talk about hitting a moving target!)
* Add tone-numbers for Mandarin hanzi? Namely: In Unihan, when tone marks are
present (like in "kMandarin: dao", should I continue to
transliterate as just "Dao", or should I put in the tone number:
"Dao4"? It would be pretty jarring to have digits appear where
previously there was just alphabetic stuff-- But tone numbers make Chinese
more readable.
* Start dealing with characters over U+FFFF.
* Fill in all the little characters that've crept into the Misc Symbols Etc
blocks.
* More things that need tending to are detailed in the TODO.txt file, included
in this distribution. Normal installs probably don't leave the TODO.txt lying
around, but if nothing else, you can see it at
<
http://search.cpan.org/search?dist=Text::Unidecode>
MOTTO¶
The Text::Unidecode motto is:
It's better than nothing!
...in
both meanings: 1) seeing the output of "unidecode(...)"
is better than just having all font-unavailable Unicode characters replaced
with "?"'s, or rendered as gibberish; and 2) it's the worst, i.e.,
there's nothing that Text::Unidecode's algorithm is better than. All sensible
transliteration algorithms (like for German, see below) are going to be
smarter than Unidecode's.
WHEN YOU DON'T LIKE WHAT UNIDECODE DOES¶
I will repeat the above, because some people miss it:
Text::Unidecode is meant to be a transliterator of
last resort, to be
used once you've decided that you can't just display the Unicode data as is,
and once you've decided you don't have a more clever,
language-specific transliterator available-- or once you've
already
applied a smarter algorithm and now just want Unidecode to do cleanup.
In other words, when you don't like what Unidecode does,
do it
yourself. Really, that's what the above says. Here's how you would do
this for German, for example:
In German, there's the typographical convention that an umlaut (the double-dots
on: ae oe ue) can be written as an "-e", like with
"Schoen" becoming "Schoen". But Unidecode doesn't do
that-- I have Unidecode simply drop the umlaut accent and give back
"Schon".
(I chose this not because I'm a big meanie, but because
generally
changing "ue" to "ue" is disastrous for all text that's
not in German. Finnish "Hyvaeae paeivaeae" would turn into
"Hyvaeae paeivaeae". And I discourage you from being
yet
another German who emails me, trying to impel me to consider a
typographical nicety of German to be more important than
all other
languages.)
If you know that the text you're handling is probably in German, and you want to
apply the "umlaut becomes -e" rule, here's how to do it for yourself
(and then use Unidecode as
the fallback afterwards):
our( %German_Characters ) = qw(
Ae AE ae ae
Oe OE oe oe
Ue UE ue ue
ss ss
);
use Text::Unidecode qw(unidecode);
sub german_to_ascii {
my($german_text) = @_;
$german_text =~
s/([AeaeOeoeUeuess])/$German_Characters{$1}/g;
# And now, as a *fallthrough*:
$german_text = unidecode( $german_text );
return $german_text;
}
To pick another example, here's something that's not about a specific language,
but simply having a preference that may or may not agree with Unidecode's
(i.e., mine). Consider the "X" symbol. Unidecode changes that to
"Y=". If you want "X" as "YEN", then...
use Text::Unidecode qw(unidecode);
sub my_favorite_unidecode {
my($text) = @_;
$text =~ s/X/YEN/g;
# ...and anything else you like, such as:
$text =~ s/X/Euro/g;
# And then, as a fallback,...
$text = unidecode($text);
return $text;
}
Then if you do:
print my_favorite_unidecode("You just won X250,000 and X40,000!!!");
...you'll get:
You just won YEN250,000 and Euro40,000!!!
...just as you like it.
(By the way, the reason
I don't have Unidecode just turn "X"
into "YEN" is that the same symbol also stands for yuan, the Chinese
currency. A "Y=" is nicely,
safely neutral as to whether
we're talking about yen or yuan-- Japan, or China.)
Another example: for hanzi/kanji/hanja, I have designed Unidecode to
transliterate according to the value that that character has in Mandarin
(otherwise Cantonese,...). Some users have complained that applying Unidecode
to Japanese produces gibberish.
To make a long story short: transliterating from Japanese is
difficult
and it requires a
lot of context-sensitivity. If you have text that
you're fairly sure is in Japanese, you're going to have to use a
Japanese-specific algorithm to transliterate Japanese into ASCII. (And then
you can call Unidecode on the output from that-- it is useful for, for
example, turning XXXXXXXXX characters into their normal (ASCII) forms.
CAVEATS¶
If you get really implausible nonsense out of "unidecode(...)", make
sure that the input data really is a utf8 string. See perlunicode and
perlunitut.
THANKS¶
Thanks to (in only the sloppiest of sorta-chronological order): Jordan Lachler,
Harald Tveit Alvestrand, Melissa Axelrod, Abhijit Menon-Sen, Mark-Jason
Dominus, Joe Johnston, Conrad Heiney, fileformat.info, Philip Newton, XX,
TomaX Xolc, Mike Doherty, JT Smith and the MadMongers, Arden Ogg, Craig
Copris, and
many other pals in Unicode's behind-the-scenes F5 tornado
underlying its code.
SEE ALSO¶
An article I wrote for
The Perl Journal about Unidecode:
<
http://interglacial.com/tpj/22/> (
READ IT!)
Unicode Consortium: <
http://www.unicode.org/>
Searchable Unihan database:
<
http://www.unicode.org/cgi-bin/GetUnihanData.pl>
Geoffrey Sampson. 1990.
Writing Systems: A Linguistic Introduction. ISBN:
0804717567
Randall K. Barry (editor). 1997.
ALA-LC Romanization Tables:
Transliteration Schemes for Non-Roman Scripts. ISBN: 0844409405 [ALA is
the American Library Association; LC is the Library of Congress.]
Rupert Snell. 2000.
Beginner's Hindi Script (Teach Yourself
Books). ISBN: 0658009109
LICENSE¶
Copyright (c) 2001, 2014 Sean M. Burke.
Unidecode is distributed under the Perl Artistic License ( perlartistic ),
namely:
This library is free software; you can redistribute it and/or modify it under
the same terms as Perl itself.
This program is distributed in the hope that it will be useful, but without any
warranty; without even the implied warranty of merchantability or fitness for
a particular purpose.
DISCLAIMER¶
Much of Text::Unidecode's internal data is based on data from The Unicode
Consortium, with which I am unaffiliated.
The views and conclusions contained in the software and documentation are those
of the authors/contributors and should not be interpreted as representing
official policies, either expressed or implied, of The Unicode Consortium.
AUTHOR¶
Sean M. Burke "sburke@cpan.org"