.\" Automatically generated by Pod::Man v1.37, Pod::Parser v1.32
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sh \" Subsection heading
.br
.if t .Sp
.ne 5
.PP
\fB\\$1\fR
.PP
..
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings.  \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote.  \*(C+ will
.\" give a nicer C++.  Capital omega is used to do unbreakable dashes and
.\" therefore won't be available.  \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
.    ds -- \(*W-
.    ds PI pi
.    if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
.    if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\"  diablo 12 pitch
.    ds L" ""
.    ds R" ""
.    ds C` ""
.    ds C' ""
'br\}
.el\{\
.    ds -- \|\(em\|
.    ds PI \(*p
.    ds L" ``
.    ds R" ''
'br\}
.\"
.\" If the F register is turned on, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.Sh), items (.Ip), and index
.\" entries marked with X<> in POD.  Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.if \nF \{\
.    de IX
.    tm Index:\\$1\t\\n%\t"\\$2"
..
.    nr % 0
.    rr F
.\}
.\"
.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.hy 0
.if n .na
.\"
.\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2).
.\" Fear.  Run.  Save yourself.  No user-serviceable parts.
.    \" fudge factors for nroff and troff
.if n \{\
.    ds #H 0
.    ds #V .8m
.    ds #F .3m
.    ds #[ \f1
.    ds #] \fP
.\}
.if t \{\
.    ds #H ((1u-(\\\\n(.fu%2u))*.13m)
.    ds #V .6m
.    ds #F 0
.    ds #[ \&
.    ds #] \&
.\}
.    \" simple accents for nroff and troff
.if n \{\
.    ds ' \&
.    ds ` \&
.    ds ^ \&
.    ds , \&
.    ds ~ ~
.    ds /
.\}
.if t \{\
.    ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u"
.    ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'
.    ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'
.    ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'
.    ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'
.    ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'
.\}
.    \" troff and (daisy-wheel) nroff accents
.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'
.ds 8 \h'\*(#H'\(*b\h'-\*(#H'
.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#]
.ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'
.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'
.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#]
.ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#]
.ds ae a\h'-(\w'a'u*4/10)'e
.ds Ae A\h'-(\w'A'u*4/10)'E
.    \" corrections for vroff
.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'
.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'
.    \" for low resolution devices (crt and lpr)
.if \n(.H>23 .if \n(.V>19 \
\{\
.    ds : e
.    ds 8 ss
.    ds o a
.    ds d- d\h'-1'\(ga
.    ds D- D\h'-1'\(hy
.    ds th \o'bp'
.    ds Th \o'LP'
.    ds ae ae
.    ds Ae AE
.\}
.rm #[ #] #H #V #F C
.\" ========================================================================
.\"
.IX Title "Text::Unidecode 3pm"
.TH Text::Unidecode 3pm "2008-03-01" "perl v5.8.8" "User Contributed Perl Documentation"
.SH "NAME"
Text::Unidecode \-\- US\-ASCII transliterations of Unicode text
.SH "SYNOPSIS"
.IX Header "SYNOPSIS"
.Vb 6
\&  use utf8;
\&  use Text::Unidecode;
\&  print unidecode(
\&    "\ex{5317}\ex{4EB0}\en"
\&     # those are the Chinese characters for Beijing
\&  );
.Ve
.PP
.Vb 1
\&  # That prints: Bei Jing
.Ve
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
It often happens that you have non-Roman text data in Unicode, but
you can't display it \*(-- usually because you're trying to
show it to a user via an application that doesn't support Unicode,
or because the fonts you need aren't accessible.  You could
represent the Unicode characters as \*(L"???????\*(R" or
\&\*(L"\e15BA\e15A0\e1610...\*(R", but that's nearly useless to the user who
actually wants to read what the text says.
.PP
What Text::Unidecode provides is a function, \f(CW\*(C`unidecode(...)\*(C'\fR that
takes Unicode data and tries to represent it in US-ASCII characters
(i.e., the universally displayable characters between 0x00 and
0x7F).  The representation is
almost always an attempt at \fItransliteration\fR \*(-- i.e., conveying,
in Roman letters, the pronunciation expressed by the text in
some other writing system.  (See the example in the synopsis.)
.PP
Unidecode's ability to transliterate is limited by two factors:
.IP "* The amount and quality of data in the original" 4
.IX Item "The amount and quality of data in the original"
So if you have Hebrew data
that has no vowel points in it, then Unidecode cannot guess what
vowels should appear in a pronounciation.
S f y hv n vwls n th npt, y wn't gt ny vwls
n th tpt.  (This is a specific application of the general principle
of \*(L"Garbage In, Garbage Out\*(R".)
.IP "* Basic limitations in the Unidecode design" 4
.IX Item "Basic limitations in the Unidecode design"
Writing a real and clever transliteration algorithm for any single
language usually requires a lot of time, and at least a passable
knowledge of the language involved.  But Unicode text can convey
more languages than I could possibly learn (much less create a
transliterator for) in the entire rest of my lifetime.  So I put
a cap on how intelligent Unidecode could be, by insisting that
it support only context\-\fIin\fRsensitive transliteration.  That means
missing the finer details of any given writing system,
while still hopefully being useful.
.PP
Unidecode, in other words, is quick and
dirty.  Sometimes the output is not so dirty at all:
Russian and Greek seem to work passably; and
while Thaana (Divehi, \s-1AKA\s0 Maldivian) is a definitely non-Western
writing system, setting up a mapping from it to Roman letters
seems to work pretty well.  But sometimes the output is \fIvery
dirty:\fR Unidecode does quite badly on Japanese and Thai.
.PP
If you want a smarter transliteration for a particular language
than Unidecode provides, then you should look for (or write)
a transliteration algorithm specific to that language, and apply
it instead of (or at least before) applying Unidecode.
.PP
In other words, Unidecode's
approach is broad (knowing about dozens of writing systems), but
shallow (not being meticulous about any of them).
.SH "FUNCTIONS"
.IX Header "FUNCTIONS"
Text::Unidecode provides one function, \f(CW\*(C`unidecode(...)\*(C'\fR, which
is exported by default.  It can be used in a variety of calling contexts:
.ie n .IP """$out = unidecode($in);"" # scalar context" 4
.el .IP "\f(CW$out = unidecode($in);\fR # scalar context" 4
.IX Item "$out = unidecode($in); # scalar context"
This returns a copy of \f(CW$in\fR, transliterated.
.ie n .IP """$out = unidecode(@in);"" # scalar context" 4
.el .IP "\f(CW$out = unidecode(@in);\fR # scalar context" 4
.IX Item "$out = unidecode(@in); # scalar context"
This is the same as \f(CW\*(C`$out = unidecode(join '', @in);\*(C'\fR
.ie n .IP """@out = unidecode(@in);"" # list context" 4
.el .IP "\f(CW@out = unidecode(@in);\fR # list context" 4
.IX Item "@out = unidecode(@in); # list context"
This returns a list consisting of copies of \f(CW@in\fR, each transliterated.  This
is the same as \f(CW\*(C`@out = map scalar(unidecode($_)), @in;\*(C'\fR
.ie n .IP """unidecode(@items);"" # void context" 4
.el .IP "\f(CWunidecode(@items);\fR # void context" 4
.IX Item "unidecode(@items); # void context"
.PD 0
.ie n .IP """unidecode(@bar, $foo, @baz);"" # void context" 4
.el .IP "\f(CWunidecode(@bar, $foo, @baz);\fR # void context" 4
.IX Item "unidecode(@bar, $foo, @baz); # void context"
.PD
Each item on input is replaced with its transliteration.  This
is the same as \f(CW\*(C`for(@bar, $foo, @baz) { $_ = unidecode($_) }\*(C'\fR
.PP
You should make a minimum of assumptions about the output of
\&\f(CW\*(C`unidecode(...)\*(C'\fR.  For example, if you assume an all-alphabetic
(Unicode) string passed to \f(CW\*(C`unidecode(...)\*(C'\fR will return an all-alphabetic
string, you're wrong \*(-- some alphabetic Unicode characters are
transliterated as strings containing punctuation (e.g., the
Armenian letter at 0x0539 currently transliterates as \f(CW\*(C`T`\*(C'\fR.
.PP
However, these are the assumptions you \fIcan\fR make:
.IP "\(bu" 4
Each character 0x0000 \- 0x007F transliterates as itself.  That is,
\&\f(CW\*(C`unidecode(...)\*(C'\fR is 7\-bit pure.
.IP "\(bu" 4
The output of \f(CW\*(C`unidecode(...)\*(C'\fR always consists entirely of US-ASCII
characters \*(-- i.e., characters 0x0000 \- 0x007F.
.IP "\(bu" 4
All Unicode characters translate to a sequence of (any number of)
characters that are newline (\*(L"\en\*(R") or in the range 0x0020\-0x007E.  That
is, no Unicode character translates to \*(L"\ex01\*(R", for example.  (Altho if
you have a \*(L"\ex01\*(R" on input, you'll get a \*(L"\ex01\*(R" in output.)
.IP "\(bu" 4
Yes, some transliterations produce a \*(L"\en\*(R" \*(-- but just a few, and only
with good reason.  Note that the value of newline (\*(L"\en\*(R") varies
from platform to platform \*(-- see \*(L"perlport\*(R" in perlport.
.IP "\(bu" 4
Some Unicode characters may transliterate to nothing (i.e., empty string).
.IP "\(bu" 4
Very many Unicode characters transliterate to multi-character sequences.
E.g., Han character 0x5317 transliterates as the four-character string
\&\*(L"Bei \*(R".
.IP "\(bu" 4
Within these constraints, I may change the transliteration of characters
in future versions.  For example, if someone convinces me that
the Armenian letter at 0x0539, currently transliterated as \*(L"T`\*(R", would
be better transliterated as \*(L"D\*(R", I may well make that change.
.SH "DESIGN GOALS AND CONSTRAINTS"
.IX Header "DESIGN GOALS AND CONSTRAINTS"
Text::Unidecode is meant to be a transliterator-of-last resort,
to be used once you've decided that you can't just display the
Unicode data as is, and once you've decided you don't have a
more clever, language-specific transliterator available.  It
transliterates context-insensitively \*(-- that is, a given character is
replaced with the same US-ASCII (7\-bit \s-1ASCII\s0) character or characters,
no matter what the surrounding character are.
.PP
The main reason I'm making Text::Unidecode work with only
context-insensitive substitution is that it's fast, dumb, and
straightforward enough to be feasable.  It doesn't tax my
(quite limited) knowledge of world languages.  It doesn't require
me writing a hundred lines of code to get the Thai syllabification
right (and never knowing whether I've gotten it wrong, because I
don't know Thai), or spending a year trying to get Text::Unidecode
to use the ChaSen algorithm for Japanese, or trying to write heuristics
for telling the difference between Japanese, Chinese, or Korean, so
it knows how to transliterate any given Uni-Han glyph.  And
moreover, context-insensitive substitution is still mostly useful,
but still clearly couldn't be mistaken for authoritative.
.PP
Text::Unidecode is an example of the 80/20 rule in
action \*(-- you get 80% of the usefulness using just 20% of a
\&\*(L"real\*(R" solution.
.PP
A \*(L"real\*(R" approach to transliteration for any given language can
involve such increasingly tricky contextual factors as these
.IP "The previous / preceding character(s)" 4
.IX Item "The previous / preceding character(s)"
What a given symbol \*(L"X\*(R" means, could
depend on whether it's followed by a consonant, or by vowel, or
by some diacritic character.
.IP "Syllables" 4
.IX Item "Syllables"
A character \*(L"X\*(R" at end of a syllable could mean something
different from when it's at the start \*(-- which is especially problematic
when the language involved doesn't explicitly mark where one syllable
stops and the next starts.
.IP "Parts of speech" 4
.IX Item "Parts of speech"
What \*(L"X\*(R" sounds like at the end of a word,
depends on whether that word is a noun, or a verb, or what.
.IP "Meaning" 4
.IX Item "Meaning"
By semantic context, you can tell that this ideogram \*(L"X\*(R" means \*(L"shoe\*(R"
(pronounced one way) and not \*(L"time\*(R" (pronounced another),
and that's how you know to transliterate it one way instead of the other.
.IP "Origin of the word" 4
.IX Item "Origin of the word"
\&\*(L"X\*(R" means one thing in loanwords and/or placenames (and
derivatives thereof), and another in native words.
.ie n .IP """It's just that way""" 4
.el .IP "``It's just that way''" 4
.IX Item "It's just that way"
\&\*(L"X\*(R" normally makes
the /X/ sound, except for this list of seventy exceptions (and words based
on them, sometimes indirectly).  Or: you never can tell which of the three
ways to pronounce \*(L"X\*(R" this word actually uses; you just have to know
which it is, so keep a dictionary on hand!
.IP "Language" 4
.IX Item "Language"
The character \*(L"X\*(R" is actually used in several different languages, and you
have to figure out which you're looking at before you can determine how
to transliterate it.
.PP
Out of a desire to avoid being mired in \fIany\fR of these kinds of
contextual factors, I chose to exclude \fIall of them\fR and just stick
with context-insensitive replacement.
.SH "TODO"
.IX Header "TODO"
Things that need tending to are detailed in the \s-1TODO\s0.txt file, included
in this distribution.  Normal installs probably don't leave the \s-1TODO\s0.txt
lying around, but if nothing else, you can see it at
http://search.cpan.org/search?dist=Text::Unidecode
.SH "MOTTO"
.IX Header "MOTTO"
The Text::Unidecode motto is:
.PP
.Vb 1
\&  It\(aqs better than nothing!
.Ve
.PP
\&...in both meanings: 1) seeing the output of \f(CW\*(C`unidecode(...)\*(C'\fR is
better than just having all font-unavailable Unicode characters
replaced with \*(L"?\*(R"'s, or rendered as gibberish; and 2) it's the
worst, i.e., there's nothing that Text::Unidecode's algorithm is
better than.
.SH "CAVEATS"
.IX Header "CAVEATS"
If you get really implausible nonsense out of \f(CW\*(C`unidecode(...)\*(C'\fR, make
sure that the input data really is a utf8 string.  See
\&\*(L"perlunicode\*(R" in perlunicode.
.SH "THANKS"
.IX Header "THANKS"
Thanks to Harald Tveit Alvestrand,
Abhijit Menon\-Sen, and Mark-Jason Dominus.
.SH "SEE ALSO"
.IX Header "SEE ALSO"
Unicode Consortium: http://www.unicode.org/
.PP
Geoffrey Sampson.  1990.  \fIWriting Systems: A Linguistic Introduction.\fR
\&\s-1ISBN:\s0 0804717567
.PP
Randall K. Barry (editor).  1997.  \fIALA-LC Romanization Tables:
Transliteration Schemes for Non-Roman Scripts.\fR
\&\s-1ISBN:\s0 0844409405
[\s-1ALA\s0 is the American Library Association; \s-1LC\s0 is the Library of
Congress.]
.PP
Rupert Snell.  2000.  \fIBeginner's Hindi Script (Teach Yourself
Books).\fR  \s-1ISBN:\s0 0658009109
.SH "COPYRIGHT AND DISCLAIMERS"
.IX Header "COPYRIGHT AND DISCLAIMERS"
Copyright (c) 2001 Sean M. Burke. All rights reserved.
.PP
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.
.PP
This program is distributed in the hope that it will be useful, but
without any warranty; without even the implied warranty of
merchantability or fitness for a particular purpose.
.PP
Much of Text::Unidecode's internal data is based on data from The
Unicode Consortium, with which I am unafiliated.
.SH "AUTHOR"
.IX Header "AUTHOR"
Sean M. Burke \f(CW\*(C`sburke@cpan.org\*(C'\fR