NAME¶
Encode::Arabic::Buckwalter - Tim Buckwalter's transliteration of Arabic
SYNOPSIS¶
use Encode::Arabic::Buckwalter; # imports just like 'use Encode' would, plus more
while ($line = <>) { # Tim Buckwalter's mapping into the Arabic script
print encode 'utf8', decode 'buckwalter', $line; # 'Buckwalter' alias 'Tim'
}
# shell filter of data, e.g. in *n*x systems instead of viewing the Arabic script proper
% perl -MEncode::Arabic::Buckwalter -pe '$_ = encode "buckwalter", decode "utf8", $_'
# employing the modes of conversion for filtering and trimming
Encode::Arabic::enmode 'buckwalter', 'nosukuun', '>&< xml';
Encode::Arabic::Buckwalter->demode(undef, undef, 'strip _');
$decode = "Aiqora>o h`*aA {l_n~a_S~a bi___{notibaAhK.";
$encode = encode 'buckwalter', decode 'buckwalter', $decode;
# $encode eq "AiqraO h`*aA Aln~aS~a biAntibaAhK."
DESCRIPTION¶
Tim Buckwalter's notation is a one-to-one transliteration of the Arabic script
for Modern Standard Arabic, using lower ASCII characters to encode the
graphemes of the original script. This system has been very popular in Natural
Language Processing, however, there are limits to its applicability due to
numerous non-alphabetic codes involved.
IMPLEMENTATION¶
The module takes care of the Encode::Encoding programming interface, while the
effective code is Tim Buckwalter's "tr"ick:
$encode =~ tr[\x{060C}\x{061B}\x{061F}\x{0621}-\x{063A}\x{0640}-\x{0652} # !! no break in true perl !!
\x{0670}\x{0671}\x{067E}\x{0686}\x{0698}\x{06A4}\x{06AF}\x{0660}-\x{0669}]
[,;?'|>&<}AbptvjHxd*rzs$SDTZEg_fqklmnhwYyFNKaui~o`{PJRVG0-9];
$decode =~ tr[,;?'|>&<}AbptvjHxd*rzs$SDTZEg_fqklmnhwYyFNKaui~o`{PJRVG0-9]
[\x{060C}\x{061B}\x{061F}\x{0621}-\x{063A}\x{0640}-\x{0652} # !! no break in true perl !!
\x{0670}\x{0671}\x{067E}\x{0686}\x{0698}\x{06A4}\x{06AF}\x{0660}-\x{0669}];
EXPORTS & MODES¶
If the first element in the list to "use" is ":xml", the
alternative mapping is introduced that suits the
XML etiquette. This
option is there only to replace the ">&<" reserved
characters by "OWI" while still having a one-to-one notation. There
is no XML parsing involved, and the markup would get distorted if subject to
"decode"!
$using_xml = eval q { use Encode::Arabic::Buckwalter ':xml'; decode 'buckwalter', 'OWI' };
$classical = eval q { use Encode::Arabic::Buckwalter; decode 'buckwalter', '>&<' };
# $classical eq $using_xml and $classical eq "\x{0623}\x{0624}\x{0625}"
The module exports as if "use Encode" also appeared in the package.
The other "import" options are just delegated to Encode and imports
performed properly.
The
conversion modes of this module allow to override the setting of the
":xml" option, in addition to filtering out diacritical marks and
stripping off
kashida. The modes and aliases relate like this:
our %Encode::Arabic::Buckwalter::modemap = (
'default' => 0, 'undef' => 0,
'fullvocalize' => 0, 'full' => 0,
'nowasla' => 4,
'vocalize' => 3, 'nosukuun' => 3,
'novocalize' => 2, 'novowels' => 2, 'none' => 2,
'noshadda' => 1, 'noneplus' => 1,
);
- enmode ($obj, $mode, $xml, $kshd)
- demode ($obj, $mode, $xml, $kshd)
- These methods can be invoked directly or through the respective functions
of Encode::Arabic. The meaning of the extra parameters follows from the
examples of usage.
SEE ALSO¶
Encode::Arabic, Encode, Encode::Encoding
Tim Buckwalter's Qamus <
http://www.qamus.org/>
Buckwalter Arabic Morphological Analyzer
<
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49>
AUTHOR¶
Otakar Smrz "<otakar-smrz users.sf.net>",
<
http://otakar-smrz.users.sf.net/>
COPYRIGHT AND LICENSE¶
Copyright (C) 2003-2012 Otakar Smrz
This library is free software; you can redistribute it and/or modify it under
the same terms as Perl itself.