NAME¶
Locale::Po4a::Xml - convert XML documents and derivates from/to PO files
DESCRIPTION¶
The po4a (PO for anything) project goal is to ease translations (and more
interestingly, the maintenance of translations) using gettext tools on areas
where they were not expected like documentation.
Locale::Po4a::Xml is a module to help the translation of XML documents into
other [human] languages. It can also be used as a base to build modules for
XML-based documents.
TRANSLATING WITH PO4A::XML¶
This module can be used directly to handle generic XML documents. This will
extract all tag's content, and no attributes, since it's where the text is
written in most XML based documents.
There are some options (described in the next section) that can customize this
behavior. If this doesn't fit to your document format you're encouraged to
write your own module derived from this, to describe your format's details.
See the section
WRITING DERIVATE MODULES below, for the process
description.
OPTIONS ACCEPTED BY THIS MODULE¶
The global debug option causes this module to show the excluded strings, in
order to see if it skips something important.
These are this module's particular options:
- nostrip
- Prevents it to strip the spaces around the extracted
strings.
- wrap
- Canonizes the string to translate, considering that
whitespaces are not important, and wraps the translated document. This
option can be overridden by custom tag options. See the "tags"
option below.
- caseinsensitive
- It makes the tags and attributes searching to work in a
case insensitive way. If it's defined, it will treat <BooK>laNG and
<BOOK>Lang as <book>lang.
- includeexternal
- When defined, external entities are included in the
generated (translated) document, and for the extraction of strings. If
it's not defined, you will have to translate external entities separately
as independent documents.
- ontagerror
- This option defines the behavior of the module when it
encounter a invalid XML syntax (a closing tag which does not match the
last opening tag, or a tag's attribute without value). It can take the
following values:
- fail
- This is the default value. The module will exit with an
error.
- warn
- The module will continue, and will issue a warning.
- silent
- The module will continue without any warnings.
Be careful when using this option. It is generally recommended to fix the input
file.
- tagsonly
- Extracts only the specified tags in the "tags"
option. Otherwise, it will extract all the tags except the ones specified.
Note: This option is deprecated.
- doctype
- String that will try to match with the first line of the
document's doctype (if defined). If it doesn't, a warning will indicate
that the document might be of a bad type.
- addlang
- String indicating the path (e.g. <bbb><aaa>) of
a tag where a lang="..." attribute shall be added. The language
will be defined as the basename of the PO file without any .po
extension.
- tags
- Space-separated list of tags you want to translate or skip.
By default, the specified tags will be excluded, but if you use the
"tagsonly" option, the specified tags will be the only ones
included. The tags must be in the form <aaa>, but you can join some
(<bbb><aaa>) to say that the content of the tag <aaa>
will only be translated when it's into a <bbb> tag.
You can also specify some tag options putting some characters in front of
the tag hierarchy. For example, you can put 'w' (wrap) or 'W' (don't wrap)
to override the default behavior specified by the global "wrap"
option.
Example: W<chapter><title>
Note: This option is deprecated. You should use the translated and
untranslated options instead.
- attributes
- Space-separated list of tag's attributes you want to
translate. You can specify the attributes by their name (for example,
"lang"), but you can prefix it with a tag hierarchy, to specify
that this attribute will only be translated when it's into the specified
tag. For example: <bbb><aaa>lang specifies that the lang
attribute will only be translated if it's into an <aaa> tag, and
it's into a <bbb> tag.
- foldattributes
- Do not translate attributes in inline tags. Instead,
replace all attributes of a tag by po4a-id=<id>.
This is useful when attributes shall not be translated, as this simplifies
the strings for translators, and avoids typos.
- customtag
- Space-separated list of tags which should not be treated as
tags. These tags are treated as inline, and do not need to be closed.
- break
- Space-separated list of tags which should break the
sequence. By default, all tags break the sequence.
The tags must be in the form <aaa>, but you can join some
(<bbb><aaa>), if a tag (<aaa>) should only be considered
when it's into another tag (<bbb>).
- inline
- Space-separated list of tags which should be treated as
inline. By default, all tags break the sequence.
The tags must be in the form <aaa>, but you can join some
(<bbb><aaa>), if a tag (<aaa>) should only be considered
when it's into another tag (<bbb>).
- placeholder
- Space-separated list of tags which should be treated as
placeholders. Placeholders do not break the sequence, but the content of
placeholders is translated separately.
The location of the placeholder in its block will be marked with a string
similar to:
<placeholder type=\"footnote\" id=\"0\"/>
The tags must be in the form <aaa>, but you can join some
(<bbb><aaa>), if a tag (<aaa>) should only be considered
when it's into another tag (<bbb>).
- nodefault
- Space separated list of tags that the module should not try
to set by default in any category.
- cpp
- Support C preprocessor directives. When this option is set,
po4a will consider preprocessor directives as paragraph separators. This
is important if the XML file must be preprocessed because otherwise the
directives may be inserted in the middle of lines if po4a consider it
belong to the current paragraph, and they won't be recognized by the
preprocessor. Note: the preprocessor directives must only appear between
tags (they must not break a tag).
- translated
- Space-separated list of tags you want to translate.
The tags must be in the form <aaa>, but you can join some
(<bbb><aaa>), if a tag (<aaa>) should only be considered
when it's into another tag (<bbb>).
You can also specify some tag options putting some characters in front of
the tag hierarchy. For example, you can put 'w' (wrap) or 'W' (don't wrap)
to overide the default behavior specified by the global "wrap"
option.
Example: W<chapter><title>
- untranslated
- Space-separated list of tags you do not want to translate.
The tags must be in the form <aaa>, but you can join some
(<bbb><aaa>), if a tag (<aaa>) should only be considered
when it's into another tag (<bbb>).
- defaulttranslateoption
- The default categories for tags that are not in any of the
translated, untranslated, break, inline, or placeholder.
This is a set of letters:
- w
- Tags should be translated and content can be
re-wrapped.
- W
- Tags should be translated and content should not be
re-wrapped.
- i
- Tags should be translated inline.
- p
- Tags should be translated as placeholders.
WRITING DERIVATE MODULES¶
The simplest customization is to define which tags and attributes you want the
parser to translate. This should be done in the initialize function. First you
should call the main initialize, to get the command-line options, and then,
append your custom definitions to the options hash. If you want to treat some
new options from command line, you should define them before calling the main
initialize:
$self->{options}{'new_option'}='';
$self->SUPER::initialize(%options);
$self->{options}{'_default_translated'}.=' <p> <head><title>';
$self->{options}{'attributes'}.=' <p>lang id';
$self->{options}{'_default_inline'}.=' <br>';
$self->treat_options;
You should use the
_default_inline,
_default_break,
_default_placeholder,
_default_translated,
_default_untranslated, and
_default_attributes options in
derivated modules. This allow users to override the default behavior defined
in your module with command line options.
OVERRIDING THE found_string FUNCTION¶
Another simple step is to override the function "found_string", which
receives the extracted strings from the parser, in order to translate them.
There you can control which strings you want to translate, and perform
transformations to them before or after the translation itself.
It receives the extracted text, the reference on where it was, and a hash that
contains extra information to control what strings to translate, how to
translate them and to generate the comment.
The content of these options depends on the kind of string it is (specified in
an entry of this hash):
- type="tag"
- The found string is the content of a translatable tag. The
entry "tag_options" contains the option characters in front of
the tag hierarchy in the module "tags" option.
- type="attribute"
- Means that the found string is the value of a translatable
attribute. The entry "attribute" has the name of the
attribute.
It must return the text that will replace the original in the translated
document. Here's a basic example of this function:
sub found_string {
my ($self,$text,$ref,$options)=@_;
$text = $self->translate($text,$ref,"type ".$options->{'type'},
'wrap'=>$self->{options}{'wrap'});
return $text;
}
There's another simple example in the new Dia module, which only filters some
strings.
MODIFYING TAG TYPES (TODO)¶
This is a more complex one, but it enables a (almost) total customization. It's
based in a list of hashes, each one defining a tag type's behavior. The list
should be sorted so that the most general tags are after the most concrete
ones (sorted first by the beginning and then by the end keys). To define a tag
type you'll have to make a hash with the following keys:
- beginning
- Specifies the beginning of the tag, after the
"<".
- end
- Specifies the end of the tag, before the
">".
- breaking
- It says if this is a breaking tag class. A non-breaking
(inline) tag is one that can be taken as part of the content of another
tag. It can take the values false (0), true (1) or undefined. If you leave
this undefined, you'll have to define the f_breaking function that will
say whether a concrete tag of this class is a breaking tag or not.
- f_breaking
- It's a function that will tell if the next tag is a
breaking one or not. It should be defined if the breaking option is
not.
- f_extract
- If you leave this key undefined, the generic extraction
function will have to extract the tag itself. It's useful for tags that
can have other tags or special structures in them, so that the main parser
doesn't get mad. This function receives a boolean that says if the tag
should be removed from the input stream or not.
- f_translate
- This function receives the tag (in the
get_string_until() format) and returns the translated tag
(translated attributes or all needed transformations) as a single
string.
INTERNAL FUNCTIONS used to write derivated parsers¶
- get_path()
- This function returns the path to the current tag from the
document's root, in the form <html><body><p>.
An additional array of tags (without brackets) can be passed as argument.
These path elements are added to the end of the current path.
- tag_type()
- This function returns the index from the tag_types list
that fits to the next tag in the input stream, or -1 if it's at the end of
the input file.
- extract_tag($$)
- This function returns the next tag from the input stream
without the beginning and end, in an array form, to maintain the
references from the input file. It has two parameters: the type of the tag
(as returned by tag_type) and a boolean, that indicates if it should be
removed from the input stream.
- get_tag_name(@)
- This function returns the name of the tag passed as an
argument, in the array form returned by extract_tag.
- breaking_tag()
- This function returns a boolean that says if the next tag
in the input stream is a breaking tag or not (inline tag). It leaves the
input stream intact.
- treat_tag()
- This function translates the next tag from the input
stream. Using each tag type's custom translation functions.
- tag_in_list($@)
- This function returns a string value that says if the first
argument (a tag hierarchy) matches any of the tags from the second
argument (a list of tags or tag hierarchies). If it doesn't match, it
returns 0. Else, it returns the matched tag's options (the characters in
front of the tag) or 1 (if that tag doesn't have options).
WORKING WITH ATTRIBUTES¶
- treat_attributes(@)
- This function handles the translation of the tags'
attributes. It receives the tag without the beginning / end marks, and
then it finds the attributes, and it translates the translatable ones
(specified by the module option "attributes"). This returns a
plain string with the translated tag.
WORKING WITH THE MODULE OPTIONS¶
- treat_options()
- This function fills the internal structures that contain
the tags, attributes and inline data with the options of the module
(specified in the command-line or in the initialize function).
GETTING TEXT FROM THE INPUT DOCUMENT¶
- get_string_until($%)
- This function returns an array with the lines (and
references) from the input document until it finds the first argument. The
second argument is an options hash. Value 0 means disabled (the default)
and 1, enabled.
The valid options are:
- include
- This makes the returned array to contain the searched
text
- remove
- This removes the returned stream from the input
- unquoted
- This ensures that the searched text is outside any
quotes
- skip_spaces(\@)
- This function receives as argument the reference to a
paragraph (in the format returned by get_string_until), skips his heading
spaces and returns them as a simple string.
- join_lines(@)
- This function returns a simple string with the text from
the argument array (discarding the references).
STATUS OF THIS MODULE¶
This module can translate tags and attributes.
TODO LIST¶
DOCTYPE (ENTITIES)
There is a minimal support for the translation of entities. They are translated
as a whole, and tags are not taken into account. Multilines entities are not
supported and entities are always rewrapped during the translation.
MODIFY TAG TYPES FROM INHERITED MODULES (move the tag_types structure inside the
$self hash?)
SEE ALSO¶
Locale::Po4a::TransTractor(3pm),
po4a(7)
AUTHORS¶
Jordi Vilalta <jvprat@gmail.com>
Nicolas Francois <nicolas.francois@centraliens.net>
COPYRIGHT AND LICENSE¶
Copyright (c) 2004 by Jordi Vilalta <jvprat@gmail.com>
Copyright (c) 2008-2009 by Nicolas Francois <nicolas.francois@centraliens.net>
This program is free software; you may redistribute it and/or modify it under
the terms of GPL (see the COPYING file).