NAME¶
XMLTV - Perl extension to read and write TV listings in XMLTV format
SYNOPSIS¶
use XMLTV;
my $data = XMLTV::parsefile('tv.xml');
my ($encoding, $credits, $ch, $progs) = @$data;
my $langs = [ 'en', 'fr' ];
print 'source of listings is: ', $credits->{'source-info-name'}, "\n"
if defined $credits->{'source-info-name'};
foreach (values %$ch) {
my ($text, $lang) = @{XMLTV::best_name($langs, $_->{'display-name'})};
print "channel $_->{id} has name $text\n";
print "...in language $lang\n" if defined $lang;
}
foreach (@$progs) {
print "programme on channel $_->{channel} at time $_->{start}\n";
next if not defined $_->{desc};
foreach (@{$_->{desc}}) {
my ($text, $lang) = @$_;
print "has description $text\n";
print "...in language $lang\n" if defined $lang;
}
}
The value of $data will be something a bit like:
[ 'UTF-8',
{ 'source-info-name' => 'Ananova', 'generator-info-name' => 'XMLTV' },
{ 'radio-4.bbc.co.uk' => { 'display-name' => [ [ 'en', 'BBC Radio 4' ],
[ 'en', 'Radio 4' ],
[ undef, '4' ] ],
'id' => 'radio-4.bbc.co.uk' },
... },
[ { start => '200111121800', title => [ [ 'Simpsons', 'en' ] ],
channel => 'radio-4.bbc.co.uk' },
... ] ]
DESCRIPTION¶
This module provides an interface to read and write files in XMLTV format (a TV
listings format defined by xmltv.dtd). In general element names in the XML
correspond to hash keys in the Perl data structure. You can think of this
module as a bit like
XML::Simple, but specialized to the XMLTV file
format.
The Perl data structure corresponding to an XMLTV file has four elements. The
first gives the character encoding used for text data, typically UTF-8 or
ISO-8859-1. (The encoding value could also be undef meaning 'unknown', when
the library can't work out what it is.) The second element gives the
attributes of the root <tv> element, which give information about the
source of the TV listings. The third element is a list of channels, each list
element being a hash corresponding to one <channel> element. The fourth
element is similarly a list of programmes. More details about the data
structure are given later. The easiest way to find out what it looks like is
to load some small XMLTV files and use
Data::Dumper to print out the
resulting structure.
USAGE¶
- parse(document)
- Takes an XMLTV document (a string) and returns the Perl data structure. It
is assumed that the document is valid XMLTV; if not the routine may
die() with an error (although the current implementation just warns
and continues for most small errors).
The first element of the listref returned, the encoding, may vary according
to the encoding of the input document, the versions of perl and
"XML::Parser" installed, the configuration of the XMLTV library
and other factors including, but not limited to, the phase of the moon.
With luck it should always be either the encoding of the input file or
UTF-8.
Attributes and elements in the XML file whose names begin with 'x-' are
skipped silently. You can use these to include information which is not
currently handled by the XMLTV format, or by this module.
- parsefiles(filename...)
- Like "parse()" but takes one or more filenames instead of a
string document. The data returned is the merging of those file contents:
the programmes will be concatenated in their original order, the channels
just put together in arbitrary order (ordering of channels should not
matter).
It is necessary that each file have the same character encoding, if not, an
exception is thrown. Ideally the credits information would also be the
same between all the files, since there is no obvious way to merge it -
but if the credits information differs from one file to the next, one file
is picked arbitrarily to provide credits and a warning is printed. If two
files give differing channel definitions for the same XMLTV channel id,
then one is picked arbitrarily and a warning is printed.
In the simple case, with just one file, you needn't worry about mismatching
of encodings, credits or channels.
The deprecated function "parsefile()" is a wrapper allowing just
one filename.
- parse_callback(document, encoding_callback, credits_callback,
channel_callback, programme_callback)
- An alternative interface. Whereas "parse()" reads the whole
document and then returns a finished data structure, with this routine you
specify a subroutine to be called as each <channel> element is read
and another for each <programme> element.
The first argument is the document to parse. The remaining arguments are
code references, one for each part of the document.
The callback for encoding will be called once with a string giving the
encoding. In present releases of this module, it is also possible for the
value to be undefined meaning 'unknown', but it's hoped that future
releases will always be able to figure out the encoding used.
The callback for credits will be called once with a hash reference. For
channels and programmes, the appropriate function will be called zero or
more times depending on how many channels / programmes are found in the
file.
The four subroutines will be called in order, that is, the encoding and
credits will be done before the channel handler is called and all the
channels will be dealt with before the first programme handler is called.
If any of the code references is undef, nothing is called for that part of
the file.
For backwards compatibility, if the value for 'encoding callback' is not a
code reference but a scalar reference, then the encoding found will be
stored in that scalar. Similarly if the 'credits callback' is a scalar
reference, the scalar it points to will be set to point to the hash of
credits. This style of interface is deprecated: new code should just use
four callbacks.
For example:
my $document = '<tv>...</tv>';
my $encoding;
sub encoding_cb( $ ) { $encoding = shift }
my $credits;
sub credits_cb( $ ) { $credits = shift }
# The callback for each channel populates this hash.
my %channels;
sub channel_cb( $ ) {
my $c = shift;
$channels{$c->{id}} = $c;
}
# The callback for each programme. We know that channels are
# always read before programmes, so the %channels hash will be
# fully populated.
#
sub programme_cb( $ ) {
my $p = shift;
print "got programme: $p->{title}->[0]->[0]\n";
my $c = $channels{$p->{channel}};
print 'channel name is: ', $c->{'display-name'}->[0]->[0], "\n";
}
# Let's go.
XMLTV::parse_callback($document, \&encoding_cb, \&credits_cb,
\&channel_cb, \&programme_cb);
- parsefiles_callback(encoding_callback, credits_callback, channel_callback,
programme_callback, filenames...)
- As "parse_callback()" but takes one or more filenames to open,
merging their contents in the same manner as "parsefiles()".
Note that the reading is still gradual - you get the channels and
programmes one at a time, as they are read.
Note that the same <channel> may be present in more than one file, so
the channel callback will get called more than once. It's your
responsibility to weed out duplicate channel elements (since writing them
out again requires that each have a unique id).
For compatibility, there is an alias "parsefile_callback()" which
is the same but takes only a single filename, before the callback
arguments. This is deprecated.
- write_data(data, options...)
- Takes a data structure and writes it as XML to standard output. Any extra
arguments are passed on to XML::Writer's constructor, for example
my $f = new IO::File '>out.xml'; die if not $f;
write_data($data, OUTPUT => $f);
The encoding used for the output is given by the first element of the data.
Normally, there will be a warning for any Perl data which is not understood
and cannot be written as XMLTV, such as strange keys in hashes. But as an
exception, any hash key beginning with an underscore will be skipped over
silently. You can store 'internal use only' data this way.
If a programme or channel hash contains a key beginning with 'debug', this
key and its value will be written out as a comment inside the
<programme> or <channel> element. This lets you include small
debugging messages in the XML output.
- best_name(languages, pairs [, comparator])
- The XMLTV format contains many places where human-readable text is given
an optional 'lang' attribute, to allow mixed languages. This is
represented in Perl as a pair [ text, lang ], although the second element
may be missing or undef if the language is unknown. When several
alernatives for an element (such as <title>) can be given, the
representation is a list of [ text, lang ] pairs. Given such a list, what
is the best text to use? It depends on the user's preferred language.
This function takes a list of acceptable languages and a list of [string,
language] pairs, and finds the best one to use. This means first finding
the appropriate language and then picking the 'best' string in that
language.
The best is normally defined as the first one found in a usable language,
since the XMLTV format puts the most canonical versions first. But you can
pass in your own comparison function, for example if you want to choose
the shortest piece of text that is in an acceptable language.
The acceptable languages should be a reference to a list of language codes
looking like 'ru', or like 'de_DE'. The text pairs should be a reference
to a list of pairs [ string, language ]. (As a special case if this list
is empty or undef, that means no text is present, and the result is
undef.) The third argument if present should be a cmp-style function that
compares two strings of text and returns 1 if the first argument is
better, -1 if the second better, 0 if they're equally good.
Returns: [s, l] pair, where s is the best of the strings to use and l is its
language. This pair is 'live' - it is one of those from the list passed
in. So you can use "best_name()" to find the best pair from a
list and then modify the content of that pair.
(This routine depends on the "Lingua::Preferred" module being
installed; if that module is missing then the first available language is
always chosen.)
Example:
my $langs = [ 'de', 'fr' ]; # German or French, please
# Say we found the following under $p->{title} for a programme $p.
my $pairs = [ [ 'La CitE des enfants perdus', 'fr' ],
[ 'The City of Lost Children', 'en_US' ] ];
my $best = best_name($langs, $pairs);
print "chose title $best->[0]\n";
- list_channel_keys(), list_programme_keys()
- Some users of this module may wish to enquire at runtime about which keys
a programme or channel hash can contain. The data in the hash comes from
the attributes and subelements of the corresponding element in the XML.
The values of attributes are simply stored as strings, while subelements
are processed with a handler which may return a complex data structure.
These subroutines returns a hash mapping key to handler name and
multiplicity. This lets you know what data types can be expected under
each key. For keys which come from attributes rather than subelements, the
handler is set to 'scalar', just as for subelements which give a simple
string. See "DATA STRUCTURE" for details on what the different
handler names mean.
It is not possible to find out which keys are mandatory and which optional,
only a list of all those which might possibly be present. An example use
of these routines is the tv_grep program, which creates its allowed
command line arguments from the names of programme subelements.
- catfiles(w_args, filename...)
- Concatenate several listings files, writing the output to somewhere
specified by "w_args". Programmes are catenated together,
channels are merged, for credits we just take the first and warn if the
others differ.
The first argument is a hash reference giving information to pass to
"XMLTV::Writer"'s constructor. But do not specify encoding, this
will be taken from the input files. Currently "catfiles()" will
fail work if the input files have different encodings.
- cat(data, ...)
- Concatenate (and merge) listings data. Programmes are catenated together,
channels are merged, for credits we just take the first and warn if the
others differ (except that the 'date' of the result is the latest date of
all the inputs).
Whereas "catfiles()" reads and writes files, this function takes
already-parsed listings data and returns some more listings data. It is
much more memory-hungry.
- cat_noprogrammes
- Like "cat()" but ignores the programme data and just returns
encoding, credits and channels. This is in case for scalability reasons
you want to handle programmes individually, but still merge the smaller
data.
DATA STRUCTURE¶
For completeness, we describe more precisely how channels and programmes are
represented in Perl. Each element of the channels list is a hashref
corresponding to one <channel> element, and likewise for programmes. The
possible keys of a channel (programme) hash are the names of attributes or
subelements of <channel> (<programme>).
The values for attributes are not processed in any way; an attribute
"fred="jim"" in the XML will become a hash element with
key 'fred', value 'jim'.
But for subelements, there is further processing needed to turn the XML content
of a subelement into Perl data. What is done depends on what type of data is
stored under that subelement. Also, if a certain element can appear several
times then the hash key for that element points to a list of values rather
than just one.
The conversion of a subelement's content to and from Perl data is done by a
handler. The most common handler is
with-lang, used for human-readable
text content plus an optional 'lang' attribute. There are other handlers for
other data structures in the file format. Often two subelements will share the
same handler, since they hold the same type of data. The handlers defined are
as follows; note that many of them will silently strip leading and trailing
whitespace in element content. Look at the DTD itself for an explanation of
the whole file format.
Unless specified otherwise, it is not allowed for an element expected to contain
text to have empty content, nor for the text to contain newline characters.
- credits
- Turns a list of credits (for director, actor, writer, etc.) into a hash
mapping 'role' to a list of names. The names in each role are kept in the
same order.
- scalar
- Reads and writes a simple string as the content of the XML element.
- length
- Converts the content of a <length> element into a number of seconds
(so <length units="minutes">5</minutes> would be
returned as 300). On writing out again tries to convert a number of
seconds to a time in minutes or hours if that would look better.
- episode-num
- The representation in Perl of XMLTV's odd episode numbers is as a pair of
[ content, system ]. As specified by the DTD, if the system is not given
in the file then 'onscreen' is assumed. Whitespace in the 'xmltv_ns'
system is unimportant, so on reading it is normalized to a single space on
either side of each dot.
- video
- The <video> section is converted to a hash. The <present>
subelement corresponds to the key 'present' of this hash, 'yes' and 'no'
are converted to Booleans. The same applies to <colour>. The content
of the <aspect> subelement is stored under the key 'aspect'. These
keys can be missing in the hash just as the subelements can be missing in
the XML.
- audio
- This is similar to video. <present> is a Boolean value, while
the content of <stereo> is stored unchanged.
- previously-shown
- The 'start' and 'channel' attributes are converted to keys in a hash.
- presence
- The content of the element is ignored: it signfies something by its very
presence. So the conversion from XML to Perl is a constant true value
whenever the element is found; the conversion from Perl to XML is to write
out the element if true, don't write anything if false.
- subtitles
- The 'type' attribute and the 'language' subelement (both optional) become
keys in a hash. But see language for what to pass as the value of
that element.
- rating
- The rating is represented as a tuple of [ rating, system, icons ]. The
last element is itself a listref of structures returned by the icon
handler.
- star-rating
- In XML this is a string 'X/Y' plus a list of icons. In Perl represented as
a pair [ rating, icons ] similar to rating.
Multiple star ratings are now supported. For backward compatability, you may
specify a single [rating,icon] or the preferred double array
[[rating,system,icon],[rating2,system2,icon2]] (like 'ratings')
- icon
- An icon in XMLTV files is like the <img> element in HTML. It is
represented in Perl as a hashref with 'src' and optionally 'width' and
'height' keys.
- with-lang
- In XML something like title can be either <title>Foo</title>
or <title lang="en">Foo</title>. In Perl these are
stored as [ 'Foo' ] and [ 'Foo', 'en' ]. For the former [ 'Foo', undef ]
would also be okay.
This handler also has two modifiers which may be added to the name after
'/'. /e means that empty text is allowed, and will be returned as
the empty tuple [], to mean that the element is present but has no text.
When writing with /e, undef will also be understood as
present-but-empty. You cannot however specify a language if the text is
empty.
The modifier /m means that the text is allowed to span multiple
lines.
So for example with-lang/em is a handler for text with language,
where the text may be empty and may contain newlines. Note that the
with-lang-or-empty of earlier releases has been replaced by
with-lang/e.
Now, which handlers are used for which subelements (keys) of channels and
programmes? And what is the multiplicity (should you expect a single value or
a list of values)?
The following tables map subelements of <channel> and of <programme>
to the handlers used to read and write them. Many elements have their own
handler with the same name, and most of the others use
with-lang. The
third column specifies the multiplicity of the element:
* (any number)
will give a list of values in Perl,
+ (one or more) will give a
nonempty list,
? (maybe one) will give a scalar, and
1 (exactly
one) will give a scalar which is not undef.
Handlers for <channel>¶
- display-name, with-lang, +
- icon, icon, *
- url, scalar, *
Handlers for <programme>¶
- title, with-lang, +
- sub-title, with-lang, *
- desc, with-lang/m, *
- credits, credits, ?
- date, scalar, ?
- category, with-lang, *
- language, with-lang, ?
- orig-language, with-lang, ?
- length, length, ?
- icon, icon, *
- url, scalar, *
- country, with-lang, *
- episode-num, episode-num, *
- video, video, ?
- audio, audio, ?
- previously-shown, previously-shown, ?
- premiere, with-lang/em, ?
- last-chance, with-lang/em, ?
- new, presence, ?
- subtitles, subtitles, *
- rating, rating, *
- star-rating, star-rating, *
At present, no parsing or validation on dates is done because dates may be
partially specified in XMLTV. For example '2001' means that the year is known
but not the month, day or time of day. Maybe in the future dates will be
automatically converted to and from
Date::Manip objects. For now they
just use the
scalar handler. Similar remarks apply to URLs.
WRITING¶
When reading a file you have the choice of using "parse()" to gulp the
whole file and return a data structure, or using "parse_callback()"
to get the programmes one at a time, although channels and other data are
still read all at once.
There is a similar choice when writing data: the "write_data()"
routine prints a whole XMLTV document at once, but if you want to write an
XMLTV document incrementally you can manually create an
"XMLTV::Writer" object and call methods on it. Synopsis:
use XMLTV;
my $w = new XMLTV::Writer();
$w->comment("Hello from XML::Writer's comment() method");
$w->start({ 'generator-info-name' => 'Example code in pod' });
my %ch = (id => 'test-channel', 'display-name' => [ [ 'Test', 'en' ] ]);
$w->write_channel(\%ch);
my %prog = (channel => 'test-channel', start => '200203161500',
title => [ [ 'News', 'en' ] ]);
$w->write_programme(\%prog);
$w->end();
XMLTV::Writer inherits from XML::Writer, and provides the following extra or
overridden methods:
- new(), the constructor
- Creates an XMLTV::Writer object and starts writing an XMLTV file, printing
the DOCTYPE line. Arguments are passed on to XML::Writer's constructor,
except for the following:
the 'encoding' key if present gives the XML character encoding. For example:
my $w = new XMLTV::Writer(encoding => 'ISO-8859-1');
If encoding is not specified, XML::Writer's default is used (currently
UTF-8).
XMLTW::Writer can also filter out specific days from the data. This is
useful if the datasource provides data for periods of time that does not
match the days that the user has asked for. The filtering is controlled
with the days, offset and cutoff arguments:
my $w = new XMLTV::Writer(
offset => 1,
days => 2,
cutoff => "050000" );
In this example, XMLTV::Writer will discard all entries that do not have
starttimes larger than or equal to 05:00 tomorrow and less than 05:00 two
days after tomorrow. The time offset is stripped off the starttime before
the comparison is made.
- start()
- Write the start of the <tv> element. Parameter is a hashref which
gives the attributes of this element.
- write_channels()
- Write several channels at once. Parameter is a reference to a hash mapping
channel id to channel details. They will be written sorted by id, which is
reasonable since the order of channels in an XMLTV file isn't
significant.
- write_channel()
- Write a single channel. You can call this routine if you want, but most of
the time "write_channels()" is a better interface.
- write_programme()
- Write details for a single programme as XML.
- end()
- Say you've finished writing programmes. This ends the <tv> element
and the file.
AUTHOR¶
Ed Avis, ed@membled.com
SEE ALSO¶
The file format is defined by the DTD xmltv.dtd, which is included in the xmltv
package along with this module. It should be installed in your system's
standard place for SGML and XML DTDs.
The xmltv package has a web page at <
http://xmltv.org/> which carries
information about the file format and the various tools and apps which are
distributed with this module.