NAME¶
EBook::Tools::Mobipocket - Palm::PDB handler for manipulating the Mobipocket
format.
SYNOPSIS¶
use EBook::Tools::Mobipocket qw(:all);
my $mobi = EBook::Tools::Mobipocket->new();
$mobi->Load('filename.prc');
print "Title: ",$mobi->{title},"\n";
print "Author: ",$mobi->{header}{exth}{author},"\n";
print "Language: ",$mobi->{header}{mobi}{language},"\n";
my $mobigen = find_mobigen();
system_mobigen('myfile.opf');
DEPENDENCIES¶
- •
- "Bit::Vector"
- •
- "Compress::Zlib"
- •
- "HTML::Tree"
- •
- "Image::Size"
- •
- "List::MoreUtils"
- •
- "P5-Palm"
- •
- "String::CRC32"
CONSTRUCTOR¶
"new()"¶
Instantiates a new Ebook::Tools::Mobipocket object.
ACCESSOR METHODS¶
"drm()"¶
Returns 1 if the "drmoffset" header value is neither 0 nor 0xffffffff.
Returns undef if "drmoffset" is undefined. Returns 0 otherwise.
"text()"¶
Returns the text of the file
"write_images()"¶
Writes each image record to the disk.
Returns the number of images written.
"write_text($filename)"¶
Writes the book text to disk with the given filename. This filename must match
the filename given to "
fix_html()" for the internal links to
be consistent.
Croaks if $filename is not specified.
Returns 1 on success, or undef if there was no text to write.
"write_unknown_records()"¶
Writes each unidentified record to disk with a filename in the format of
'raw-record-####', where #### is the record number (not the record ID).
Returns the number of records written.
MODIFIER METHODS¶
These methods have two naming/capitalization schemes -- methods directly related
to the subclassing of Palm::PDB use its MethodName capitalization style. Any
other methods are lowercase_with_underscores for consistency with the rest of
EBook::Tools.
"Load($filename)"¶
Sets "$self->{filename}" and then loads and parses the file
specified by $filename, calling "ParseRecord(%record)" on every
record found.
If DictionaryHuffman compression is detected, text records will be left
untouched during the ParseRecord pass, and "
uncompress_dictionaryhuffman_records()" will be called after the
initial parsing pass is complete.
"ParseRecord(%record)"¶
Parses PDB records, updating the object attributes. This method is called
automatically on every database record during "Load()".
"ParseRecord0($data)"¶
Parses the header record and places the parsed values into the hashref
"$self->{header}{palm}", the hashref
"$self->{header}{mobi}", and "$self->{header}{exth}"
by calling "
parse_palmdoc_header()", "
parse_mobi_header()", and "
parse_mobi_exth()"
respectively.
"ParseRecordCDIC(\$data)"¶
Parses a CDIC record. Takes as a sole argument a reference to the data of the
record.
Record format
- •
- Offset 0: Record identifier
4 bytes, always 'CDIC'
- •
- Offset 4: Header length
4 bytes, big-endian long int, always = 16
- •
- Offset 8: Index count
4 bytes, big-endian long int, marks the number of big-endian short ints
immediately following the header used as index points into the dictionary
data
- •
- Offset 12: Codelength
4 bytes, big-endian long int, number of code bits
- •
- Offset 16: Indexes
A number of big-endian short ints used as index points into the dictionary
data
- •
- Offset ??: Dictionary data
Dictionary result strings immediately following the indexes
"ParseRecordHUFF(\$data)"¶
Parses a HUFF record. Takes as a sole argument a reference to the data of the
record.
Record format
- •
- Offset 0: Record identifier
4 bytes, always 'HUFF'
- •
- Offset 4: Header length
4 bytes, big-endian long int, always = 24
- •
- Offset 8: Cache table (big-endian) offset
4 bytes, big-endian long int, always = 24
- •
- Offset 12: Base table (big-endian) offset
4 bytes, big-endian long int, always = 1048
- •
- Offset 16: Cache table (little-endian) offset
4 bytes, big-endian long int, always = 1304
- •
- Offset 20: Base table (little-endian) offset
4 bytes, big-endian long int, always = 2328
- •
- Offset 24: Cache table (big-endian)
1024 bytes, 256 big-endian long ints
This is a look up table for the length and decoding of short codewords. If
the codeword represented by the 8 bits is unique, then bit 7 (0x80) will
be set, and the low 5 bits are the length in bits of the code. The high
three bytes partially represent the final symbol.
If bit 7 is clear, then the code is looked up in the base table
- •
- Offset 1048: Base table (big-endian)
256 bytes, 64 big-endian long ints
This is where the codeword is looked up if it isn't found in the cache
table.
- •
- Offset 1304: Cache table (little-endian)
1024 bytes, 256 little-endian long ints.
This contains exactly the same data as in the cache table at offset 24,
except that all of the values are stored in little-endian format instead
of big-endian.
Presumably this is for a speed advantage on slow little-endian processors.
This module uses only the big-endian tables.
- •
- Offset 2328: Base table (little-endian)
256 bytes, 64 little-endian long ints
This contains exactly the same data as in the base table at offset 1048,
except that all of the values are stored in little-endian format instead
of big-endian.
Presumably this is for a speed advantage on slow little-endian processors.
This module uses only the big-endian tables.
"ParseRecordImage(\$dataref)"¶
Parses image records, updating object attributes, most notably adding the image
data to the hash "$self->{imagedata}", adding the image filename
to "$self->{recindexlinks}", and incrementing
"$self->{recindex}".
Takes as an argument a reference to the record data. Croaks if it isn't
provided, or isn't a reference.
This is called automatically by "
ParseRecord()" and "
ParseResource()" as needed.
"ParseRecordText(\$dataref)"¶
Parses text records, updating object attributes, most notably appending text to
"$self->{text}". Takes as an argument a reference to the record
data.
This is called automatically by "
ParseRecord()" and "
ParseResource()" as needed.
fix_html(%args)¶
Takes raw Mobipocket text and replaces the custom tags and file position anchors
Arguments
- •
- "filename"
The name of the output HTML file (used in generating hrefs). The procedure
croaks if this is not supplied.
- •
- "nonewlines" (optional)
If this is set to true, the procedure will not attempt to insert newlines
for readability. This will leave the output in a single unreadable line,
but has the advantage of reducing the processing time, especially useful
if tidy is going to be run on the output anyway.
"fix_html_filepos()"¶
Takes the raw HTML text of the object and replaces the filepos anchors. This has
to be called before any other action that modifies the text, or the filepos
positions will not be valid.
Returns 1 if successful, undef if there was no text to fix.
This is called automatically by "
fix_html()".
"uncompress_dictionaryhuffman_records()"¶
Uncompresses all text records using "
uncompress_dictionaryhuffman()". This destroys the existing
contents of $self->{text} if any.
This method is called automatically at the end of "Load()" if
DictionaryHuffman encoding is detected.
PROCEDURES¶
All procedures are exportable, but none are exported by default. All procedures
can be exported by using the ":all" tag.
"find_mobidedrm()"¶
Attempts to locate a copy of the MobiDeDrm script by searching PATH and looking
in the EBook::Tools user configuration directory (see "
userconfigdir()" in EBook::Tools.
Returns the complete path to the script, or undef if nothing was found.
This will use package variable $mobidedrm_cmd as its first guess, and set that
variable to the return value as well.
"find_mobigen()"¶
Attempts to locate the mobigen executable by making a test execution on
predicted locations (including just checking PATH) and looking in the
EBook::Tools user configuration directory (see "
userconfigdir()" in EBook::Tools.
Returns the system command used for a successful invocation, or undef if nothing
worked.
This will use package variable $mobigen_cmd as its first guess, and set that
variable to the return value as well.
Takes as an argument a scalar containing the variable-length Mobipocket EXTH
data from the first record. Returns an array of hashes, each hash containing
the data from one EXTH record with values from that data keyed to recognizable
names.
If $headerdata doesn't appear to be an EXTH header, carps a warning and returns
an empty list.
See:
http://wiki.mobileread.com/wiki/MOBI
Hash keys
- •
- "type"
A numeric value indicating the type of EXTH data in the record. See package
variable %exthtypes.
- •
- "length"
The length of the "data" value in bytes
- •
- "data"
The data of the record.
Takes as an argument a scalar containing the variable-length Mobipocket-specific
header data from the first record. Returns a hash containing values from that
data keyed to recognizable names.
See:
http://wiki.mobileread.com/wiki/MOBI
keys
The returned hash will have the following keys (documented in the order in which
they are encountered in the header):
- "identifier"
- This should always be the string 'MOBI'. If it isn't, the procedure
croaks.
- "headerlength"
- This is the size of the complete header. If this value is different from
the length of the argument, the procedure croaks.
- "type"
- A numeric code indicating what category of Mobipocket file this is.
- "encoding"
- A numeric code representing the encoding. Expected values are '1252' (for
Windows-1252) and '65001 (for UTF-8).
The procedure carps a warning if an unexpected value is encountered.
- "uniqueid"
- This is thought to be a unique ID for the book, but its actual use is
unknown.
Use with caution. This key may be renamed in the future if more information
is found.
- "version"
- This is thought to be the Mobipocket format version. A second version code
shows up again later as "version2" which is usually the same on
unprotected books but different on DRMd books.
Use with caution. This key may be renamed in the future if more information
is found.
- "reserved"
- 40 bytes of reserved data.
Use with caution. This key may be renamed in the future if more information
is found.
- "indxrecord"
- This is thought to be the record offset to the first 'INDX' record, so
named for its first four letters.
Use with caution. This key may be renamed in the future if more information
is found.
- "titleoffset"
- Offset in record 0 (not from start of file) of the full title of the
book.
- "titlelength"
- Length in bytes of the full title of the book
- "languageunknown"
- 16 bits of unknown data thought to be related to the book language.
Use with caution. This key may be renamed in the future if more information
is found.
- "language"
- A pseudo-IANA language code string representing the main book language
(i.e. the value of <dc:language>). See %mobilangcodes for an exact
map of raw values to this string and notes on non-compliant results.
- "dilanguageunknown"
- 16 bits of unknown data thought to be related to the dictionary input
language.
Use with caution. This key may be renamed in the future if more information
is found.
- "dilanguage"
- A pseudo-IANA language code string for the DictionaryInLanguage element.
See %mobilangcodes for an exact map of raw values to this string and notes
on non-compliant results.
- "dolanguageunknown"
- 16 bits of unknown data thought to be related to the dictionary output
language.
Use with caution. This key may be renamed in the future if more information
is found.
- "dolanguage"
- A pseudo-IANA language code string for the DictionaryOutLanguage element.
See %mobilangcodes for an exact map of raw values to this string and notes
on non-compliant results.
- "version2"
- This is another Mobipocket format version related to DRM. If no DRM is
present, it should be the same as "version".
Use with caution. This key may be renamed in the future if more information
is found.
- "firstimagerecord"
- This is thought to be an index to the first record containing image data.
If there are no images in the book, this value will be 4294967295
(0xffffffff)
Use with caution. This key may be renamed in the future if more information
is found.
- "huffrecord"
- This is thought to be the record offset to the 'HUFF' record, used in
HUFF/CDIC decompression.
Use with caution. This key may be renamed in the future if more information
is found.
- "huffreccnt"
- This is thought to be the number of HUFF and CDIC records, starting at
"huffrecord".
Use with caution. This key may be renamed in the future if more information
is found.
- "datprecord"
- This is thought to be the record offset to the first 'DATP' record, so
named for its first four letters.
Use with caution. This key may be renamed in the future if more information
is found.
- "datpreccnt"
- This is thought to be the number of 'DATP' records present.
Use with caution. This key may be renamed in the future if more information
is found.
- "exthflags"
- A 32-bit bitfield related to the Mobipocket EXTH data. If bit 6 (0x40) is
set, then there is at least one EXTH record.
- "unknown116"
- 36 bytes of unknown data at offset 116. This value will be undefined if
the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information
is found.
- "drmoffset"
- A number thought to be the byte offset inside of the record 0 data in
which DRM data can be found. If present and no DRM is set, contains either
the value 0xFFFFFFFF (normal books) or 0x00000000 (samples). This value
will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information
is found.
- "drmcount"
- A number thought to be related to DRM.
This value will be undefined if the header data was not long enough to
contain it.
Use with caution. This key may be renamed in the future if more information
is found.
- "drmsize"
- A number thought to be the size of the data in bytes after
"drmoffset" containing DRM keys.
This value will be undefined if the header data was not long enough to
contain it.
Use with caution. This key may be renamed in the future if more information
is found.
- "drmflags"
- A number thought to be related to DRM.
This value will be undefined if the header data was not long enough to
contain it.
Use with caution. This key may be renamed in the future if more information
is found.
- "unknown168"
- 32 bits of unknown data at offset 168, usually zeroes. This value will be
undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information
is found.
- "unknown172"
- 32 bits of unknown data at offset 172, usually zeroes. This value will be
undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information
is found.
- "unknown176"
- 16 bits of unknown data at offset 176. This value will be undefined if the
header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information
is found.
- "lastimagerecord"
- This is thought to be an index to the last record containing image data.
If there are no images in the book, this value will be 65535 (0xffff).
Use with caution. This key may be renamed in the future if more information
is found.
- "unknown180"
- 32 bits of unknown data at offset 180. This value will be undefined if the
header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information
is found.
- "fcisrecord"
- This is thought to be an index to a 'FCIS' record, so named because those
are always the first four characters when the record data is decompressed
using uncompress_palmdoc().
This value will be undefined if the header data was not long enough to
contain it.
Use with caution. This key may be renamed in the future if more information
is found.
- "unknown188"
- 32 bits of unknown data at offset 188. This value will be undefined if the
header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information
is found.
- "flisrecord"
- This is thought to be an index to a 'FLIS' record, so named because those
are always the first four characters when the record data is decompressed
using uncompress_palmdoc().
This value will be undefined if the header data was not long enough to
contain it.
Use with caution. This key may be renamed in the future if more information
is found.
- "unknown196"
- 32 bits of unknown data at offset 180. This value will be undefined if the
header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information
is found.
- "unknown200"
- Unknown data of unknown length running to the end of the header. This
value will be undefined if the header data was not long enough to contain
it.
Use with caution. This key may be renamed in the future if more information
is found.
- "extradataflags"
- Two bytes sometimes found inside of "unknown200", used to
determine if extra data has been appended to each text record that should
not be used in decompression.
"parse_mobi_language($languagecode, $regioncode)"¶
Takes the integer values $languagecode and $regioncode unpacked from the
Mobipocket header and returns a language string mostly (but not entirely)
conformant to the IANA language subtag registry codes.
Croaks if $languagecode is not provided. If $regioncode is not provided or not
recognized, it is disregarded and the base language string (with no region or
script) is returned.
If $languagecode is not provided, the sub croaks. If it isn't recognized, a
warning is carped and the sub returns undef. Note that 0,0 is a recognized
code returning an empty string.
See %mobilanguagecodes for an exact map of values. Note that the bottom two bits
of the region code appear to be unused (i.e. the values are all multiples of
4).
"pid_append_checksum($pid)"¶
Computes the Mobipocket PID checksum used as the final two bytes of the PID and
appends them to $pid, returning the merged string.
Used by "pid_is_valid($pid)".
"pid_is_valid($pid)"¶
Returns 1 if the PID is a valid Mobipocket/Kindle PID and 0 otherwise.
This is determined by first ensuring that $pid is exactly ten bytes long, and
then stripping the final two bytes normally used as a checksum and recomputing
them, returning 1 only if they are recomputed correctly.
"pukall_cipher_1(%args)"¶
This is a COMPLETELY UNTESTED implementation of the Pukall Cipher 1 algorithm
used for encryption and decryption in Mobipocket files. It is a 128-bit stream
cipher. For more information and alternate implementations, see
<
http://membres.lycos.fr/pc1/>.
Use at your own risk. Bug reports appreciated.
Arguments
- •
- "key"
16-byte encryption key. This must be provided, and must be exactly 16 bytes,
or the procedure will croak.
- •
- "input"
Input data to be either encrypted or decrypted. If this is not provided, the
procedure croaks.
- •
- "encrypt" (optional)
If set to true, the cipher will be used to encrypt the input data. If not
set, or set to false, the cipher will be used to decrypt the input
data.
This checks the end of a text record for extra data that should not be made part
of decompression and returns the total size of all data fields.
Arguments
- •
- "dataref"
A reference to the record data
- •
- "extradataflags"
16 bits worth of flags indicating which extra data fields are present.
"system_mobidedrm(%args)"¶
Runs python on a copy of "MobiDeDrm.py" if it is available (not
included with this distribution) to downconvert a Mobipocket file.
Returns the output filename on success, or undef otherwise.
Arguments
- •
- "infile"
The input filename. If not specified or invalid, the procedure returns
undef.
- •
- "outfile"
The output filename. If not specified, the program will use a name based on
the input file, appending '-nodrm' to the basename and keeping the
extension. In the special case of Mobipocket files ending in '-sm', the
'-sm' portion of the basename is simply removed, and nothing else is
appended.
- •
- "pid"
The PID to use to decrypt the file. If not specified or invalid, the
procedure returns undef.
"system_mobigen(%args)"¶
Runs "mobigen" to convert OPF, HTML, or ePub input into a Mobipocket
.prc/.mobi book. The procedure
find_mobigen() is called to locate the
executable.
Returns the return value from mobigen, or undef if no filename was specified or
the file did not exist. Also returns undef if mobigen could not be found.
Arguments
- •
- "infile"
The input filename. If not specified or invalid, the procedure returns
undef.
- •
- "outfile"
The output filename. The mobigen executable will choose its own filename for
direct output, but if this argument is specified, the output file will be
renamed to the specified filename instead.
If not specified, the default output will be left in place.
- •
- "dir"
The directory in which to place the output file. The mobigen executable
itself will always place its output into the current working directory,
but if this argument is specified, the output file will be moved into the
specified directory, creating that directory if necessary.
- •
- "compression"
Compression level from 0-2, where 0 is no compression, 1 is PalmDoc
compression, and 2 is HUFF/CDIC compression. If not specified, defaults to
1 (PalmDoc compression).
"uncompress_dictionaryhuffman(%args)"¶
Uncompresses text compressed with the DictionaryHuffman compression scheme.
Arguments
- •
- "data"
A scalar containing the compressed data to uncompress.
- •
- "huff"
A hashref pointing to the HUFF record data
- •
- "cdics"
An arrayref pointing to the CDIC record data
- •
- "depth"
The current depth of the huffman tree, currently only used in
debugging.
"unpack_mobi_language($data)"¶
Takes as an argument 4 bytes of data. If less data is provided, the sub croaks.
If more, a debug warning is provided, but the sub continues.
In scalar context returns a language string mostly (but not entirely) conformant
to the IANA language subtag registry codes.
In list context, returns the language string, an unknown code integer, a region
code integer, and a language code integer, with the last three being directly
unpacked values.
See %mobilangcodes for an exact map of values. Note that the bottom two bits of
the region code appear to be unused (i.e. the values are all multiples of 4).
The unknown code integer appears to be unused, and is generally zero.
The original implementation by Mobipocket may have been via Microsoft's .NET
CultureInfo class. See:
<
http://msdn.microsoft.com/en-us/library/system.globalization.cultureinfo(VS.71).aspx>
BUGS AND LIMITATIONS¶
- •
- Unpacking DRM-protected text isn't supported. Although infrastructure may
be added later to make use of external helpers and plugins, direct DRM
support will never be added to the main code for legal reasons.
- •
- Repacking a .prc without fully extracting to OPF and completely converting
back isn't supported. This will have to be implemented before an interface
to perform minor metadata alterations can be implemented.
- •
- Mobipocket HUFF/CDIC decoding (used mostly on dictionaries) isn't well
documented.
- •
- Not all Mobipocket data is understood, so a conversion from OPF to
Mobipocket .prc back to OPF will not result in all data being retained.
Patches welcome.
- •
- Mobipocket INDX, DATP, FCIS, and FLIS records are not understood and are
completely ignored
- •
- Mobipocket EXTH subjectcode records may not end up attached to the correct
subject element if the number of subject records differs from the number
of subjectcode records. This is because the Mobipocket format leaves the
EXTH subjectcode records completely unlinked from the subject records, and
there is no way to detect if a subject with no associated subjectcode
comes before a subject with an associated subjectcode.
Fortunately, this should rarely be a problem with real data, as Mobipocket
Creator only allows a single subject to be set, and the only other way to
have a subjectcode attached to a subject is to manually edit the OPF file
and insert an additional dc:Subject element with a BASICCode attribute.
Mobipocket has indicated that they may move data currently in their custom
elements and attributes to the standard <meta> elements in a future
release, so this problem may become moot then.
AUTHOR¶
Zed Pobre <zed@debian.org>
LICENSE AND COPYRIGHT¶
Copyright 2008 Zed Pobre
Licensed to the public under the terms of the GNU GPL, version 2