NAME¶
HTML::TableParser - Extract data from an HTML table
SYNOPSIS¶
use HTML::TableParser;
@reqs = (
{
id => 1.1, # id for embedded table
hdr => \&header, # function callback
row => \&row, # function callback
start => \&start, # function callback
end => \&end, # function callback
udata => { Snack => 'Food' }, # arbitrary user data
},
{
id => 1, # table id
cols => [ 'Object Type',
qr/object/ ], # column name matches
obj => $obj, # method callbacks
},
);
# create parser object
$p = HTML::TableParser->new( \@reqs,
{ Decode => 1, Trim => 1, Chomp => 1 } );
$p->parse_file( 'foo.html' );
# function callbacks
sub start {
my ( $id, $line, $udata ) = @_;
#...
}
sub end {
my ( $id, $line, $udata ) = @_;
#...
}
sub header {
my ( $id, $line, $cols, $udata ) = @_;
#...
}
sub row {
my ( $id, $line, $cols, $udata ) = @_;
#...
}
DESCRIPTION¶
HTML::TableParser uses
HTML::Parser to extract data from an HTML
table. The data is returned via a series of user defined callback functions or
methods. Specific tables may be selected either by a matching a unique table
id or by matching against the column names. Multiple (even nested) tables may
be parsed in a document in one pass.
Table Identification
Each table is given a unique id, relative to its parent, based upon its order
and nesting. The first top level table has id 1, the second 2, etc. The first
table nested in table 1 has id 1.1, the second 1.2, etc. The first table
nested in table 1.1 has id 1.1.1, etc. These, as well as the tables' column
names, may be used to identify which tables to parse.
Data Extraction
As the parser traverses a selected table, it will pass data to user provided
callback functions or methods after it has digested particular structures in
the table. All functions are passed the table id (as described above), the
line number in the HTML source where the table was found, and a reference to
any table specific user provided data.
- Table Start
- The start callback is invoked when a matched table
has been found.
- Table End
- The end callback is invoked after a matched table
has been parsed.
- Header
- The hdr callback is invoked after the table header
has been read in. Some tables do not use the <th> tag to
indicate a header, so this function may not be called. It is passed the
column names.
- Row
- The row callback is invoked after a row in the table
has been read. It is passed the column data.
- Warn
- The warn callback is invoked when a non-fatal error
occurs during parsing. Fatal errors croak.
- New
- This is the class method to call to create a new object
when HTML::TableParser is supposed to create new objects upon table
start.
Callback API
Callbacks may be functions or methods or a mixture of both. In the latter case,
an object must be passed to the constructor. (More on that later.)
The callbacks are invoked as follows:
start( $tbl_id, $line_no, $udata );
end( $tbl_id, $line_no, $udata );
hdr( $tbl_id, $line_no, \@col_names, $udata );
row( $tbl_id, $line_no, \@data, $udata );
warn( $tbl_id, $line_no, $message, $udata );
new( $tbl_id, $udata );
Data Cleanup
There are several cleanup operations that may be performed automatically:
- Chomp
- chomp() the data
- Decode
- Run the data through HTML::Entities::decode.
- DecodeNBSP
- Normally HTML::Entitites::decode changes a
non-breaking space into a character which doesn't seem to be matched by
Perl's whitespace regexp. Setting this attribute changes the HTML
"nbsp" character to a plain 'ol blank.
- Trim
- remove leading and trailing white space.
Data Organization
Column names are derived from cells delimited by the
<th> and
</th> tags. Some tables have header cells which span one or more
columns or rows to make things look nice.
HTML::TableParser determines
the actual number of columns used and provides column names for each column,
repeating names for spanned columns and concatenating spanned rows and
columns. For example, if the table header looks like this:
+----+--------+----------+-------------+-------------------+
| | | Eq J2000 | | Velocity/Redshift |
| No | Object |----------| Object Type |-------------------|
| | | RA | Dec | | km/s | z | Qual |
+----+--------+----------+-------------+-------------------+
The columns will be:
No
Object
Eq J2000 RA
Eq J2000 Dec
Object Type
Velocity/Redshift km/s
Velocity/Redshift z
Velocity/Redshift Qual
Row data are derived from cells delimited by the
<td> and
</td> tags. Cells which span more than one column or row are
handled correctly, i.e. the values are duplicated in the appropriate places.
METHODS¶
- new
-
$p = HTML::TableParser->new( \@reqs, \%attr );
This is the class constructor. It is passed a list of table requests as well
as attributes which specify defaults for common operations. Table requests
are documented in "Table Requests".
The %attr hash provides default values for some of the table request
attributes, namely the data cleanup operations ( "Chomp",
"Decode", "Trim" ), and the multi match attribute
"MultiMatch", i.e.,
$p = HTML::TableParser->new( \@reqs, { Chomp => 1 } );
will set Chomp on for all of the table requests, unless overriden by
them. The data cleanup operations are documented above;
"MultiMatch" is documented in "Table Requests".
Decode defaults to on; all of the others default to off.
- parse_file
- This is the same function as in HTML::Parser.
- parse
- This is the same function as in HTML::Parser.
Table Requests¶
A table request is a hash used by
HTML::TableParser to determine which
tables are to be parsed, the callbacks to be invoked, and any data cleanup.
There may be multiple requests processed by one call to the parser; each table
is associated with a single request (even if several requests match the
table).
A single request may match several tables, however unless the
MultiMatch
attribute is specified for that request, it will be used for the first
matching table only.
A table request which matches a table id of "DEFAULT" will be used as
a catch-all request, and will match all tables not matched by other requests.
Please note that tables are compared to the requests in the order that the
latter are passed to the
new() method; place the
DEFAULT method last for proper behavior.
Identifying tables to parse
HTML::TableParser needs to be told which tables to parse. This can be
done by matching table ids or column names, or a combination of both. The
table request hash elements dedicated to this are:
- id
- This indicates a match on table id. It can take one of
these forms:
- exact match
-
id => $match
id => '1.2'
Here $match is a scalar which is compared directly to the table id.
- regular expression
-
id => $re
id => qr/1\.\d+\.2/
$re is a regular expression, which must be constructed with the
"qr//" operator.
- subroutine
-
id => \&my_match_subroutine
id => sub { my ( $id, $oids ) = @_ ;
$oids[0] > 3 && $oids[1] < 2 }
Here "id" is assigned a coderef to a subroutine which returns true
if the table matches, false if not. The subroutine is passed two
arguments: the table id as a scalar string ( e.g. 1.2.3) and the table id
as an arrayref (e.g. "$oids = [ 1, 2, 3]").
"id" may be passed an array containing any combination of the above:
id => [ '1.2', qr/1\.\d+\.2/, sub { ... } ]
Elements in the array may be preceded by a modifier indicating the action to be
taken if the table matches on that element. The modifiers and their meanings
are:
- "-"
- If the id matches, it is explicitly excluded from being
processed by this request.
- "--"
- If the id matches, it is skipped by all
requests.
- "+"
- If the id matches, it will be processed by this request.
This is the default action.
An example:
id => [ '-', '1.2', 'DEFAULT' ]
indicates that this request should be used for all tables, except for table 1.2.
id => [ '--', '1.2' ]
Table 2 is just plain skipped altogether.
- cols
- This indicates a match on column names. It can take one of
these forms:
- exact match
-
cols => $match
cols => 'Snacks01'
Here $match is a scalar which is compared directly to the column names. If
any column matches, the table is processed.
- regular expression
-
cols => $re
cols => qr/Snacks\d+/
$re is a regular expression, which must be constructed with the
"qr//" operator. Again, a successful match against any column
name causes the table to be processed.
- subroutine
-
cols => \&my_match_subroutine
cols => sub { my ( $id, $oids, $cols ) = @_ ;
... }
Here "cols" is assigned a coderef to a subroutine which returns
true if the table matches, false if not. The subroutine is passed three
arguments: the table id as a scalar string ( e.g. 1.2.3), the table id as
an arrayref (e.g. "$oids = [ 1, 2, 3]"), and the column names,
as an arrayref (e.g. "$cols = [ 'col1', 'col2' ]"). This option
gives the calling routine the ability to make arbitrary selections based
upon table id and columns.
"cols" may be passed an arrayref containing any combination of the
above:
cols => [ 'Snacks01', qr/Snacks\d+/, sub { ... } ]
Elements in the array may be preceded by a modifier indicating the action to be
taken if the table matches on that element. They are the same as the table id
modifiers mentioned above.
- colre
- This is deprecated, and is present for backwards
compatibility only. An arrayref containing the regular expressions to
match, or a scalar containing a single reqular expression
More than one of these may be used for a single table request. A request may
match more than one table. By default a request is used only once (even the
"DEFAULT" id match!). Set the "MultiMatch" attribute to
enable multiple matches per request.
When attempting to match a table, the following steps are taken:
- 1.
- The table id is compared to the requests which contain an
id match. The first such match is used (in the order given in the passed
array).
- 2.
- If no explicit id match is found, column name matches are
attempted. The first such match is used (in the order given in the passed
array)
- 3.
- If no column name match is found (or there were none
requested), the first request which matches an id of
"DEFAULT" is used.
Specifying the data callbacks
Callback functions are specified with the callback attributes "start",
"end", "hdr", "row", and "warn". They
should be set to code references, i.e.
%table_req = ( ..., start => \&start_func, end => \&end_func )
To use methods, specify the object with the "obj" key, and the method
names via the callback attributes, which should be set to strings. If you
don't specify method names they will default to (you guessed it)
"start", "end", "hdr", "row", and
"warn".
$obj = SomeClass->new();
# ...
%table_req_1 = ( ..., obj => $obj );
%table_req_2 = ( ..., obj => $obj, start => 'start',
end => 'end' );
You can also have
HTML::TableParser create a new object for you for each
table by specifying the "class" attribute. By default the
constructor is assumed to be the class
new()
method; if not, specify it using the "new" attribute:
use MyClass;
%table_req = ( ..., class => 'MyClass', new => 'mynew' );
To use a function instead of a method for a particular callback, set the
callback attribute to a code reference:
%table_req = ( ..., obj => $obj, end => \&end_func );
You don't have to provide all the callbacks. You should not use both
"obj" and "class" in the same table request.
HTML::TableParser automatically determines if your object or class has
one of the required methods. If you wish it
not to use a particular
method, set it equal to "undef". For example
%table_req = ( ..., obj => $obj, end => undef )
indicates the object's
end method should not be called, even if it
exists.
You can specify arbitrary data to be passed to the callback functions via the
"udata" attribute:
%table_req = ( ..., udata => \%hash_of_my_special_stuff )
Specifying Data cleanup operations
Data cleanup operations may be specified uniquely for each table. The available
keys are "Chomp", "Decode", "Trim". They should
be set to a non-zero value if the operation is to be performed.
Other Attributes
The "MultiMatch" key is used when a request is capable of handling
multiple tables in the document. Ordinarily, a request will process a single
table only (even "DEFAULT" requests). Set it to a non-zero value to
allow the request to handle more than one table.
LICENSE¶
This software is released under the GNU General Public License. You may find a
copy at
http://www.fsf.org/copyleft/gpl.html
AUTHOR¶
Diab Jerius (djerius@cpan.org)
SEE ALSO¶
HTML::Parser, HTML::TableExtract.