.\" Automatically generated by Pod::Man 4.09 (Pod::Simple 3.35)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings. \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote. \*(C+ will
.\" give a nicer C++. Capital omega is used to do unbreakable dashes and
.\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
. ds -- \(*W-
. ds PI pi
. if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
. if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch
. ds L" ""
. ds R" ""
. ds C` ""
. ds C' ""
'br\}
.el\{\
. ds -- \|\(em\|
. ds PI \(*p
. ds L" ``
. ds R" ''
. ds C`
. ds C'
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\"
.\" If the F register is >0, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD. Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.\"
.\" Avoid warning from groff about undefined register 'F'.
.de IX
..
.if !\nF .nr F 0
.if \nF>0 \{\
. de IX
. tm Index:\\$1\t\\n%\t"\\$2"
..
. if !\nF==2 \{\
. nr % 0
. nr F 2
. \}
.\}
.\" ========================================================================
.\"
.IX Title "HTML::TableParser 3pm"
.TH HTML::TableParser 3pm "2018-03-25" "perl v5.26.1" "User Contributed Perl Documentation"
.\" For nroff, turn off justification. Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH "NAME"
HTML::TableParser \- HTML::TableParser \- Extract data from an HTML table
.SH "VERSION"
.IX Header "VERSION"
version 0.43
.SH "SYNOPSIS"
.IX Header "SYNOPSIS"
.Vb 1
\& use HTML::TableParser;
\&
\& @reqs = (
\& {
\& id => 1.1, # id for embedded table
\& hdr => \e&header, # function callback
\& row => \e&row, # function callback
\& start => \e&start, # function callback
\& end => \e&end, # function callback
\& udata => { Snack => \*(AqFood\*(Aq }, # arbitrary user data
\& },
\& {
\& id => 1, # table id
\& cols => [ \*(AqObject Type\*(Aq,
\& qr/object/ ], # column name matches
\& obj => $obj, # method callbacks
\& },
\& );
\&
\& # create parser object
\& $p = HTML::TableParser\->new( \e@reqs,
\& { Decode => 1, Trim => 1, Chomp => 1 } );
\& $p\->parse_file( \*(Aqfoo.html\*(Aq );
\&
\&
\& # function callbacks
\& sub start {
\& my ( $id, $line, $udata ) = @_;
\& #...
\& }
\&
\& sub end {
\& my ( $id, $line, $udata ) = @_;
\& #...
\& }
\&
\& sub header {
\& my ( $id, $line, $cols, $udata ) = @_;
\& #...
\& }
\&
\& sub row {
\& my ( $id, $line, $cols, $udata ) = @_;
\& #...
\& }
.Ve
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
\&\fBHTML::TableParser\fR uses \fBHTML::Parser\fR to extract data from an \s-1HTML\s0
table. The data is returned via a series of user defined callback
functions or methods. Specific tables may be selected either by a
matching a unique table id or by matching against the column names.
Multiple (even nested) tables may be parsed in a document in one pass.
.SS "Table Identification"
.IX Subsection "Table Identification"
Each table is given a unique id, relative to its parent, based upon its
order and nesting. The first top level table has id \f(CW1\fR, the second
\&\f(CW2\fR, etc. The first table nested in table \f(CW1\fR has id \f(CW1.1\fR, the
second \f(CW1.2\fR, etc. The first table nested in table \f(CW1.1\fR has id
\&\f(CW1.1.1\fR, etc. These, as well as the tables' column names, may
be used to identify which tables to parse.
.SS "Data Extraction"
.IX Subsection "Data Extraction"
As the parser traverses a selected table, it will pass data to user
provided callback functions or methods after it has digested
particular structures in the table. All functions are passed the
table id (as described above), the line number in the \s-1HTML\s0 source
where the table was found, and a reference to any table specific user
provided data.
.IP "Table Start" 8
.IX Item "Table Start"
The \fBstart\fR callback is invoked when a matched table has been found.
.IP "Table End" 8
.IX Item "Table End"
The \fBend\fR callback is invoked after a matched table has been parsed.
.IP "Header" 8
.IX Item "Header"
The \fBhdr\fR callback is invoked after the table header has been read in.
Some tables do not use the \fB
\fR tag to indicate a header, so this
function may not be called. It is passed the column names.
.IP "Row" 8
.IX Item "Row"
The \fBrow\fR callback is invoked after a row in the table has been read.
It is passed the column data.
.IP "Warn" 8
.IX Item "Warn"
The \fBwarn\fR callback is invoked when a non-fatal error occurs during
parsing. Fatal errors croak.
.IP "New" 8
.IX Item "New"
This is the class method to call to create a new object when
\&\fBHTML::TableParser\fR is supposed to create new objects upon table
start.
.SS "Callback \s-1API\s0"
.IX Subsection "Callback API"
Callbacks may be functions or methods or a mixture of both.
In the latter case, an object must be passed to the constructor.
(More on that later.)
.PP
The callbacks are invoked as follows:
.PP
.Vb 1
\& start( $tbl_id, $line_no, $udata );
\&
\& end( $tbl_id, $line_no, $udata );
\&
\& hdr( $tbl_id, $line_no, \e@col_names, $udata );
\&
\& row( $tbl_id, $line_no, \e@data, $udata );
\&
\& warn( $tbl_id, $line_no, $message, $udata );
\&
\& new( $tbl_id, $udata );
.Ve
.SS "Data Cleanup"
.IX Subsection "Data Cleanup"
There are several cleanup operations that may be performed automatically:
.IP "Chomp" 8
.IX Item "Chomp"
\&\fB\f(BIchomp()\fB\fR the data
.IP "Decode" 8
.IX Item "Decode"
Run the data through \fBHTML::Entities::decode\fR.
.IP "DecodeNBSP" 8
.IX Item "DecodeNBSP"
Normally \fBHTML::Entitites::decode\fR changes a non-breaking space into
a character which doesn't seem to be matched by Perl's whitespace
regexp. Setting this attribute changes the \s-1HTML\s0 \f(CW\*(C`nbsp\*(C'\fR character to
a plain 'ol blank.
.IP "Trim" 8
.IX Item "Trim"
remove leading and trailing white space.
.SS "Data Organization"
.IX Subsection "Data Organization"
Column names are derived from cells delimited by the \fB | \fR and
\&\fB | \fR tags. Some tables have header cells which span one or
more columns or rows to make things look nice. \fBHTML::TableParser\fR
determines the actual number of columns used and provides column
names for each column, repeating names for spanned columns and
concatenating spanned rows and columns. For example, if the
table header looks like this:
.PP
.Vb 5
\& +\-\-\-\-+\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+
\& | | | Eq J2000 | | Velocity/Redshift |
\& | No | Object |\-\-\-\-\-\-\-\-\-\-| Object Type |\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-|
\& | | | RA | Dec | | km/s | z | Qual |
\& +\-\-\-\-+\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+
.Ve
.PP
The columns will be:
.PP
.Vb 8
\& No
\& Object
\& Eq J2000 RA
\& Eq J2000 Dec
\& Object Type
\& Velocity/Redshift km/s
\& Velocity/Redshift z
\& Velocity/Redshift Qual
.Ve
.PP
Row data are derived from cells delimited by the \fB\fR and
\&\fB | \fR tags. Cells which span more than one column or row are
handled correctly, i.e. the values are duplicated in the appropriate
places.
.SH "METHODS"
.IX Header "METHODS"
.IP "new" 8
.IX Item "new"
.Vb 1
\& $p = HTML::TableParser\->new( \e@reqs, \e%attr );
.Ve
.Sp
This is the class constructor. It is passed a list of table requests
as well as attributes which specify defaults for common operations.
Table requests are documented in \*(L"Table Requests\*(R".
.Sp
The \f(CW%attr\fR hash provides default values for some of the table
request attributes, namely the data cleanup operations ( \f(CW\*(C`Chomp\*(C'\fR,
\&\f(CW\*(C`Decode\*(C'\fR, \f(CW\*(C`Trim\*(C'\fR ), and the multi match attribute \f(CW\*(C`MultiMatch\*(C'\fR,
i.e.,
.Sp
.Vb 1
\& $p = HTML::TableParser\->new( \e@reqs, { Chomp => 1 } );
.Ve
.Sp
will set \fBChomp\fR on for all of the table requests, unless overridden
by them. The data cleanup operations are documented above; \f(CW\*(C`MultiMatch\*(C'\fR
is documented in \*(L"Table Requests\*(R".
.Sp
\&\fBDecode\fR defaults to on; all of the others default to off.
.IP "parse_file" 8
.IX Item "parse_file"
This is the same function as in \fBHTML::Parser\fR.
.IP "parse" 8
.IX Item "parse"
This is the same function as in \fBHTML::Parser\fR.
.SH "Table Requests"
.IX Header "Table Requests"
A table request is a hash used by \fBHTML::TableParser\fR to determine
which tables are to be parsed, the callbacks to be invoked, and any
data cleanup. There may be multiple requests processed by one call to
the parser; each table is associated with a single request (even if
several requests match the table).
.PP
A single request may match several tables, however unless the
\&\fBMultiMatch\fR attribute is specified for that request, it will be used
for the first matching table only.
.PP
A table request which matches a table id of \f(CW\*(C`DEFAULT\*(C'\fR will be used as
a catch-all request, and will match all tables not matched by other
requests. Please note that tables are compared to the requests in the
order that the latter are passed to the \fB\f(BInew()\fB\fR method; place the
\&\fB\s-1DEFAULT\s0\fR method last for proper behavior.
.SS "Identifying tables to parse"
.IX Subsection "Identifying tables to parse"
\&\fBHTML::TableParser\fR needs to be told which tables to parse. This can
be done by matching table ids or column names, or a combination of
both. The table request hash elements dedicated to this are:
.IP "id" 8
.IX Item "id"
This indicates a match on table id. It can take one of these forms:
.RS 8
.IP "exact match" 8
.IX Item "exact match"
.Vb 2
\& id => $match
\& id => \*(Aq1.2\*(Aq
.Ve
.Sp
Here \f(CW$match\fR is a scalar which is compared directly to the table id.
.IP "regular expression" 8
.IX Item "regular expression"
.Vb 2
\& id => $re
\& id => qr/1\e.\ed+\e.2/
.Ve
.Sp
\&\f(CW$re\fR is a regular expression, which must be constructed with the
\&\f(CW\*(C`qr//\*(C'\fR operator.
.IP "subroutine" 8
.IX Item "subroutine"
.Vb 3
\& id => \e&my_match_subroutine
\& id => sub { my ( $id, $oids ) = @_ ;
\& $oids[0] > 3 && $oids[1] < 2 }
.Ve
.Sp
Here \f(CW\*(C`id\*(C'\fR is assigned a coderef to a subroutine which returns
true if the table matches, false if not. The subroutine is passed
two arguments: the table id as a scalar string ( e.g. \f(CW1.2.3\fR) and the
table id as an arrayref (e.g. \f(CW\*(C`$oids = [ 1, 2, 3]\*(C'\fR).
.RE
.RS 8
.Sp
\&\f(CW\*(C`id\*(C'\fR may be passed an array containing any combination of the
above:
.Sp
.Vb 1
\& id => [ \*(Aq1.2\*(Aq, qr/1\e.\ed+\e.2/, sub { ... } ]
.Ve
.Sp
Elements in the array may be preceded by a modifier indicating
the action to be taken if the table matches on that element.
The modifiers and their meanings are:
.ie n .IP """\-""" 8
.el .IP "\f(CW\-\fR" 8
.IX Item "-"
If the id matches, it is explicitly excluded from being processed
by this request.
.ie n .IP """\-\-""" 8
.el .IP "\f(CW\-\-\fR" 8
.IX Item "--"
If the id matches, it is skipped by \fBall\fR requests.
.ie n .IP """+""" 8
.el .IP "\f(CW+\fR" 8
.IX Item "+"
If the id matches, it will be processed by this request. This
is the default action.
.RE
.RS 8
.Sp
An example:
.Sp
.Vb 1
\& id => [ \*(Aq\-\*(Aq, \*(Aq1.2\*(Aq, \*(AqDEFAULT\*(Aq ]
.Ve
.Sp
indicates that this request should be used for all tables,
except for table 1.2.
.Sp
.Vb 1
\& id => [ \*(Aq\-\-\*(Aq, \*(Aq1.2\*(Aq ]
.Ve
.Sp
Table 2 is just plain skipped altogether.
.RE
.IP "cols" 8
.IX Item "cols"
This indicates a match on column names. It can take one of these forms:
.RS 8
.IP "exact match" 8
.IX Item "exact match"
.Vb 2
\& cols => $match
\& cols => \*(AqSnacks01\*(Aq
.Ve
.Sp
Here \f(CW$match\fR is a scalar which is compared directly to the column names.
If any column matches, the table is processed.
.IP "regular expression" 8
.IX Item "regular expression"
.Vb 2
\& cols => $re
\& cols => qr/Snacks\ed+/
.Ve
.Sp
\&\f(CW$re\fR is a regular expression, which must be constructed with the
\&\f(CW\*(C`qr//\*(C'\fR operator. Again, a successful match against any column name
causes the table to be processed.
.IP "subroutine" 8
.IX Item "subroutine"
.Vb 3
\& cols => \e&my_match_subroutine
\& cols => sub { my ( $id, $oids, $cols ) = @_ ;
\& ... }
.Ve
.Sp
Here \f(CW\*(C`cols\*(C'\fR is assigned a coderef to a subroutine which returns
true if the table matches, false if not. The subroutine is passed
three arguments: the table id as a scalar string ( e.g. \f(CW1.2.3\fR), the
table id as an arrayref (e.g. \f(CW\*(C`$oids = [ 1, 2, 3]\*(C'\fR), and the column
names, as an arrayref (e.g. \f(CW\*(C`$cols = [ \*(Aqcol1\*(Aq, \*(Aqcol2\*(Aq ]\*(C'\fR). This
option gives the calling routine the ability to make arbitrary
selections based upon table id and columns.
.RE
.RS 8
.Sp
\&\f(CW\*(C`cols\*(C'\fR may be passed an arrayref containing any combination of the
above:
.Sp
.Vb 1
\& cols => [ \*(AqSnacks01\*(Aq, qr/Snacks\ed+/, sub { ... } ]
.Ve
.Sp
Elements in the array may be preceded by a modifier indicating
the action to be taken if the table matches on that element.
They are the same as the table id modifiers mentioned above.
.RE
.IP "colre" 8
.IX Item "colre"
\&\fBThis is deprecated, and is present for backwards compatibility only.\fR
An arrayref containing the regular expressions to match, or a scalar
containing a single reqular expression
.PP
More than one of these may be used for a single table request. A
request may match more than one table. By default a request is used
only once (even the \f(CW\*(C`DEFAULT\*(C'\fR id match!). Set the \f(CW\*(C`MultiMatch\*(C'\fR
attribute to enable multiple matches per request.
.PP
When attempting to match a table, the following steps are taken:
.IP "1." 8
The table id is compared to the requests which contain an id match.
The first such match is used (in the order given in the passed array).
.IP "2." 8
If no explicit id match is found, column name matches are attempted.
The first such match is used (in the order given in the passed array)
.IP "3." 8
If no column name match is found (or there were none requested),
the first request which matches an \fBid\fR of \f(CW\*(C`DEFAULT\*(C'\fR is used.
.SS "Specifying the data callbacks"
.IX Subsection "Specifying the data callbacks"
Callback functions are specified with the callback attributes
\&\f(CW\*(C`start\*(C'\fR, \f(CW\*(C`end\*(C'\fR, \f(CW\*(C`hdr\*(C'\fR, \f(CW\*(C`row\*(C'\fR, and \f(CW\*(C`warn\*(C'\fR. They should be set to
code references, i.e.
.PP
.Vb 1
\& %table_req = ( ..., start => \e&start_func, end => \e&end_func )
.Ve
.PP
To use methods, specify the object with the \f(CW\*(C`obj\*(C'\fR key, and
the method names via the callback attributes, which should be set
to strings. If you don't specify method names they will default to (you
guessed it) \f(CW\*(C`start\*(C'\fR, \f(CW\*(C`end\*(C'\fR, \f(CW\*(C`hdr\*(C'\fR, \f(CW\*(C`row\*(C'\fR, and \f(CW\*(C`warn\*(C'\fR.
.PP
.Vb 5
\& $obj = SomeClass\->new();
\& # ...
\& %table_req_1 = ( ..., obj => $obj );
\& %table_req_2 = ( ..., obj => $obj, start => \*(Aqstart\*(Aq,
\& end => \*(Aqend\*(Aq );
.Ve
.PP
You can also have \fBHTML::TableParser\fR create a new object for you
for each table by specifying the \f(CW\*(C`class\*(C'\fR attribute. By default
the constructor is assumed to be the class \fB\f(BInew()\fB\fR method; if not,
specify it using the \f(CW\*(C`new\*(C'\fR attribute:
.PP
.Vb 2
\& use MyClass;
\& %table_req = ( ..., class => \*(AqMyClass\*(Aq, new => \*(Aqmynew\*(Aq );
.Ve
.PP
To use a function instead of a method for a particular callback,
set the callback attribute to a code reference:
.PP
.Vb 1
\& %table_req = ( ..., obj => $obj, end => \e&end_func );
.Ve
.PP
You don't have to provide all the callbacks. You should not use both
\&\f(CW\*(C`obj\*(C'\fR and \f(CW\*(C`class\*(C'\fR in the same table request.
.PP
\&\fBHTML::TableParser\fR automatically determines if your object
or class has one of the required methods. If you wish it \fInot\fR
to use a particular method, set it equal to \f(CW\*(C`undef\*(C'\fR. For example
.PP
.Vb 1
\& %table_req = ( ..., obj => $obj, end => undef )
.Ve
.PP
indicates the object's \fBend\fR method should not be called, even
if it exists.
.PP
You can specify arbitrary data to be passed to the callback functions
via the \f(CW\*(C`udata\*(C'\fR attribute:
.PP
.Vb 1
\& %table_req = ( ..., udata => \e%hash_of_my_special_stuff )
.Ve
.SS "Specifying Data cleanup operations"
.IX Subsection "Specifying Data cleanup operations"
Data cleanup operations may be specified uniquely for each table. The
available keys are \f(CW\*(C`Chomp\*(C'\fR, \f(CW\*(C`Decode\*(C'\fR, \f(CW\*(C`Trim\*(C'\fR. They should be
set to a non-zero value if the operation is to be performed.
.SS "Other Attributes"
.IX Subsection "Other Attributes"
The \f(CW\*(C`MultiMatch\*(C'\fR key is used when a request is capable of handling
multiple tables in the document. Ordinarily, a request will process
a single table only (even \f(CW\*(C`DEFAULT\*(C'\fR requests).
Set it to a non-zero value to allow the request to handle more than
one table.
.SH "BUGS"
.IX Header "BUGS"
Please report any bugs or feature requests on the bugtracker website
or by
email to
bug\-HTML\-TableParser@rt.cpan.org .
.PP
When submitting a bug or request, please include a test-file or a
patch to an existing test-file that illustrates the bug or desired
feature.
.SH "SOURCE"
.IX Header "SOURCE"
The development version is on github at
and may be cloned from
.SH "AUTHOR"
.IX Header "AUTHOR"
Diab Jerius
.SH "COPYRIGHT AND LICENSE"
.IX Header "COPYRIGHT AND LICENSE"
This software is Copyright (c) 2018 by Smithsonian Astrophysical Observatory.
.PP
This is free software, licensed under:
.PP
.Vb 1
\& The GNU General Public License, Version 3, June 2007
.Ve