.\" Automatically generated by Pod::Man 4.11 (Pod::Simple 3.35) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{\ . if \nF \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" ======================================================================== .\" .IX Title "LinkExtractor 3pm" .TH LinkExtractor 3pm "2020-09-09" "perl v5.30.3" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" HTML::LinkExtractor \- Extract links from an HTML document .SH "DESCRIPTION" .IX Header "DESCRIPTION" HTML::LinkExtractor is used for extracting links from \s-1HTML.\s0 It is very similar to HTML::LinkExtor, except that besides getting the \s-1URL,\s0 you also get the link-text. .PP Example ( \fBplease run the examples\fR ): .PP .Vb 2 \& use HTML::LinkExtractor; \& use Data::Dumper; \& \& my $input = q{If I am a LINK!!! }; \& my $LX = new HTML::LinkExtractor(); \& \& $LX\->parse(\e$input); \& \& print Dumper($LX\->links); \& _\|_END_\|_ \& # the above example will yield \& $VAR1 = [ \& { \& \*(Aq_TEXT\*(Aq => \*(Aq I am a LINK!!! \*(Aq, \& \*(Aqhref\*(Aq => bless(do{\e(my $o = \*(Aqhttp://perl.com/\*(Aq)}, \*(AqURI::http\*(Aq), \& \*(Aqtag\*(Aq => \*(Aqa\*(Aq \& } \& ]; .Ve .PP \&\f(CW\*(C`HTML::LinkExtractor\*(C'\fR will also correctly extract nested \&\fIlink-type\fR tags. .SH "SYNOPSIS" .IX Header "SYNOPSIS" .Vb 3 \& ## the demo \& perl LinkExtractor.pm \& perl LinkExtractor.pm file.html othefile.html \& \& ## or if the module is installed, but you don\*(Aqt know where \& \& perl \-MHTML::LinkExtractor \-e" system $^X, $INC{q{HTML/LinkExtractor.pm}} " \& perl \-MHTML::LinkExtractor \-e\*(Aq system $^X, $INC{q{HTML/LinkExtractor.pm}} \*(Aq \& \& ## or \& \& use HTML::LinkExtractor; \& use LWP qw( get ); # use LWP::Simple qw( get ); \& \& my $base = \*(Aqhttp://search.cpan.org\*(Aq; \& my $html = get($base.\*(Aq/recent\*(Aq); \& my $LX = new HTML::LinkExtractor(); \& \& $LX\->parse(\e$html); \& \& print qq{\en}; \& \& for my $Link( @{ $LX\->links } ) { \& ## new modules are linked by /author/NAME/Dist \& if( $$Link{href}=~ m{^\e/author\e/\ew+} ) { \& print $$Link{_TEXT}."\en"; \& } \& } \& \& undef $LX; \& _\|_END_\|_ \& \& ## or \& \& use HTML::LinkExtractor; \& use Data::Dumper; \& \& my $input = q{If I am a LINK!!! }; \& my $LX = new HTML::LinkExtractor( \& sub { \& print Data::Dumper::Dumper(@_); \& }, \& \*(Aqhttp://perlFox.org/\*(Aq, \& ); \& \& $LX\->parse(\e$input); \& $LX\->strip(1); \& $LX\->parse(\e$input); \& _\|_END_\|_ \& \& #### Calculate to total size of a web\-page \& #### adds up the sizes of all the images and stylesheets and stuff \& \& use strict; \& use LWP; # use LWP::Simple; \& use HTML::LinkExtractor; \& # \& my $url = shift || \*(Aqhttp://www.google.com\*(Aq; \& my $html = get($url); \& my $Total = length $html; \& # \& print "initial size $Total\en"; \& # \& my $LX = new HTML::LinkExtractor( \& sub { \& my( $X, $tag ) = @_; \& # \& unless( grep {$_ eq $tag\->{tag} } @HTML::LinkExtractor::TAGS_IN_NEED ) { \& # \& print "$$tag{tag}\en"; \& # \& for my $urlAttr ( @{$HTML::LinkExtractor::TAGS{$$tag{tag}}} ) { \& if( exists $$tag{$urlAttr} ) { \& my $size = (head( $$tag{$urlAttr} ))[1]; \& $Total += $size if $size; \& print "adding $size\en" if $size; \& } \& } \& } \& }, \& $url, \& 0 \& ); \& # \& $LX\->parse(\e$html); \& # \& print "The total size of \en$url\en is $Total bytes\en"; \& _\|_END_\|_ .Ve .SH "METHODS" .IX Header "METHODS" .ie n .SS """$LX\->new([\e&callback, [$baseUrl, [1]]])""" .el .SS "\f(CW$LX\->new([\e&callback, [$baseUrl, [1]]])\fP" .IX Subsection "$LX->new([&callback, [$baseUrl, [1]]])" Accepts 3 arguments, all of which are optional. If for example you want to pass a \f(CW$baseUrl\fR, but don't want to have a callback invoked, just put \f(CW\*(C`undef\*(C'\fR in place of a subref. .PP This is the only class method. .IP "1." 4 a callback ( a sub reference, as in \f(CW\*(C`sub{}\*(C'\fR, or \f(CW\*(C`\e&sub\*(C'\fR) which is to be called each time a new \s-1LINK\s0 is encountered ( for \f(CW@HTML::LinkExtractor::TAGS_IN_NEED\fR this means after the closing tag is encountered ) .Sp The callback receives an object reference(\f(CW$LX\fR) and a link hashref. .IP "2." 4 and a base \s-1URL\s0 ( \s-1URI\-\s0>new, so its up to you to make sure it's valid which is used to convert all relative \s-1URI\s0's to absolute ones. .Sp .Vb 1 \& $ALinkP{href} = URI\->new_abs( $ALink{href}, $base ); .Ve .IP "3." 4 A \*(L"boolean\*(R" (just stick with 1). See the example in \*(L"\s-1DESCRIPTION\*(R"\s0. Normally, you'd get back _TEXT that looks like .Sp .Vb 1 \& \*(Aq_TEXT\*(Aq => \*(Aq I am a LINK!!! \*(Aq, .Ve .Sp If you turn this option on, you'll get the following instead .Sp .Vb 1 \& \*(Aq_TEXT\*(Aq => \*(Aq I am a LINK!!! \*(Aq, .Ve .Sp The private utility function \f(CW\*(C`_stripHTML\*(C'\fR does this by using HTML::TokeParsers method get_trimmed_text. .Sp You can turn this feature on an off by using \&\f(CW\*(C`$LX\->strip(undef || 0 || 1)\*(C'\fR .ie n .SS """$LX\->parse( $filename || *FILEHANDLE || \e$FileContent )""" .el .SS "\f(CW$LX\->parse( $filename || *FILEHANDLE || \e$FileContent )\fP" .IX Subsection "$LX->parse( $filename || *FILEHANDLE || $FileContent )" Each time you call \f(CW\*(C`parse\*(C'\fR, you should pass it a \&\f(CW$filename\fR a \f(CW*FILEHANDLE\fR or a \f(CW\*(C`\e$FileContent\*(C'\fR .PP Each time you call \f(CW\*(C`parse\*(C'\fR a new \f(CW\*(C`HTML::TokeParser\*(C'\fR object is created and stored in \f(CW\*(C`$this\->{_tp}\*(C'\fR. .PP You shouldn't need to mess with the TokeParser object. .ie n .SS """$LX\->links()""" .el .SS "\f(CW$LX\->links()\fP" .IX Subsection "$LX->links()" Only after you call \f(CW\*(C`parse\*(C'\fR will this method return anything. This method returns a reference to an ArrayOfHashes, which basically looks like (Data::Dumper output) .PP .Vb 1 \& $VAR1 = [ { tag => \*(Aqimg\*(Aq, src => \*(Aqimage.png\*(Aq }, ]; .Ve .PP Please note that if yo provide a callback this array will be empty. .ie n .SS """$LX\->strip( [ 0 || 1 ])""" .el .SS "\f(CW$LX\->strip( [ 0 || 1 ])\fP" .IX Subsection "$LX->strip( [ 0 || 1 ])" If you pass in \f(CW\*(C`undef\*(C'\fR (or nothing), returns the state of the option. Passing in a true or false value sets the option. .PP If you wanna know what the option does see \&\f(CW\*(C`$LX\->new([\e&callback, [$baseUrl, [1]]])\*(C'\fR .SH "WHAT'S A LINK-type tag" .IX Header "WHAT'S A LINK-type tag" Take a look at \f(CW%HTML::LinkExtractor::TAGS\fR to see what I consider to be link-type-tag. .PP Take a look at \f(CW@HTML::LinkExtractor::VALID_URL_ATTRIBUTES\fR to see all the possible tag attributes which can contain \s-1URI\s0's (the links!!) .PP Take a look at \f(CW@HTML::LinkExtractor::TAGS_IN_NEED\fR to see the tags for which the \f(CW\*(Aq_TEXT\*(Aq\fR attribute is provided, like \f(CW\*(C` TEST \*(C'\fR .SS "How can that be?!?!" .IX Subsection "How can that be?!?!" I took at look at \f(CW%HTML::Tagset::linkElements\fR and the following \s-1URL\s0's .PP .Vb 1 \& http://www.blooberry.com/indexdot/html/tagindex/all.htm \& \& http://www.blooberry.com/indexdot/html/tagpages/a/a\-hyperlink.htm \& http://www.blooberry.com/indexdot/html/tagpages/a/applet.htm \& http://www.blooberry.com/indexdot/html/tagpages/a/area.htm \& \& http://www.blooberry.com/indexdot/html/tagpages/b/base.htm \& http://www.blooberry.com/indexdot/html/tagpages/b/bgsound.htm \& \& http://www.blooberry.com/indexdot/html/tagpages/d/del.htm \& http://www.blooberry.com/indexdot/html/tagpages/d/div.htm \& \& http://www.blooberry.com/indexdot/html/tagpages/e/embed.htm \& http://www.blooberry.com/indexdot/html/tagpages/f/frame.htm \& \& http://www.blooberry.com/indexdot/html/tagpages/i/ins.htm \& http://www.blooberry.com/indexdot/html/tagpages/i/image.htm \& http://www.blooberry.com/indexdot/html/tagpages/i/iframe.htm \& http://www.blooberry.com/indexdot/html/tagpages/i/ilayer.htm \& http://www.blooberry.com/indexdot/html/tagpages/i/inputimage.htm \& \& http://www.blooberry.com/indexdot/html/tagpages/l/layer.htm \& http://www.blooberry.com/indexdot/html/tagpages/l/link.htm \& \& http://www.blooberry.com/indexdot/html/tagpages/o/object.htm \& \& http://www.blooberry.com/indexdot/html/tagpages/q/q.htm \& \& http://www.blooberry.com/indexdot/html/tagpages/s/script.htm \& http://www.blooberry.com/indexdot/html/tagpages/s/sound.htm \& \& And the special cases \& \& \& http://www.blooberry.com/indexdot/html/tagpages/d/doctype.htm \& \*(Aq!doctype\*(Aq is really a process instruction, but is still listed \& in %TAGS with \*(Aqurl\*(Aq as the attribute \& \& and \& \& \& http://www.blooberry.com/indexdot/html/tagpages/m/meta.htm \& If there is a valid url, \*(Aqurl\*(Aq is set as the attribute. \& The meta tag has no \*(Aqattributes\*(Aq listed in %TAGS. .Ve .SH "SEE ALSO" .IX Header "SEE ALSO" HTML::LinkExtor, HTML::TokeParser, HTML::Tagset. .SH "AUTHOR" .IX Header "AUTHOR" D.H (PodMaster) .PP Please use http://rt.cpan.org/ to report bugs. .PP Just go to http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML\-Scrubber to see a bug list and/or repot new ones. .SH "LICENSE" .IX Header "LICENSE" Copyright (c) 2003, 2004 by D.H. (PodMaster). All rights reserved. .PP This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. The \s-1LICENSE\s0 file contains the full text of the license.