.\" Automatically generated by Pod::Man 4.14 (Pod::Simple 3.40)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings.  \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote.  \*(C+ will
.\" give a nicer C++.  Capital omega is used to do unbreakable dashes and
.\" therefore won't be available.  \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
.    ds -- \(*W-
.    ds PI pi
.    if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
.    if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\"  diablo 12 pitch
.    ds L" ""
.    ds R" ""
.    ds C` ""
.    ds C' ""
'br\}
.el\{\
.    ds -- \|\(em\|
.    ds PI \(*p
.    ds L" ``
.    ds R" ''
.    ds C`
.    ds C'
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\"
.\" If the F register is >0, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD.  Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.\"
.\" Avoid warning from groff about undefined register 'F'.
.de IX
..
.nr rF 0
.if \n(.g .if rF .nr rF 1
.if (\n(rF:(\n(.g==0)) \{\
.    if \nF \{\
.        de IX
.        tm Index:\\$1\t\\n%\t"\\$2"
..
.        if !\nF==2 \{\
.            nr % 0
.            nr F 2
.        \}
.    \}
.\}
.rr rF
.\" ========================================================================
.\"
.IX Title "lwptut 3pm"
.TH lwptut 3pm "2021-01-11" "perl v5.32.0" "User Contributed Perl Documentation"
.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH "NAME"
lwptut \-\- An LWP Tutorial
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
\&\s-1LWP\s0 (short for \*(L"Library for \s-1WWW\s0 in Perl\*(R") is a very popular group of
Perl modules for accessing data on the Web. Like most Perl
module-distributions, each of \s-1LWP\s0's component modules comes with
documentation that is a complete reference to its interface. However,
there are so many modules in \s-1LWP\s0 that it's hard to know where to start
looking for information on how to do even the simplest most common
things.
.PP
Really introducing you to using \s-1LWP\s0 would require a whole book \*(-- a book
that just happens to exist, called \fIPerl & \s-1LWP\s0\fR. But this article
should give you a taste of how you can go about some common tasks with
\&\s-1LWP.\s0
.SS "Getting documents with LWP::Simple"
.IX Subsection "Getting documents with LWP::Simple"
If you just want to get what's at a particular \s-1URL,\s0 the simplest way
to do it is LWP::Simple's functions.
.PP
In a Perl program, you can call its \f(CW\*(C`get($url)\*(C'\fR function.  It will try
getting that \s-1URL\s0's content.  If it works, then it'll return the
content; but if there's some error, it'll return undef.
.PP
.Vb 2
\&  my $url = \*(Aqhttp://www.npr.org/programs/fa/?todayDate=current\*(Aq;
\&    # Just an example: the URL for the most recent /Fresh Air/ show
\&
\&  use LWP::Simple;
\&  my $content = get $url;
\&  die "Couldn\*(Aqt get $url" unless defined $content;
\&
\&  # Then go do things with $content, like this:
\&
\&  if($content =~ m/jazz/i) {
\&    print "They\*(Aqre talking about jazz today on Fresh Air!\en";
\&  }
\&  else {
\&    print "Fresh Air is apparently jazzless today.\en";
\&  }
.Ve
.PP
The handiest variant on \f(CW\*(C`get\*(C'\fR is \f(CW\*(C`getprint\*(C'\fR, which is useful in Perl
one-liners.  If it can get the page whose \s-1URL\s0 you provide, it sends it
to \s-1STDOUT\s0; otherwise it complains to \s-1STDERR.\s0
.PP
.Vb 1
\&  % perl \-MLWP::Simple \-e "getprint \*(Aqhttp://www.cpan.org/RECENT\*(Aq"
.Ve
.PP
That is the \s-1URL\s0 of a plain text file that lists new files in \s-1CPAN\s0 in
the past two weeks.  You can easily make it part of a tidy little
shell command, like this one that mails you the list of new
\&\f(CW\*(C`Acme::\*(C'\fR modules:
.PP
.Vb 2
\&  % perl \-MLWP::Simple \-e "getprint \*(Aqhttp://www.cpan.org/RECENT\*(Aq"  \e
\&     | grep "/by\-module/Acme" | mail \-s "New Acme modules! Joy!" $USER
.Ve
.PP
There are other useful functions in LWP::Simple, including one function
for running a \s-1HEAD\s0 request on a \s-1URL\s0 (useful for checking links, or
getting the last-revised time of a \s-1URL\s0), and two functions for
saving/mirroring a \s-1URL\s0 to a local file. See the LWP::Simple
documentation for the full details, or chapter 2 of \fIPerl
& \s-1LWP\s0\fR for more examples.
.SS "The Basics of the \s-1LWP\s0 Class Model"
.IX Subsection "The Basics of the LWP Class Model"
LWP::Simple's functions are handy for simple cases, but its functions
don't support cookies or authorization, don't support setting header
lines in the \s-1HTTP\s0 request, generally don't support reading header lines
in the \s-1HTTP\s0 response (notably the full \s-1HTTP\s0 error message, in case of an
error). To get at all those features, you'll have to use the full \s-1LWP\s0
class model.
.PP
While \s-1LWP\s0 consists of dozens of classes, the main two that you have to
understand are LWP::UserAgent and HTTP::Response. LWP::UserAgent
is a class for \*(L"virtual browsers\*(R" which you use for performing requests,
and HTTP::Response is a class for the responses (or error messages)
that you get back from those requests.
.PP
The basic idiom is \f(CW\*(C`$response = $browser\->get($url)\*(C'\fR, or more fully
illustrated:
.PP
.Vb 1
\&  # Early in your program:
\&  
\&  use LWP 5.64; # Loads all important LWP classes, and makes
\&                #  sure your version is reasonably recent.
\&
\&  my $browser = LWP::UserAgent\->new;
\&  
\&  ...
\&  
\&  # Then later, whenever you need to make a get request:
\&  my $url = \*(Aqhttp://www.npr.org/programs/fa/?todayDate=current\*(Aq;
\&  
\&  my $response = $browser\->get( $url );
\&  die "Can\*(Aqt get $url \-\- ", $response\->status_line
\&   unless $response\->is_success;
\&
\&  die "Hey, I was expecting HTML, not ", $response\->content_type
\&   unless $response\->content_type eq \*(Aqtext/html\*(Aq;
\&     # or whatever content\-type you\*(Aqre equipped to deal with
\&
\&  # Otherwise, process the content somehow:
\&  
\&  if($response\->decoded_content =~ m/jazz/i) {
\&    print "They\*(Aqre talking about jazz today on Fresh Air!\en";
\&  }
\&  else {
\&    print "Fresh Air is apparently jazzless today.\en";
\&  }
.Ve
.PP
There are two objects involved: \f(CW$browser\fR, which holds an object of
class LWP::UserAgent, and then the \f(CW$response\fR object, which is of
class HTTP::Response. You really need only one browser object per
program; but every time you make a request, you get back a new
HTTP::Response object, which will have some interesting attributes:
.IP "\(bu" 4
A status code indicating
success or failure
(which you can test with \f(CW\*(C`$response\->is_success\*(C'\fR).
.IP "\(bu" 4
An \s-1HTTP\s0 status
line that is hopefully informative if there's failure (which you can
see with \f(CW\*(C`$response\->status_line\*(C'\fR,
returning something like \*(L"404 Not Found\*(R").
.IP "\(bu" 4
A \s-1MIME\s0 content-type like \*(L"text/html\*(R", \*(L"image/gif\*(R",
\&\*(L"application/xml\*(R", etc., which you can see with 
\&\f(CW\*(C`$response\->content_type\*(C'\fR
.IP "\(bu" 4
The actual content of the response, in \f(CW\*(C`$response\->decoded_content\*(C'\fR.
If the response is \s-1HTML,\s0 that's where the \s-1HTML\s0 source will be; if
it's a \s-1GIF,\s0 then \f(CW\*(C`$response\->decoded_content\*(C'\fR will be the binary
\&\s-1GIF\s0 data.
.IP "\(bu" 4
And dozens of other convenient and more specific methods that are
documented in the docs for HTTP::Response, and its superclasses
HTTP::Message and HTTP::Headers.
.SS "Adding Other \s-1HTTP\s0 Request Headers"
.IX Subsection "Adding Other HTTP Request Headers"
The most commonly used syntax for requests is \f(CW\*(C`$response =
$browser\->get($url)\*(C'\fR, but in truth, you can add extra \s-1HTTP\s0 header
lines to the request by adding a list of key-value pairs after the \s-1URL,\s0
like so:
.PP
.Vb 1
\&  $response = $browser\->get( $url, $key1, $value1, $key2, $value2, ... );
.Ve
.PP
For example, here's how to send some commonly used headers, in case
you're dealing with a site that would otherwise reject your request:
.PP
.Vb 6
\&  my @ns_headers = (
\&   \*(AqUser\-Agent\*(Aq => \*(AqMozilla/4.76 [en] (Win98; U)\*(Aq,
\&   \*(AqAccept\*(Aq => \*(Aqimage/gif, image/x\-xbitmap, image/jpeg, image/pjpeg, image/png, */*\*(Aq,
\&   \*(AqAccept\-Charset\*(Aq => \*(Aqiso\-8859\-1,*,utf\-8\*(Aq,
\&   \*(AqAccept\-Language\*(Aq => \*(Aqen\-US\*(Aq,
\&  );
\&
\&  ...
\&  
\&  $response = $browser\->get($url, @ns_headers);
.Ve
.PP
If you weren't reusing that array, you could just go ahead and do this:
.PP
.Vb 6
\&  $response = $browser\->get($url,
\&   \*(AqUser\-Agent\*(Aq => \*(AqMozilla/4.76 [en] (Win98; U)\*(Aq,
\&   \*(AqAccept\*(Aq => \*(Aqimage/gif, image/x\-xbitmap, image/jpeg, image/pjpeg, image/png, */*\*(Aq,
\&   \*(AqAccept\-Charset\*(Aq => \*(Aqiso\-8859\-1,*,utf\-8\*(Aq,
\&   \*(AqAccept\-Language\*(Aq => \*(Aqen\-US\*(Aq,
\&  );
.Ve
.PP
If you were only ever changing the 'User\-Agent' line, you could just change
the \f(CW$browser\fR object's default line from \*(L"libwww\-perl/5.65\*(R" (or the like)
to whatever you like, using the LWP::UserAgent \f(CW\*(C`agent\*(C'\fR method:
.PP
.Vb 1
\&   $browser\->agent(\*(AqMozilla/4.76 [en] (Win98; U)\*(Aq);
.Ve
.SS "Enabling Cookies"
.IX Subsection "Enabling Cookies"
A default LWP::UserAgent object acts like a browser with its cookies
support turned off. There are various ways of turning it on, by setting
its \f(CW\*(C`cookie_jar\*(C'\fR attribute. A \*(L"cookie jar\*(R" is an object representing
a little database of all
the \s-1HTTP\s0 cookies that a browser knows about. It can correspond to a
file on disk or 
an in-memory object that starts out empty, and whose collection of
cookies will disappear once the program is finished running.
.PP
To give a browser an in-memory empty cookie jar, you set its \f(CW\*(C`cookie_jar\*(C'\fR
attribute like so:
.PP
.Vb 2
\&  use HTTP::CookieJar::LWP;
\&  $browser\->cookie_jar( HTTP::CookieJar::LWP\->new );
.Ve
.PP
To save a cookie jar to disk, see \*(L"dump_cookies\*(R" in HTTP::CookieJar.
To load cookies from disk into a jar, see \*(L"load_cookies\*(R" in HTTP::CookieJar.
.SS "Posting Form Data"
.IX Subsection "Posting Form Data"
Many \s-1HTML\s0 forms send data to their server using an \s-1HTTP POST\s0 request, which
you can send with this syntax:
.PP
.Vb 7
\& $response = $browser\->post( $url,
\&   [
\&     formkey1 => value1, 
\&     formkey2 => value2, 
\&     ...
\&   ],
\& );
.Ve
.PP
Or if you need to send \s-1HTTP\s0 headers:
.PP
.Vb 9
\& $response = $browser\->post( $url,
\&   [
\&     formkey1 => value1, 
\&     formkey2 => value2, 
\&     ...
\&   ],
\&   headerkey1 => value1, 
\&   headerkey2 => value2, 
\& );
.Ve
.PP
For example, the following program makes a search request to AltaVista
(by sending some form data via an \s-1HTTP POST\s0 request), and extracts from
the \s-1HTML\s0 the report of the number of matches:
.PP
.Vb 4
\&  use strict;
\&  use warnings;
\&  use LWP 5.64;
\&  my $browser = LWP::UserAgent\->new;
\&
\&  my $word = \*(Aqtarragon\*(Aq;
\&
\&  my $url = \*(Aqhttp://search.yahoo.com/yhs/search\*(Aq;
\&  my $response = $browser\->post( $url,
\&    [ \*(Aqq\*(Aq => $word,  # the Altavista query string
\&      \*(Aqfr\*(Aq => \*(Aqaltavista\*(Aq, \*(Aqpg\*(Aq => \*(Aqq\*(Aq, \*(Aqavkw\*(Aq => \*(Aqtgz\*(Aq, \*(Aqkl\*(Aq => \*(AqXX\*(Aq,
\&    ]
\&  );
\&  die "$url error: ", $response\->status_line
\&   unless $response\->is_success;
\&  die "Weird content type at $url \-\- ", $response\->content_type
\&   unless $response\->content_is_html;
\&
\&  if( $response\->decoded_content =~ m{([0\-9,]+)(?:<.*?>)? results for} ) {
\&    # The substring will be like "996,000</strong> results for"
\&    print "$word: $1\en";
\&  }
\&  else {
\&    print "Couldn\*(Aqt find the match\-string in the response\en";
\&  }
.Ve
.SS "Sending \s-1GET\s0 Form Data"
.IX Subsection "Sending GET Form Data"
Some \s-1HTML\s0 forms convey their form data not by sending the data
in an \s-1HTTP POST\s0 request, but by making a normal \s-1GET\s0 request with
the data stuck on the end of the \s-1URL.\s0  For example, if you went to
\&\f(CW\*(C`www.imdb.com\*(C'\fR and ran a search on \*(L"Blade Runner\*(R", the \s-1URL\s0 you'd see
in your browser window would be:
.PP
.Vb 1
\&  http://www.imdb.com/find?s=all&q=Blade+Runner
.Ve
.PP
To run the same search with \s-1LWP,\s0 you'd use this idiom, which involves
the \s-1URI\s0 class:
.PP
.Vb 3
\&  use URI;
\&  my $url = URI\->new( \*(Aqhttp://www.imdb.com/find\*(Aq );
\&    # makes an object representing the URL
\&
\&  $url\->query_form(  # And here the form data pairs:
\&    \*(Aqq\*(Aq => \*(AqBlade Runner\*(Aq,
\&    \*(Aqs\*(Aq => \*(Aqall\*(Aq,
\&  );
\&
\&  my $response = $browser\->get($url);
.Ve
.PP
See chapter 5 of \fIPerl & \s-1LWP\s0\fR for a longer discussion of \s-1HTML\s0 forms
and of form data, and chapters 6 through 9 for a longer discussion of
extracting data from \s-1HTML.\s0
.SS "Absolutizing URLs"
.IX Subsection "Absolutizing URLs"
The \s-1URI\s0 class that we just mentioned above provides all sorts of methods
for accessing and modifying parts of URLs (such as asking sort of \s-1URL\s0 it
is with \f(CW\*(C`$url\->scheme\*(C'\fR, and asking what host it refers to with \f(CW\*(C`$url\->host\*(C'\fR, and so on, as described in the docs for the \s-1URI\s0
class.  However, the methods of most immediate interest
are the \f(CW\*(C`query_form\*(C'\fR method seen above, and now the \f(CW\*(C`new_abs\*(C'\fR method
for taking a probably-relative \s-1URL\s0 string (like \*(L"../foo.html\*(R") and getting
back an absolute \s-1URL\s0 (like \*(L"http://www.perl.com/stuff/foo.html\*(R"), as
shown here:
.PP
.Vb 2
\&  use URI;
\&  $abs = URI\->new_abs($maybe_relative, $base);
.Ve
.PP
For example, consider this program that matches URLs in the \s-1HTML\s0
list of new modules in \s-1CPAN:\s0
.PP
.Vb 4
\&  use strict;
\&  use warnings;
\&  use LWP;
\&  my $browser = LWP::UserAgent\->new;
\&  
\&  my $url = \*(Aqhttp://www.cpan.org/RECENT.html\*(Aq;
\&  my $response = $browser\->get($url);
\&  die "Can\*(Aqt get $url \-\- ", $response\->status_line
\&   unless $response\->is_success;
\&  
\&  my $html = $response\->decoded_content;
\&  while( $html =~ m/<A HREF=\e"(.*?)\e"/g ) {
\&    print "$1\en";
\&  }
.Ve
.PP
When run, it emits output that starts out something like this:
.PP
.Vb 7
\&  MIRRORING.FROM
\&  RECENT
\&  RECENT.html
\&  authors/00whois.html
\&  authors/01mailrc.txt.gz
\&  authors/id/A/AA/AASSAD/CHECKSUMS
\&  ...
.Ve
.PP
However, if you actually want to have those be absolute URLs, you
can use the \s-1URI\s0 module's \f(CW\*(C`new_abs\*(C'\fR method, by changing the \f(CW\*(C`while\*(C'\fR
loop to this:
.PP
.Vb 3
\&  while( $html =~ m/<A HREF=\e"(.*?)\e"/g ) {
\&    print URI\->new_abs( $1, $response\->base ) ,"\en";
\&  }
.Ve
.PP
(The \f(CW\*(C`$response\->base\*(C'\fR method from HTTP::Message
is for returning what \s-1URL\s0
should be used for resolving relative URLs \*(-- it's usually just
the same as the \s-1URL\s0 that you requested.)
.PP
That program then emits nicely absolute URLs:
.PP
.Vb 7
\&  http://www.cpan.org/MIRRORING.FROM
\&  http://www.cpan.org/RECENT
\&  http://www.cpan.org/RECENT.html
\&  http://www.cpan.org/authors/00whois.html
\&  http://www.cpan.org/authors/01mailrc.txt.gz
\&  http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS
\&  ...
.Ve
.PP
See chapter 4 of \fIPerl & \s-1LWP\s0\fR for a longer discussion of \s-1URI\s0 objects.
.PP
Of course, using a regexp to match hrefs is a bit simplistic, and for
more robust programs, you'll probably want to use an HTML-parsing module
like HTML::LinkExtor or HTML::TokeParser or even maybe
HTML::TreeBuilder.
.SS "Other Browser Attributes"
.IX Subsection "Other Browser Attributes"
LWP::UserAgent objects have many attributes for controlling how they
work.  Here are a few notable ones:
.IP "\(bu" 4
\&\f(CW\*(C`$browser\->timeout(15);\*(C'\fR
.Sp
This sets this browser object to give up on requests that don't answer
within 15 seconds.
.IP "\(bu" 4
\&\f(CW\*(C`$browser\->protocols_allowed( [ \*(Aqhttp\*(Aq, \*(Aqgopher\*(Aq] );\*(C'\fR
.Sp
This sets this browser object to not speak any protocols other than \s-1HTTP\s0
and gopher. If it tries accessing any other kind of \s-1URL\s0 (like an \*(L"ftp:\*(R"
or \*(L"mailto:\*(R" or \*(L"news:\*(R" \s-1URL\s0), then it won't actually try connecting, but
instead will immediately return an error code 500, with a message like
\&\*(L"Access to 'ftp' URIs has been disabled\*(R".
.IP "\(bu" 4
\&\f(CW\*(C`use LWP::ConnCache; $browser\->conn_cache(LWP::ConnCache\->new());\*(C'\fR
.Sp
This tells the browser object to try using the \s-1HTTP/1.1\s0 \*(L"Keep-Alive\*(R"
feature, which speeds up requests by reusing the same socket connection
for multiple requests to the same server.
.IP "\(bu" 4
\&\f(CW\*(C`$browser\->agent( \*(AqSomeName/1.23 (more info here maybe)\*(Aq )\*(C'\fR
.Sp
This changes how the browser object will identify itself in
the default \*(L"User-Agent\*(R" line is its \s-1HTTP\s0 requests.  By default,
it'll send "libwww\-perl/\fIversionnumber\fR\*(L", like
\&\*(R"libwww\-perl/5.65".  You can change that to something more descriptive
like this:
.Sp
.Vb 1
\&  $browser\->agent( \*(AqSomeName/3.14 (contact@robotplexus.int)\*(Aq );
.Ve
.Sp
Or if need be, you can go in disguise, like this:
.Sp
.Vb 1
\&  $browser\->agent( \*(AqMozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)\*(Aq );
.Ve
.IP "\(bu" 4
\&\f(CW\*(C`push @{ $ua\->requests_redirectable }, \*(AqPOST\*(Aq;\*(C'\fR
.Sp
This tells this browser to obey redirection responses to \s-1POST\s0 requests
(like most modern interactive browsers), even though the \s-1HTTP RFC\s0 says
that should not normally be done.
.PP
For more options and information, see the full documentation for
LWP::UserAgent.
.SS "Writing Polite Robots"
.IX Subsection "Writing Polite Robots"
If you want to make sure that your LWP-based program respects \fIrobots.txt\fR
files and doesn't make too many requests too fast, you can use the LWP::RobotUA
class instead of the LWP::UserAgent class.
.PP
LWP::RobotUA class is just like LWP::UserAgent, and you can use it like so:
.PP
.Vb 3
\&  use LWP::RobotUA;
\&  my $browser = LWP::RobotUA\->new(\*(AqYourSuperBot/1.34\*(Aq, \*(Aqyou@yoursite.com\*(Aq);
\&    # Your bot\*(Aqs name and your email address
\&
\&  my $response = $browser\->get($url);
.Ve
.PP
But HTTP::RobotUA adds these features:
.IP "\(bu" 4
If the \fIrobots.txt\fR on \f(CW$url\fR's server forbids you from accessing
\&\f(CW$url\fR, then the \f(CW$browser\fR object (assuming it's of class LWP::RobotUA)
won't actually request it, but instead will give you back (in \f(CW$response\fR) a 403 error
with a message \*(L"Forbidden by robots.txt\*(R".  That is, if you have this line:
.Sp
.Vb 2
\&  die "$url \-\- ", $response\->status_line, "\enAborted"
\&   unless $response\->is_success;
.Ve
.Sp
then the program would die with an error message like this:
.Sp
.Vb 2
\&  http://whatever.site.int/pith/x.html \-\- 403 Forbidden by robots.txt
\&  Aborted at whateverprogram.pl line 1234
.Ve
.IP "\(bu" 4
If this \f(CW$browser\fR object sees that the last time it talked to
\&\f(CW$url\fR's server was too recently, then it will pause (via \f(CW\*(C`sleep\*(C'\fR) to
avoid making too many requests too often. How long it will pause for, is
by default one minute \*(-- but you can control it with the \f(CW\*(C`$browser\->delay( \f(CIminutes\f(CW )\*(C'\fR attribute.
.Sp
For example, this code:
.Sp
.Vb 1
\&  $browser\->delay( 7/60 );
.Ve
.Sp
\&...means that this browser will pause when it needs to avoid talking to
any given server more than once every 7 seconds.
.PP
For more options and information, see the full documentation for
LWP::RobotUA.
.SS "Using Proxies"
.IX Subsection "Using Proxies"
In some cases, you will want to (or will have to) use proxies for
accessing certain sites and/or using certain protocols. This is most
commonly the case when your \s-1LWP\s0 program is running (or could be running)
on a machine that is behind a firewall.
.PP
To make a browser object use proxies that are defined in the usual
environment variables (\f(CW\*(C`HTTP_PROXY\*(C'\fR, etc.), just call the \f(CW\*(C`env_proxy\*(C'\fR
on a user-agent object before you go making any requests on it.
Specifically:
.PP
.Vb 2
\&  use LWP::UserAgent;
\&  my $browser = LWP::UserAgent\->new;
\&  
\&  # And before you go making any requests:
\&  $browser\->env_proxy;
.Ve
.PP
For more information on proxy parameters, see the LWP::UserAgent
documentation, specifically the \f(CW\*(C`proxy\*(C'\fR, \f(CW\*(C`env_proxy\*(C'\fR,
and \f(CW\*(C`no_proxy\*(C'\fR methods.
.SS "\s-1HTTP\s0 Authentication"
.IX Subsection "HTTP Authentication"
Many web sites restrict access to documents by using \*(L"\s-1HTTP\s0
Authentication\*(R". This isn't just any form of \*(L"enter your password\*(R"
restriction, but is a specific mechanism where the \s-1HTTP\s0 server sends the
browser an \s-1HTTP\s0 code that says \*(L"That document is part of a protected
\&'realm', and you can access it only if you re-request it and add some
special authorization headers to your request\*(R".
.PP
For example, the Unicode.org admins stop email-harvesting bots from
harvesting the contents of their mailing list archives, by protecting
them with \s-1HTTP\s0 Authentication, and then publicly stating the username
and password (at \f(CW\*(C`http://www.unicode.org/mail\-arch/\*(C'\fR) \*(-- namely
username \*(L"unicode-ml\*(R" and password \*(L"unicode\*(R".
.PP
For example, consider this \s-1URL,\s0 which is part of the protected
area of the web site:
.PP
.Vb 1
\&  http://www.unicode.org/mail\-arch/unicode\-ml/y2002\-m08/0067.html
.Ve
.PP
If you access that with a browser, you'll get a prompt
like 
\&\*(L"Enter username and password for 'Unicode\-MailList\-Archives' at server
\&'www.unicode.org'\*(R".
.PP
In \s-1LWP,\s0 if you just request that \s-1URL,\s0 like this:
.PP
.Vb 2
\&  use LWP;
\&  my $browser = LWP::UserAgent\->new;
\&
\&  my $url =
\&   \*(Aqhttp://www.unicode.org/mail\-arch/unicode\-ml/y2002\-m08/0067.html\*(Aq;
\&  my $response = $browser\->get($url);
\&
\&  die "Error: ", $response\->header(\*(AqWWW\-Authenticate\*(Aq) || \*(AqError accessing\*(Aq,
\&    #  (\*(AqWWW\-Authenticate\*(Aq is the realm\-name)
\&    "\en ", $response\->status_line, "\en at $url\en Aborting"
\&   unless $response\->is_success;
.Ve
.PP
Then you'll get this error:
.PP
.Vb 4
\&  Error: Basic realm="Unicode\-MailList\-Archives"
\&   401 Authorization Required
\&   at http://www.unicode.org/mail\-arch/unicode\-ml/y2002\-m08/0067.html
\&   Aborting at auth1.pl line 9.  [or wherever]
.Ve
.PP
\&...because the \f(CW$browser\fR doesn't know any the username and password
for that realm (\*(L"Unicode-MailList-Archives\*(R") at that host
(\*(L"www.unicode.org\*(R").  The simplest way to let the browser know about this
is to use the \f(CW\*(C`credentials\*(C'\fR method to let it know about a username and
password that it can try using for that realm at that host.  The syntax is:
.PP
.Vb 5
\&  $browser\->credentials(
\&    \*(Aqservername:portnumber\*(Aq,
\&    \*(Aqrealm\-name\*(Aq,
\&   \*(Aqusername\*(Aq => \*(Aqpassword\*(Aq
\&  );
.Ve
.PP
In most cases, the port number is 80, the default \s-1TCP/IP\s0 port for \s-1HTTP\s0; and
you usually call the \f(CW\*(C`credentials\*(C'\fR method before you make any requests.
For example:
.PP
.Vb 5
\&  $browser\->credentials(
\&    \*(Aqreports.mybazouki.com:80\*(Aq,
\&    \*(Aqweb_server_usage_reports\*(Aq,
\&    \*(Aqplinky\*(Aq => \*(Aqbanjo123\*(Aq
\&  );
.Ve
.PP
So if we add the following to the program above, right after the \f(CW\*(C`$browser = LWP::UserAgent\->new;\*(C'\fR line...
.PP
.Vb 5
\&  $browser\->credentials(  # add this to our $browser \*(Aqs "key ring"
\&    \*(Aqwww.unicode.org:80\*(Aq,
\&    \*(AqUnicode\-MailList\-Archives\*(Aq,
\&    \*(Aqunicode\-ml\*(Aq => \*(Aqunicode\*(Aq
\&  );
.Ve
.PP
\&...then when we run it, the request succeeds, instead of causing the
\&\f(CW\*(C`die\*(C'\fR to be called.
.SS "Accessing \s-1HTTPS\s0 URLs"
.IX Subsection "Accessing HTTPS URLs"
When you access an \s-1HTTPS URL,\s0 it'll work for you just like an \s-1HTTP URL\s0
would \*(-- if your \s-1LWP\s0 installation has \s-1HTTPS\s0 support (via an appropriate
Secure Sockets Layer library).  For example:
.PP
.Vb 8
\&  use LWP;
\&  my $url = \*(Aqhttps://www.paypal.com/\*(Aq;   # Yes, HTTPS!
\&  my $browser = LWP::UserAgent\->new;
\&  my $response = $browser\->get($url);
\&  die "Error at $url\en ", $response\->status_line, "\en Aborting"
\&   unless $response\->is_success;
\&  print "Whee, it worked!  I got that ",
\&   $response\->content_type, " document!\en";
.Ve
.PP
If your \s-1LWP\s0 installation doesn't have \s-1HTTPS\s0 support set up, then the
response will be unsuccessful, and you'll get this error message:
.PP
.Vb 3
\&  Error at https://www.paypal.com/
\&   501 Protocol scheme \*(Aqhttps\*(Aq is not supported
\&   Aborting at paypal.pl line 7.   [or whatever program and line]
.Ve
.PP
If your \s-1LWP\s0 installation \fIdoes\fR have \s-1HTTPS\s0 support installed, then the
response should be successful, and you should be able to consult
\&\f(CW$response\fR just like with any normal \s-1HTTP\s0 response.
.PP
For information about installing \s-1HTTPS\s0 support for your \s-1LWP\s0
installation, see the helpful \fI\s-1README.SSL\s0\fR file that comes in the
libwww-perl distribution.
.SS "Getting Large Documents"
.IX Subsection "Getting Large Documents"
When you're requesting a large (or at least potentially large) document,
a problem with the normal way of using the request methods (like \f(CW\*(C`$response = $browser\->get($url)\*(C'\fR) is that the response object in
memory will have to hold the whole document \*(-- \fIin memory\fR. If the
response is a thirty megabyte file, this is likely to be quite an
imposition on this process's memory usage.
.PP
A notable alternative is to have \s-1LWP\s0 save the content to a file on disk,
instead of saving it up in memory.  This is the syntax to use:
.PP
.Vb 3
\&  $response = $ua\->get($url,
\&                         \*(Aq:content_file\*(Aq => $filespec,
\&                      );
.Ve
.PP
For example,
.PP
.Vb 3
\&  $response = $ua\->get(\*(Aqhttp://search.cpan.org/\*(Aq,
\&                         \*(Aq:content_file\*(Aq => \*(Aq/tmp/sco.html\*(Aq
\&                      );
.Ve
.PP
When you use this \f(CW\*(C`:content_file\*(C'\fR option, the \f(CW$response\fR will have
all the normal header lines, but \f(CW\*(C`$response\->content\*(C'\fR will be
empty.  Errors writing to the content file (for example due to
permission denied or the filesystem being full) will be reported via
the \f(CW\*(C`Client\-Aborted\*(C'\fR or \f(CW\*(C`X\-Died\*(C'\fR response headers, and not the
\&\f(CW\*(C`is_success\*(C'\fR method:
.PP
.Vb 2
\&  if ($response\->header(\*(AqClient\-Aborted\*(Aq) eq \*(Aqdie\*(Aq) {
\&    # handle error ...
.Ve
.PP
Note that this \*(L":content_file\*(R" option isn't supported under older
versions of \s-1LWP,\s0 so you should consider adding \f(CW\*(C`use LWP 5.66;\*(C'\fR to check
the \s-1LWP\s0 version, if you think your program might run on systems with
older versions.
.PP
If you need to be compatible with older \s-1LWP\s0 versions, then use
this syntax, which does the same thing:
.PP
.Vb 2
\&  use HTTP::Request::Common;
\&  $response = $ua\->request( GET($url), $filespec );
.Ve
.SH "SEE ALSO"
.IX Header "SEE ALSO"
Remember, this article is just the most rudimentary introduction to
\&\s-1LWP\s0 \*(-- to learn more about \s-1LWP\s0 and LWP-related tasks, you really
must read from the following:
.IP "\(bu" 4
LWP::Simple \*(-- simple functions for getting/heading/mirroring URLs
.IP "\(bu" 4
\&\s-1LWP\s0 \*(-- overview of the libwww-perl modules
.IP "\(bu" 4
LWP::UserAgent \*(-- the class for objects that represent \*(L"virtual browsers\*(R"
.IP "\(bu" 4
HTTP::Response \*(-- the class for objects that represent the response to
a \s-1LWP\s0 response, as in \f(CW\*(C`$response = $browser\->get(...)\*(C'\fR
.IP "\(bu" 4
HTTP::Message and HTTP::Headers \*(-- classes that provide more methods
to HTTP::Response.
.IP "\(bu" 4
\&\s-1URI\s0 \*(-- class for objects that represent absolute or relative URLs
.IP "\(bu" 4
URI::Escape \*(-- functions for URL-escaping and URL-unescaping strings
(like turning \*(L"this & that\*(R" to and from \*(L"this%20%26%20that\*(R").
.IP "\(bu" 4
HTML::Entities \*(-- functions for HTML-escaping and HTML-unescaping strings
(like turning \*(L"C. & E. Brontë\*(R" to and from \*(L"C. &amp; E. Bront&euml;\*(R")
.IP "\(bu" 4
HTML::TokeParser and HTML::TreeBuilder \*(-- classes for parsing \s-1HTML\s0
.IP "\(bu" 4
HTML::LinkExtor \*(-- class for finding links in \s-1HTML\s0 documents
.IP "\(bu" 4
The book \fIPerl & \s-1LWP\s0\fR by Sean M. Burke.  O'Reilly & Associates, 
2002.  \s-1ISBN: 0\-596\-00178\-9,\s0 <http://oreilly.com/catalog/perllwp/>.  The
whole book is also available free online:
<http://lwp.interglacial.com>.
.SH "COPYRIGHT"
.IX Header "COPYRIGHT"
Copyright 2002, Sean M. Burke.  You can redistribute this document and/or
modify it, but only under the same terms as Perl itself.
.SH "AUTHOR"
.IX Header "AUTHOR"
Sean M. Burke \f(CW\*(C`sburke@cpan.org\*(C'\fR