.\" Automatically generated by Pod::Man 4.14 (Pod::Simple 3.43) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{\ . if \nF \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" .\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2). .\" Fear. Run. Save yourself. No user-serviceable parts. . \" fudge factors for nroff and troff .if n \{\ . ds #H 0 . ds #V .8m . ds #F .3m . ds #[ \f1 . ds #] \fP .\} .if t \{\ . ds #H ((1u-(\\\\n(.fu%2u))*.13m) . ds #V .6m . ds #F 0 . ds #[ \& . ds #] \& .\} . \" simple accents for nroff and troff .if n \{\ . ds ' \& . ds ` \& . ds ^ \& . ds , \& . ds ~ ~ . ds / .\} .if t \{\ . ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u" . ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u' . ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u' . ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u' . ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u' . ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u' .\} . \" troff and (daisy-wheel) nroff accents .ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V' .ds 8 \h'\*(#H'\(*b\h'-\*(#H' .ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#] .ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H' .ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u' .ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#] .ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#] .ds ae a\h'-(\w'a'u*4/10)'e .ds Ae A\h'-(\w'A'u*4/10)'E . \" corrections for vroff .if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u' .if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u' . \" for low resolution devices (crt and lpr) .if \n(.H>23 .if \n(.V>19 \ \{\ . ds : e . ds 8 ss . ds o a . ds d- d\h'-1'\(ga . ds D- D\h'-1'\(hy . ds th \o'bp' . ds Th \o'LP' . ds ae ae . ds Ae AE .\} .rm #[ #] #H #V #F C .\" ======================================================================== .\" .IX Title "HTTP::OAI::Harvester 3pm" .TH HTTP::OAI::Harvester 3pm "2023-09-28" "perl v5.36.0" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" HTTP::OAI::Harvester \- Agent for harvesting from Open Archives version 1.0, 1.1, 2.0 and static ('2.0s') compatible repositories .SH "DESCRIPTION" .IX Header "DESCRIPTION" \&\f(CW\*(C`HTTP::OAI::Harvester\*(C'\fR is the harvesting front-end in the OAI-PERL library. .PP To harvest from an OAI-PMH compliant repository create an \f(CW\*(C`HTTP::OAI::Harvester\*(C'\fR object using the baseURL option and then call OAI-PMH methods to request data from the repository. To handle version 1.0/1.1 repositories automatically you \fBmust\fR request \f(CW\*(C`Identify()\*(C'\fR first. .PP It is recommended that you request an Identify from the Repository and use the \f(CW\*(C`repository()\*(C'\fR method to update the Identify object used by the harvester. .PP When making \s-1OAI\s0 requests the underlying HTTP::OAI::UserAgent module will take care of automatic redirection (http code 302) and retry-after (http code 503). OAI-PMH flow control (i.e. resumption tokens) is handled transparently by \f(CW\*(C`HTTP::OAI::Response\*(C'\fR. .SS "Static Repository Support" .IX Subsection "Static Repository Support" Static repositories are automatically and transparently supported within the existing \s-1API.\s0 To harvest a static repository specify the repository \s-1XML\s0 file using the baseURL argument to HTTP::OAI::Harvester. An initial request is made that determines whether the base \s-1URL\s0 specifies a static repository or a normal \s-1OAI 1\s0.x/2.0 \s-1CGI\s0 repository. To prevent this initial request state the \s-1OAI\s0 version using an HTTP::OAI::Identify object e.g. .PP .Vb 5 \& $h = HTTP::OAI::Harvester\->new( \& repository=>HTTP::OAI::Identify\->new( \& baseURL => \*(Aqhttp://arXiv.org/oai2\*(Aq, \& version => \*(Aq2.0\*(Aq, \& )); .Ve .PP If a static repository is found the response is cached, and further requests are served by that cache. Static repositories do not support sets, and will result in a noSetHierarchy error if you try to use sets. You can determine whether the repository is static by checking the version ($ha\->repository\->version), which will be \*(L"2.0s\*(R" for static repositories. .SH "FURTHER READING" .IX Header "FURTHER READING" You should refer to the Open Archives Protocol version 2.0 and other \s-1OAI\s0 documentation, available from http://www.openarchives.org/. .PP Note OAI-PMH 1.0 and 1.1 are deprecated. .SH "BEFORE USING EXAMPLES" .IX Header "BEFORE USING EXAMPLES" In the examples I use arXiv.org's and cogprints \s-1OAI\s0 interfaces. To avoid causing annoyance to their server administrators please contact them before performing testing or large downloads (or use other, less loaded, servers for testing). .SH "SYNOPSIS" .IX Header "SYNOPSIS" .Vb 1 \& use HTTP::OAI; \& \& my $h = new HTTP::OAI::Harvester(baseURL=>\*(Aqhttp://arXiv.org/oai2\*(Aq); \& my $response = $h\->Identify; \& \& if( $response\->is_error ) { \& print "Error requesting Identify:\en", \& $response\->code . " " . $response\->message, "\en"; \& exit; \& } \& \& # Note: repositoryVersion will always be 2.0, $r\->version returns \& # the actual version the repository is running \& print "Repository supports protocol version ", $response\->version, "\en"; \& \& # Version 1.x repositories don\*(Aqt support metadataPrefix, \& # but OAI\-PERL will drop the prefix automatically \& # if an Identify was requested first (as above) \& $response = $h\->ListIdentifiers( \& metadataPrefix=>\*(Aqoai_dc\*(Aq, \& from=>\*(Aq2001\-02\-03\*(Aq, \& until=>\*(Aq2001\-04\-10\*(Aq \& ); \& \& if( $response\->is_error ) { \& die("Error harvesting: " . $response\->message . "\en"); \& } \& \& print "responseDate => ", $response\->responseDate, "\en", \& "requestURL => ", $response\->requestURL, "\en"; \& \& while( my $id = $response\->next ) { \& print "identifier => ", $id\->identifier; \& # Only available from OAI 2.0 repositories \& print " (", $id\->datestamp, ")" if $id\->datestamp; \& print " (", $id\->status, ")" if $id\->status; \& print "\en"; \& # Only available from OAI 2.0 repositories \& for( $id\->setSpec ) { \& print "\et", $_, "\en"; \& } \& } \& \& # Using a handler \& $response = $h\->ListRecords( \& metadataPrefix=>\*(Aqoai_dc\*(Aq, \& handlers=>{metadata=>\*(AqHTTP::OAI::Metadata::OAI_DC\*(Aq}, \& onRecord=>sub { \& my $rec = shift; \& \& printf"%s\et%s\et%s\en" \& , $rec\->identifier \& , $rec\->datestamp \& , join(\*(Aq,\*(Aq, @{$rec\->metadata\->dc\->{\*(Aqtitle\*(Aq}}); \& } \& ); \& \& # End program \& ################# \& \& ################# \& # If you have some local OAI\-PMH response data you want to \& # parse you can use the OAI\-PMH verb as in: \& \& use HTTP::OAI; \& my $I = HTTP::OAI::Identify\->new(); \& \& # If you have a $content string with some cached OAI\-PMH verb=Identify response \& # it can be parsed like this.. \& $I\->parse_string($content); \& \& # Or if you have an opened file handle $fh to a file with a cached \& # OAI\-PMH verb=Identify response \& $I\->parse_file($fh); \& \& # Using either method now you can do something like \& \& printf "RepositoryName: %s\en" , $I\->repositoryName; \& for ($I\->adminEmail) { \& print $_, "\en"; \& } .Ve .SH "METHODS" .IX Header "METHODS" .ie n .IP "HTTP::OAI::Harvester\->new( %params )" 4 .el .IP "HTTP::OAI::Harvester\->new( \f(CW%params\fR )" 4 .IX Item "HTTP::OAI::Harvester->new( %params )" This constructor method returns a new instance of \f(CW\*(C`HTTP::OAI::Harvester\*(C'\fR. Requires either an HTTP::OAI::Identify object, which in turn must contain a baseURL, or a baseURL from which to construct an Identify object. .Sp Any other parameters are passed to the HTTP::OAI::UserAgent module, and from there to the LWP::UserAgent module. .Sp .Vb 6 \& $h = HTTP::OAI::Harvester\->new( \& baseURL => \*(Aqhttp://arXiv.org/oai2\*(Aq, \& resume=>0, # Suppress automatic resumption \& ) \& $id = $h\->repository(); \& $h\->repository($h\->Identify); \& \& $h = HTTP::OAI::Harvester\->new( \& HTTP::OAI::Identify\->new( \& baseURL => \*(Aqhttp://arXiv.org/oai2\*(Aq, \& )); .Ve .ie n .IP "$h\->\fBrepository()\fR" 4 .el .IP "\f(CW$h\fR\->\fBrepository()\fR" 4 .IX Item "$h->repository()" Returns and optionally sets the HTTP::OAI::Identify object used by the Harvester agent. .ie n .IP "$h\->resume( [1] )" 4 .el .IP "\f(CW$h\fR\->resume( [1] )" 4 .IX Item "$h->resume( [1] )" If set to true (default) resumption tokens will automatically be handled by requesting the next partial list during \f(CW\*(C`next()\*(C'\fR calls. .SH "OAI-PMH Verbs" .IX Header "OAI-PMH Verbs" The 6 OAI-PMH Verbs are the requests supported by an OAI-PMH interface. .SS "Error Messages" .IX Subsection "Error Messages" Use \f(CW\*(C`is_success()\*(C'\fR or \f(CW\*(C`is_error()\*(C'\fR on the returned object to determine whether an error occurred (see HTTP::OAI::Response). .PP \&\f(CW\*(C`code()\*(C'\fR and \f(CW\*(C`message()\*(C'\fR return the error code (200 is success) and a human-readable message respectively. Errors returned by the repository can be retrieved using the \f(CW\*(C`errors()\*(C'\fR method: .PP .Vb 3 \& foreach my $error ($r\->errors) { \& print $error\->code, "\et", $error\->message, "\en"; \& } .Ve .PP Note: \f(CW\*(C`is_success()\*(C'\fR is true for the \s-1OAI\s0 Error Code \f(CW\*(C`noRecordsMatch\*(C'\fR (i.e. empty set), although \f(CW\*(C`errors()\*(C'\fR will still contain the \s-1OAI\s0 error. .SS "Flow Control" .IX Subsection "Flow Control" If the response contained a resumption token this can be retrieved using the \f(CW$r\fR\->resumptionToken method. .SS "Methods" .IX Subsection "Methods" These methods return an object subclassed from HTTP::Response (where the class corresponds to the verb requested, e.g. \f(CW\*(C`GetRecord\*(C'\fR requests return an \f(CW\*(C`HTTP::OAI::GetRecord\*(C'\fR object). .ie n .IP "$r = $h\->GetRecord( %params )" 4 .el .IP "\f(CW$r\fR = \f(CW$h\fR\->GetRecord( \f(CW%params\fR )" 4 .IX Item "$r = $h->GetRecord( %params )" Get a single record from the repository identified by identifier, in format metadataPrefix. .Sp .Vb 8 \& $gr = $h\->GetRecord( \& identifier => \*(Aqoai:arXiv:hep\-th/0001001\*(Aq, # Required \& metadataPrefix => \*(Aqoai_dc\*(Aq # Required \& ); \& $rec = $gr\->next; \& die $rec\->message if $rec\->is_error; \& printf("%s (%s)\en", $rec\->identifier, $rec\->datestamp); \& $dom = $rec\->metadata\->dom; .Ve .ie n .IP "$r = $h\->\fBIdentify()\fR" 4 .el .IP "\f(CW$r\fR = \f(CW$h\fR\->\fBIdentify()\fR" 4 .IX Item "$r = $h->Identify()" Get information about the repository. .Sp .Vb 2 \& $id = $h\->Identify(); \& print join \*(Aq,\*(Aq, $id\->adminEmail; .Ve .ie n .IP "$r = $h\->ListIdentifiers( %params )" 4 .el .IP "\f(CW$r\fR = \f(CW$h\fR\->ListIdentifiers( \f(CW%params\fR )" 4 .IX Item "$r = $h->ListIdentifiers( %params )" Retrieve the identifiers, datestamps, sets and deleted status for all records within the specified date range (from/until) and set spec (set). 1.x repositories will only return the identifier. Or, resume an existing harvest by specifying resumptionToken. .Sp .Vb 11 \& $lr = $h\->ListIdentifiers( \& metadataPrefix => \*(Aqoai_dc\*(Aq, # Required \& from => \*(Aq2001\-10\-01\*(Aq, \& until => \*(Aq2001\-10\-31\*(Aq, \& set=>\*(Aqphysics:hep\-th\*(Aq, \& ); \& while($rec = $lr\->next) \& { \& { ... do something with $rec ... } \& } \& die $lr\->message if $lr\->is_error; .Ve .ie n .IP "$r = $h\->ListMetadataFormats( %params )" 4 .el .IP "\f(CW$r\fR = \f(CW$h\fR\->ListMetadataFormats( \f(CW%params\fR )" 4 .IX Item "$r = $h->ListMetadataFormats( %params )" List available metadata formats. Given an identifier the repository should only return those metadata formats for which that item can be disseminated. .Sp .Vb 7 \& $lmdf = $h\->ListMetadataFormats( \& identifier => \*(Aqoai:arXiv.org:hep\-th/0001001\*(Aq \& ); \& for($lmdf\->metadataFormat) { \& print $_\->metadataPrefix, "\en"; \& } \& die $lmdf\->message if $lmdf\->is_error; .Ve .ie n .IP "$r = $h\->ListRecords( %params )" 4 .el .IP "\f(CW$r\fR = \f(CW$h\fR\->ListRecords( \f(CW%params\fR )" 4 .IX Item "$r = $h->ListRecords( %params )" Return full records within the specified date range (from/until), set and metadata format. Or, specify a resumption token to resume a previous partial harvest. .Sp .Vb 11 \& $lr = $h\->ListRecords( \& metadataPrefix=>\*(Aqoai_dc\*(Aq, # Required \& from => \*(Aq2001\-10\-01\*(Aq, \& until => \*(Aq2001\-10\-01\*(Aq, \& set => \*(Aqphysics:hep\-th\*(Aq, \& ); \& while($rec = $lr\->next) \& { \& { ... do something with $rec ... } \& } \& die $lr\->message if $lr\->is_error; .Ve .ie n .IP "$r = $h\->ListSets( %params )" 4 .el .IP "\f(CW$r\fR = \f(CW$h\fR\->ListSets( \f(CW%params\fR )" 4 .IX Item "$r = $h->ListSets( %params )" Return a list of sets provided by the repository. The scope of sets is undefined by OAI-PMH, so therefore may represent any subset of a collection. Optionally provide a resumption token to resume a previous partial request. .Sp .Vb 6 \& $ls = $h\->ListSets(); \& while($set = $ls\->next) \& { \& print $set\->setSpec, "\en"; \& } \& die $ls\->message if $ls\->is_error; .Ve .SH "ENVIRONMENT" .IX Header "ENVIRONMENT" The \s-1HTTP\s0 Agent is default OAI\-PERL/ where is the \s-1HTTP::OAI\s0 version. This Agent can be set via an environment variable \s-1HTTP_OAI_AGENT.\s0 .SH "AUTHOR" .IX Header "AUTHOR" These modules have been written by Tim Brody .