table of contents
CHECKBOT(1p) | User Contributed Perl Documentation | CHECKBOT(1p) |
NAME¶
Checkbot - WWW Link VerifierSYNOPSIS¶
checkbot [ --cookies] [--debug] [--file file name] [--help][ --mailto email addresses] [--noproxy list of domains]
[ --verbose]
[ --url start URL]
[ --match match string] [--exclude exclude string]
[ --proxy proxy URL] [--internal-only]
[ --ignore ignore string]
[ --filter substitution regular expression]
[ --style style file URL]
[ --note note] [--sleep seconds] [--timeout timeout]
[ --interval seconds] [--dontwarn HTTP responde codes]
[ --enable-virtual]
[ --language language code]
[ --suppress suppression file]
[start URLs]
DESCRIPTION¶
Checkbot verifies the links in a specific portion of the World Wide Web. It creates HTML pages with diagnostics. Checkbot uses LWP to find URLs on pages and to check them. It supports the same schemes as LWP does, and finds the same links that HTML::LinkExtor will find. Checkbot considers links to be either 'internal' or 'external'. Internal links are links within the web space that needs to be checked. If an internal link points to a web document this document is retrieved, and its links are extracted and processed. External links are only checked to be working. Checkbot checks links as it finds them, so internal and external links are checked at the same time, even though they are treated differently. Options for Checkbot are:- --cookies
- Accept cookies from the server and offer them again at later requests. This may be useful for servers that use cookies to handle sessions. By default Checkbot does not accept any cookies.
- --debug
- Enable debugging mode. Not really supported anymore, but it will keep some files around that otherwise would be deleted.
- --file <file name>
- Use the file file name as the basis for the summary
file names. The summary page will get the file name given, and the
server pages are based on the file name without the .html
extension. For example, setting this option to "index.html" will
create a summary page called index.html and server pages called
index-server1.html and index-server2.html.
- --help
- Shows brief help message on the standard output.
- --mailto <email address>[,<email address>]
- Send mail to the email address when Checkbot is done checking. You can give more than one address separated by commas. The notification email includes a small summary of the results. As of Checkbot 1.76 email is only sent if problems have been found during the Checkbot run.
- --noproxy <list of domains>
- Do not proxy requests to the given domains. The list of domains must be a comma-separated list. For example, so avoid using the proxy for the localhost and someserver.xyz, you can use "--noproxy localhost,someserver.xyz".
- --verbose
- Show verbose output while running. Includes all links checked, results from the checks, etc.
- --url <start URL>
- Set the start URL. Checkbot starts checking at this URL,
and then recursively checks all links found on this page. The start URL
takes precedence over additional URLs specified on the command line.
- --match <match string>
- This option selects which pages Checkbot considers local.
If the match string is contained within the URL, then Checkbot
considers the page local, retrieves it, and will check all the links
contained on it. Otherwise the page is considered external and it is only
checked with a HEAD request.
- --exclude <exclude string>
- URLs matching the exclude string are considered to
be external, even if they happen to match the match string (See
option "--match"). URLs matching the --exclude string are still
being checked and will be reported if problems are found, but they will
not be checked for further links into the site.
- --filter <filter string>
- This option defines a filter string, which is a perl
regular expression. This filter is run on each URL found, thus rewriting
the URL before it enters the queue to be checked. It can be used to remove
elements from a URL. This option can be useful when symbolic links point
to the same directory, or when a content management system adds session
IDs to URLs.
- --ignore <ignore string>
- URLs matching the ignore string are not checked at
all, they are completely ignored by Checkbot. This can be useful to ignore
known problem links, or to ignore links leading into databases. The
ignore string is matched after the filter string has
been applied.
- --proxy <proxy URL>
- This attribute specifies the URL of a proxy server. Only the HTTP and FTP requests will be sent to that proxy server.
- --internal-only
- Skip the checking of external links at the end of the Checkbot run. Only matching links are checked. Note that some redirections may still cause external links to be checked.
- --note <note>
- The note is included verbatim in the mail message
(See option "--mailto"). This can be useful to include the URL
of the summary HTML page for easy reference, for instance.
- --sleep <seconds>
- Number of seconds to sleep in between requests. Default is 0 seconds, i.e. do not sleep at all between requests. Setting this option can be useful to keep the load on the web server down while running Checkbot. This option can also be set to a fractional number, i.e. a value of 0.1 will sleep one tenth of a second between requests.
- --timeout <timeout>
- Default timeout for the requests, specified in seconds. The default is 2 minutes.
- --interval <seconds>
- The maximum interval between updates of the results web pages in seconds. Default is 3 hours (10800 seconds). Checkbot will start the interval at one minute, and gradually extend it towards the maximum interval.
- --style <URL of style file>
- When this option is used, Checkbot embeds this URL as a link to a style file on each page it writes. This makes it easy to customize the layout of pages generated by Checkbot.
- --dontwarn <HTTP response codes regular expression>
- Do not include warnings on the result pages for those HTTP
response codes which match the regular expression. For instance,
--dontwarn "(301|404)" would not include 301 and 404 response
codes.
901 Host name expected but not found In this case the URL supports a host name, but non was found in the URL. This usually indicates a mistake in the URL. An exception is that this check is not applied to news: URLs. 902 Unqualified host name found In this case the host name does not contain the domain part. This usually means that the pages work fine when viewed within the original domain, but not when viewed from outside it. 903 Double slash in URL path The URL has a double slash in it. This is legal, but some web servers cannot handle it very well and may cause Checkbot to run away. See also the comments below. 904 Unknown scheme in URL The URL starts with a scheme that Checkbot does not know about. This is often caused by mistyping the scheme of the URL, but the scheme can also be a legal one. In that case please let me know so that it can be added to Checkbot.
- --enable-virtual
- This option enables dealing with virtual servers. Checkbot then assumes that all hostnames for internal servers are unique, even though their IP addresses may be the same. Normally Checkbot uses the IP address to distinguish servers. This has the advantage that if a server has two names (e.g. www and bamboozle) its pages only get checked once. When you want to check multiple virtual servers this causes problems, which this feature works around by using the hostname to distinguish the server.
- --language
- The argument for this option is a two-letter language code. Checkbot will use language negotiation to request files in that language. The default is to request English language (language code 'en').
- --suppress
- The argument for this option is a file which contains
combinations of error codes and URLs for which to suppress warnings. This
can be used to avoid reporting of known and unfixable URL errors or
warnings.
# 301 Moved Permanently 301 http://www.w3.org/P3P # 403 Forbidden 403 http://www.herring.com/
403 /http:\/\/wikipedia.org\/.*/
- --allow-simple-hosts (deprecated)
- This option turns off warnings about URLs which contain
unqualified host names. This is useful for intranet sites which often use
just a simple host name or even "localhost" in their links.
HINTS AND TIPS¶
- Problems with checking FTP links
- Some users may experience consistent problems with checking FTP links. In these cases it may be useful to instruct Net::FTP to use passive FTP mode to check files. This can be done by setting the environment variable FTP_PASSIVE to 1. For example, using the bash shell: "FTP_PASSIVE=1 checkbot ...". See the Net::FTP documentation for more details.
- Run-away Checkbot
- In some cases Checkbot literally takes forever to finish.
There are two common causes for this problem.
- Problems with https:// links
- The error message
Can't locate object method "new" via package "LWP::Protocol::https::Socket"
EXAMPLES¶
The most simple use of Checkbot is to check a set of pages on a server. To check my checkbot pages I would use:checkbot http://degraaff.org/checkbot/Checkbot runs can take some time so Checkbot can send a notification mail when the run is done:
checkbot --mailto hans@degraaff.org http://degraaff.org/checkbot/It is possible to check a set of local file without using a web server. This only works for static files but may be useful in some cases.
checkbot file:///var/www/documents/
PREREQUISITES¶
This script uses the "LWP" modules.COREQUISITES¶
This script can send mail when "Mail::Send" is present.AUTHOR¶
Hans de Graaff <hans@degraaff.org> any2008-10-15 | perl v5.10.1 |