NAME¶
htdig - retrieve HTML documents for
ht://Dig search engine
SYNOPSIS¶
htdig [options]
DESCRIPTION¶
Htdig retrieves HTML documents using the HTTP protocol and gathers information
from these documents which can later be used to search these documents. This
program can be referred to as the search robot.
OPTIONS¶
- -
- Get the list of URLs to start indexing from standard input.
This will override the default parameter start_url specified in the
config file and the file supplied to the -m option.
- -a
- Use alternate work files. Tells htdig to append
.work to database files, causing a second copy of the database to
be built. This allows the original files to be used by htsearch during the
indexing run.
- -c configfile
- Use the specified configfile instead of the
default.
- -h maxhops
- Restrict the dig to documents that are at most
maxhops links away from the starting document. This only works if
option -i is also given.
- -i
- Initial. Do not use any old databases. Old databases will
be erased before running the program.
- -m filename
- Minimal run. Only index the URLs given in the file
filename, ignoring all others. URLs in the file should be formatted
one URL per line.
- -s
- Print statistics about the dig after completion.
- -t
- Create an ASCII version of the document database. This
database is easy to parse with other programs so that information can be
extracted from it for purposes other than searching. One could gather some
interesting statistics from this database.
Fieldname |
Value |
u |
URL |
t |
Title |
a |
State |
|
(0 normal, 1 not found, 2 not indexed, 3 obsolete) |
m |
Time of last modification reported by the server |
s |
Document Size in bytes |
H |
Excerpt of the document |
h |
Meta Description |
l |
Time of last retrieval |
L |
Count of links in the document or of outgoing links |
b |
Number of links to the document, also called |
|
incoming links or backlinks |
c |
Hop count of this document |
g |
Signature of this document |
|
(used to detect duplicates) |
e |
E-Mail address to use for a notification from htnotify |
n |
Date on which such notification is sent |
S |
Subject of the notifcation message |
d |
The text of Incoming links pointing to this document |
|
(e.g. <a href="docURL">description</a>) |
A |
Anchors in the document (i.e. <A NAME=...) |
- -u username:password
- Tells htdig to send the supplied username and password with
each HTTP request. The credentials will be encoded using the
´Basic´ authentication method. There HAS to be a
colon (:) between the username and password.
- -v
- Verbose mode. This increases the verbosity of the program.
Using more than 2 is probably only useful for debugging purposes. The
default verbose mode (using only one -v) gives a nice progress report
while digging. Please consult the section below on the exact format of the
progress report.
A line is shown for each URL, with 3 numbers before the URL and some symbols
after the URL. The first number is the number of documents parsed so far, the
second is the DocID for this document, and the third is the hop count of the
document (number of hops from one of the start_url documents). Signification
of the symbols printed after the url:
- "*" is printed for a link already
visited
- "+" is printed for a new link just
queued
- "-" is output for a link rejected for any
of a number of reasons. To find out what those reasons are, you need to
run htdig with at least 3 -v options, i.e. -vvv.
- If there are no "*", "+" or
"-" symbols after the URL, it doesn't mean the document was not
parsed or was empty, but only that no links to other documents were found
within it. With more verbose output, these symbols will get interspersed in
several lines of debugging output.
FILES¶
- /etc/htdig/htdig.conf
- The default configuration file.
SEE ALSO¶
Please refer to the HTML pages (in the htdig-doc package)
/usr/share/doc/htdig-doc/html/index.html and the manual pages
htdigconfig(8) ,
htmerge(1) ,
htnotify(1) ,
htsearch(1) and
rundig(1) for a detailed description of
ht://Dig
and its commands.
AUTHOR¶
This manual page was written by Christian Schwarz, modified by Stijn de Bekker.
It is updated and maintained by Robert Ribnitz and based on the HTML
documentation of
ht://Dig.