.\"
.\"	SWISH++
.\"	swish++.index.5
.\"
.\"	Copyright (C) 1998-2003  Paul J. Lucas
.\"
.\"	This program is free software; you can redistribute it and/or modify
.\"	it under the terms of the GNU General Public License as published by
.\"	the Free Software Foundation; either version 2 of the License, or
.\"	(at your option) any later version.
.\"
.\"	This program is distributed in the hope that it will be useful,
.\"	but WITHOUT ANY WARRANTY; without even the implied warranty of
.\"	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
.\"	GNU General Public License for more details.
.\"
.\"	You should have received a copy of the GNU General Public License
.\"	along with this program; if not, write to the Free Software
.\"	Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
.\"
.\" ---------------------------------------------------------------------------
.\" define code-start macro
.de cS
.sp
.nf
.RS 5
.ft CW
.ta .5i 1i 1.5i 2i 2.5i 3i 3.5i 4i 4.5i 5i 5.5i
..
.\" define code-end macro
.de cE
.ft 1
.RE
.fi
.if !'\\$1'0' .sp
..
.\" ---------------------------------------------------------------------------
.TH \f3swish++.index\f1 5 "March 29, 2004" "SWISH++"
.SH NAME
swish++.index \- SWISH++ index file format
.SH SYNOPSIS
.nf
.ft CW
.ta 10
long	num_words;
off_t	word_offset[ num_words ];
long	num_stop_words;
off_t	stop_word_offset[ num_stop_words ];
long	num_directories;
off_t	directory_offset[ num_directories ];
long	num_files;
off_t	file_offset[ num_files ];
long	num_meta_names;
off_t	meta_name_offset[ num_meta_names ];
.ft 2
	word index
	stop-word index
	directory index++
	file index
	meta-name index
.ft 1
.fi
.SH DESCRIPTION
The index file format used by SWISH++ is as shown above.
Every \f(CWword_offset\f1 is an offset into the
.I "word index"
pointing at the first character of a word entry;
similarly,
every \f(CWstop_word_offset\f1 is an offset into the
.I "stop-word index"
pointing at the first character of a stop-word entry;
similarly,
every \f(CWdirectory_offset\f1 is an offset into the
.I "directory index++"
pointing at the first character of a directory entry;
similarly,
every \f(CWfile_offset\f1 is an offset into the
.I "file index"
pointing at the first byte of a file entry;
finally,
every \f(CWmeta_name_offset\f1 is an offset into the
.I "meta-name index"
pointing at the first character of a meta-name entry.
.P
The index file is written as it is so that it can be mapped into memory via the
.BR mmap (2)
Unix system call enabling ``instantaneous'' access.
.SS Encoded Integers
All integers in an index file are stored in an encoded format for compactness.
An integer is encoded to use a varying number of bytes.
For a given byte, only the lower 7 bits are used for data;
the high bit, if set, is used to indicate
whether the integer continues into the next byte.
The encoded integer is in big-endian byte order.
For example, the integers 0\-127
are encoded as single bytes of
\f(CW\\x00\f1\-\f(CW\\x7F\f1, respectively;
the integer 128 is encoded as the two bytes of \f(CW\\x8100\f1.
.P
Note that the byte \f(CW\\x80\fP
will never be the first byte of an encoded integer
(although it can be any other byte);
therefore, it can be used as a
.I marker
to embed extra information into an encoded integer byte sequence.
.SS Encoded Integer Lists
Because the high bit of the last byte of an encoded integer is always clear,
lists of encoded integers can be stored one right after the other
with no separators.
For example, the byte sequence \f(CW\\x010203\f1
encodes the three separate integers 1, 2, and 3;
the byte sequence \f(CW\\x81002A8101\f1
encodes the three integers 128, 42, and 129.
.SS Word Entries
Every word entry in the
.I "word index"
is of the form:
.cS
\f2word\fP0\f3\s+2{\s-2\fP\f2data\fP\f3\s+2}...\s-2\fP
.cE
that is: a null-terminated word followed by one or more
.I data
entries where a
.I data
entry is:
.cS
\f3\s+2{\s-2\fP\f2F\fP\f3\s+2}{\s-2\fP\f2O\fP\f3\s+2}{\s-2\fP\f2R\fP\f3\s+2}[{\s-2\fP\f2list\fP\f3\s+2}...]{\s-2\fP\f2marker\fP\f3\s+2}\s-2\fP
.cE
that is: a file-index
.RI ( F )
followed by the number of occurrences in the file
.RI ( O )
followed by a rank
.RI ( R )
followed by zero or more
.I lists
of integers followed by a
.I marker
byte.
The
.I file-index
is an index into the \f(CWfile_offset\f1 table;
the
.I marker
byte is one of:
.P
.RS 5
.PD 0
.TP 4
\f(CW00\f1
Another
.I data
entry follows.
.TP
\f(CW80\f1
No more
.I data
entries follow.
.PD
.RE
.P
A
.I list
is:
.cS
\f3\s+2{\s-2\fP\f2type\fP\f3\s+2}{\s-2\fP\f2I\fP\f3\s+2}...\s-2\fPx80
.cE
that is: a
.I type
followed by one or more integers
.RI ( I )
followed by an \f(CW\\x80\f1 byte
where
.I type
defines the type of list, i.e., what the integers mean.
Currently, there is only one type of list:
.TP 4
\f(CW01\fP
Meta-ID list.
Meta-IDs are unique integers
identifying which meta name(s) a word is associated with
in the meta-name index.
.TP
\f(CW02\fP
Word-position list.
Every word in a file has a position:
the first word is 1, the second word is 2, etc.
Each word position is stored as a delta from the previous position
for compactness.
.SS Stop-Word Entries
Every stop-word entry in the
.I "stop-word index"
is of the form:
.cS
\f2stop-word\fP0
.cE
that is: every word is null-terminated.
.SS Directory Entries
Every directory entry in the
.I "directory index++"
is of the form:
.cS
\f2directory-path\fP0
.cE
that is: a null-terminated full pathname of a directory
(not including the trailing slash).
The pathnames are relative to where the indexing was performed
(unless absolute paths were used).
.SS File Entries
Every file entry in the
.I "file index"
is of the form:
.cS
\f3\s+2{\s-2\fP\f2D\fP\f3\s+2}\s-2\fP\f2file-name\fP0\f3\s+2{\s-2\fP\f2S\fP\f3\s+2}{\s-2\fP\f2W\fP\f3\s+2}\s-2\fP\f2file-title\fP0
.cE
that is: the file's directory index++
.RI ( D )
followed by a null-terminated file name
followed by the file's size in bytes
.RI ( S )
followed by the number of words in the file
.RI ( W )
followed by the file's null-terminated title.
.P
For an HTML or XHTML file,
the title is what is between \f(CW<TITLE>\f1 ... \f(CW</TITLE>\f1 pairs;
for an MP3 file,
the title is the value of title field;
for a mail or news file,
the title is the value of the \f(CWSubject\f1 header;
for a LaTeX file,
the title is the argument of the \f(CW\\title\f1 command;
for a Unix manual page file,
the title is the contents of the first line within the \f(CWNAME\f1 section.
If a file is not one of those types of files, or is but does not have a title,
the title is simply the file (not path) name.
.SS Meta-Name Entries
Every meta-name entry in the
.I "meta-name index"
is of the form:
.cS
\f2meta-name\fP0\f3\s+2{\s-2\f2I\f3\s+2}\s-2\f(CW
.cE
that is: a null-terminated meta-name followed by the ID
.RI ( I ).
.SH CAVEATS
Generated index files are machine-dependent
(size of data types and byte-order).
.SH SEE ALSO
.BR index++ (1),
.BR search++ (1)
.SH AUTHOR
Paul J. Lucas
.RI < pauljlucas@mac.com >