.\" This is a comment
.TH IFILE "1" "November 2004" "ifile 1.3.4" "User Commands"
.SH NAME
ifile \- core executable for the ifile mail filtering system
.SH SYNOPSIS
.B ifile
[\fB-b \fIfile\fR] [\fB-q\fR|\fB-Q\fR] [\fB-g\fR] [\fB-k\fR] [\fB-o\fR] [\fB-v \fInum\fR] [\fIlexing options\fR] \fIfile \fI...
.br
.B ifile
\fB-c\fR \fB-q\fR|\fB-Q\fR [\fB-T \fIthreshold\fR] [\fB-b \fIfile\fR] [\fB-g\fR] [\fB-k\fR] [\fB-o\fR] [\fIlexing options\fR] \fIfile \fI...
.br
.B ifile
[\fB-b \fIfile\fR] [\fB-d \fIfolder\fR] [\fB-i \fIfolder\fR|\fB-u \fIfolder\fR] [\fB-g\fR] [\fB-k\fR] [\fB-o\fR] [\fB-v \fInum\fR] [\fIlexing options\fR] \fIfile \fI...
.br
.B ifile
\fB-r\fR [\fB-b \fIfile\fR]
.SH DESCRIPTION
.B ifile
is a mail filter client that uses machine learning to classify
e-mail into folders/mail boxes.  The algorithm that it uses is called
Naive Bayes.   Basically, naive bayes considers each document an
unordered collection of words and classifies by matching the document
distribution with the most closely matching folder/mailbox
distribution.
.SH OPTIONS
.TP
\fB\-b\fR, \fB\-\-db\-file\fR=\fIfile\fR
Location to read/store ifile database.  Default is
~/.idata
.TP
\fB\-c\fR, \fB\-\-concise\fR
equivalent of "ifile \fB\-v\fR 0 | head \fB\-1\fR | cut \fB\-f1\fR
\fB\-d\fR".  Must be used with \fB-q\fR or \fB-Q\fR.
.TP
\fB\-d\fR, \fB\-\-delete\fR=\fIfolder\fR
Delete the statistics for each of \fIfiles\fR from the
category \fIfolder\fR
.TP
\fB\-f\fR, \fB\-\-folder\-calcs\fR=\fIfolder\fR
Show the word-probability calculations for \fIfolder\fR
.TP
\fB\-g\fR, \fB\-\-log\-file\fR
Create and store debugging information in
~/.ifile.log
.TP
\fB\-i\fR, \fB\-\-insert\fR=\fIfolder\fR
Add the statistics for each of the files to the
category \fIfolder\fR
.TP
\fB\-k\fR, \fB\-\-keep\-infrequent\fR
Leave in the database words that occur
infrequently (normally they are tossed)
.TP
\fB\-l\fR, \fB\-\-query\-loocv\fR=\fIfolder\fR
For each of the files, temporarily removes file from
\fIfolder\fR, performs query and then reinserts file in
\fIfolder\fR.  Database is not modified.
.TP
\fB\-o\fR, \fB\-\-occur\fR
Uses document bit-vector representation.  Count
each word once per document.
.TP
\fB\-q\fR, \fB\-\-query\fR
Output rating scores for each of the files
.TP
\fB\-Q\fR, \fB\-\-query\-insert\fR
For each of the files, output rating scores and add
statistics for the folder with the highest score
.TP
\fB\-T\fR, \fB\-\-threshold\fR=\fIthreshold\fR
When used with both \fB-c\fR and \fB-q\fR, 
output the two highest ranking categories if
their score differs by at most \fIthreshold\fR / 1000,
which can be used to detect border cases.
When used with \fB-q\fR only and any \fIthreshold\fR > 0,
output the score difference percentage.
For example,
.RS
.RS
\fBifile \-T\fR1 \fB\-q\fR foo.txt
.RE
might result in
.RS
.br
spam \-15570.48640776
.br
non-spam \-18728.00272369
.br
diff[spam,non-spam](%) 9.21
.RE
If so, then 
.RS
\fBifile \-T\fR93 \fB\-q \-c\fR foo.txt
.RE
will result in
.RS
foo.txt spam,non-spam
.RE
whereas
.RS
\fBifile \-T\fR92 \fB\-q \-c\fR foo.txt
.RE
will result in
.RS
foo.txt spam
.RE
.RE
.TP
\fB\-r\fR, \fB\-\-reset\-data\fR
Erases all currently stored information
.TP
\fB\-u\fR, \fB\-\-update\fR=\fIfolder\fR
Same as 'insert' except only adds stats if \fIfolder\fR
already exists
.TP
\fB\-v\fR, \fB\-\-verbosity\fR=\fInum\fR
Amount of output while running: 0=silent, 1=quiet,
2=progress, 3=verbose, 4=debug
.PP
Lexing options:
.TP
\fB\-a\fR, \fB\-\-alpha\-lexer\fR
Lex words as sequences of alphabetic characters
(default)
.TP
\fB\-A\fR, \fB\-\-alpha\-only\-lexer\fR
Only lex space-separated character sequences which
are composed entirely of alphabetic characters
.TP
\fB\-h\fR, \fB\-\-strip\-header\fR
Skip all of the header lines except Subject:,
From: and To:
.TP
\fB\-m\fR, \fB\-\-max\-length\fR=\fIchar\fR
Ignore portion of message after first \fIchar\fR
characters.  Use entire message if \fIchar\fR set to 0.
Default is 50,000.
.TP
\fB\-p\fR, \fB\-\-print\-tokens\fR
Just tokenize and print, don't do any other
processing.  Documents are returned as a list of
word, frequency pairs.
.TP
\fB\-s\fR, \fB\-\-no\-stoplist\fR
Do not throw out overly frequent (stoplist) words
when lexing
.TP
\fB\-S\fR, \fB\-\-stemming\fR
Use 'Porter' stemming algorithm when lexing
documents
.TP
\fB\-w\fR, \fB\-\-white\-lexer\fR
Lex words as sequences of space separated
characters
.PP
If no files are specified on the command line, ifile will use standard input
as its message to process.
.TP
\fB-?\fR, \fB\-\-help\fR
Give this help list
.TP
\fB\-\-usage\fR
Give a short usage message
.TP
\fB\-V\fR, \fB\-\-version\fR
Print program version
.PP
Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.
.SH FILES
.TP
.I ~/.idata
ifile database (default location).  See \fIFAQ\fR included in ifile package for description of database format.
.SH AUTHOR
Jason Rennie <jrennie@csail.mit.edu> and many others.  See the ChangeLog for the full list.
.\".SH "SEE ALSO"
.\".BR ifilter_mh (1),
.\".BR irefile_mh (1),
.\".BR knowledge_base.mh (1),
.\".BR news2mail (1).
.SH EXAMPLES
Before using
.BR ifile ,
you need to train it.
Let's say that you have three folders, "spam", "ifile" and "friends",
and the following directory structure:

.RS
/--+--spam----+--1
   |          +--2
   |          +--3
   |
   +--ifile---+--1
   |          +--2
   |          +--3
   |
   +--friends-+--1
              +--2
              +--3
.RE

The following commands build the ifile database in ~/.idata (use the
.B \-d 
option to specify a different location for the database):

.RS
.br
.BR "ifile \-h \-i" " spam /spam/*"
.br
.BR "ifile \-h \-i" " ifile /ifile/*"
.br
.BR "ifile \-h \-i" " friends /friends/*"
.RE

The 
.B \-h
option strips off headers besides "Subject:", "From:" and "To:". 
I find that 
.B \-h
improves ifile's performance, but you may find otherwise for
your personal collection.

Note that we have made the argument to 
.B \-i
the same as the corresponding folder name. This is not necessary. The
argument to
.B \-i
can be any word you want to use to identify a category of e-mails. The
argument to 
.B \-i
must not include space characters (including tab, feedline, etc.).

At this point, your ~/.idata file should look something like this:

.RS
.br
spam ifile friends 
.br
662 1020 6451 
.br
3 3 3 
.br
jrennie 9 0:3 1:18 2:16 
.br
mindspring 6 1:7 2:5 
.br
make 9 0:5 1:3 
.br
yahoo 9 0:1 1:22 2:2 
.RE

The first line is the space-separated list of folders. Their ordering
specifies a numbering (spam=0, ifile=1, friends=2). The second line is a
token count for each folder (e.g. 662 tokens observed in the three spam
messages). The third line is an e-mail count for each folder (e.g. 3
e-mails for each of spam, ifile and friends). Each following line
specifies statistics for a word. The format of a line is 

.RS
\fIword age folder\fR:\fIcount\fR [\fIfolder\fR:\fIcount\fR ...]
.RE

where \fIfolder\fR is the folder number determined by the first line
ordering. Folders with a count of zero are not listed. So, the line
beginning with "jrennie" indicates that "jrennie" appeared 3 times in
"spam" e-mails, 18 times in "ifile" e-mails and 16 times in "friends"
e-mails. The \fIage\fR is the number of e-mails that have been processed
since the word was added to the database. Very infrequent words are
pruned from the database to keep the database size down.

Now that you have a database, you might want to filter some e-mails. Say
you have the following incoming e-mails:

.RS
/--inbox--+--1
          +--2
          +--3
.RE

To find out what folders ifile thinks these e-mails belong in, run

.RS
.br
.BR "ifile \-c \-q" " /inbox/1"
.br
.BR "ifile \-c \-q" " /inbox/2"
.br
.BR "ifile \-c \-q" " /inbox/3"
.RE

Let's say that 1 is about ifile, 2 is spam and 3 is from a
friend. Assuming ifile does its job correctly, you'll see output like
this:

.RS
.br
/inbox/1 ifile
.br
/inbox/2 spam
.br
/inbox/3 friends
.RE

With such little training data, ifile is unlikely to get the labels
correct, but you should get the idea :-)

Now, if you move the e-mails to the folders suggested by ifile, you'll
want to update the database accordingly. You can do this with the 
.B \-i
option, like before. Or, you can simply use 
.B \-Q
in place of 
.B \-q
above. This automatically adds the e-mail to the folder ifile suggests.

Now, assume for a moment that e-mail 1 was actually spam. We've added 1
to ifile and put it in the ifile folder. We need to move it to the spam
folder and update the ifile database accordingly. We can update the
database with the following command:

.RS
.BR "ifile \-d" " ifile "
.BR "-i" " spam /inbox/1"
.RE

This deletes the e-mail from "ifile" and adds it to "spam".
.SH "SEE ALSO"
Examples of how to use
.B ifile
together with
.BR procmail (1)
and
.BR metamail (1)
can be found in the directory
.B /usr/share/doc/ifile/examples.