Pigmail Installation Notes
--------------------------
Requirements:
------------
1. Ruby - grab it from your OS distribution's package archive, or
.
2. TMail - best acquired as part of the `ruby-sumo' package: see
or from the RAA, at
.
3. DBM or GDBM - should be part of your Ruby installation
4. A large body of mails, for each category (spam/ham/whatever) you wish
to identify.
Installation
------------
Download the tarball, pigmail-version.tar.gz, and unpack it into a
subdirectory, thus:
bash$ mkdir -p ~/ruby/pigmail
bash$ cd ~/ruby/pigmail
bash$ tar xvpfz ~/Downloads/pigmail-version.tar.gz
For CVS access instead, try:
bash$ mkdir -p ~/ruby/
bash$ cd ~/ruby/
bash$ cvs -d ":pserver:anonymous@cvs.pigmail.sourceforge.net:/cvsroot/pigmail" login
bash$ cvs -qz3 -d ":pserver:anonymous@cvs.pigmail.sourceforge.net:/cvsroot/pigmail" co pigmail
Either way, you may want to add this directory to your PATH:
sh$ PATH=~/ruby/pigmail:$PATH ; export PATH
bash$ export PATH=~/ruby/pigmail:$PATH
tcsh$ set path (~/ruby/pigmail $path) ; rehash
zsh% set -A path ~/ruby/pigmail $path ; rehah
as appropriate.
Configuration:
-------------
The default configuration provides two categories, called `ham' and `spam',
with an equal weight assigned to each token-class.
The Categories may be renamed by amending config.rb:
Categories=["ham", "spam"]
Usage:
-----
To train pigmail about a particular mail:
pigmail --learn=categoryname ([filename]+ | [ < filename])
It will use stdin as a fallback if no filenames are specified in the
commandline.
e.g.
pigmail --learn=spam probably-spam/{1,3,17}
This is best done in batch against an MH or other file-per-email folder,
e.g.:
find probably-spam -type f -size +5 -size -50 ! -name \*.gz \
! -name .\* | xargs -n 20 pigmail --learn=spam
If you absolutely must use the mbox format, try:
formail -ds pigmail --learn=category < folder.mbox
although this is considerably slower.
Now you've trained it, run some checks on similar mails:
pigmail --check probably-spam/2
The response will indicate the most likely category based on the training
data:
Matching file: probably-spam/2: best category = spam
Procmail
--------
There are two ways to use this with procmail. The favoured method is to
invoke pigmail in pass-through mode, where the mail is reprinted in
entirety but with a hint header added.
The corresponding block in .procmailrc is:
:0fwHB
|$PIGMAIL --check --passthrough
:0
^X-Pigmail-Hint: spam
spam-pigmail
to redirect all spams to a separate folder for later perusal.
Alternatively, for those who prefer to use the exit-code, pigmail has been
written to return the category-number (starting from 0) that matches the
mail. While weird, this does mean that an exit-code of 0 means ham, and 1
means spam, in the default configuration. Use the --exitstatus or `-e'
option for this mode.
Feedback mode:
-------------
In addition to simple learning and checking, pigmail may be told to feed
the mail's tokens back into whichever category matched.
The theory is, if you train pigmail that mails containing the words
apple
orange
banana
are spam, then when someone sends you a mail where the most predominant
words are
apple
banana
kumquat
then kumquat will be added to the list of spam words. Similarly this
applies for all token-classes.
This means that both categories will naturally become more refined over
time, although you will need to unlearn and re-learn any false positives or
negatives.
To effect this, either write some procmail rules to handle the header and
re-add to the category, or, preferably, simply specify both --check and
--learn options simultaneously.
Note that if you specify a category, --learn=something, and --check
together, your choice will be overridden with the category resulting from
--check.
Advanced Configuration:
----------------------
The weightings assigned to each token-class may be renamed by amending
config.rb also:
Weight=Hash.new(1.0)
Weight["WordPairs"]=2.0
Weight["Sender"]=5.0
according to taste.
These weights apply only with the `--check' option; they are not reflected
in the backend databases at all.