Pigmail Installation Notes -------------------------- Requirements: ------------ 1. Ruby - grab it from your OS distribution's package archive, or . 2. TMail - best acquired as part of the `ruby-sumo' package: see or from the RAA, at . 3. DBM or GDBM - should be part of your Ruby installation 4. A large body of mails, for each category (spam/ham/whatever) you wish to identify. Installation ------------ Download the tarball, pigmail-version.tar.gz, and unpack it into a subdirectory, thus: bash$ mkdir -p ~/ruby/pigmail bash$ cd ~/ruby/pigmail bash$ tar xvpfz ~/Downloads/pigmail-version.tar.gz For CVS access instead, try: bash$ mkdir -p ~/ruby/ bash$ cd ~/ruby/ bash$ cvs -d ":pserver:anonymous@cvs.pigmail.sourceforge.net:/cvsroot/pigmail" login bash$ cvs -qz3 -d ":pserver:anonymous@cvs.pigmail.sourceforge.net:/cvsroot/pigmail" co pigmail Either way, you may want to add this directory to your PATH: sh$ PATH=~/ruby/pigmail:$PATH ; export PATH bash$ export PATH=~/ruby/pigmail:$PATH tcsh$ set path (~/ruby/pigmail $path) ; rehash zsh% set -A path ~/ruby/pigmail $path ; rehah as appropriate. Configuration: ------------- The default configuration provides two categories, called `ham' and `spam', with an equal weight assigned to each token-class. The Categories may be renamed by amending config.rb: Categories=["ham", "spam"] Usage: ----- To train pigmail about a particular mail: pigmail --learn=categoryname ([filename]+ | [ < filename]) It will use stdin as a fallback if no filenames are specified in the commandline. e.g. pigmail --learn=spam probably-spam/{1,3,17} This is best done in batch against an MH or other file-per-email folder, e.g.: find probably-spam -type f -size +5 -size -50 ! -name \*.gz \ ! -name .\* | xargs -n 20 pigmail --learn=spam If you absolutely must use the mbox format, try: formail -ds pigmail --learn=category < folder.mbox although this is considerably slower. Now you've trained it, run some checks on similar mails: pigmail --check probably-spam/2 The response will indicate the most likely category based on the training data: Matching file: probably-spam/2: best category = spam Procmail -------- There are two ways to use this with procmail. The favoured method is to invoke pigmail in pass-through mode, where the mail is reprinted in entirety but with a hint header added. The corresponding block in .procmailrc is: :0fwHB |$PIGMAIL --check --passthrough :0 ^X-Pigmail-Hint: spam spam-pigmail to redirect all spams to a separate folder for later perusal. Alternatively, for those who prefer to use the exit-code, pigmail has been written to return the category-number (starting from 0) that matches the mail. While weird, this does mean that an exit-code of 0 means ham, and 1 means spam, in the default configuration. Use the --exitstatus or `-e' option for this mode. Feedback mode: ------------- In addition to simple learning and checking, pigmail may be told to feed the mail's tokens back into whichever category matched. The theory is, if you train pigmail that mails containing the words apple orange banana are spam, then when someone sends you a mail where the most predominant words are apple banana kumquat then kumquat will be added to the list of spam words. Similarly this applies for all token-classes. This means that both categories will naturally become more refined over time, although you will need to unlearn and re-learn any false positives or negatives. To effect this, either write some procmail rules to handle the header and re-add to the category, or, preferably, simply specify both --check and --learn options simultaneously. Note that if you specify a category, --learn=something, and --check together, your choice will be overridden with the category resulting from --check. Advanced Configuration: ---------------------- The weightings assigned to each token-class may be renamed by amending config.rb also: Weight=Hash.new(1.0) Weight["WordPairs"]=2.0 Weight["Sender"]=5.0 according to taste. These weights apply only with the `--check' option; they are not reflected in the backend databases at all.