bogofilter-wordlist - Trained database for indimail-spamfilter

Edit Package bogofilter-wordlist

bogofilter-wordlist provides a trained database for bogofilter. The training comes from set of SPAM and HAM emails collected from the SpamAssassin public mail corpus.

The corpus is a selection of mail messages, suitable for use in testing spam filtering systems. Pertinent points:

* All headers are reproduced in full. Some address obfuscation has taken place, and hostnames in some cases have been replaced with "spamassassin.taint.org" (which has a valid MX record). In most cases though, the headers appear as they were received.
* All of these messages were posted to public fora, were sent to me in the knowledge that they may be made public, were sent by me, or originated as newsletters from public news web sites.
* relying on data from public networked blacklists like DNSBLs, Razor, DCC or Pyzor for identification of these messages is not recommended, as a previous downloader of this corpus might have reported them!
* Copyright for the text in the messages remains with the original senders.

OK, now onto the corpus description. It's split into six parts, as follows:

* spam: 500 spam messages, all received from non-spam-trap sources.
* easy_ham: 2500 non-spam messages. These are typically quite easy to differentiate from spam, since they frequently do not contain any spammish signatures (like HTML etc).
* hard_ham: 250 non-spam messages which are closer in many respects to typical spam: use of HTML, unusual HTML markup, coloured text, "spammish-sounding" phrases etc.
* easy_ham2: 1400 non-spam messages. A more recent addition to the set.
* spam_2: 1396 spam messages. Again, more recent.
* spam_3: 393 spam messages. Most recent.

Total count: 6439 messages, with about a 35% spam ratio.

The corpora are prefixed with the date they were assembled. The messages are named by a message number and their MD5 checksum.

This corpus lives [here](http://spamassassin.apache.org/publiccorpus/). Mail jm - public - corpus AT jmason dot org if you have questions.
https://github.com/mbhangui/indimail-virtualdomains

Refresh
Refresh
Source Files
Filename Size Changed
_service 0000000411 411 Bytes
_service:download_url:bogofilter-wordlist-obs.tar.gz 0012937306 12.3 MB
_service:extract_file:PKGBUILD 0000001334 1.3 KB
_service:extract_file:bogofilter-wordlist-1.0.0.tar.gz 0012936781 12.3 MB
_service:extract_file:bogofilter-wordlist.changes 0000000317 317 Bytes
_service:extract_file:bogofilter-wordlist.dsc 0000000427 427 Bytes
_service:extract_file:bogofilter-wordlist.spec 0000001650 1.61 KB
_service:extract_file:debian.tar.gz 0000002039 1.99 KB
Latest Revision
Manvendra Bhangui's avatar Manvendra Bhangui (mbhangui) committed (revision 7)
trigger service run
Comments 0
openSUSE Build Service is sponsored by