mirror of https://github.com/carlostrub/sisyphus
add slides
parent
1e9254d1f7
commit
de13cdbcd5
Binary file not shown.
After Width: | Height: | Size: 8.3 KiB |
@ -0,0 +1,74 @@
|
||||
|
||||
Sisyphus
|
||||
How to store 50 000 mails in 10MB to fight Spammers
|
||||
1 Mar 2018
|
||||
Tags: sisyphus, spam, junk, mail
|
||||
|
||||
Carlo Strub
|
||||
economist, gopher, rustacean, FreeBSD developer
|
||||
cs@carlostrub.ch
|
||||
cs@FreeBSD.org
|
||||
https://carlostrub.ch
|
||||
https://github.com/carlostrub
|
||||
|
||||
|
||||
* Junk Mail
|
||||
|
||||
What is it?
|
||||
|
||||
- Mail we do not want to have in our mailbox.
|
||||
- The same sender might sometimes be in either category.
|
||||
|
||||
How to fight it?
|
||||
|
||||
- Block lists, e.g. Spamhaus, etc.
|
||||
- Sophisticated filters, e.g. SpamAssassin
|
||||
- Greylisting, tarpit, and other exotic punishments
|
||||
|
||||
* Sisyphus
|
||||
|
||||
- requires zero configuration, neither on the server nor on the client
|
||||
- works with any MTA and any client
|
||||
- learns about your preferences based on all messages in your inbox and your junk folder
|
||||
- can handle multiple mail accounts with independent junk mail preferences
|
||||
- requires minimal resources, e.g. learning over 50 000 mails and keeping track of roughly 90 000 words requires only 10MB of storage
|
||||
- BSD licensed
|
||||
|
||||
* How it works — Bayes' Rule
|
||||
|
||||
.image bayes.png
|
||||
|
||||
* It's all about counters
|
||||
|
||||
- All needed probabilities can be calculated using counters
|
||||
- But counters are costly in general (storage complexity proportional to number of elements)
|
||||
- What if we learn a mail twice?
|
||||
|
||||
* HyperLogLog Algorithm
|
||||
|
||||
- Hashes of a stream of data has interesting properties regarding cardinality:
|
||||
1) number of leading zeroes yields estimate on lower bound (bit-pattern observables)
|
||||
2) smallest values yield estimate on cardinality (order statistics observables)
|
||||
- Two consequences for Sisyphus:
|
||||
1) we can count all words in all mails on very small space
|
||||
2) we do not have to check whether we already learned a mail
|
||||
|
||||
|
||||
* Implementation
|
||||
- Pure go
|
||||
- Database: bolt (stores sisyphus.db in Maildir)
|
||||
- Learns all mails in Maildir
|
||||
- Classifies new mail, triggered by FSNotify
|
||||
- Dependencies:
|
||||
github.com/boltdb/boltdb
|
||||
github.com/carlostrub/maildir
|
||||
github.com/fsnotify/fsnotify
|
||||
github.com/gonum/stat
|
||||
github.com/kennygrant/sanitize
|
||||
github.com/retailnext/hllpp
|
||||
github.com/sirupsen/logrus
|
||||
github.com/urfave/cli
|
||||
- Principles: 12factor App, semantic versioning
|
||||
|
||||
* API
|
||||
.link https://godoc.org/github.com/carlostrub/sisyphus
|
Loading…
Reference in New Issue