Efficiently identifying spam honeypots

213 Views Asked by At

I realise that the title is computing specific, but I think the underlying problem is general - I just don't know how to phrase it more generally (which may be part of my problem). So I am asking here, but please say if I should ask elsewhere (where?).

Some spam (unwanted email) detection services construct blacklists using "honeypots" - known email addresses that receive spam (say N of these, in total). If some (small) number of these (say n) receive spam from the same address then that address is blacklisted.

So I started to wonder how a spammer could identify the honeypots.

For n=1 then you could identify the (single) honeypot with emails from log2(N) distinct addresses - number the N candidates from 0 to N-1 and use the binary representation of that number to select which emails to send. Then see which addresses are blocked and reconstruct the honeypot's number from the appropriate bits.

For n=2 I initially thought that a similar approach would work, numbering the pairs. However, every honeypot is a member of multiple pairs, and so you get confusion. I think it would work if you partitioned the pairs into subsets that didn't have common honeypots and then used a separate numbering scheme for each subset. But I don't know how to do that partition, or how it affects the total number of distinct addresses (or total number of emails) needed.

Anyway, my question is - what is the general, most efficient solution to this problem? Also, is it a known (more general) problem?

Thanks (and apologies - I don't have a clue what tags to use, either).

EDIT - I am still curious about solutions that send all emails before checking any results.

2

There are 2 best solutions below

1
On BEST ANSWER

For $n = 1$ do a binary search (this is the same solution as yours, but in different words).

For $n = 2$ do a binary search... It will take you at most $2\log_2 N$ steps. The idea is that you first check $[0,...N/2)$ and $[N/2...N)$ and then if both halves contains a honey pot, use $n=1$ solution for each. If both pots are contained in one half, use solution for $n = 2$ recursively on that half (its a smaller set).

For $n \ll N$ do a similar thing: split the set in halves, ignore the empty intervals and continue on those that contain the pots. This will take something like $n \log N$, but I didn't do a proper calculation here.

Also, observe, that with no bounds on $n$ your situation might be arbitrarily bad. For example for $n = N-1$ you have to use $N-1$ steps in the worst case, and $\frac{N}{2}$ in average, i.e. the only way you could find the only non-honey-pot is to ping them one by one, any other strategy would not give you any information.

I hope this helps $\ddot\smile$

0
On

Well, I have a general solution, but it requires an awful lot of addresses.

We need the first $N$ primes ($P_1$ to $P_N$), and $K = P_N \cdot P_{N-1}$ addresses (for the pairs case). Number the addresses (1 to K) and send honeypot $i$ emails from addresses at multiples of $P_i$. Then when an address is blocked look for the (two) prime factors of the number associated with the blocked address to identify the honeypots.

This extends to triplets etc, but requires even more addresses.