I have a bunch of sets. Some sets contain bad values. I know which sets have them, but not which values are bad.

96 Views Asked by At

My company sends email on behalf of many other companies. Hotmail tells us when we start sending spammy messages, but they only say "some of the emails this giant batch of messages had spammy stuff", and not specifically which emails were spammy. Those batches of emails contain stuff from lots of different clients we have, and I need to narrow down which clients are sending the spammy messages.

So, given a bunch of sets of emails we sent, and Hotmail giving us a "yay" or "nay" on each set, and given that emails and clients are 1:1 so it's easy to which client sent any given email in the sets, how can I tell which clients are probably the spammers?

(P.S.- I'm not the brightest math whiz in the box, so please explain your answer in layman's terms).

2

There are 2 best solutions below

0
On

If you have access to the bad batch, why not partition it by client? Then forward each partitioned batch to an email address your company owns to see which batch is flagged. Since every email in the partitioned batch belongs to one client, you'll know who the spammer is.

2
On

Here's a way you could model the problem. Suppose you have clients numbered $i=1,2,\ldots,M$, and each client has an associated probability $p_i$ which is the probability that they don't ruin a batch of messages with spam if they are included in the batch.

The simplest thing to do would be to say $p_i=0$ for spammers, i.e. they always spoil the batch. Then you could say the likely spammers were clients who never participated in a good batch of messages.

More elaborate, you could say that each spammer has the same probability of passing $p_i=\sigma$ and each non-spammer passes with probability $p_i=\nu$.

Then for a given set of clients $C$ in a batch, the probability of a good batch of messages is $$ G_C = \prod_{i\in C} p_i $$ and the probability of a bad batch of messages is $1-G_C$.

Given a set of batches with client sets $C_1,C_2,\ldots$, the likelihood of getting the realized good/bad labelling back is $$ P = \prod_{C_j~\mathrm{good}} G_{C_j} \prod_{C_j~\mathrm{bad}} (1-G_{C_j}) $$

We can look for the maximum likelihood condition by maximizing $\log P$ $$ \begin{align} L &= \log P = \sum_{C_j~\mathrm{good}} \log G_{C_j} + \sum_{C_j~\mathrm{bad}} \log (1-G_{C_j}) \\ & \simeq \sum_{C_j~\mathrm{good}}\sum_{i\in C_j} \log p_i - \sum_{C_j~\mathrm{bad}}\prod_{i\in C_j}p_i \end{align} $$ to a first order approximation assuming $G_C$ is close to zero for bad batches. Then $$ \frac{\partial L}{\partial p_i} \simeq \sum_{\mathrm{good}}\frac{1}{p_i}-\sum_{\mathrm{bad}}\frac{G_{c_j}}{p_i} $$ and we can look the best spammer/non-spammer classification by starting with and arbitrary labelling (say all non-spammer, e.g.), then reclassifying the clients with largest negative derivative as spammers, and/or those with largest positive derivative as non-spammers, then repeat as necessary until you converge to a (local) maximum.

Some variations of this approach would be to consider the probabilities on a message-by-message basis to account for different volumes, or to use something like the EM algorithm to also try to determine $\sigma$ and $\nu$. But since you asked for layman's terms I think I should stop here.