Confused about the definition of differential privacy

138 Views Asked by At

From standard references, the definition of differential privacy is as follows.

A mechanism $M$ is called $\epsilon$-differential privacy ($\epsilon$-dp) if it satisfies the following condition: for all $x, x' \in X$ with $d(x, x') = 1$, and for all measureable set $S \subset \mathbb R^n$,

$$\mathbb P(M(x) \in S) \le e^\epsilon P(M(x') \in S).$$

From my understanding, this equation means that for any outputs of $M(x)$ and $M(x')$, even for the case $M(x) \neq M(x')$, if they satisfy $\mathbb P(M(x)) \le e^\epsilon P(M(x'))$, differential privacy is achieved.

Is it true that differential privacy covers the case where $M(x) \neq M(x')$, as long as $M(x) \in S$ and $M(x') \in S$?

I am confused about this especially when Laplace mechanism is used to achieve differential privacy, $M(x) = M(x')$ is implicitly assumed (for example, here).

1

There are 1 best solutions below

0
On

Hi not_einstein (BTW I'm not Einstein too, like all living humanity, so identifying as not Einstein is differential private but I suspect if it contributes meaningful information at all, even as cumulative information :) )

The formal definition of DP says something about the distribution of mechanism (algorithm) outputs.

Generally speaking, if an algorithm gets 2 datasets as input that only differs by 1 data item e.g. a single person response to the midterm poll, the distribution of that mechanism output should not change much such that an attacker can infer what that person's response was.

Formally, if our identical up to 1 person datasets $x, x'$ the outputs of the mechanism $\mathcal{M}$ might be different but the distribution of $\mathcal{M}$'s outputs to $x$ and $x'$ is so similar so that you cannot find ant measurable set $S$ that separates them. (If you do not know what measurable is that is fine because measurable sets are so weird that you do not need to deal with them. It's for the sake of pure mathematical correctness).

The $\epsilon$ in the definition and the multiplication by $e^{\epsilon}$ is called Privacy Budget and it amounts how much different you do allow the output distributions to be different.

  • If $\epsilon$ is very large then the algorithm is not private because the attacker will surely find some $S$ that separates between $x$ and $x'$ and will uncover that person's data item e.g. the single person midterm answer that differs between dataset $x$ and dataset $x'$.
  • If $\epsilon=0$ then such algorithm is very private but not very useful like an algorithm that gives all Stackexchange users that are not the famous scientist Albert Einstein the tag 'not_einstein'. Of course if he was alive the discussion would have been a bit different...

Hope this helps and sorry for the late response I only got into this subject lately.

You can find very useful and practical stuff here and more theoretical stuff here.