Use Held-out estimator for unseen events

1k Views Asked by At

Held out estimator is an empiric estimation technique for calculating probabilities of events. (I would put a Wikipedia link, but i couldn't find a wikipedia page on this subject)

The main idea is to split the experiments data into two sets: Training and Held-out. (from this point and onward i will be using the notion of NLP for clarity)

Define: $S^T$ - The training corpus
$S^H$ - The calibration corpus (held-out)
$N_r^T$ - The number of events that appear in $S^T$ $r$ times
$c^T(x)$,$C^H(x)$ - The number of time that event $x$ appears in $S^T$ and $S^H$, respectfully.

According to the notations above, the calculation of the probability of an event, measured by the HeldOut estimator is given by the following equation:

$P_{HO} (x:c^T (x)=r)=\frac{∑_{x:c^T (x)=r}c^H (x)}{N_r^T|S^H |}$

And now for the question...

This is all good and nice if the event we are talking about is an event that appears both in $S^T$ and in $S^H$, but what happens when $C^T(x)=C^H(x)=0$? These are words (events) that don't appear in any of the corpora, so I don't know what words they are. If so, how can I get $C^H(x)$ if I have no Idea who $x$ is in the first place?

Is there some other equation or something that is the calculation for unseen words(events)?

Am i missing something? Is there another equation or something that deals with unseen words?