How to read and understand the underlying mathematical concept about Data Programing

74 Views Asked by At

I was going through the official paper << https://arxiv.org/pdf/1605.07723.pdf>> Data Programming:Creating Large Training Sets, Quickly and came across this two equations:

Equations

Λ=(Λ1,…,Λm) are m labelling functions which labels our training data.

Here are my doubts regarding equation 1:

What does enter image description here in equation1 means? And what is value of Y, is it just {-1,1} or entire true labels of data in data-set and The final value of equation 1 will be just a number, right?

How would I read this equation 1 and 2 in mathematical way?

Also equation 2 is compared with logistic regression loss function. But when we minimize loss function we also have true labels attached to it. But I am not able to understand what is true lables in case of equation 2 given in the above image?

How we put λ(x) from equation 2 in equation 1? x here would be unlabelled training data-point. What does that inner summation means in equation 2 as y' is already taken care in equation 1. Can someone elaborate more on equation 2 on lines of data-frame with x features.

Updated based on answer.

I have written equation2 based on equation1. Please let me know if it is correct?

enter image description here

1

There are 1 best solutions below

6
On BEST ANSWER

I'm not totally familiar with that paper but I think I can give you some insight. So, $\Lambda = (\Lambda_1,\dots,\Lambda_m)$ are some random variables. Each are assumed to be independent, and can have values in $\{-1,0,1\}$ or equivalently in $\{-Y,0,Y\}$. Hence, each variable have a distribution function which will look something like this: $$ p_i(\Lambda_i) = \left\{ \begin{array}{l} \beta_i\alpha_i & \text{if $\Lambda_i = Y$} \\ \beta_i(1-\alpha_i) & \text{if $\Lambda_i = -Y$}\\ (1-\beta_i) & \text{if $\Lambda_i = 0$}\\ \end{array}\right. $$ with parameters $\alpha_i,\beta_i$. Note that $\beta_i\alpha_i + \beta_i(1-\alpha_i) + (1-\beta_i) = 1$ so it is a true probability distribution. Now, note that $p_i(\Lambda_i) $ can be written compactly using the indicator function: $$ \mathbf{1}_{\{p(x)\}} = \left\{ \begin{array}{ll} 1 & \text{if $p(x)$ is true} \\ 0 & \text{if $p(x)$ is false}\\ \end{array}\right. $$ for some predicate $p(x)$ (which might state something like "x=4"). Using this function, we can write: $$ p_i(\Lambda_i) = \beta_i\alpha_i\mathbf{1}_{\{\Lambda_i=Y\}} + \beta_i(1-\alpha_i)\mathbf{1}_{\{\Lambda_i=-Y\}} + (1-\beta_i)\mathbf{1}_{\{\Lambda_i=0\}} $$ Now, we can look at the joint probability distribution of the whole $\Lambda=(\Lambda_1,\dots,\Lambda_m)$. Since all $\Lambda_i$ are independent, then the PDF of $\Lambda$ will be the multiplication of all $p_i(\Lambda_i)$.

Then, we want to take a look at the joint probability distribution of both $\Lambda$ and $Y$. Note that $Y$ can only take values in $\{-1,1\}$ with equal probability, so $1/2$ for each. When we combine these, the joint probability density $(\Lambda,Y)$ is $\mu_{\alpha,\beta}(\Lambda,Y)$ as in equation (1).

Finally, by obtaining the maximum of the log-likelihood as in equation 2 (obtaining the "best" $(\alpha,\beta)$), you obtain a probability distribution from which you can sample data that "look like" data in the training dataset, with generated predicted classes $Y$.

So, answering to your questions:

What does $\mathbf{1}_{\{\Lambda_i=Y\}}$ in equation 1 means? Its the indicator function used to write the PDF of $\Lambda$ more compactly.

What is value of Y, is it just {-1,1} or entire true labels of data in data-set? Its the unknown predicted class. At this point, is neither fixed nor given in a dataset. It is a random variable which when sampling from $\mu_{\alpha,\beta}$ you obtain a concrete value: this is precisely what a generative algorithm is, you "learn" a probability distribution from data, and then sample from it.

The final value of equation 1 will be just a number, right? Its the joint probability function of $(\Lambda,Y)$, when $\Lambda,Y$ are not fixed, but random variables.

But I am not able to understand what is true labels in case of equation 2 given in the above image? The data from where you are learning is contained in the dataset $S$. In (2) you replace the random variables $\Lambda$ with concrete values $\lambda(x)$ obtained from all points $x\in S$.

Hope this helps!