We have a series of observations $(x_i,y_i)$ where $Y_i \in \{0,1\}$ and $X_i \in (0,1)$. I want to curve fit these observations to $y = f(x)$ in order to interpret $f(x)$ as a probability given some arbitrary metric $x$. How can I do this and under what assumptions on the model? Note that the observations on $x$ will cluster around $x \approx 1/2$ with extremely few observations close to the boundaries.
Some assumptions:
- $f(x)$ is monotone
- $f(x) \rightarrow 0$ when $x \rightarrow 0$, $f(x) \rightarrow 1$ when $x \rightarrow 1$
- We have tons of data
Can we do this by linear regression on some transformation on the observations?
Bin your data!
For instance, take $\{(x_i,y_i) | x_i \in [0.2,0.3]\}$ and call this the data chunk for that interval. You will clearly have 10 such intervals, and counting the frequency of $y_i$ in each interval gives you a local estimate of probability. So for instance, the above subset would give a value for $f(0.25)$ which I could use in fitting to $f$.
Of course, how many intervals you divide into depends on the amount of data, and how quickly $f$ is supposed to vary. A good rule of thumb would be -- if you have $n$ data points, divide into $\sqrt{n}$ bins of $\sqrt{n}$ points each.
Plotting your fitted curve $f$ over these "averaged" points will probably give you a good idea how well you're estimating the local probability. But note that you can also find a fitting curve $f$ by just doing a normal least-squares fit to the $\{0,1\}$ data: least squares will actually weight it appropriately to try to get the right probability.
Although, note that least squares will only be faithful if you give it enough degrees of freedom to fit the data. For instance: If the true function $f$ has $f(x) = 0$ for all $x \in [0,0.5)$ and $f(x) = 1$ for all $x \in [0.5,1)$, and you try to fit a linear function, you'll get something like $f_0(x) = 3x/2 - 1/4$. Which will the data decently well, but give negative probabilities for $x < 1/6$. So make sure to use a sufficiently flexible model (or equivalently, a model with the constraints to have $f \in [0,1]$ everywhere).