I am a data scientist working for a company that takes user deposits. I wanted to answer the question of how likely an account that's dropped to $0 on deposit - or dies, in other words - would refund in the next 30 days - or resurrect, in other words. I decided to look at the number of days "dead" as my independent variable and I thus collected data every month again with the following independent and dependent variables.
- Independent variable (d) - how many days has the account been dead.
- Dependent variable (R) - did the account resurrect over the next 30 days? {1: yes, 0:no}
I used logistic regression to obtain a function of the form f(d) = p(R=1|d) = 1/(1+e^-(b1+b2*d)) where d is the number of days dead and f(d) = p(R=1|d) is the probability of resurrection over the next thirty days given those d days dead.
This is useful but I'd like to redefine R as whether the account resurrected over a generalized w days {1:yes, 0:no} and then figure out f(d,w) = p(R=1|d,w) where p is the probability that account resurrects over the next w days given it's been dead for r days.
My question three-part:
- What is the least biased way to estimate this function from the data?
- What is a bias - variance efficient way to estimate this function from the data?
- What is a computationally feasible way to estimate this function from the data?