I have a business problem which requires calculating drop in percentage across different stages in a survey. Unfortunately, I can't give out the actual business case due to confidentiality reasons, so I have tried my best to explain my case generically below.
Business sends out surveys in different domains (example: Tech, Medicine, Science, Sports etc) to a select (based on the domain) list of individuals. We have the contact details for these individuals. Individuals who are contacted as part of a survey go through 3 stages. In each stage, the individual should provide a response. If there is no response or if the response is not satisfactory then the individual is dropped at that stage. The individual cannot get to stage 2 without going past stage 1. Each stage runs for a random number of days and happens in a chronological order. At every stage there will be some individuals who will get dropped. Individuals who remain after the 3rd stage are selected.
The business question is- if X individuals are required in stage 3 within Y days, then how many individuals should be contacted at the start of the survey?
I have calculated the daily response rates for each stage in each domain based on past surveys. I have data which looks something like the below. Note: These are not real %, but just some made up values. I have included it here to give an idea of how I'm approaching to solve this.
Daily response rate for day x= #Positive responses day x / Total positive responses
Few things to note on the daily response %-
1) If the past data contains 100 surveys in "Tech", then the day 0 response % is calculated as an average of the day 0 values seen in these 100 surveys. The same logic applies for each stage.
2) The daily response % of stage 2 & 3 are calculated just based on their total responses.
I believe there is way to answer the business question (in bold above) with the response %. Is this right? If yes, can someone please explain how this can be done?
I need a way to relate the daily response % between stage 3 & stage 1 to answer the business question, for now these percentages of their own. For "Tech", the day 0 response % for stage 2 is 2% from the table. But, actually this is not really 2% and this is 2% of the individuals who passed stage 1 (which is again 4% of the total individuals who participated in the survey). I can say what I need logically, but need help on how this can be calculated mathematically.
If I understand your question I think what you want is this:
As of day Y, in total A (=sum of column 1 up to day Y) people were contacted initially, and of those B (=sum of column 3 up to day Y) had made it to stage 3 by that day. So the pass rate is B/A, and you would need AX/B people in total to expect to achieve X people by day Y.
Of course, that only gets you an expected value of X. If instead you want to be sure (or sufficiently confident) of at least X then obviously you'll need more people, how many more is a more complicated question.
And on the actual data you show, you're stuffed, because pretty much no-one made it to stage 3 at all and so both the number required and the uncertainty in that estimate are huge. But hopefully your real data is nicer than that.