Regression vs. Normal Distribution

89 Views Asked by At

I have to estimate something using historical data. Should I find the equation of the curve of best fit to estimate? Or use a confidence interval, standard deviation, and a z-score to calculate it? Conceptually, which is more accurate. What are the benefits? I just don't know what approach to take.

I have some data about a number of projects that have been categorized. For each project, I have the proportion of the total project length that was spent on testing. For each category, I have to estimate how long testing would likely take for any project of that category. I was wondering how best to solve this. The independent variable is, I suppose, the total length of each project. Is the dependent variable the absolute time spent on testing, or the % of time spent on testing out of the total length of the project?

I've plotted absolute time spent on testing against total length of the project and that has a weak correlation. R^2 = 0.3

1

There are 1 best solutions below

0
On

Have you tried plotting data in each category separately?

Regression is not, strictly speaking, appropriate in this case. Let $X$ denote the total length of a project, $1_K$ where $K\in \{2,3,4,5\}$ denote the category indicators and $Y$ denote the total time spent in testing. It is clear that $Y\in [0,X]$. If you run a regression (for example)

$$Y \sim \beta_0 + \sum_{j=2}^5 \alpha_j 1_j + \beta_1 X + \beta_2 X^2$$

then the predicted value $\hat{Y}$ may be outside of the interval $[0,X]$. This leads to a number of problems: nonsensical predictions are possible, predictions are made based on the assumption of normal errors, which is clearly violated in this case, etc.

The right statistical tool for this data is the beta regression. I recommend using R (excellent free statistical package). This reference should help you to get started: http://cran.r-project.org/web/packages/betareg/vignettes/betareg.pdf

P.S. If the plot of $Y$ against $X$ in each category separately looks like a blob, why not proceed as I suggested earlier? Bayesian approach allows you to make clear probability statements about your prediction.