Regression with error coming from rounding

1.1k Views Asked by At

I am looking at the following model:

$c$ is a fixed vector in $\mathbb{R}_+^n$ and for any $x \in \mathbb{R}_+^n$ we obtain a value $y =[c^Tx]$, i.e. rounding $c^Tx$ to the nearest integer.

I want to determine c based on observations of $x_1, \ldots ,x_k$ and $y_1, \ldots, y_k$. (Before EDIT $x_1, \ldots ,x_n$ and $y_1 \ldots, y_n$.)

Of course I could simply do a least squares, but I do not know how to get a good measure of error then. I think one could somehow exploit the fact that error is not really random but comes from rounding. And I think it should be possible to measure of quality of the resulting vector $c$.

So here is the question:

Do you know a method to determine $c$ together with a bound on the error in $c$, i.e. an interval for each coefficient of $c$ where the true value is guaranteed to be contained. (Of course I'd like to know methods which small error bounds)?

EDIT: I had a typo fixed, namely I may have many more observations $k$ than the dimension of the vectors $n$. My hope is to be able to get better bounds by increasing $k$.

2

There are 2 best solutions below

0
On

for a limited number of levels of y , try using multinomial probit. the idea is that for n-0.5 < y < n+0.5, you see n. Treat this as a categorical dependent variable.

hope this helps!

0
On

There are several methods to approach the problem of adjusting estimated regression coefficients for rounding errors in the data. Most of these techniques perform an adjustment to the main diagonal of the sample covariance matrix of the variables. One of the most commonly used method is the so-called Sheppard's correction. This is based on a Taylor expansion of the likelyhood function, which finally leads to the subtraction of a quantity (depending on the intervals used in rounding) in the diagonal of the covariance matrix.

To apply the correction: let us consider a regression model with $M$ independent variables and $N$ observations given by

$$y_k=\alpha +\beta_1x_{1k} + \beta_2x_{2k} .... + \beta_M x_{M k} +\epsilon_k$$

with $k=1,2,3.... N$, and where the errors $\epsilon_k$ are normally distributed. The uncorrected least square estimate of the regression coefficients is given by $\displaystyle \hat \beta=(SS_{xy})(SS_{xx})^{-1}$, where $S_{xx}$ represents the $M\times M$ covariance matrix of the independent variables, $S_{xy}$ is the $M\times 1$ matrix relating independent variables with the dependent variable, and $\hat \beta=(\hat \beta_1, \hat \beta_2, \hat \beta_3... \hat \beta_k)^T$. In $SS_{xx}$, any term in the row $i$ and column $j$ of the matrix is given by $\displaystyle s_{ij}=\frac{1}{N} \sum_{k=1}^N x_{ik}x_{jk}$, whereas in $SS_{xy}$ the terms are given by $\displaystyle r_{i}=\frac{1}{N} \sum_{k=1}^N x_{ik}y_k$.

If we have a rounding error, each of the observations $x_{ik}$ can be rewritten as $\displaystyle x_{ik}=\mathring {x}_{ik}+z_{ik}$, where $\displaystyle \mathring x_{ik}$ and $\displaystyle z_{ik}$ are the rounded value and the corresponding rounding error referring to the $\displaystyle k^{th}$ item of the $\displaystyle i^{th}$ independent variable, respectively. Similarly, each of the observations $\displaystyle y_{k}$ can be rewritten as $\displaystyle y_{k}=\mathring {y}_{k}+z_{0k}$, where $\displaystyle \mathring y_{k}$ and $\displaystyle z_{0k}$ are the rounded value and the corresponding rounding error of the $\displaystyle k^{th}$ item of the dependent variable. Let us consider the covariance matrix of the independent variables $\displaystyle \mathring {s}_{ij}=\frac{1}{N} \sum_{k=1}^N \mathring {x}_{ik}{x}_{jk}$, obtained using the rounded data. Calling $\displaystyle w_i$ the bin interval widths used for rounding variables, if we assume that rounding errors are independent of the rounded data and uniformly distributed over the symmetric interval of rounding precision $\displaystyle (-\frac{w_i}{2}, \frac{w_i}{2})$, we get that the correct terms $\displaystyle s_{ij}$ of the covariance matrix can be estimated from $\displaystyle \mathring {s}_{ij}$ as

$$\displaystyle s_{ij}=\frac{1}{N} \sum_{k=1}^N \mathring {x}_{ik} \mathring {x}_{jk} + x_{ik} z_{jk} + x_{jk} z_{ik} + z_{ik} z_{jk}$$

which leads to

$$s_{ij}= \left\{ \begin{array}{rl} \mathring s_{ij} -\frac{1}{12}w_i^2 + \mathcal{O}(1) &\mbox{ if $i=j$} \\ \mathring s_{ij} + \mathcal{O}(1) &\mbox{ otherwise} \end{array} \right.$$

Applying the same procedure to the matrix of dependent vs independent variables, we get

$$\displaystyle r_i=\mathring r_i + \mathcal{O}(1)$$

where $\displaystyle \mathring r_i=\frac{1}{N} \sum_{k=1}^N \mathring {y}_{k}{x}_{ik}$ is the covariance matrix of dependent vs independent variables obtained using the rounded data.

The formulas above allow to obtain the correction for rounding directly from rounded data. In practice, to perform your analysis:

  • build an initial covariance matrix $\displaystyle SS'_{xx}$ for the independent variables using rounded data, where each term $\displaystyle \mathring s_{ij}$ is given by $\displaystyle \frac{1}{N} \sum_{k=1}^N x_{ik}x_{jk}$;

  • build the covariance matrix $\displaystyle SS_{xy}$ for the dependent vs independent variables using rounded data, where each term $\displaystyle \mathring {r}_{i}$ is given by $\displaystyle \frac{1}{N} \sum_{k=1}^N y_{k}x_{ik}$;

  • using the Sheppard's corrections, modify the initial matrix $\displaystyle SS'_{xx}$ by subtracting $\displaystyle \frac{1}{12}w^2$ to each term of the main diagonal, so that you get the corrected matrix $\displaystyle SS_{xx}$;

  • calculate regression coefficients using the standard method $\displaystyle \hat \beta=\frac{SS_{xy}}{SS_{xx}}$, as in ordinary least square regression.

Since this correction simply implies the use of a corrected matrix, we can estimate the standard error of each regression coefficient by the usual procedure followed in ordinary LSE. The possibility of calculating a measure of error for the regression parameters may be important to measure of quality of the resulting vector, and to obtain an interval for each coefficient where it is highly probable that the true value is contained (as correctly pointed out in the question). In this regard, we have to remind that, in linear regression, parameter estimates are normally distributed with mean equal to the true regression parameter and covariance matrix given by $\displaystyle \Sigma = {s^2}{A'A}^{-1}$, where $\displaystyle s^2$ represents the residual variance and $\displaystyle A'A$ is the design matrix ($\displaystyle A'$ is the transpose of $A$, and $A$ is defined by the regression model equation). Using this formula, the standard error of each regression coefficient can be calculated, and from this we can easily get the corresponding 95% confidence intervals.

Note that, as already highlighted, other methods can be used. For example, if we do not assume that rounding errors are independent of rounded data, then the same calculations as above lead to an alternative correction (the so-called BRB correction) that is similar to the Sheppard's one, but where the term $\frac{1}{12}w^2$ has to be added, and not subtracted, to the main diagonal of the $SS{xx}$ matrix. However, current evidences suggest that the Sheppard's correction performs better than other techniques. Moreover, it has to be pointed out that, whatever the method used, adjustments often lead to considerably larger 95% CI for regression coefficients as compared with those obtainable by unrounded data. Lastly, while the sampling standard deviations decreases in proportion with $\sqrt{N}$ when the sample size $N$ increases, the rounding error adjustment errors usually do not decrease when $N$ increases.