Is Student's t-distribution valid when samples themselves have uncertainty - such as quantisation errors?

116 Views Asked by At

NB: I was gonna post on physics stack exchange, not really sure where this fits in. But I'm only a lowly Engineer so please go easy on the notation if you can

Using Student's t-distribution I can infer the parameters ($\mu,\sigma^2$) of a probability distribution based on $n$ samples of data that I assume will fit a gaussian prior. However in all the examples I've seen, the $n$ samples are all simple values. How can I infer a probability distribution based on samples of data with uncertainty; if my $n$ samples are not simple values but probability distributions themselves? What is the effect of measurement uncertainty on the shape of the inferred distribution?

Context

I'm trying to measure how long some code takes to run on a computer. The timer is low resolution - similar order of magnitude to the duration I'm trying to measure - and so the true timestamps are quantized into 100 ms bins. Assuming a uniform rectangular probability distribution within these bins, then the time differences have a triangular probability distribution.

i.e. A task starting at $142ms$ and ending at $331 ms$ when quantised will appear to start at $100\pm50ms$ and end at $300\pm50ms$. Then the difference will be a triangular probability distribution, centered on $200ms$ and with a width of $\pm 100ms$.

I have several of these triangular timespan measurements, and I'd like to use them to determine the parameters of a distribution. As I say, I could just ignore the quantisation errors in my samples, and plug the modal (centre) values into the t-distribution, but surely those errors will increase the uncertainty ($\sigma$) of my inferred gaussian?

1

There are 1 best solutions below

6
On

All you have to do is incorporate this uncertainty into the statistics that you are measuring. I'll give you examples using $\bar{x}$ and $s^2$, but it's not difficult to see how this generalizes.

Suppose we have some function of $n$ variables: $$f:\mathbb{R}^n\to \mathbb{R} ~; f:(x_1,...,x_n)\mapsto f(x_1,...,x_n)$$ and we have some error $\delta x_i$ on each argument, the error in $f$ will be roughly $$\delta f=\sqrt{\sum_{i=1}^{n} \left(\frac{\partial f}{\partial x_i}\delta x_i\right)^2}$$

Let's apply this, for example, to the sample mean. For a sample of data $x_1,...x_n$ the sample mean is $$\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i$$ Therefore, $\forall i\in [1,n]$, $$\frac{\partial \bar{x}}{\partial x_i}=\frac{1}{n}.$$ Therefore if each of our measurements has an associated error $\delta x_i$, the maximum error when measuring the sample mean is $$\delta \bar{x}=\frac{1}{n}\sqrt{\sum_{i=1}^n {\delta x_i}^2}$$ Thus, when estimating the population mean, you have to incorporate both the standard error, as usual, but also add on whatever error you get from the above. Let's also do the sample standard deviation as well. Recall that $$s^2=\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2$$ And so $$\frac{\partial s^2}{\partial x_i}=\frac{1}{n-1}\frac{\partial}{\partial x_i}(x_i-\bar{x})^2$$ Using what we determined above, this is $$\frac{\partial s^2}{\partial x_i}=\frac{2}{n-1}(x_i-\bar{x})\left(1-\frac{1}{n}\right)=\frac{2}{n}(x_i-\bar{x})$$ Thus $$\delta s^2=\frac{2}{n}\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2{\delta x_i}^2}$$ Note: Don't confuse $\delta s^2$ with $(\delta s)^2$ !

So to expand this answer using some of the topics mentioned in the comments: suppose we have some real data points $x_1,...,x_n=\mathbf{x}$ which are intended to measure the position in $\Bbb{R}^n$ of some desired point $\mathbf{x}_0.$ Now suppose that each measurement has corresponding errors $\delta x_i$ which I.I.D random variables, each following a PDF $p$ on $\Bbb{R}$. We assume that the desired point $\mathbf{x}_0$ is within the vector range $(x_1\pm\delta x_1,...,x_n\pm\delta x_n)$. The question is: How large do we expect the net error, $\Vert \mathbf{x}-\mathbf{x}_0\Vert$, to be? Basically what we need to do is define the random vector $\delta\mathbf{x}=(\delta x_1,...,\delta x_n)$ and take the expected value of $\Vert\delta\mathbf{x}\Vert.$ So in principle this should look like $$\mathrm{E}\left(\Vert\delta\mathbf{x}\Vert\right)=\int_0^\infty \epsilon\cdot \mathrm{P}(\Vert\delta\mathbf{x}\Vert=\epsilon)\mathrm{d}\epsilon$$ The $\mathrm{P}(\Vert\delta\mathbf{x}\Vert=\epsilon)$ bit is an integral in and of itself, and is is quite tricky: it has to do with the volume of an $n$ dimensional spherical shell bound by the radii $\epsilon$ ; $\epsilon + \mathrm{d}\epsilon$ and weighted by the value of the PDF $p$ for each $\delta x_i$ at that point. This is much easier if we assume the $\delta x_i$ are all I.I.D, since then the integral is symmetric and can be reduced to one dimension. I'll need a bit more time if I'm to iron out all the details, though.