Computation of the mean of a random variable to estimate algorithm complexity

165 Views Asked by At

I made an incremental algorithm which I would like to evaluate the complexity. The algorithm works with a sliding window of size n.

To study the complexity, the window is considered full and the data present are ${x_{1},...,x_{n}}$ . They are assumed to be random, i.i.d, but with no assumption regarding their distribution. In particular, they are not necessarily bounded. Finally, the most recent data is $x_{n+1}$.

The complexity of the algorithm depends on the distance from $x_{n+1}$ to its “nearest lower value” $x_{l}$ whose index is $l=\max\left\{ k\in\{1...n\}\;st\;x_{k}<x_{n+1}\right\}$. The distance $d$ is then $d=n+1-l$.

$d$ is obviously a random variable, and I would like to determine its mean, which would give the algorithm complexity.

I first thought $d$ was a geometric variable with argument 1/2 since $P\left(x_{n+1}\geq x_{n}\right)=1/2$ (which I cannot prove either), so $E\left[d\right]=2$ but it is not the case. Indeed, $x_{n+1}$ and $x_{n}$ are not independent so $d$ does not follow a geometric law. Moreover, I did some experiments on Excel to compute $E\left[d\right]$ using different distributions and it seems it does not depend on the distribution of the $x_{i}$ but on $n$ (I can not post the image, it confirms this result).

Below is a track I have tried to follow to determine $E\left[d\right]$ which proved unsuccessful:

$E\left[d\right]=\sum_{k=1...n}kP\left(d=k\right)$

and

$\begin{eqnarray*} P\left(d=k\right) & = & P\left(x_{n-k}<x_{n+1}\cap x_{n-k+1}\geq x_{n+1}\cap...\cap x_{n}\geq x_{n+1}\right)\\ & = & \forall\lambda_{1},\lambda_{2}\in\mathbb{R}\int_{\lambda_{1}<\lambda_{2}}P\left(x_{n-k}<\lambda_{1}\cap x_{n+1}\in\left[\lambda_{1};\lambda_{2}\right]\cap\bigcup_{j=n-k+1...n}x_{j}\geq\lambda_{2}\right)d\lambda_{1}d\lambda_{2}\\ & = & \int_{\lambda_{1}<\lambda_{2}}P\left(x_{n-k}<\lambda_{1}\right)P\left(x_{n+1}\in\left[\lambda_{1};\lambda_{2}\right]\right)\sum_{j=n-k+1...n}P\left(x_{j}\geq\lambda_{2}\right)d\lambda_{1}d\lambda_{2}\\ & = & \int_{\lambda_{1}<\lambda_{2}}F_{X}\left(\lambda_{1}\right)\left(F_{X}\left(\lambda_{2}\right)-F_{X}\left(\lambda_{1}\right)\right)\sum_{j=n-k+1...n}\left(1-F_{X}\left(\lambda_{2}\right)\right)d\lambda_{1}d\lambda_{2} \end{eqnarray*}$

But I can not go any further ? Do you have any idea ? Thanks in advance !

2

There are 2 best solutions below

6
On BEST ANSWER

I just gave an answer at https://mathoverflow.net/questions/139706/computation-of-the-mean-of-a-random-variable-to-estimate-algorithm-complexity, that almost agrees with that of Wild Chan. I repost it here for comparison.

....

It seems to me that $P(d>k)=\tfrac1{k+1}$ for $k\in[0,n]$, so that $$E[d]=\sum_{k\ge 0}P(d>k)=H_{n+1}\simeq \ln n.$$ The reason is that the last $k+1$ terms of the sequence are in random order, so that the last one is the smallest among the last $k+1$ terms of the sequence with probability $P(d>k)=\tfrac1{k+1}$. When the last is the smallest of all I assume that $d$ is set equal to $n+1$. Does it make sense ?

For this to be valid, you need to assume that the probability distribution has no atoms, which is true for instance if it has a density of probability.

.....

Note that I obtain implicitly, for $n\ge k\ge 1$, $$P(d=k)=\frac{1}{k(k+1)}$$ and $$P(d=n+1)=\frac{1}{n+1},$$ to be compared with the solution by Wild Chan, $$P(d=k)=\frac{1}{k^2+3k+2}=\frac{1}{(k+1)(k+2)}$$ for $n-1\ge k\ge 0$, that is not a probability distribution, for it does not add to one. There is also a shift due to different understandings of the definition of $d$, mine being that the minimum value of $d$ is 1, while the other solution assumes that $P(d=0)=1/2$, it seems. This explains the difference $-2+\tfrac1{n+1}$ between the 2 solutions.

Also note a shorter computation below: $$\begin{eqnarray*} \int_{-\infty}^\infty F'(y)F(y)(1-F(y))^kdy &=& \int_{-\infty}^\infty F'(y)(1-F(y))^kdy - \int_{-\infty}^\infty F'(y)(1-F(y))^{k+1}dy \\ &=& \frac1{k+1}-\frac1{k+2}, \end{eqnarray*}$$ that agrees with Wild Chan result too.

8
On

I think $P(d=k)$ should be calculated by joint distribution. $$\begin{eqnarray*} P(d=k) &=& P(x_{n-k}<y\ \cap x_{n-k+1}\geq y\ \cap\ldots\cap x_n\geq y) \\ &=& \int_{x_{n-k}<y,\ x_{n-k+1}\geq y,\ldots,x_n\geq y}\varphi(x_1,\ldots,x_n,y)dx_1\ldots dx_ndy \\ &=&\int_{-\infty}^\infty \varphi(y)dy \left(\int_{-\infty}^y\varphi(x_{n-k})dx_{n-k}\int_y^\infty\varphi(x_{n-k+1})dx_{n-k+1}\ \ldots\ \int_y^\infty\varphi(x_n)dx_n\right) \\ &=&\int_{-\infty}^\infty F'(y)F(y)(1-F(y))^kdy \\ &=&\int_{-\infty}^\infty F'\left(\sum_{i=0}^k(-1)^iC_k^iF^{i+1}\right)dy \\ &=&\int_{-\infty}^\infty \left(\sum_{i=0}^k\frac{(-1)^iC_k^i}{i+2}F^{i+2}\right)'dy \\ &=&\sum_{i=0}^k\frac{(-1)^iC_k^i}{i+2}=\frac{1}{k^2+3k+2} \end{eqnarray*}$$

May be clumsy... And then the expectation is obtained $$Ed=\sum_{k=0}^{n-1}\frac{k}{k^2+3k+2}=H_{n+1}+\frac{1}{n+1}-2$$ So the asypmtotic complexity of $d$ is $\Theta(\ln n)$.


In addition, if we use the property of symmetry, and calculate $$P(x_{n-k}>y\ \cap x_{n-k+1}\leq y\ \cap\ldots\cap x_n\leq y)=\int_{-\infty}^\infty F'(y)(1-F(y))F(y)^kdy$$ we can get the answer far more faster.