Regression convergence

Question

Regression convergence

156 Views Asked by Bumbble Comm At 27 Mar 2026 - 7:29

By simulation we create a vector $Y = (y_1,y_2,...,y_n)$, where each $y_i \in R$ is independently drawn from a given non-degenerate distribution.

Next we create by simulation vector $\xi = (\xi_1,\xi_2,...,\xi_n)$ where each $\xi_i$ are independent realizations of a random variable which takes only finite number of values $[\alpha_1,\alpha_2,...\alpha_k]$ with probabilities $p_1,p_2,...,p_k$ respectively. $\alpha_i$ are given.

Suppose that we have got function $f: R \to R$

We make a regression of $\begin{bmatrix} f(y_1+\xi_1) \\ f(y_2+\xi_2) \\ ... \\ f(y_n+\xi_n) \end{bmatrix}$ on $\begin{bmatrix} f(y_1+\alpha_1) & f(y_1+\alpha_2) & ...& f(y_1+\alpha_k) \\ f(y_2+\alpha_1) & f(y_2+\alpha_2) & ... & f(y_2+\alpha_k)\\ ... & ... & ... & ... \\ f(y_n+\alpha_1) & f(y_n+\alpha_2) &... & f(y_n+\alpha_k) \end{bmatrix}$

By regression I mean that we are optimizing $\beta_i$ to minimize $\sum_{i=1}^n(f(Y+\xi)-\sum_{j=1}^k\beta_jf(Y+\alpha_j))^2$

Intuitively I think that as $n \to \infty$ least squares procedure should give us the following equation:

$f(Y + \xi) = p_1*f(Y+\alpha_1) + p_2*f(Y+\alpha_2) + ... +p_k*f(Y+\alpha_k)$

where $f(Y + \xi)$ and $f(Y+\alpha_i)$ are just representations of vector columns above.

So my conjecture is that as $n \to \infty, \beta_i \to p_i$.

My question is what conditions should be imposed on function $f$ to get the equation above? Is my intuition correct that normally we should get such a equation? Maybe we need to impose some conditions on the distribution of $y_i$ also.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

UPDATE 2019-05-24:

Oh, I just realized (don't know what took me so long) that when $f()$ is linear, the matrix has rank $2$! E.g for $f(x) = x$, the matrix equals $Y 1^T_5 + 1_n \alpha^T$ where $1_m$ denotes the column vector of $m \, 1$s. Since $rank(Y 1^T_5) = rank(1_n \alpha^T) = 1,$ the sum has rank at most $2$. (And it will have rank $2$ because $Y$ is randomly generated.) Based on this alone, when $k>2$ there are leftover degrees of freedom and so there is no reason to expect $\beta_j \to p_j$.

To be more explicit: there is a subspace of dimension $k-2$ in the choice of $\beta$ vector, and every choice of $\beta$ in this subspace results in the same $\sum_j \beta_{j=1}^k (Y + \alpha_j)$ and therefore the same summed squared error! Exactly which choice gets chosen will be left to implementation details of the least squares computational package.

To conclude: when $f()$ is linear the conjecture is false. I'm not sure yet whether the conjecture can be true for some non-linear $f()$.

UPDATE 2019-05-23:

I still don't know under what conditions your conjecture will hold, but here is another case where it doesn't. I tried:

$y_i \sim N(10, 1)$ i.i.d.
$\{\alpha_1, ..., \alpha_5\} = \{1,2,3,4,5\}$ equiprobable
$f(x) = x$ i.e. identity function
repeated runs with $n=10^5, 10^6,$ etc.

Simulation Result: The final $\beta$s are not $\beta_j \approx 0.2$.

Since $y_i \gg \alpha_j$ the optimal $\beta$s must have $\sum_{j=1}^5 \beta_j \approx 1$, and the sim result supports that. However, individual $\beta_j$ can be very different from $0.2$. Indeed, in some runs, we have $|\beta_j| \approx 10^{11}$ but some are positive and some are negative and $\sum \beta_j \approx 1$. Geometrically, what seems to have happened is that the $5$ different $f(Y + \alpha_j)$ are not parallel (contrast my Example $1$), but they are almost parallel, since $y_i \gg \alpha_j$. So when you try to write $f(Y+\xi)$ as a linear combo of $5$ almost parallel vectors, tiny differences may get exaggerated in the name of minimizing (summed squared) error.

Another way to look at this is that the error contribution from row $i$ is

$$\delta_i = (y_i (1 - \sum_{j=1}^5 \beta_j) + (\xi_i - \sum_{j=1}^5 \beta_j \alpha_j))^2$$

Roughly speaking, $\sum \beta_j = 1$ would zero out the first term, while $\sum \beta_j \alpha_j = E[\xi_i]$ would minimize the second term. However, with $5$ different $\beta_j$ and only $2$ equations, once again there is a lot of freedom left. On any particular run, the extra freedom might be used to "overfit" the data, and therefore there is no guarantee that $\beta_j$ will converge to the "nominal" solution of $\beta_j = p_j$.

[Python code available if you're interested]

Partial answer / too long for a comment

You're interested in $n \to \infty$, but I feel there is some weirdness with $Y$ being unspecified. I.e. it feels a little weird to me to say "$Y \in \mathbb{R}^n$ is given" and also "$n \to \infty$". Is $Y_{(n)} \in \mathbb{R}^n$ given for every $n$?

If you're conjecturing the convergence for some (i.e. given) infinite sequence $\mathbb{Y}= (Y_{(1)}, Y_{(2)}, \dots, Y_{(n)}, \dots)$ where $Y_{(n)} \in \mathbb{R}^n$, then it has a chance to be true, but my Example $1$ still shows it's potentially false (depending on your interpretation).

If you're conjecturing the convergence for all infinite sequences $\mathbb{Y}$, then I'd think the conjecture is false, simply because an adversary can choose each $Y_{(n+1)}$ to be sufficiently different from $Y_{(n)}$ so that the $\beta$s do not converge at all. My Example $2$ below is an informal attempt to show this.

As yet another (perhaps more natural?) alternative, you might actually have a distribution for $y_i$ in mind, say $N(0,1)$, and as $n$ increases you just keep adding another $y_i$ i.e. another row to the regression. This case... I'm not so sure, but my guess is that for linear $f$ the conjecture is probably true.

Terminology: I will use $i$ as row index, so $1 \le i \le n$, and $j$ as column index, so $1 \le j \le k$.

Example 1: Let $Y_{(n)} = 0$ for every length $n$. Then every column $f(Y+\alpha_j) = f(\alpha_j) \vec{1}$ where $\vec{1}$ denotes the all-$1$s vector. Thus the matrix becomes rank $1$ (all columns are parallel), and crucially, $\sum_j \beta_j f(Y+\alpha_j) = (\sum_j \beta_j f(\alpha_j)) \vec{1}$.

In this case, the sum of squared errors is $\Delta_n = \sum_{i=1}^n (f(\xi_i) - \sum_j \beta_j f(\alpha_j))^2$. Under most interpretations of how you generate $\xi_i$ we would conclude that $\Delta_n$ is minimized when $\sum_j \beta_j f(\alpha_j) = E[f(\xi_i)] = \sum_j p_j f(\alpha_j)$, regardless of what $f$ is.

So $\beta_j = p_j$ is certainly a solution. But due to the degeneracy, you have $k$ different $\beta$s and only $1$ equation, so there are many many other $(\beta_1, \dots, \beta_k)$ that satisfy $\sum_j \beta_j f(\alpha_j) = E[f(\xi_i)] = \sum_j p_j f(\alpha_j)$, so $\beta_j = p_j$ is not the only solution. Does this count as an example of your conjectured convergence? (IMHO, no, but it is somewhat a matter of interpretation...)

Example 2: Without loss assume the $\alpha$s are confined to some smallish range, e.g. $(-10, 10)$. As the adversary, I pick some extremely fast-growth sequence e.g. $y_i = 10^i$ and function e.g. $f(x) = e^x$. I'm gonna argue informally that in this case there is no convergence: As you add each row, that new row (i.e. the last row, i.e. row $n$) will dominate the regression. Specifically, suppose the last $\xi_n = \alpha_q$ which is the maximum $\alpha$, then due to the fast-growth nature of both $y_i$ and $f$, the optimizing $\beta$s will be e.g. $\beta_q \approx 1$ and all other $\beta_j \approx 0$, just because minimizing the last row squared error $\delta = (f(y_n + \xi_n) - \sum_j \beta_j f(y_n + \alpha_j))^2$ is the dominating concern. [At least, it is obvious that $\beta_j = p_j$ cannot be anywhere near the optimal choice if the last $\xi_n = $ the maximum $\alpha$.] A similar thing will happen if $\xi_n = $ the minimum $\alpha$. Thus, as $n$ increases, as each new $\xi$ comes along, the $\beta$s will fluctuate and will not converge. Sorry this is informal, but I hope it makes sense.

Regression convergence

There are 1 best solutions below

Related Questions in PROBABILITY

Related Questions in CONVERGENCE-DIVERGENCE

Related Questions in REGRESSION

Trending Questions

Popular # Hahtags

Popular Questions