I am really confused in solving the following problem and I am observing very strange behavior that the solution between regularized and normal least square do not match. Following is the problem:
Assume I have $Y=AX+b$ where b is random small number
$Y=[1\ \ 2]^T$, $A=[1\ \ 0\ \ 1\ \ 0\ \ ; 0\ \ 1\ \ 0\ \ 1]$ Then solving least square problem using backslash operator gives me following solution : $$ X=[ 1 ; 2 ; 0 ; 0] $$ Also $A^TA$ is also singular. Therefore I used regularization parameter delta and tried to solve following problem $(A^TA+\delta I)^{-1}A^T*Y$ and I know regularization is like optimization problem considering joint least norm and least square, however, my answer now is $[0.4975 ; 0.9950 ;0.4975 ; 0.9950]$
Note that adding first and third column gives me 1 and and adding second and fourth column gives me 2.
Why is it like this? It seems it is not giving any new information because assume vector X has four elements it seems that I only get the two elements and I can never get the other two. Can anyone here help me to solve this problem ? I know this might give me the least norm solution but that's not what I am after.
First of all, you have an underdetermined system since you have more unknowns than equations. This means that, without any assumptions on the problem such as sparsity, you will have infinitely many solutions. Note that, for instance, $X=[0.5;1;0.5;1]$ or $X=[0.75;1.9;0.25;0.1]$ are also perfectly correct solutions to your original least squares problem.
It seems to me that you have used ridge regression to regularize your problem $$ ||y-Ax||_2^2+\delta||x||_2^2 $$ yielding $$ x^* = \left(A^TA+\delta I \right)^{-1}A^Ty $$ What you have done now is added some small weights $\delta$ to the diagonal of $A^TA$ to make this matrix invertible (or differently expressed: increased the eigenvalues). Your original $A$ has duplicate columns $[1\ 0]^T$ and $[0\ 1]^T$. When this is the case, ridge regression will yield a solution that divides the power of the corresponding elements in $x$ equally. To see this assume that $A = [a\ a]$, thus a matrix composed of two identical columns. The problem is then $$ \underset{x}{\text{minimize}}\ ||y-Ax||^2_2+\delta||x||_2^2=\underset{x_1,x_2}{\text{minimize}}\ ||y-ax_1-ax_2||^2_2+\delta||x_1||_2^2+\delta||x_2||_2^2 $$ taking the partial derivatives with repsect to $x_1$ and $x_2$ and setting them to zero, one gets $$ -a^T(y-ax_1+ax_2)+\delta x_1=0\\ -a^T(y-ax_1+ax_2)+\delta x_2=0 $$ this means that $$ x_1=x_2=\frac{a^Ty}{2a^Ta+\delta}\quad (1) $$ So when you apply ridge regression to your problem you get a unique solution where the power becomes evenly distributed over the variables corresponding to the columns that are identical. Now, you stated that the sum of the first and the third column gave you $1$ and the sum of the second and fourth column gave you $2$. This is not correct. It will actually be less than $1$ and $2$. This is because of the $\delta$, which is in the denominator of $(1)$. For the original least squares problem, however, the power can be distributed in any way between the variables corresponding to the identical columns of $A$. This is the difference between the two approaches.
Hope this clarified things!