I want to optimize a sparse problem using discrete math of the following form:
S [16500x1]: measured data vector
D [1500x2]: a dictionary matrix containing two basis signals of 1500 samples
X [16500x2]: the impulse responses of the problem that need to be found.
The signal generation has the following structure: $$S[n] = \sum_i D_i[n] * X_i[n] + \eta$$ So basically, the signal S[n] is the sum of discrete convolutions of $D_i$ and $X_i$.
To reconstruct the impulse responses $X_i$ from a measurement of $S$ I use a sparse, convex optimization problem with following loss function: $$ \epsilon = |\sum_i D_i * X_i - S |^2_2 + \lambda \cdot |X|_1^1$$
I want to solve this using simple gradient descent, so I calculate the gradient of $\epsilon$:
$$\frac{\partial \epsilon[n]}{\partial X} = 2\cdot (\sum_i D_i * X_i - S) * D_i + \lambda\cdot sign(X)$$
However, when I implement this, the loss diverges, even with very small step sizes. Only when I flip the second $D_i$ used in the second convolution (in the time dimension), I get the results I am expecting. The idea for reversing came from an implementation I saw of this in the frequency domain, where the hermetian transpose was used in the multiplication (which equates to time reversal in time domain).
So a long introduction for the question: why do I need to time-reverse the second $D_i$ (right before the plus sign)? I do not see where this is comming from, based on my knowledge on chain rules. Any help would be greatly appreciated in understanding this!
Thanks! PS. Sorry for sloppy notation. Not a mathematician.
EDIT: added the code:
gradientVecs = zeros(size(X));
maxIter = 1000;
for cntIter = 1 : maxIter
curDX = 0;
for cntDict = 1 : size( D, 2 )
curDX = curDX + conv( X( :, cntDict ), (D(:, cntDict)), 'same' );
end
curLoss = sum( sum( ( curDX - S ).^2 ) ) + lambda * sum( sum( abs( X ) ) );
for cntDict = 1:size(D, 2)
% Why is there a flip here???? HERE???
gradientVecs(:, cntDict) = 2 * conv( (curDX - S), flip(D(:, cntDict)), 'same') + lambda * sign (X(:, cntDict));
end
X = X - 0.02*( gradientVecs ) ;
end
OK, I found the answer to my own question:
First, let us recap the problem we want to find the gradient for:
$$\epsilon = | \sum_i d_i*x_i |_2^2 + \lambda \cdot \sum_i |x_i|_1$$
We can rewrite the discrete convolution operator in matrix form: $$\sum_i d_i*x_i = \sum_i D_i \cdot x_i$$ where $D_i$ is a matrix of which the columns are shifted versions of $d_i$. $D_i$ therefore is a Toeplitz matrix.
then, we can rewrite the loss function: $$\epsilon = | \sum_i D_i \cdot x_i |_2^2 + \lambda \cdot \sum_i |x_i|_1$$
We can then write out the $L_2$ norm: $$ \epsilon = ( \sum_i D_i \cdot x_i )^T\cdot( \sum_i D_i \cdot x_i ) + \lambda \cdot \sum_i |x_i|_1$$
Which simplifies to: $$ \epsilon = \sum_i D_i^T x_i^T x_i D_i - 2s^T\sum_i D_i x_i + s^Ts + \lambda \cdot \sum_i |x_i|_1$$
We can then take the gradient w.r.t $x_i$, dropping the sum over $i$: $$ \frac{\partial \epsilon}{\partial x_i} = 2D^T_iD_ix_i - 2D_i^Ts + \lambda \cdot sign(x_i)$$
$$ \frac{\partial \epsilon}{\partial x_i} = 2\cdot D^T \cdot(D_i x_i -s) + \lambda \cdot sign(x_i)$$
The multiplication with $D^T$ before the $(D_i x_i -s)$ equals to a convolution with the time inverse of $d_i$ because of the Toeplitz structure of $D_i$