I'm trying to formulate an ADMM for performing dictionary learning (for sparse coding) on a set of data.
Let's assume we have a data matrix of $X \in \mathbb{R}^{M \times N}$, a dictionary of $D \in \mathbb{R}^{M \times K}$ and a coefficient matrix of $A \in \mathbb{R}^{K \times N}$, where $K < N$.
For learning the dictionary, I am following the form of:
$\arg\min_{D,A,B} \frac{1}{2}\|X - DA\|^2_F + \lambda \|B\|_1$
$\text{s.t. } B = A, \|d_i\|^2_2 <= 1$
The Lagrangian (using scaled form) is:
$\mathcal{L}(D,A,B,U) = \frac{1}{2}\|X - DA\|^2_F + \lambda \|B\|_1 + \frac{\rho}{2} \|B-A+U\|^2_F$
The update expressions are:
$B^* = \mathcal{S}_{\frac{2\lambda}{\rho}}[-A+U]$ (soft thresholding)
$W^* = (D^TD + \rho I)^{-1} (\rho A + D^T X - \rho U)$
$D^* = XW^T(WW^T)^{-1}$
Having implemented this in MATLAB, I am noticing that I am unable to obtain a reasonable solution. The coefficients are suitably sparse, however the data fidelity ($\|X-DA\|^2_F$) is very poor.
What are some strategies for best choosing $\lambda \text{ and } \rho$?
When there is no constraint on the dictionary learning problem (usually there is a constraint on the norm of each column), ADMM reduces to simple alternating minimization method.So if you consider the constraints, I think it reduces to a method such as Method of Optimal Direction (may be with some negligible differences). So I think for the classical form of dictionary learning, there is no advantage to use ADMM instead of a popular method such as K-SVD ( Note that the K-SVD method is also an alternating minimization method). Also, It is worth to note that the K-SVD is not also converge to a global optimum but it is good in practice.( the problem of dictionary learning is not a convex problem. It is a bi-convex problem, so there are local minimums!)