$\underset{K}{\text{minimize}} ( (K\circ P)\mathbf{1}_m - T)^\intercal ((K\circ P)\mathbf{1}_m - T)$ with equality constraint

72 Views Asked by At

I would like to solve the following function.

\begin{align} &\underset{K}{\text{minimize}} & ( (K\circ P)\mathbf{1}_m - T)^\intercal ((K\circ P)\mathbf{1}_m - T)\\ &\text{s.t}& K^\intercal \mathbf{1}_n = \mathbf{1}_m \end{align}

Where $\circ$ is the Hadamard Product, $K, P \in \mathbb{R}^{n\times m}$, $T \in \mathbb{R}^n$, $^\intercal$ is the transposed of the matrix and $\mathbf{1}_n, \mathbf{1}_m$ are vectors of ones of size $n$ and $m$ respectively.

Thus, I tried but I got stuck pretty quickly. So, first I set my lagrange multiplier function ${\displaystyle {\mathcal {L}}(x,\lambda )=f(x)-\lambda g(x)}$. Thus:

$$ ( (K\circ P)\mathbf{1}_m - T)^\intercal ((K\circ P)\mathbf{1}_m - T) - \lambda^\intercal(K^\intercal\mathbf{1}_n - \mathbf{1}_m)\\ $$

Where $\lambda \in \mathbb{R}^n$. I try then to develop the equation:

$$ \mathcal {L}=((K\circ P)\mathbf{1}_m)^\intercal((K\circ P)\mathbf{1}_m)- ((K\circ P)\mathbf{1}_m)^\intercal T - T ^\intercal ((K\circ P)\mathbf{1}_m) + T^\intercal T - \lambda^\intercal(K^\intercal\mathbf{1}_n - \mathbf{1}_m) $$

$$ \mathcal {L}=\mathbf{1}_m^\intercal (K\circ P)^\intercal(K\circ P)\mathbf{1}_m- \mathbf{1}_m^\intercal (K\circ P)^\intercal T - T ^\intercal ((K\circ P)\mathbf{1}_m) + T^\intercal T - \lambda^\intercal(K^\intercal\mathbf{1}_n - \mathbf{1}_m) $$

From here I have questions because I do not know if I have some properties as for example in $(K\circ P)^\intercal(K\circ P)$ with the hadamard product. Anyway, if I follow with the derivatives, $\frac{\partial \mathcal{L}}{\partial \lambda}$ is easy:

$$ \frac{\partial \mathcal{L}}{\partial \lambda} =K^\intercal \mathbf{1}_n - \mathbf{1}_m $$

However, concerning $\frac{\partial \mathcal{L}}{\partial K}$ I do not know how to proceed since $K$ is a matrix. I know for sure that I can drop the term $T^\intercal T$, but besides that I'm kind of lost.

$$ \frac{\partial \mathcal{L}}{\partial K} = ? $$

I looked into the matrix cookbook and here for some insight. I should be able to use some property such as $\nabla_A XAY = X^\intercal Y^\intercal$, however I'm still stuck, mainly I would not know how to combine that property with $(K\circ P)^\intercal(K\circ P)$.

1

There are 1 best solutions below

3
On BEST ANSWER

$\def\D{{\rm Diag}}\def\d{{\rm diag}}\def\o{{\large\tt1}}\def\R#1{\mathbb{R}^{#1}}\def\p#1#2{\frac{\partial #1}{\partial #2}}$Let $(\odot,\oslash)$ denote element-wise multiplication and division. Then, instead of a building a Lagragian, you can use the all-ones matrix $J\in\R{n\times n}$ and an unconstrained matrix $U\in\R{n\times m}$ to satisfy the constraint by constructing
$$\eqalign{ K &= U\oslash JU \quad\implies\quad K^T\o_n = \o_m \\ }$$ It will also be convenient to define some auxiliary variables $$\eqalign{ a &= \left(P\odot K\right)\o_m - T \\ A &= \D(a) \\ F &= 2P\odot (a\o_m^T) = 2AP \\ b &= U^T\o_n \\ B &= \D(b) \\ C &= B^{-1} \\ v &= \d\left(U^TFC^2\right) \\ }$$ Write the objective function in terms these new variables.
Then calculate the differential and gradient. $$\eqalign{ \phi &= a:a \\ d\phi &= 2a:da \\ &= 2a:\left(P\odot dK\right)\o_m \\ &= F:dK \\ &= F:\left(dU\oslash JU\right) - F:(U\odot J\,dU \oslash JU\oslash JU) \\ &= \left(F\oslash JU\right):dU - (F\oslash JU\oslash JU):(U\odot J\,dU) \\ &= FC:dU - FC^2:U\,dB \\ &= FC:dU - U^TFC^2:\D\left(dU^T\o_n\right) \\ &= FC:dU - v\o_n^T:dU^T \\ &= \Big(FC - \o_nv^T\Big):dU \\ \p{\phi}{U} &= \Big(FC - \o_nv^T\Big) \;\doteq\; G \\\\ }$$ Knowing the gradient, we can use gradient descent to solve for the optimal $U.$ $$\eqalign{ i &= 0 \\ U_0 &= \o_n\o_m \qquad\qquad \big({\rm initialize}\big) \\ \\ K_i &= U_i\oslash JU_i \qquad\; \big({\rm current\,solution}\big) \\ F_i &= 2\;\D\Big(\big(P\odot K_i\big)\o_m - T\Big) P \\ C_i &= \D\big(U_i^T\o_n\big)^{-1} \\ v_i &= \d\big(U_i^TF_iC_i^2\big) \\ G_i &= F_iC_i - \o_nv_i^T \\ U_{i+1} &= U_i - \lambda_i G_i \\ i &= i+{\tt1} \qquad\qquad \big({\rm next\,iteration}\big) \\ }$$ The steplength $\lambda_i>0$ is chosen to minimize (or sufficiently reduce) $\phi$ for the current step.

Instead of gradient descent, it might be possible to solve $(G=0)$ directly with some clever algebra, however the equation is highly nonlinear.


In the above, a colon is used as a convenient product notation for the trace $$\eqalign{ A:B &= {\rm Tr}\big(A^TB\big) \;=\; \sum_{i=1}^n\sum_{j=1}^m A_{ij} B_{ij} \\ A:A &= \big\|A\big\|_F^2 \\ }$$ ${\bf NB\!:}\;{\rm If}\;m={\tt1},\,{\rm then}\,(A,B)$ are vectors and this corresponds to the standard dot product.