Solve for the gradient of $\log \sum\limits_{i = 1}^{m} \exp(a_i^Tx + bi)$

5.2k Views Asked by At

This is a standard problem in convex optimization with well known solution but I cannot seem to follow the procedure given in Boyd's CVX book pg 643

Suppose I am given $f(x) = \log \sum\limits_{i = 1}^{m} \exp(a_i^Tx + bi)$, $f: R^n \to R$, I need to find the gradient

Then, by the chain rule:

$Df(x) = Dg(h(x))Dh(x)= D(\log \sum\limits_{i = 1}^{m} \exp(a_i^Tx + bi))D( \sum\limits_{i = 1}^{m} \exp(a_i^Tx + bi))$

$Df(x) = \dfrac{1}{\sum\limits_{i = 1}^{m} \exp(a_i^Tx + bi)} [?? \text{what is } D( \sum\limits_{i = 1}^{m} \exp(a_i^Tx + bi))??] $

I would really appreciate if someone can show me how you would get a closed form of the expression $D( \sum\limits_{i = 1}^{m} \exp(a_i^Tx + bi))$.

The final solution is $\nabla f(x) = \dfrac{1}{1^Tz}A^Tz$, where $z_i = \exp(a_i^Tx + b_i)$

2

There are 2 best solutions below

3
On BEST ANSWER

Split it up. Calculate each partial derivative of each term of the sum separately. By the chain rule we have $$\partial_{x_j}\exp(a_i^Tx+bi)=\exp(a_i^Tx+bi)\partial_{x_j}(a^T_ix+bi)=a_{ij}\exp(a_i^Tx+bi).$$ Hence $D\sum_{i=1}^m\exp(a_i^Tx+bi)=(\sum_{i=1}^ma_{ij}z_i)_{j=1}^n=A^Tz$.

0
On

I had to peek at the PDF you linked, to see that the notation $a^T_i$ denotes the $i^{th}$ row of the matrix $A$, rather than the transpose of the $i^{th}$ column of the matrix.

You can approach the problem using sequential change-of-variables within differential expressions. This is useful in matrix calculus, because quite often the intermediate quantities needed to apply the chain-rule are higher order tensors which are difficult to work with in standard matrix-vector notation.

Define the variables $$\eqalign{ y &= Ax + b \cr z &= \exp(y) \cr s &= 1:z \cr f &= \log(s) \cr }$$ where colon denotes the Frobenius product $($so that $1\!:\!z=1^Tz)$ and the functions are applied element-wise to vector arguments.

Now expand the differential of $f$ in terms of these variables $$\eqalign{ df &= d\log(s) = \frac{ds}{s} \cr &= \frac{1:dz}{s} \cr &= \frac{1:z\circ dy}{s} = \frac{1\circ z:dy}{s} \cr &= \frac{z:dy}{s} = \frac{z:A\,dx}{s} \cr &= \frac{A^Tz:dx}{s} = \frac{A^Tz}{s}:dx \cr }$$ where $\circ$ denotes the Hadamard product. The advantage of using the Frobenius and Hadamard products is that they commute, i.e. $\,A\circ B:C=A: B\circ C,\,$ which can sometimes simplify your results.

Since $df=(\frac{\partial f}{\partial x}):dx,\,$ the gradient is $$\eqalign{ \frac{\partial f}{\partial x} &= \frac{A^Tz}{s} = \frac{A^Tz}{1^Tz} \cr }$$