Question about a convex optimization using gradient descent

47 Views Asked by At

Recently I have read a paper, but I was confused about the optimized method of this article. In the following I will try to abstract the problem in the text. Supposed that we have six variables $\bf{\theta_{1}},\bf{\theta_{2}},\bf{\theta_{3}}$,$w_1,w_2,w_3$,(note that $\bf{\theta}$ is a column vector and $w$ is a scalar) and a continuous and differentiable function $f(x)$. And here the variables satisfy the relationship $$ \theta_{i}^{t+1} = \sum_{n=1}^{3}w_n\theta_{n}^{t}, i=1,2,3 $$ Here $t$ means a moment, and t = 0,1,2,...N. Besides, let's just think of $i$ as some fixed value. Our target is to find $w_1, w_2, w_3$ at each moment t which can minimize the following: $$ \min_{w_1,w_2,w_3}f(\theta_i^{t+1})=\min_{w_1,w_2,w_3}f(\sum_{n=1}^{3}w_n\theta_{n}^{t}) $$ In my opinion, at the moment $t+1$, because $w_1,w_2,w_3$ are variables, and $\theta_1^t,\theta_2^t,\theta_3^t$ are constant vector, we can think that $\theta_i^{t+1}$ is a linear combination of $w_1,w_2,w_3$. To minimize $f(\theta_i^{t+1})$, using gradient descent method, we can find a appoximated solution $\tilde{\theta}_i^{t+1} $after T iterations: $$ (\theta_i^{t+1})^{T+1} = (\theta_i^{t+1})^{T} - \eta\nabla_{(\theta_{i}^{t+1})^T} f((\theta_i^{t+1})^{T}) $$ Here $\eta$ is the step size. Using the chain rule, we have: $$ \nabla_{(\theta_{i}^{t+1})^T} f((\theta_i^{t+1})^{T})=\frac{\partial f((\theta_i^{t+1})^{T}) }{\partial w_1}\cdot \frac{\partial w_1 }{\partial \theta_{i}^{t+1}} + \frac{\partial f((\theta_i^{t+1})^{T}) }{\partial w_2}\cdot \frac{\partial w_2 }{\partial \theta_{i}^{t+1}} + \frac{\partial f((\theta_i^{t+1})^{T}) }{\partial w_3}\cdot \frac{\partial w_3 }{\partial \theta_{i}^{t+1}} $$ Let $\bf{w}$$=[w_1,w_2,w_3]^{H}$, here $H$ means transposition, then we have: $$ \nabla_{(\theta_{i}^{t+1})^T} f((\theta_i^{t+1})^{T})=[\frac{\partial w_1 }{\partial \theta_{i}^{t+1}},\frac{\partial w_2 }{\partial \theta_{i}^{t+1}},\frac{\partial w_3 }{\partial \theta_{i}^{t+1}}]^{H} \nabla_{\bf{w}} f((\theta_i^{t+1})^{T}) $$ So, the final iteration formula should be $$ (\theta_i^{t+1})^{T+1} = (\theta_i^{t+1})^{T} - \eta[\frac{\partial w_1 }{\partial \theta_{i}^{t+1}},\frac{\partial w_2 }{\partial \theta_{i}^{t+1}},\frac{\partial w_3 }{\partial \theta_{i}^{t+1}}]^{H} \nabla_{\bf{w}} f((\theta_i^{t+1})^{T}) $$ The above is my opinion. But in the paper it seems that the final result is $$ (\theta_i^{t+1})^{T+1} = (\theta_i^{t+1})^{T} - \eta \bf{1}^{H} \nabla_{\bf{w}} f((\theta_i^{t+1})^{T}) $$ Here $\bf{1}$ is a size-3 vector of ones. By the way, it seems that the moment $t$ and iteration $T$ are mixed in the paper, and the above is purely my personal understanding, I hope that someone can tell me what went wrong in my solution process? If someone is interested in the paper, you can search "PERSONALIZED FEDERATED LEARNING WITH FIRST ORDER MODEL OPTIMIZATION", but I think it may be tough for you if you don't know about federated learning. Thank you.