Reading the paper on AdaGrad, an optimization method for machine learning, I am coming across an inequality I do not understand on page 5, available here
Denote $g_{1:T}$ as a matrix $G_T=[g_1, \ldots g_T]$ consisting of the sub gradients of dimension $d$ of the hinge loss function $L_t(x) = max(0, 1- y (z \cdot x_t))$ for $x$ a weight vector of dimension $d$, $z$ a feature vector, and $y$ its correct label at time step $T$.
Denote $g_{1:T,i}$ as the $i'th$ row of the matrix. They write:
For our first example, which was also given by McMahan and Streeter (2010), consider the following sparse random data scenario, where the vectors $z_t \in \{−1,0,1\}^d$. Assume that at in each round $t$, feature $i$ appears with probability $p_i = min\{1, ci^{−a}\}$ for some $\alpha \in (1, \infty)$ and a dimension independent constant c. Then taking the expectation of the gradient terms, we have:
$(1) $$\mathbb{E} \sum_{i=1}^{d}||g_{1:T, i}||_2 = \sum_{i=1}^{d} \mathbb{E} [\sqrt{|\{t:|g_{t,i}| = 1 \}}|] \leq \sum_{i=1}^{d} \sqrt{\mathbb{E}|\{t:|g_{t,i}|=1\}|} = \sum_{i=1}^{d} \sqrt{p_iT}$
I am not sure what the random variable is in this case. I believe what is happening is for each dimension of our sub gradient input vector, we are considering the expectation of the row of our matrix $G_T$ which will only be non-zero for sub gradient time step indices $t$ where the $i'th$ index of the gradient is $0$, and then applying Jensen's Inequality.
I also know that the gradient of the hinge loss is $ \frac {\partial L} {\partial x_i} (x) = -y * z_i$ if $y (z \cdot x) < 1$, and $0$ otherwise. I'm not sure what the authors mean by a feature $i$ appears. Does this mean a value of $z_{t_{i}}$ that is non-zero?
I'm looking for a clarification of $(1)$ and would be happy to provide any other details/corrections.
The notation in the paper is indeed confusing and many steps are not clear/explicit. I'll give it a go!
Let's start computing the subgradient. Notice that $$y_t\langle z_t, x\rangle = 1$$ is an hyperplane dividing the space of the parameters in 2 halves. On both these halves, the subgradient is actually a gradient and you can compute it just by standard calculus. On the dividing hyperplane the gradient is not well defined, as there is no unique linear approximate to the function on that point, but in this case this does not really matter (I'll explain later).
Where $y_t\langle z_t, x\rangle > 1$ the loss is identically $0$ and the gradient is $0$.
Where $y_t\langle z_t, x\rangle < 1$, the $i$-th partial derivative is $$\frac{\partial f_t(x)}{\partial x_i} = -z_{t, i}y_t.$$ What about $y_t\langle z_t, x\rangle = 1$? In this case you get the subgradient which is a full set and not a single vector. This is not a problem, as all we care is to find some bound for the components of each one of these vectors. Thinking about the hinge loss in one or two dimensions, you can convince yourself that each vector in the subgradient must have $i$-th component living in the interval $[-z_{t, i}y_t, 0]$. So for the purposes of getting the inequality we can consider the worst case, i.e the $i$-th component being equal to $-z_{t, i}y_t$.
And these are all the possible values of the gradient.
We also know that $z_{t, i}$ is nonzero with probability $p_i$, this means that the components of the gradient are nonzero with probability at most $p_i$.
Lastly, we notice that the standard hinge loss is used in 2 class classification problems, where $y_t = \pm 1$.
Finally, we are ready for equation $(1)$!
The first equality follows from the linearity of the expected value and from the fact that $g_{t, i}$ is $\pm1$ or $0$, so the sum of the square is equal to the number of elements that are not $0$.
The inequality follows from Jensen's inequality.
The last equality follows from computing the expected value of $\vert\vert(g_{1, i}, \dots g_{T, i})\vert\vert_{1}$ (which is equal to the number of nonzero elements in the $i$-th component of the gradients) which is
$$ \mathbb{E}[\sum_{t=1}^{T}\vert g_{t, i}\vert] = \sum_{t=1}^{T}\mathbb{E}[\vert g_{t, i}\vert] = \sum_{t=1}^{T} p_i = Tp_i.$$