I am reading the book Foundation of Machine Learning, and the author has many proofs for different theorems. Here is one part of a proof about Perceptron algorithm which I don't quite understand
Note: $w_t$ is a vector
Let \begin{equation} w_{t+1} = \begin{cases} w_t + ny_t x_t & \text{if } y_t(w.x_t) < 0 \\ w_t & \text{if } y_t(w.x_t) > 0 \\ w_t & \text{otherwise} \end{cases} \end{equation}
and in the proof, he has \begin{equation} \|w_{T+1}\| = \sqrt{\sum_{t \in T} \|w_{t+1}\|^2 - \| w_t \|^2} \end{equation}
he mentioned that above equation uses (telescoping sum, $w_0 = 0$). I have no idea why lhs is equal to rhs using telescoping sum, can you help me explain this?
Thanks in advance!
This is an algebraic identity that has very little to do with the specific function: for any $f\colon \mathbb Z_{\ge0} \to \mathbb R_{\ge0}$ be any function satisfying $f(0)=0$, we have $$ f(T+1) = \sqrt{\sum_{0\le t\le T} \big( f(t+1)^2-f(t)^2 \big)}. $$ To see this, it's equivalent to show that $$ f(T+1)^2 = \sum_{0\le t\le T} \big( f(t+1)^2-f(t)^2 \big). $$ And this is easy to show by induction, or indeed by recognizing that the right-hand side is a telescoping sum. Taking $T=3$ for example: $$ f(4)^2 = (f(1)^2-f(0)^2) + (f(2)^2-f(1)^2) + (f(3)^2-f(2)^2) + (f(4)^2-f(3)^2). $$