For two variables to have zero covariance, there must be no linear dependence between them. Independence is a stronger requirement than zero covariance, because independence also excludes nonlinear relationships. It is possible for two variables to be dependent but have zero covariance.
Page 59
Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron, Deep learning, Adaptive Computation and Machine Learning. Cambridge, MA: MIT Press (ISBN 978-0-262-03561-3/hbk; 978-0-262-33743-4/ebook). xxii, 775 p. (2016). ZBL1373.68009.
I understand the why stochastic independence implies zero covariance, which is $\mathbf{E}[\mathbf{XY}]\triangleq\iint_{\mathbb{R^2}} xy*p_{xy}(x,y)dxdy = \iint_{\mathbb{R^2}} xy*p_{x}(x)*p_y(y)dxdy = \int_{\mathbb{R}}x*p_x(x)dx\int_{\mathbb{R}}y*p_y(y)dy = \mathbf{E}[\mathbf{X}]\mathbf{E}[\mathbf{Y}]$ as long as the independence condition holds. Then, it follows that, since $\mathbf{Cov}[\mathbf{XY}]\triangleq\mathbf{E}[\mathbf{XY}]-\mathbf{E}[\mathbf{X}]\mathbf{E}[\mathbf{Y}]$, substitution yields $\mathbf{Cov}[\mathbf{XY}] = 0$.
I apologize for the digression; now comes my question: I understand what linear and non-linear dependence entail, but I fail to see why the type of dependence between the $\mathbf{X}$ and $\mathbf{Y}$ would affect stochastic independence. I apologize if my question isn't worded in the perfect way possible. Thanks in advance!
Correlation only captures linear dependence. E.g. Take $Y = X^2$. These two random variables $X,Y$ are definitely dependent on each other i.e. given the value of one, we can comment on the value of the other. However, the correlation between $Y,X$ is zero. But surely, $X,Y$ are NOT independent.
UPDATE:
Intuition behind why correlation captures only linear dependence:
Say two rvs $X,Y$ have linear relationship i.e $Y = aX + \epsilon$, where $\epsilon$ is independent of $X$ and has variances $\sigma^2_{\epsilon}$.
\begin{align} \rho_{XY} &= \frac{\Bbb E(XY) - \Bbb E(X) \Bbb E(Y)}{\sigma_X \sigma_Y} \\ &= \frac{\Bbb E(X(aX+\epsilon)) - \Bbb E(X) \Bbb E(aX+\epsilon)}{\sigma_X \sqrt{(a^2 \sigma^2_x + \sigma^2_{\epsilon})}} \\ &= \frac{a\Bbb E(X^2) - a\Bbb E(X)^2}{\sigma_X \sqrt{(a^2 \sigma^2_x + \sigma^2_{\epsilon})}} = \frac{a\sigma_X}{\sqrt{(a^2 \sigma^2_x + \sigma^2_{\epsilon})}} \\ \end{align}
If $\sigma_{\epsilon} = 0$ i.e. the error term is not stochastic, then the correlation is 1.
On the other hand, by having a large $\sigma_{\epsilon}$, correlation is close to zero.