Suppose we have the following equations defined as follows: $$ \begin{align} y_i=\sum_j{w_{ij}x_j} ,\\ w_{ij}^{'}=x_{i}^{T}x_j ,\\ w_{ij}=\frac{e^{w_{ij}^{'}}}{\sum_j{e^{w_{ij}^{'}}}} \end{align} $$ where $T$ is the transpose operation, and $w_{ij}$ is defined as the $SoftMax$ function.
In a vector form we can rearrange the equations above as follows: $$ \begin{align} Y=WX^T, \\ W=SoftMax(XX^T) \end{align} $$
Question: Since $Y$ is defined as multiplication, it's purely linear and thus is non-vanishing gradient, which means the gradient will be of a linear transformation. While $W=SoftMax(XX^T)$ is non-linear and thus can cause vanishing gradients. May I know please what is the relationship between linearity/non-linearity and vanishing/non-vanishing gradient?