I tried to solve Linear Least Squares and I am kind of stuck / confused about the property of vector multiplication/associativity/commutativity.
The way I defined my problem is the following: Let $\phi(x_n)$ define the features of data point $x_n$ with $\phi(x_n),w \in \mathbb{R}^{m}$. Then the loss is given by $$ L = \frac{1}{2} \sum^N _{n=1} (\phi(x_n)^Tw -y_n)^2 $$
I took the derivative w.r.t $w$, set it to 0 and got
$$ 0 = \frac{1}{2} \sum^N _{n=1} (\phi(x_n)^Tw)^2 - 2\phi(x_n)^Twy_n + y_n ^2 $$
$$ 0 = \sum^N _{n=1} (\phi(x_n)^Tw)\phi(x_n) - \phi(x_n)^Ty_n \rightarrow \sum^N _{n=1} \phi(x_n)^Ty_n = \sum^N _{n=1} (\phi(x_n)^Tw)\phi(x_n) $$ The term $\sum^N _{n=1} (\phi(x_n)^Tw)\phi(x_n)$ confuses me. For one, it is not associative, but the multiplication is still defined because $\phi(x_n)^Tw$ is a scalar. Also, if I would write $(\phi(x_n)^Tw)\phi(x_n) = \phi(x_n) (\phi(x_n)^Tw)$ then turns turns out to be the correct solution for least squares if I can use associativity, i.e. $ (\phi(x_n) (\phi(x_n)^T)w$ . This would make sense to me since scalar multiplication is commutative. But why would I be allowed to use associativity now and not before ? Also, if I could look at $\phi(x_n)^Tw$ as a "unit" which is a scalar, then it must also hold that $(\phi(x_n)^Tw )^T = w ^T \phi(x_n)$, which is again not associative with $\phi(x_n)$.
So, I guess, my question is: Where is the mistake in my derivation/my line of thinking? Im pretty sure it has something to do with the fact that the first expression is a 1x1 matrix and I look at it as a scalar, but Im very confused about the entire situation, especially when I am allowed to use commutativity and when I can use associativity.