I don't quite get how logistic loss works for binary classification:
$$\log(1+\exp(−y\cdot \mathbf{w}^T\mathbf{x})), \quad y\in\{−1,+1\}$$
Minimizing this function for $\mathbf{w}$ seems to me to simply make $\mathbf{w}^T\mathbf{x}$ as large as possible, meaning setting $w_i$ to infinity (negative or positive - depending on $x_i$).
What do I misunderstand?
$f(w; x_i) = 1/(1+\exp(-y_i w^Tx_i))$ is the probability on observing $y_i$ given $x_i$. Given a set of observations, assuming independence, you obtain the product of these functions. Applying a logarithmic transformation does not affect the location of the maximum (merely its value), and removing the negative sign turns it into a minimization problem. You are therefore interested in the $w$ that minimizes $-\log \prod_i f(w; x_i)$, or, equivalently, that minimizes $$\sum_i \log\left(1+\exp(-y_i w^Tx_i)\right).$$ Now you cannot simply let $w^Tx_i$ go to $\infty$ for all $i$.