I consider the following simple linear program
$$\min_x c^{T}x \text{ subject to } Ax \leq b$$
Its Lagrangian dual problem is $\max\limits_{\lambda \geq 0} D(\lambda)$ where $D(\lambda) = \min\limits_{x} L(x, \lambda)$ and
$$L(x,\lambda) = c^{T}x + \lambda^{T}(Ax-b)$$
In section 7.3.1 of Mathematics for Machine Learning, however, the authors minimize $L(x, \lambda)$ by differentiating $L$ w.r.t. $x$, and setting it to $0$, so that get $c + A^{T} \lambda = 0$
But, I think minimization of $L(x,\lambda)$ is just $-\infty$, since it's just linear function in $x$, so getting $x$ to $\pm \infty$ componentwise (more precisely, if $(c+A^{T}\lambda)_{i}>0$ then $x_{i}$ goes to $-\infty$, and if $(c+A^{T}\lambda)_{i}<0$ then $x_{i}$ goes to $\infty$) will gives $L(x, \lambda)$ to be $-\infty$. So in my opinion, Lagrangian does not attain minimum.
Moreover, in convex optimization, unconstrained problem with differentiable optimization problem, we have theorem that $x$ is minimal $\iff$ $\nabla f(x) =0 $, by first-order condition of differentiable convex function. It seems also weird since $f(x) = ax+b$ is affine, so convex, but it's minimum is $-\infty$, not $b$, when $a=0$.
What's wrong in my logic? (I think constraint may not give satisfactory answer in both case: because Lagrangian is introduced due to remove constraint, and later one assume unconstrained problem)
You are correct that if $(c+A^\mathsf T \lambda)_{i}>0$ for any $i$, then $L(x,\lambda)$ can be made arbitrarily small by taking $x_i \to -\infty$, and if $(c+A^\mathsf T \lambda)_{i}<0$ for any $i$, then $L(x,\lambda)$ can be made arbitrarily small by taking $x_i \to \infty$.
On the other hand, if $c+A^\mathsf T \lambda = 0$, then $L(x,\lambda)$ simplifies to just $-\lambda^{\mathsf T}b$, and we can't make that arbitrarily small by changing $x$, because it no longer depends on $x$.
As a result, if we define $D(\lambda) = \inf_x L(x,\lambda)$, then we will have $$D(\lambda) = \begin{cases} -\lambda^{\mathsf T} b & c+A^\mathsf T \lambda = 0 \\ -\infty & \text{otherwise.}\end{cases}$$ But we are trying to maximize $D(\lambda)$, so we want to avoid the branch where it's $-\infty$: we want to choose $\lambda$ so that $c+A^\mathsf T \lambda = 0$. Therefore the dual is equivalent to the linear program $$\max_\lambda -\lambda^{\mathsf T}b \text{ subject to } c+A^\mathsf T \lambda = 0, \lambda \ge 0.$$ This is why we set the gradient to $0$, and in general this is the logic that lets us derive additional constraints on any Lagrangian dual.