It is known that the solution to the linear system $Ax=b$ where $A$ is symmetric and positive definite is the minimizer of the quadratic function $f(x)= \frac{1}{2} x^T A x - x^T b$
We can solve it using the steepest descent method. However, there is a better method which chooses the direction $p_k$ to be A-conjugate.
The thing I don't understand is why choosing the current direction to be Conjugate to the previous ones is a good thing?
Why is the CG method is better?
As already mentioned the comments, the typical weakness of the steepest/gradient descent can be seen by considering for instance for the model problem $$A = \begin{pmatrix} \lambda & 0 \\ 0 & \sigma\end{pmatrix}, 0 < \lambda \ll \sigma , \quad b = \begin{pmatrix} 0 \\ 0 \end{pmatrix} $$ with objective function $$ f(x, y) = \frac12 \Big( \lambda x^2 + \sigma y^2 \Big). $$ The resulting contours are ellipses which are strongly elongated along the $x$-axis and rapidly increasing in $y$-direction. Consequently, the convergence of steepest/gradient descent can be rather slow, as illustrated by the picture below.
(Note that two successive update directions are orthogonal to one another, this is not appearant in this picture due to the different scalings of $x$ and $y$ axis.)
To argue why CG performs much better, it is helpful to introduce the following concept of optimality for subspaces $V \subseteq \mathbb R^n$: Call $\boldsymbol y \in \mathbb R^n$ optimal for $V$ if $$f(\boldsymbol y) = \min_{\boldsymbol u \in V} f(\boldsymbol y + \boldsymbol u). $$ In other words, $\boldsymbol y$ is optimal for $V$ if on the hyperplane $\boldsymbol y \cup V$ the objective $f(\boldsymbol x)$ is minimal at $\boldsymbol y$.
Now let $\big \{ \boldsymbol d_i \big \}_{i = 1, \dots s} $ be a basis of $V$. Furthermore, for given $\boldsymbol y$ and $\boldsymbol c \in \mathbb R^s $ define $$h(\boldsymbol c) := f\bigg(\boldsymbol y + \sum_{i=1}^s c_i \boldsymbol d_i \bigg). $$ Then $\boldsymbol y$ is optimal for $V$ iff $\nabla_{\boldsymbol c} h(\boldsymbol 0) = \boldsymbol 0 $. Carrying out the the differentiation, one obtains by employing the Chain Rule the "condition for optimality for subspaces" $$ \nabla_{\boldsymbol c} h(\boldsymbol 0) = \big \langle f(\boldsymbol y), \boldsymbol d_i \big \rangle = \big \langle A \boldsymbol y - \boldsymbol b, \boldsymbol d_i \big \rangle = \big \langle \boldsymbol r_{\boldsymbol y}, \boldsymbol d_i \big \rangle\overset{!}{=} \boldsymbol 0. \label{1} \tag{1}$$
With this concept, you can motivate $A$-conjugate search directions as follows: Assume that your first search direction $\boldsymbol p^0 $ is given by the steepest descent one $$\boldsymbol p^0 = \nabla f(\boldsymbol x) = A \boldsymbol x^0 - \boldsymbol b =: \boldsymbol r^0. \label{2} \tag{2} $$ Here, I use an update formula of the form $$\boldsymbol x^{k+1} = \boldsymbol x^{k} + \alpha \boldsymbol p^k \label{3} \tag{3}$$ and let $\alpha$ take care of the sign. Then, assume you are given / can find a second search direction $\boldsymbol p^1$ that is $A$-conjugate w.r.t. the previous one $\boldsymbol p^0 $, i.e., $$ \big \langle \boldsymbol p^1, A \boldsymbol p^0 \big \rangle = 0 \overset{A \text{ s.p.d}}{=} \big \langle A \boldsymbol p^1, \boldsymbol p^0 \big \rangle \label{4} \tag{4}$$ and for the optimal step length (note that this is guaranteed to be negative since $A$ is s.p.d) $$ \alpha^k =- \frac{ \big \langle \boldsymbol r^k, \boldsymbol r^k \big \rangle }{\big \langle \boldsymbol r^k,A \boldsymbol r^k \big \rangle} \overset{\eqref{2}}{=} -\frac{ \big \langle \boldsymbol p^k, \boldsymbol p^k \big \rangle }{\big \langle \boldsymbol p^k,A \boldsymbol p^k \big \rangle} \label{5} \tag{5}$$ you can show that \begin{align} \big \langle \boldsymbol p^0, A \boldsymbol x^2 - \boldsymbol b \big \rangle \overset{\eqref{3}}{=}& \big \langle \boldsymbol p^0, A (\boldsymbol x^1 + \alpha^1 \boldsymbol p^1) - \boldsymbol b \big \rangle \\ =& \big \langle \boldsymbol p^0, A \boldsymbol x^1 - \boldsymbol b \big \rangle + \alpha^1 \big \langle \boldsymbol p^0, A \boldsymbol p^1 \big \rangle \\ \overset{\eqref{3}}{=}& \big \langle \boldsymbol p^0, A (\boldsymbol x^0 + \alpha^0 \boldsymbol p^0) - \boldsymbol b \big \rangle \\ =& \big \langle \boldsymbol p^0, A \boldsymbol x^0 - \boldsymbol b \big \rangle + \alpha^0 \big \langle \boldsymbol p^0, A \boldsymbol p^0\big \rangle \\ \overset{\eqref{2}, \eqref{4}}{=}& \big \langle \boldsymbol p^0, \boldsymbol p^0 \big \rangle - \frac{ \big \langle \boldsymbol p^0, \boldsymbol p^0 \big \rangle }{\big \langle \boldsymbol p^0,A \boldsymbol p^0 \big \rangle} \big \langle \boldsymbol p^0, A \boldsymbol p^0\big \rangle = 0 \label{6} \tag{6} \end{align} and \begin{align} \big \langle \boldsymbol p^1, A \boldsymbol x^2 - \boldsymbol b \big \rangle \overset{\eqref{3}}{=}& \big \langle \boldsymbol p^1, A (\boldsymbol x^1 + \alpha^1 \boldsymbol p^1) - \boldsymbol b \big \rangle \\ =& \big \langle \boldsymbol p^1, A \boldsymbol x^1 - \boldsymbol b \big \rangle + \alpha^1 \big \langle \boldsymbol p^1, A \boldsymbol p^1 \big \rangle \\ \overset{\eqref{2}, \eqref{4}}{=}& \big \langle \boldsymbol p^1, \boldsymbol p^1 \big \rangle - \frac{ \big \langle \boldsymbol p^1, \boldsymbol p^1 \big \rangle }{\big \langle \boldsymbol p^1,A \boldsymbol p^1 \big \rangle} \big \langle \boldsymbol p^1, A \boldsymbol p^1\big \rangle = 0.\label{7} \tag{7} \end{align}
The significance of this is that due to $A$ being s.p.d and $ \big \langle \boldsymbol p^1, A \boldsymbol p^0 \big \rangle = 0$ we have that $\boldsymbol p^0, \boldsymbol p^1$ are linearly independent and thus form a basis of $V = \mathbb R^2$! By \eqref{6}, \eqref{7} we have also shown that $\boldsymbol x^2$ is optimal for $V = \mathbb R^2$, cf. \eqref{1}! This proves in two dimensions the well-known property of the CG algorithm that it converges in $n$ iterations for a $n$-dimensional problem.