I am looking for an intuition for the following behavior:
Let $f(x)=x^2$
Apparently the Newton's Method iterations to find the root (at $x=0$) converge in this case relatively slow: $x_{x+1}=x_n-\frac{f(x)}{f'(x)}$. I was told this happens because of the fact that the derivative at the root is also zero: $f'(0)=0$
However, I don't understand why the derivative would matter:
$$x_{n+1}=x_n-\frac{x_n^2}{2x_n}=\frac{x_n}{2}$$ I would say that because the root is $0$ therefore $x_n$ becomes smaller and smaller and therefore the steps towards the 'right' answer become smaller and smaller.
a) Is this reasoning correct?
b) What does all this have to do with the derivative?
I created some graphics:
First $f(x)=x^2$ with $x_0=2$
And $f(x)=x^2-1$ with $x_0=3$
Indeed, with the double root the convergence goes slower. Still I do not really have an intuition for this behavior. I can see the calculations confirm this result, but why is this also intuitively true?



Try running your argument for the function
$$ f(x) = (x-1)^2 $$
You will see rapidly why it is the derivative, and not the value of the root, that matters.
Okay: firstly, please note that in Newton's method, unless you accidentally hit the root in finitely many steps, always the steps get smaller and smaller. So that the qualitative fact that the steps get smaller do not say anything about rates of convergence.
The key is this:
Consider a polynomial (for general functions, this would be the Taylor polynomial near a root $f(b) = 0$)
$$ f(x) = a_1 (x-b) + a_2 (x-b)^2 + a_3 (x-b)^3 $$
which has a root at $x = b$.
Simple root case
Consider the simple root case $a_1 \neq 0$. We have that the steps
$$ (x-b) - \frac{f(x)}{f'(x)} = \frac{(x-b) f'(x) - f(x)}{f'(x)} = \frac{ a_2 (x-b)^2 + 2 a_3 (x-b)^3}{a_1 + 2 a_2 (x-b) + 3 a_3 (x-b)^2} $$
So this means that when $x_n$ is very close to the root $b$, that
$$ x_{n+1} - b = (x_n-b) - \frac{f(x_n)}{f'(x_n)} = \frac{a_2}{a_1} (x_n-b)^2 + o((x_n - b)^2) $$
That is, if the error $\epsilon_n = (x_n - b)$ at step $n$ is small, then the error $\epsilon_{n+1} \approx \frac{a_2}{a_1} \epsilon_n^2$ is much smaller.
Double root case
What if $a_1 = 0$ and $a_2 \neq 0$? The same computations give
$$ (x-b) - \frac{f(x)}{f'(x)} = \frac{a_2(x-b)^2 + 2 a_3(x-b)^2}{2 a_2(x-b) + 3a_3(x-b)^2} $$
which, when $x_n$ is very close to $b$, gives
$$ x{n+1} - b = (x_n - b) - \frac{f(x_n)}{f'(x_n)} = \frac12 (x_n-b) + o(x_n-b) $$
that is to say, the error terms decays like
$$ \epsilon_{n+1} \approx \frac12 \epsilon_n $$
Now the difference is clear: for the same starting (small) error $\epsilon$, in the simple root case the new error after running the algorithm for one step is only $\frac{a_2}{a_1} \epsilon^2$; in the double root case the new error is $\frac12 \epsilon$, which is much bigger whenever $\epsilon \ll \frac{|a_1|}{2|a_2|}$.
In words: in the simple root case Newton's method has the property that "the closer you are to the root, the higher the percentage of the error is corrected in each step" with the percentage going toward 100%. While in the double root case Newton's method has the property that "when close to the root, only about half of the error can be corrected at most in each step".
Double root case revisited
Now, if we use the formula given by Felix Mann in his comment, we can rework the approximation using the same polynomial as above. Assume $a_1 = 0$ and $a_2 \neq 0$.
$$ (x-b) - 2 \frac{f(x)}{f'(x)} = \frac{a_3(x-b)^3}{2a_2(x-b) + 3a_3(x-b)^2} $$
so starting with error $\epsilon_n = x_n - b$, your new error is $$ \epsilon_{n+1} \approx \frac{a_3}{2a_2} \epsilon_n^2 $$ and you get again the property that "the closer you get to the double root, the higher the percentage of error corrected per step". But Notice that if the root were actually simple (meaning $a_1 \neq 0$, then you would get
$$ (x-b) - 2 \frac{f(x)}{f'(x)} = \frac{- a_1(x-b) + a_3(x-b)^3}{a_1 + 2a_2(x-b) + 3a_3(x-b)^2} $$
and we arrive at a situation where "if the initial error is small, the error of the next step is approximately the same size!" And you won't even converge to the root. (So the modified formula really should only be used when you know for sure that the root is a double root.)
Intuition
The very first thing you need to remember is that Newton's method converges in one step for linear functions. That is, if you look at Newton's method, each step proceeds by approximating the curve by a triangle with height $f(x_n)$ and slope $f'(x_n)$ and correcting the guess $x_n$ by the corresponding base $f(x_n) / f'(x_n)$.
Simple root case
In the simple root case case when we zoom into the graph near its root the graph straightens out. So it becomes more and more like a triangle where we can find the base by dividing the height with the slope. Because the graph itself converges toward the triangle, Newton's method becomes more and more effective the closer you are to the root.
(The above animation is zooming into the origin on the graph of $y = (x+ 0.1)^3 - 0.001$ which has a simple root at $x = 0$.)
Multiple root case
In the multiple root case zooming in doesn't straighten out the curve at all. The picture always remain far from a triangle. In particular, you don't get the gain in effectiveness of the approximation by triangles as we saw in the simple root case, when we zoom in.
(The above animation is zooming into the origin on the graph of $y = x^3$ which has a triple root at $x = 0$.)
The Schroder approach mentioned in the comments below overcomes this by recognizing that the "shape" that the graph converges to is in fact $y = x^m$ where $m$ is the multiplicity. So instead of approximating by triangles (the linear case) one can approximate by these higher order curves. Again, since as we zoom in the graph becomes more and more like the model behavior of our approximation, we get the gain in effectiveness.