Background
I am self-studying linear algebra, and I got stuck on some steps of the proof of the following theorem:
Theorem$\quad$ Let $A$ be an $n \times n$ complex matrix and $B$ be an $m \times m$ complex matrix. For any $n \times m$ complex matrix $C$, the Sylvester equation $AX+XB=C$ has a unique solution $X$, which is an $n \times m$ complex matrix, if and only if $A$ and $-B$ do not share any eigenvalue.
Here is the proof:
Proof$\quad$ The equation $AX+XB=C$ is a linear system with $mn$ unknowns and $mn$ equations. Hence, it is uniquely solvable for any given $C$ if and only if the homogeneous equation $AX+XB=0$ admits only the trivial solution.
First suppose that $A$ and $-B$ do not share any eigenvalue. Let $X$ be a solution to the homogeneous system $AX+XB=0$. Then $AX=X(-B)$. Since \begin{align*} &A(AX) = A(X(-B))\\ \implies\ &(AA)X = (AX)(-B) = (X(-B))(-B) = X((-B)(-B))\\ \implies\ &A^2X = X(-B)^2, \end{align*} by mathematical induction, we have $A^kX=X(-B)^k$ for each $k\in\mathbb{N}$. Then, $p(A)X = Xp(-B)$ for any polynomial $p$. In particular, let $p$ be the characteristic polynomial of $A$; that is, let $p(r) = det(A-rI)$. Then, $p(A) = 0$. If $E$ is any square matrix, let $\sigma(E)$ denote the set of eigenvalues of $E$. Then, $\sigma(p(-B)) = p(\sigma(-B))$. Since $A$ and $-B$ do not share any eigenvalue, $p(\sigma(-B))$ does not contain zero, and hence, $p(-B)$ is nonsigular. Thus, $X$ is the zero matrix.
Conversely, suppose that $A$ and $-B$ share an eigenvalue $\lambda$. Let $\mathbf{u}$ be a corresponding right eigenvector for $A$ and $\mathbf{v}$ be a corresponding left eigenvector for $-B$; that is $A\mathbf{u}=\lambda\mathbf{u}$ and $\mathbf{v}(-B) = \lambda\mathbf{v}$. Let $X=\mathbf{u}\mathbf{v}^*$. Then, $X$ is not the zero matrix, and $AX+XB = A(\mathbf{u}\mathbf{v}^*) - (\mathbf{u}\mathbf{v}^*)(-B) = \lambda\mathbf{u}\mathbf{v}^* - \lambda\mathbf{u}\mathbf{v}^* = 0$. Therefore, $X$ is a nontrovial solution to the homogeneous system $AX+XB=0$.
My Questions
I could not understand certain steps in the above proof.
For the "if" direction (the second paragraph in the proof):
- After it proved that $A^kX=X(-B)^k$, it claimed that $p(A)X=Xp(-B)$ for any polynomial. Why is that?
- After it defined $p$ to be the characteristic polynomial, it said that $p(A)=0$. Is this because one might just plug in $A$ to $p(r)$ to get $p(A) = det(A-AI) = det(0) = 0$?
- I have difficulties understanding the notation of $p(\sigma(-B))$. While the $\sigma(-B)$ is the set of eigenvalues of $-B$, what is $p(\sigma(-B))$ then? Is it the set of the image of the eigenvalues of $-B$ under the polynomial $p$? If so, does $\sigma(p(-B)) = p(\sigma(-B))$ mean that the two sets are equal?
- Why does the fact that $A$ and $-B$ do not share any eigenvalue imply that $p(\sigma(-B))$ does not contain zero, and how does it implies that $p(-B)$ is nonsingular?
- Why does the nonsigularity of $p(-B)$ imply that $X$ is the zero matrix?
For the "only if" part (the third paragraph in the above proof):
- I am not familiar with the notation $\mathbf{v}^*$. Is it the conjugate transpose of $\mathbf{v}$? If so, why cannot $\mathbf{u}\mathbf{v}^*$ be the zero matrix?
I apologize for my lousy beginner linear algebra background. I would really appreciate it if someone could help me with these questions!
Short version: $X$ distributes over the additions in $p(A)$ and then in each term, you swap the $A^k X$ for an $X(-B)^k$, then undistributed the $X$.
In detail: Let $N \in \Bbb{N}$ and $p(A) = \sum_{i=0}^N c_i A^i$. Then \begin{align*} p(A)X &= \sum_{i=0}^N c_i A^i X \\ &= \sum_{i=0}^N c_i X (-B)^i \\ &= X \sum_{i=0}^N c_i (-B)^i \\ &= X p(-B) \text{.} \end{align*}
Note that $p(A)X$ is square. Then this is the Cayley-Hamilton theorem.
$p(\sigma(-B))$ is the set of values $p(e_1)$, $p(e_2)$, ... $p(e_N)$ where $-B$ has $N$ eigenvalues (with or disregarding repetition; the set is the same either way) named $e_1$, $e_2$, ..., $e_N$. This uses fairly common notation: for $f$ a function and $S$ a set, $f(S) = \{f(s) \mid s \in S\}$. And yes, $\sigma(p(-B)) = p(\sigma(-B))$ means the two sets are equal: "the set of eigenvalues of the matrix $p(-B)$ is the set of values obtained by evaluating $p$ at the eigenvalues of $-B$". (The original complaint here was based on misremembering which matrices were non-square.)
The roots of the characteristic polynomial are its eigenvalues. So, having taken $p$ to be the characteristic polynomial of $A$, $p$ is zero on each of $A$'s eigenvalues. (Additional theory you should know: (1) the characteristic polynomial need not be the minimal (degree) polynomial, (2) if it is not the minimal polynomial the only "additional" roots are repetitions of the minimal polynomial's roots. Consequently, $p$ is zero exactly on the eigenvalues of $A$ and is nonzero elsewhere.) Since $A$ and $B$ do not share eigenvalues, $p$ cannot be zero on any of $B$'s eigenvalues. That is $0 \not\in p(\sigma(-B))$, so $0 \not\in \sigma(p(-B))$ (by the argument two-ish sentences prior in the Proof), so $p(-B)$ is nonsingular. (To be singular, one or more eigenvalues is zero.)
When we took $p$ to be the characteristic polynomial of $A$, this simplified $p(A)X = X(-B)$ to $0 = X(-B) = -XB$, equivalently $XB = 0$. Since $B$ has only nonzero eigenvalues, its nullspace is trivial, so all the rows of $X$ must be the zero vector. (In other words, since $B$ is nonsingular, for it to send a row of $X$ to the zero vector, that row of $X$ must be the zero vector. Since $B$ sends all the rows of $X$ to the zero vector, they must all be the zero vector so $X$ is the zero matrix.)
(I probably would have numbered this "6" since you were numbering question blobs.) Just reading through how it's used, $\mathbf{v}^*$ only needs to be the transpose of $\mathbf{v}$. (And it must at least be the transpose for $X$ to have the correct dimensions.)
The zero vector is never an eigenvector, so neither $\mathbf{u}$ nor $\mathbf{v}$ is the zero vector, so their (outer) product is not the $0$ matrix.