Consider a real-valued orthogonal matrix $Q$ and a sequence of diagonal matrices $\{D_m\}_{m=1}^\infty$. All entries of $Q$ are real and the entries of each $D_n$ are real and positive. What is the cost of the following multiplication(s)?
$$\Sigma_m = Q D_m Q^\top$$
Is there any way this can be accomplished in $\mathcal O(n^2)$ time? I'm interested in the amortized cost, meaning that I'm okay with a $\mathcal O(n^3)$ pre-processing step if it leads to (eventual) $\mathcal O(n^2)$ multiplications.
I'm not aware of any algorithm for this, although perhaps someone knows of one. Nothing I write below uses the fact that $Q$ is orthogonal, so perhaps that can be exploited in some way.
In terms of theoretical asymptotic complexity, $AA^T$ can be computed slightly faster than an arbitrary matrix product can be. Specifically, faster by a factor $2/(2^\omega-3)$ where $\omega$ is currently about 2.37.
However, since your matrices are of size $n=1000$, the constants suppressed by the big-O, and more importantly, the implementation of the matrix product algorithms will come into play. Effectively using a library such as numpy which takes advantage e of CPU caching, factorization, etc. will likely have a very large impact on the runtime.
symmetry
If we're assuming we are going to pay $n^3$, then we should focus on the constant in front. An obvious optimization is that $QD_mQ^T$ is symmetric, so only the upper (or lower) triangular part needs to be computed. This can be done directly bu noting that, $$ [\Sigma_m]_{i,j} = Q_{[:,i]}^T D_m Q_{[:,j]} = \sum_{k=1}^{n} [D_m]_{k,k} Q_{i,k} Q_{k,j} $$ where $Q_{[:,j]}$ is the $j$-th column of $Q$. Naively computing $\Sigma_m$ this way would require $3n-1$ flops, and need to be done for $n(n+1)/2$ entries. If $m=1,\ldots, M$, then this is a total cost of $M(3n-1)n(n+1)/2\approx (3/2)Mn^3$.
preprocessing
However, notice that the only dependence of this sum on $D_m$ is the term $[D_{m}]_{k,k}$. Thus, the product $Q_{i,k}Q_{k,j}$ can be precomputed. If we define a new vector $$ q^{(i,j)} = [Q_{i,1}Q_{1,j}, Q_{i,2}Q_{2,j}, \ldots, Q_{i,n}Q_{n,j} ]^T $$ then we have $[\Sigma_{m}]_{i,j} = (q^{(i,j)})^T d_m$, where $d_m$ is the vector with entries equal to those of the diagonal of $D_m$.
By symmetry, $q^{(i,j)} = q^{(j,i)}$, so we can preprocess our data set by computing $q^{(i,j)}$, $i=1,2,\ldots, n$, $j=1,2,\ldots, i$, and storing the $n(n+1)/2$ many vectors. The cost of computing each $q^{(i,j)}$ is $n$, so the total cost is $n^2(n+1)/2 \approx n^3/2$ (also this much storage is required).
vectorization
Now the cost of computing $[\Sigma_{m}]_{i,j}$ is the cost of a dot product: $2n-1$ which must be done $n(n+1)/2$ times for each $\Sigma_m$. This gives a total cost of $n^2(n+1)/2 + M(2n-1)n(n+1)/2 \approx n^3/2 + M n^3$, so if $M$ is very large we improve by a factor of roughly 1.5.
But, perhaps even more important is that all of these products can be vectorized. Specifically, we can store the $q^{(i,j)}$s in a $n(n+1)/2\times n$ matrix, and all the $d_m$s in a $n\times M$ matrix. The last step would be to take the data out of this matrix product and put it into a useable form.
What is faster in practice will probably depend a lot on what libraries you're able to use, and how they manage memory, etc.