Given a fixed vector ${\bf x} = (x_1, x_2, \ldots, x_N) \in \mathbb{R}^N$, I am wondering if someone has a good idea on how to generate (in Matlab, the version I am using currently is R2022b) the matrix $A = (a_{ij})_{1\leq i,j\leq N}$ whose entries are given by $a_{ij} = x_i - x_j$ for all $1\leq i,j\leq N$. I believe that using a nested for loops (one loop in $i$ and another loop in $j$) will probably be too costly in moderately high dimensions, so I am eager to look for a better way of performing this task (if possible).
Another question is on matrix multiplications. Suppose that both $A, B \in \mathcal{M}_{N \times N}$ are both $N$ by $N$ real-valued matrices, I am wondering if there is a relatively "fast" way (compared to a single entry-wise for loop) of generating the vector ${\bf c} = (c_1, c_2, \ldots, c_N) \in \mathbb{R}^N$, where $c_i$ is defined to be the inner product of the $i$-th row of $A$ and the $i$-th row of $B$.
The arithmetic intensity of an algorithm is the number of arithmetic operations divided by the number of memory operations. You consider two algorithms that whose arithmetic intensity is $O(1)$. It is impossible to achieve a significant fraction of the peak flop rate on a modern computer, because arithmetic operations are much faster than memory operations.
@LutzLehmann is correct that $A=x-x'$ will work when $x$ is a vector. However, MATLAB should issue an error message and return to the prompt unless $x$ is a real number, i.e., a vector of length $1$. Perhaps this oversight will be fixed in a future version of MATLAB.
Our concern is to reduce the memory traffic. For your first calculation you should do
This will preallocate storage for the matrix $A$, the use of the ':' operator will be translated into a SIMD vector operation and you will minimize cache misses for the array $a$.
Your second calculation is more difficult because MATLAB stores matrices by columns whereas your inner products are between the rows of $A$ and $B$. Obviously, you could do
and you would get the correct result, but the performance would be abysmal because you would be doing stride $n$ access of the arrays $a$ and $b$. Every time you need an new element it wouldn't be in cache and you would have to read an entire cache line to access your element. If your cache lines contain $k$ words, then your memory traffic would be at least $k$ times what you actually need.
Explicitly transposing the matrices $A$ and $B$ is an option, but I doubt that you will gain anything at all as this is not trivial operation. You can however do this
This will be translated into vectorized operations, you will have stride 1 access of d, a, and b.
A complete test program is given here
The output on my system is
You will notice that the stride n access (second timing) is about 8 times slower than the stride 1 access (third timing). This is not surprising as I am using Intel chips where the cache lines are almost certainly 64 bytes = 8 double precision words.
You can make the stride $n$ access run arbitrarily slow by increasing $n$ so that the matrices spill into main memory instead of staying in the L3 cache. On my system the effect is quite noticeable at $n=30,000$.