Computational efficient way to compute/generate a matrix and a "row-wise" matrix product

Question

Computational efficient way to compute/generate a matrix and a "row-wise" matrix product

57 Views Asked by Bumbble Comm At 26 Mar 2026 - 7:05

Given a fixed vector ${\bf x} = (x_1, x_2, \ldots, x_N) \in \mathbb{R}^N$, I am wondering if someone has a good idea on how to generate (in Matlab, the version I am using currently is R2022b) the matrix $A = (a_{ij})_{1\leq i,j\leq N}$ whose entries are given by $a_{ij} = x_i - x_j$ for all $1\leq i,j\leq N$. I believe that using a nested for loops (one loop in $i$ and another loop in $j$) will probably be too costly in moderately high dimensions, so I am eager to look for a better way of performing this task (if possible).

Another question is on matrix multiplications. Suppose that both $A, B \in \mathcal{M}_{N \times N}$ are both $N$ by $N$ real-valued matrices, I am wondering if there is a relatively "fast" way (compared to a single entry-wise for loop) of generating the vector ${\bf c} = (c_1, c_2, \ldots, c_N) \in \mathbb{R}^N$, where $c_i$ is defined to be the inner product of the $i$-th row of $A$ and the $i$-th row of $B$.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

The arithmetic intensity of an algorithm is the number of arithmetic operations divided by the number of memory operations. You consider two algorithms that whose arithmetic intensity is $O(1)$. It is impossible to achieve a significant fraction of the peak flop rate on a modern computer, because arithmetic operations are much faster than memory operations.

@LutzLehmann is correct that $A=x-x'$ will work when $x$ is a vector. However, MATLAB should issue an error message and return to the prompt unless $x$ is a real number, i.e., a vector of length $1$. Perhaps this oversight will be fixed in a future version of MATLAB.

Our concern is to reduce the memory traffic. For your first calculation you should do

a=zeros(n,n);
for j=1:n
    a(:,j)=x-x(j);
end

This will preallocate storage for the matrix $A$, the use of the ':' operator will be translated into a SIMD vector operation and you will minimize cache misses for the array $a$.

Your second calculation is more difficult because MATLAB stores matrices by columns whereas your inner products are between the rows of $A$ and $B$. Obviously, you could do

c=zeros(n,1);
for i=1:n
    c(i)=a(i,:)*b(i,:)'
end

and you would get the correct result, but the performance would be abysmal because you would be doing stride $n$ access of the arrays $a$ and $b$. Every time you need an new element it wouldn't be in cache and you would have to read an entire cache line to access your element. If your cache lines contain $k$ words, then your memory traffic would be at least $k$ times what you actually need.

Explicitly transposing the matrices $A$ and $B$ is an option, but I doubt that you will gain anything at all as this is not trivial operation. You can however do this

d=zeros(n,1);
for j=1:n
    d=d+a(:,j).*b(:,j);
end

This will be translated into vectorized operations, you will have stride 1 access of d, a, and b.

A complete test program is given here

% Set parameters for the test
n=10000; x=rand(n,1);

% Time the first calculation
tic;
a=zeros(n,n);
for j=1:n
    a(:,j)=x-x(j);
end
toc;

% Generate test matrices
a=rand(n,n); b=rand(n,n);

% Time stride n access of a and b
tic;
c=zeros(n,1);
for i=1:n
    c(i)=a(i,:)*b(i,:)';
end
toc;

% Time stride 1 access of a and b
tic;
d=zeros(n,1);
for j=1:n
    d=d+a(:,j).*b(:,j);
end
toc;

% Verify that the two last calculations are compatible.
display(norm(d-c)/norm(d));

The output on my system is

Elapsed time is 0.118677 seconds.
Elapsed time is 1.612025 seconds.
Elapsed time is 0.234016 seconds.
   2.8722e-15

You will notice that the stride n access (second timing) is about 8 times slower than the stride 1 access (third timing). This is not surprising as I am using Intel chips where the cache lines are almost certainly 64 bytes = 8 double precision words.

You can make the stride $n$ access run arbitrarily slow by increasing $n$ so that the matrices spill into main memory instead of staying in the L3 cache. On my system the effect is quite noticeable at $n=30,000$.

Elapsed time is 0.626743 seconds.
Elapsed time is 21.229406 seconds.
Elapsed time is 1.791744 seconds.
   4.4323e-15

Computational efficient way to compute/generate a matrix and a "row-wise" matrix product

There are 1 best solutions below

Related Questions in NUMERICAL-METHODS

Related Questions in NUMERICAL-LINEAR-ALGEBRA

Related Questions in MATLAB

Related Questions in NUMERICAL-CALCULUS

Trending Questions

Popular # Hahtags

Popular Questions