TL;DR: Can the Gohberg-Semencul Formula be extended to invert 2D and 3D zero-padded convolutions?
Given a discrete, invertible, zero-padded convolution operation of a 2D image (call this transformation $T$), I want a fast way to compute the inverse transformation $T^{-1} y$. That is, given the convolution $T$ and a target image $y$, I want to find the image $x$ such that $Tx = y$. I want this calculation to be fast (around O(n log(n)) with the number of pixels n), and I want to account for the boundary effects of the zero-padding.
For a 1-dimensional signal, a zero-padded convolution can be represented by a Toeplitz matrix T where:
$ T_{i,j} = T_{i+1,j+1} = t_{(i-j)} $
where $t$ is basically the convolution kernel.
Toeplitz matrices have nice properties. For example, we can compute matrix-vector products in O(n log(n)) time using fast Fourier transforms and the convolution theorem. The Gohberg-Semencul Formula (GSF) (see Formula a1) provides a way to solve Toeplitz systems $Tx = y$ in O(n log(n)) time (after paying the initial upfront cost of finding the first and last columns of $T^{-1}$). However, Lacking a strong math background, I do not understand why the GSF works, and so far I am treating it pretty much as a black box.
However, the matrices I am interested in are not traditional Toeplitz matrices with constant diagonals, but higher dimensional "Toeplitz tensors". My "Toeplitz tensors" represent zero-padded convolutions of 2- or 3-dimensional signals, e.g. a gaussian blur of an image.
Because my Toeplitz tensors are still just convolutions, matrix-vector multiplication can still be performed in O(n log(n)) time using the FFT. But what I really want is some extension of the GSF to my setting. Since the GSF is a black box to me, I don't know how to adapt it to my situation, or if doing so is even possible.
I haven't had much luck Googling around, either. It seems like not a lot is written about these Toeplitz tensors; some sources say that such matrices are "block toeplitz with toeplitz blocks (BTTB)", but I am pretty sure that description fails to capture some important structure of the matrix which makes it possible to perform the matrix-vector product in O(n log(n)) time.