I am reading this paper where they use Matrix Factorization over Attention mechanism in their Hamburger model. In section 2.2.2 they say,
Vector Quantization (VQ) (Gray & Neuhoff, 1998), a classic data compression algorithm, can be formulated as an optimization problem in term of matrix decomposition: $$ \min _{\boldsymbol{D}, \boldsymbol{C}}\|\boldsymbol{X}-\boldsymbol{D} \boldsymbol{C}\|_F \quad \text { s.t. } \mathbf{c}_i \in\left\{\mathbf{e}_1, \mathbf{e}_2, \cdots, \mathbf{e}_r\right\}\tag1 $$ where $e_i$ is the canonical basis vector, $\mathbf{e}_i=[0, \cdots, 1, \cdots, 0]^{\top}$. The solution to minimize the objective in Eq. (1) is K-means (Gray & Neuhoff, 1998). However, to ensure that VQ is differentiable, we replace the hard arg min and Euclidean distance with softmax and cosine similarity, leading to Alg. 1, where $\operatorname{cosine}(\boldsymbol{D}, \boldsymbol{X})$ is a similarity matrix whose entries satisfy $\operatorname{cosine}(\boldsymbol{D}, \boldsymbol{X})_{i j}=\frac{\mathbf{d}_i^{\top} \mathbf{x}_j}{\|\mathbf{d}\|\|\mathbf{x}\|}$, and softmax is applied column-wise and $T$ is the temperature. Further, we can obtain a hard assignment by a one-hot vector when $T \rightarrow 0$.
I didn't understand the last line, "...and softmax is applied column-wise and $T$ is the temperature. Further, we can obtain a hard assignment by a one-hot vector when $T \rightarrow 0$." What they mean by $T$ (Temperature) here?
In fact, I couldn't get the justification for the replacement of $\arg \min$ with softmax here. And what is their update rule doing for $D\leftarrow XC^{T}\text{diag}(C1_n)^{-1}$
And it seems like their VQ is similar to traditional non-negative matrix factorization with regularization on $C$. Or am I confused between these two?
Thanks in advance.
update
I remove some questions which might be not going with M.SE rules. And convert the thread with a single math-based question only. Hope it will get better reach now.

The authors of the paper are attempting to make the Vector Quantization (VQ) algorithm differentiable, which is necessary for gradient-based optimization in neural networks. VQ, in its original form, involves a hard assignment of data points to their nearest cluster centers, which is not differentiable due to the arg min operation and the Euclidean distance metric. To overcome this issue, the authors propose replacing the hard arg min and Euclidean distance with the softmax function and the cosine similarity metric, respectively.
The softmax function is used because it provides a continuous, differentiable approximation of the arg min operation. Instead of hard assignments, softmax generates a probability distribution over the cluster centers, which can be seen as "soft assignments". The temperature parameter $T$ in the softmax function controls the sharpness of this distribution. As $T$ approaches 0, the distribution becomes more like a one-hot vector, mimicking the hard assignment of the original VQ.
The update rule for $D\leftarrow XC^{T}\text{diag}(C1_n)^{-1}$ is derived from the optimization problem mentioned in the paper. The authors are trying to minimize the objective function $|\boldsymbol{X}-\boldsymbol{D} \boldsymbol{C}|_F$, and they have replaced the hard assignment with the soft assignment matrix $C$. This update rule is used to update the dictionary matrix $D$ in an iterative manner.
High-level explanation of the update rule:
This update rule ensures a differentiable optimization process while still closely resembling the original VQ algorithm.