I am dealing with a single cell expression dataset. The dataset forms a $m \times n$ count matrix where each cell is a row and each gene is a column. Each element in the raw count matrix is a non-negative integer. By convention, the matrix is normalized by applying $\log(x + 1)$ element-wise. The offset is needed as many genes have zero count. I wish to compare the similarity of two cells which requires the dot product $\langle \vec{v}, \vec{w} \rangle$.
How should I compute the dot product given $\log(\vec{v} + 1)$ and $\log(\vec{w} + 1)$?
An idea I have is to compute the following:
$$ \begin{align}\langle \vec{1}, \exp(\log(\vec{v} + 1) + \log(\vec{w} + 1))\rangle - \langle \vec{1}, \vec{v}\rangle - \langle \vec{1}, \vec{w}\rangle - n & = \sum_i^n (v_i + 1)(w_i + 1) - \sum_i^n v_i - \sum_i^n w_i - n \\ &= \sum_i^n (v_i w_i + v_i + w_i + 1) - \sum_i^n v_i - \sum_i^n w_i - n \\ &= \sum_i^n v_i w_i \\ &= \langle \vec{v}, \vec{w} \rangle \end{align} $$
So I implemented my idea and the $\langle \exp(\log(\vec{v} + 1)) - 1, \exp(\log(\vec{w} + 1)) -1 \rangle$ proposed by @VHarisop.
def dot1p(v_log1p, w_log1p) -> float:
v = np.expm1(v_log1p)
w = np.expm1(w_log1p)
# Compute the other terms in the equation
vw_sum_exp = np.exp(v_log1p + w_log1p) # exp(log(v + 1) + log(w + 1))
return np.sum(vw_sum_exp) - np.sum(v) - np.sum(w) - len(v)
def npdot1p(v_log1p: np.ndarray, w_log1p: np.ndarray) -> float:
return np.dot(np.expm1(v_log1p), np.expm1(w_log1p))
For a $n$-dimensional random vector between $[0, 1)$, it seems that $\langle \exp(\log(\vec{v} + 1)) - 1, \exp(\log(\vec{w} + 1)) -1 \rangle$ is the faster solution with negligible error. Since there is only one multiplication for each pair of elements, it should indeed be the case.
Size: 10, Time (dot1p): 0.000053 seconds, Time (npdot1p): 0.000013 seconds, Difference: 7.632330e-16
Size: 100, Time (dot1p): 0.000029 seconds, Time (npdot1p): 0.000010 seconds, Difference: 1.084507e-16
Size: 1000, Time (dot1p): 0.000258 seconds, Time (npdot1p): 0.000024 seconds, Difference: 1.631105e-13
Size: 10000, Time (dot1p): 0.000299 seconds, Time (npdot1p): 0.000177 seconds, Difference: 1.622506e-12
Size: 100000, Time (dot1p): 0.004293 seconds, Time (npdot1p): 0.001869 seconds, Difference: 2.761014e-12
Size: 1000000, Time (dot1p): 0.028257 seconds, Time (npdot1p): 0.014261 seconds, Difference: 2.968028e-10
Size: 10000000, Time (dot1p): 0.175802 seconds, Time (npdot1p): 0.103509 seconds, Difference: 8.228977e-09
Size: 100000000, Time (dot1p): 1.700164 seconds, Time (npdot1p): 0.999349 seconds, Difference: 8.626604e-08
