How to represent a location of an object in an image mathematically?

40 Views Asked by At

So, let's say I have two matrices, $\mathbf{A}$ and $\mathbf{B}$, each of dimensions $l*l*256$. $\mathbf{B}$ is a warped version of $\mathbf{A}$. Which means there is some matrix, $\mathbf{H}$ (popularly called a homography matrix in the computer vision community) that allows me to map $\mathbf{A}$ to $\mathbf{B}$. Essentially, this means I have a feature map (as it's popularly known in the AI world) of dimensions $l*l$ and depth of $256$.

Now, let's say there are a set of pixel locations in $A$ that are "interesting." I call this set $K$, keypoints. If I apply my homography matrix to $k$ where $k \in K$, I get $k' = \mathbf{H}k$. I define $\mathbf{d_k}$ as a vector containing all the values of $\mathbf{A}$ along the third dimension at the pixel location of $k$. Likewise, $\mathbf{d_k'}$ represents all the values of $\mathbf{B}$ along the 3rd dimension at pixel location $k'$.

So finally, I'm defining a loss function as follows:

$$L = \frac{\sum^K_k{l_d(d_k, d_k')}}{|K|}$$ $$l_d(\mathbf{d}, \mathbf{d'}) = \text{max}(0, m_p - \mathbf{d^T}\mathbf{d'})$$

My questions are the following.

  1. How can I properly describe $d_k$ mathematically instead of having to be so verbose?
  2. Is everything else I stated also sound? Particularly when I'm bolding and not bolding things.

My explanation of this function will be reviewed by mathematically rigorous people. Hence, as much as scrutiny as possible would be appreciated.