How to write out a percentile rank thresholding for a vector as a mathematical notation?

884 Views Asked by At

I'm attempting to write out an operation as a mathematical formula. I've been able to code it, but I'm stuck when writing it as a concise formula. Essentially, I'm trying to extract a number of elements from each row of a matrix and then calculate the distance between only that subset of each row. More specifically, the operation is a row-wise rank threshold for each row of a matrix (separately) and find the distance for nonzero elements as a mathematical formula. I also want to perform this at multiple thresholds.

My input is a matrix M, which has the size m by n. I would like to extract the indices of the top/highest x% of elements for each row n of this matrix (where x is the top 5, 10, 25, 50, 75 and 100% elements). I have the x and y coordinates for each element.

Here's the equation to binarize the matrix M for each inclusion threshold: $$ B_{n,m,v} = \begin{cases} \begin{split} 1&, if \, M_{n,m} \geq M_{n,m_{n,v}} \\ 0&, otherwise \end{split} \end{cases} $$ $$ m_{n,v} = \lfloor s_v * N \rfloor $$ Where B is a binary matrix denoting which elements survive thresholding at inclusion threshold sv (sv = .05, .1, .25. .5, .75, 1.0), such that Bn,m,v is 1 if Mn,m is higher than the threshold value (e.g. the xth percentile (sv) value of for row n). Where mn,v is the index of the xth percentile (sv) highest highest value for row n. Big N is the total number of elements for each row (i.e. length of m). There is one threshold value for each row for each inclusion threshold. This equation assumes the M is sorted such that the element with the highest value is at index 1.

$$ D_{n,v} = \frac{\sum_{k=1}^{n_v} \sum_{j=1, j \ne k}^{n_v} dist(M_{n,j,v}, M_{n,k,v})}{n_v(n_v-1)} $$ $$ dist(B_{n,j,v}, B_{n,k,v}) = \sqrt{(x_{j} - x_{k})^2 + (y_{j} - y_{k})^2} $$ Where Dn,v denotes the average pairwise euclidean distance of surviving elements for row n at inclusion threshold at index v. Where x and y are vectors of coordinates of each element (and has a length of N). Where k and j are the indices of the surviving elements for each row for each threshold (e.g. where Bc,h,v is 1). I think here j and k are 1....n when they should be the actual indices of surviving elements, but I'm unsure how to write that.

Overall I feel like this equation should be more succinct. I feel like what I'm actually coding is simple but that's not reflected when I write out the equation. Additionally, I'm unsure about some of the syntax (such as j and k should be the indices of surviving elements at each threshold for each row). Any help on this problem or resources for thinking through how to equationify code would be super appreciated.

1

There are 1 best solutions below

0
On

$ \def\bbR{{\mathbb R}} \def\e{\varepsilon} \def\l{\left} \def\r{\right} \def\o{{\tt1}} \def\p{\partial} \def\lr#1{\l(#1\r)} \def\LR#1{\Big(#1\Big)} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\m#1{\left[\begin{array}{r}#1\end{array}\right]} $Given $M\in\bbR^{m\times n}$ and a threshold index $p\le n$, let $$e_i\in\bbR^n \qquad\qquad \e_k\in\bbR^p$$ denote the standard basis vectors for the respective dimensions and use them to create a truncated identity matrix (which you can think of as the first $p$-columns of $I_n)$ $$\eqalign{ I_\tau = \sum_{k=1}^p e_k\e_k^T \;\in\bbR^{n\times p} \\ }$$ Multiplying M by the truncated identity produces a truncated data matrix $$\eqalign{ X = MI_\tau \;\in\bbR^{m\times p} \\ }$$ Using this post (and setting $Y=X)$ we can write an expression for the squared Euclidean distances between the columns $X$ $$\eqalign{ &D^{\odot 2} = (X\odot X)^TJ + J^T(X\odot X) - 2X^TX \\ &D ​= \LR{(X\odot X)^TJ + J^T(X\odot X) - 2X^TX}^{\odot 1/2} \\ }$$ where $(\odot)$ indicates an elementwise/Hadamard product (or power) and $J\in\bbR^{m\times p}$ is the all-ones matrix.

I think this captures the essence of your problem in matrix form. If you're coding this in a language like Julia (or Matlab) you'd write something like

X,J = M[:,1:p], ones(m,p)
Y = X.*X
D = sqrt.(Y'J + J'Y - 2*X'X)