I'm working on Machine Learning problem, where ML-part is unrelevant for this question, but I'll briefly describe it.
I want to balance dataset of recordings from some speakers. There is high imbalance, where some speakers have even 1000x more recordings than others. In my problem I need as many diverse speakers possible, but overall high amount of recordings is also highly demanded.
I thought it would be nice to balance the dataset in a way that leaves most of the speakers with their recordings as it is, but shrinks the number of recordings from speakers that dominate the dataset.
I came up with the function:
$$ f(x) = \begin{cases} med(D) + \sqrt{(x - med(D)) \cdot W_x}, & \text{if } x > med(D)\\ x, & \text{if } x \le med(D) \end{cases} $$
where $D = {n_1,\ n_2,\ \dots,\ n_k}$ is the the vector representing numbers $n_x$ of recordings from speakers, $med$ is a median, $W_x$ is weight for each speaker in dataset, $W_x=\frac{n_x}{\sum_{n=1}^{k} {n}}$.
This function is supposed to keep speakers under the threshold $med(D)$ untouched, and then just keep some recordings over this threshold from dominating speakers.
Currently, when I want to process this data $D$ in vectorized form, I just have to select those dominating speakers and their weights and use 1st formula.
\begin{equation}f(D) = med(D) + \sqrt{(D - med(D)) \cdot W} \end{equation}
I was wondering if I could use some function with similar properties allowing me to process whole dataset in vectorized for without the need of selecting those dominating speakers and processing them separetly.
It's my first question on Stack, so please understanding. I'm an engineer with minimal math proficency, but I would value any feedback and help with this problem.