Let's say I have a data set of $n$ data points of $2$ variables, $x_i$ and $y_i$. Is it possible that the percentage of data points, such that $x_i$ is greater than $y_i$ can be arbitrarily high, say over $90\%$, whilst still having that the median of $y$ is greater than the median of $x$.
I have good reason to believe the answer is $yes$ but I would like to see if anyone knows of or can come up with a nice or geometric proof. Added points for an algorithm that can generate such data.
Thank you.
Yes, of course this is possible. Consider the data set $X = \{5,5,5,5,5,5,9,10,11,12,13\}$ and $Y = \{1,1,1,1,1,6,6,7,8,9,10\}$. The median of $Y$ ($6$) is greater than the median of $X$ ($5$), yet ~$91\%$ of the time, $x_i > y_i$. To achieve an arbitrarily high percentage, just make $n$ a large odd number for easiness and set the desired median for $X$ and $Y$. Then for $i \leq \left\lfloor \frac{n}{2}\right\rfloor$, generate $x_i < median(X)$ and make $y_i = x_i - random\_num()$. For $i > \left\lceil \frac{n}{2}\right\rceil$, generate $y_i > median(Y)$ and make $x_i = y_i + random\_num()$. If you want additional constraints on $X$ and $Y$ so your data doesn't look so bizarrely artificial, that might be harder.