Before doing any task, a program evaluates the time it may take to do it. That evaluation is done with a categorical variable, let's say $X$, the possibles values are 1, 2, .., and 5. Higher is the value of $X$ and longer would be the duration.
Now, some data are collected and we are able to know the time it really takes to do a given task. This real time can be shown as a continuous variable $Y$. To sum up, we have some data $(X_i, Y_i)$ for a set of tasks $i \in \{1,....,N\}$.
The goal is to assess whether the program is reliable or not. To do so, we consider 5 distributions $\delta_1, ..., \delta_5$ such that $\delta_k = \{ Y_i \text{ s.t. } X_i = k \}$. My idea is to compare each pair $(\delta_q, \delta_p)$ in order to answer this problem. I'd like to add that each distribution $\delta_k$ is not necessarily normal. I did some researches and I found that this test Mann–Whitney–Wilcoxon could help.
Can someone help me to formulate it correctly ? Thank you.
Edit :
The program is perfectly reliable if and only if $\left \{ \forall (q,p) \in \{1,2,...,5\} \ \ ( q < p \Rightarrow \forall (Y_q,Y_p) \in \delta_q \times \delta_p \ \ , \ \ Y_q < Y_p) \right\} $
But, in real world, with real data, this condition is not fulfilled and thus the program is not perfectly reliable. So, I would like to use some statistics and test the reliability of the program. To do so, I consider the Mann-Whitney-Wilcoxon's test applied for my distributions :
The null hypothesis is : It is equally likely that a randomly selected value from $\delta_q$ is greater than or less than a randomly selected value from $\delta_p$ (for all $p$,$q$ s.t. $p\neq q$).
I think this test can help me. I mean : assuming the null hypothesis is true, the program is not reliable.
Can one tell me whether it sounds correct, may be I am wrong. I will continue my researches and try to find more elements to solve this problem. Since this problem can be frequent in real world, I would be glad to share my researches here.
Edit 2 :
I used Python to get the following charts.
Histograms and Kernel Estimators of the $\delta$ distributions Boxplots of the distributions
A quick look let me think that the program is unreliable. But, while running a Mann-Whitney U test with Python (https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test, https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html), considering $\alpha=5\%$, the null hypothesis is not rejected only with $(\delta_1,\delta_2)$ and $(\delta_3, \delta_4)$. This is really strange...
P.S. Sorry, I mentioned Python but my problem is not code-related, here, I want to know how to solve this problem mathematically, which reasoning I should follow...
Thanks again.