I am trying to compare the errors from two statistical models in order to give evidence to one being "better" in terms of lower prediction error than the other.
To formalize this, I thought that a test of stochastic dominance between two collections of random variables (the OOS errors) would be a good idea. Ideally the null hypothesis would be either:
$$\mathbb{H}_0 : P(X>Y) = 0.5$$
$$\mathbb{H'}_0 : F(x) \ge G(x) \forall x \in \mathbb{R}$$
I have found resources pointing me to the Kruskal-Wallis test, but unfortunately cannot seem to find a paper explicitly stating and proving one of these (or similar) null hypotheses. Many sources I check simply state that the null is that the medians of the two distributions differ, but this is not what I want to check. Any help is appreciated.
Edit for more information: Essentially I want as minimal assumptions on the data as possible. The two samples $X_1, \dots , X_n$ and $Y_1, \dots , Y_n$ are such that $(X_1, Y_1)$ is independent of $(X_2, Y_2)$, but in general $X_i$ and $Y_i$ are dependent. Also, The $X_i$ and $Y_i$ are all different distributions (so we cannot in particular assume they all come from the same family).
Based on your Comment and edit, you have paired data. You don't seem to feel comfortable assuming data are normal. Consequently, I have two possibilities to suggest: (1) A one-sample Wilcoxon signed rank test on $D_i = X_i - Y_i$ to test $H_0: \eta_D \le 0$ against $H_a: \eta_D > 0,$ where $\eta_D$ is the median of the population from which the $D_i$ are taken. (2) A permutation test, (c) simple linear regression to see if intercept is above 0.
Fake data: In order to illustrate, I need some data. I generate data for $n = 50$ pairs with $X_i \stackrel{indep}{\sim} \mathsf{Gamma}(shape=2, rate = 1/5)$ (a right-skewed distribution with positive values), $D_i \stackrel{indep}{\sim}\mathsf{Norm}(\mu=.5, \sigma=3),$ and $Y_i = X_i + D_i.$
These are for illustration only. I'm not trying to guess the nature of your actual data. Here are means and standard deviations of the three variables, their (high positive) correlation, a look at the first six pairs (and their differences), a histogram of the differences (more positive than negative), and a scatterplot of $(X,Y)$-pairs. The scatterplot shows that most points lie above the 45-degree line, indicating that $Y$'s tend to be larger than $X$'s.
Wilcoxon Signed-Rank Test.
This test shows that the median population difference is likely positive. Specifically, the P-value of about 3% (< 5%) indicates rejection the null hypothesis.
Permutation Test. Section 3 of this paper in the Journal of Statistics Education gives an introductory account of permutation tests on paired data. Similar permutation tests on my fake data again show that the $Y$'s tend to be larger than the $X$'s. Various methods of measuring the differences between $(X,y)$-pairs are possible. It is not necessary to assume that the variables have any particular distribution.
Regression. A simple linear regression of the $Y$'s on the $X$'s shows that the regression line lies above the 45-degree line in the plot. A formal test of this is part of the usual output of regression procedures. However, the most elementary regression procedure assumes that 'errors' about the regression line are normally distributed (but not that the $X$'s are normally distributed).
Conclusion: I cannot tell you exactly what test to use without knowing more about your data. But from what you have said, it is reasonable to hope that considering the methods I have mentioned here will put you on the right track.