Calculating the expecated value of the number of samples belonging to one class in the 'best' fold of cross validation

12 Views Asked by At

Assuming we are doing a $k$-fold cross-validation on a dataset consisting of $N$ samples, of which $R$ ($0\le R\le N$) samples belong to class A and the rest $N-R$ belong to class B. Of all k-folds one fold will have the most class A samples. How do I calculate the expected value of the number of samples in this 'best' fold $\mathbb{E}[x_{max}]$? Now I understand that for each fold this is a hypergeometric distribution, and I can probably get a good estimate of the expected value based on the following equation: $$f_{X_{(n)}}(x) = \frac{d}{dx}[F_{X_1}(x)] = \frac{d}{dx} [F(x)]^k = n[F(x)]^{k−1}f(x) $$

However, this approach assumes the $k$-folds are independent, but they are not because they share a total of $R$ class A samples. Is there a way to calculate the expected value more accurately and is it possible to calculate the variance as well? Thank you!