My classmates and I are trying to figure out what J. Ross Quinlan means on page 41 of C4.5: Programs for Machine Learning. He says:
The probability of error cannot be determined easily, but has itself a (posterior) probability distribution that is usually summarized by a pair of confidence limits. For a given confidence level CF, the upper limit on this probability can be found from the confidence limits for the binomial distribution; this upper limit is here written $U_{CF}(E, N)$.
Quinlan gives a few examples:
$U_{25\%}(0, 6) = 0.206$
My class textbook, Data Mining Concepts, Models, Methods, and Algorithms by Mehmed Kantardzic gives slightly more detail, but not enough:
C4.5 follows the postpruning approach, but it uses a specific technique to estimate the predicted error rate. This method is called pessimistic pruning. For every node in a tree, the estimation of the upper confidence limit $U_{cf}$ is computed using the statistical tables for binomial distribution (given in most textbooks on statistics). Parameter $U_{cf}$ is a function of $|T_i|$ and $E$ for a given node. C4.5 uses the default confidence level of 25% and compares $U_{25\%}(|T_i|/E)$ for a given node $T_i$ with a weighted confidence of its leaves.
Kantardzic provides a few more examples of this function:
$U_{25\%}(6,0) = 0.206, U_{25\%}(9,0) = 0.143, U_{25\%}(1,0) = 0.750$
At least one other person on the Internet has the same question.
I have been unable to find these values in the binomial probability distribution ${n \choose r} p^r q^{n-r}$.
What does this syntax mean, and where do I compute this function (ideally in R or Julia)?
Finally answered my own question. Apparently this is the Clopper-Pearson Confidence Interval. You the Binomial Confidence Interval calculator at https://statpages.info/confint.html. First, enter $25$ for
% Area in Upper Tailand $0$ for% Area in Lower Tail.R
You can compute this in R using the
GenBinomAppspackage.Julia
Julia's
HypothesisTests.jlpackage will not accept such a low level. I have not found a way to compute $U_{25\%}(n, x)$ in Julia yet, but I have not tried very hard either.Python
This discussion on Stack Overflow discusses some ways to get this CI in Python.