What does Quinlan mean by "the confidence limits for the binomial distribution"?

56 Views Asked by At

My classmates and I are trying to figure out what J. Ross Quinlan means on page 41 of C4.5: Programs for Machine Learning. He says:

The probability of error cannot be determined easily, but has itself a (posterior) probability distribution that is usually summarized by a pair of confidence limits. For a given confidence level CF, the upper limit on this probability can be found from the confidence limits for the binomial distribution; this upper limit is here written $U_{CF}(E, N)$.

Quinlan gives a few examples:

$U_{25\%}(0, 6) = 0.206$

My class textbook, Data Mining Concepts, Models, Methods, and Algorithms by Mehmed Kantardzic gives slightly more detail, but not enough:

C4.5 follows the postpruning approach, but it uses a specific technique to estimate the predicted error rate. This method is called pessimistic pruning. For every node in a tree, the estimation of the upper confidence limit $U_{cf}$ is computed using the statistical tables for binomial distribution (given in most textbooks on statistics). Parameter $U_{cf}$ is a function of $|T_i|$ and $E$ for a given node. C4.5 uses the default confidence level of 25% and compares $U_{25\%}(|T_i|/E)$ for a given node $T_i$ with a weighted confidence of its leaves.

Kantardzic provides a few more examples of this function:

$U_{25\%}(6,0) = 0.206, U_{25\%}(9,0) = 0.143, U_{25\%}(1,0) = 0.750$

At least one other person on the Internet has the same question.

I have been unable to find these values in the binomial probability distribution ${n \choose r} p^r q^{n-r}$.

What does this syntax mean, and where do I compute this function (ideally in R or Julia)?

1

There are 1 best solutions below

0
On BEST ANSWER

Finally answered my own question. Apparently this is the Clopper-Pearson Confidence Interval. You the Binomial Confidence Interval calculator at https://statpages.info/confint.html. First, enter $25$ for % Area in Upper Tail and $0$ for % Area in Lower Tail.

R

You can compute this in R using the GenBinomApps package.

> install.packages('GenBinomApps')
Warning: package 'GenBinomApps' is in use and will not be installed
> library(GenBinomApps)
> clopper.pearson.ci(0, 6, alpha = .25, CI = "upper")
 Confidence.Interval Lower.limit Upper.limit alpha
               upper           0   0.2062995  0.25
> clopper.pearson.ci(0, 9, alpha = .25, CI = "upper")
 Confidence.Interval Lower.limit Upper.limit alpha
               upper           0    0.142756  0.25
> clopper.pearson.ci(0, 1, alpha = .25, CI = "upper")
 Confidence.Interval Lower.limit Upper.limit alpha
               upper           0        0.75  0.25
> clopper.pearson.ci(1, 16, alpha = .25, CI = "upper")
 Confidence.Interval Lower.limit Upper.limit alpha
               upper           0   0.1596107  0.25

Julia

Julia's HypothesisTests.jl package will not accept such a low level. I have not found a way to compute $U_{25\%}(n, x)$ in Julia yet, but I have not tried very hard either.

julia> using StatsKit

julia> confint(BinomialTest(0, 6); level = .25, tail = :right)
ERROR: ArgumentError: coverage level 0.25 not in range (0.5, 1)
Stacktrace:
 [1] check_level
   @ C:\Users\wjhol\.julia\packages\HypothesisTests\V7PST\src\HypothesisTests.jl:96 [inlined]
 [2] confint(x::BinomialTest; level::Float64, tail::Symbol, method::Symbol)
   @ HypothesisTests C:\Users\wjhol\.julia\packages\HypothesisTests\V7PST\src\binomial.jl:104
 [3] top-level scope
   @ REPL[2]:1

Python

This discussion on Stack Overflow discusses some ways to get this CI in Python.