Confusing Sampling from observed data

254 Views Asked by At

Suppose we are given some small set of data on bundles of electrical wires and increasing voltages run through them, and we note how many of the individual wires fail.

So for example, a large data set we have 6 observations, for each 6,

there is $w_{i}$ number of wires, voltage $v_{i}$ and $f_{i}$ of the wires fail.

And suppose we are given some of the information for example, ( note that each sample has increased voltage and we see increased proportion of failed wires).

$w_{1}=14$ and $f_{1}=4$

$w_{2}=13$ and $f_{2}=4$

$w_{3}=7$ and $f_{3}=3$

$w_{4}=10$ and $f_{4}=5$

$w_{5}=12$ and $f_{5}=7$

$w_{6}=20$ and $f_{6}=13$

ie we have a parameter space such that ( $t$ is the proportion that fail) $\{t_{i}: t_{1} \lt t_{2} \lt t_{3} \lt .. \lt t_{6} \le 1\}$ Assuming a flat prior over this.

My goal is to model this as a conditional distribution and sample so that I can make some statements about each $t_{i}$, such as the mean and deviations of each.(assuming flat prior) (ie. from histrograms)

Firstly I know about sampling, but I am wondering how from just the simple data, how I can accurately form the conditional distribution? Using rejection or transformations for example, and then Gibbs to make some conclusions on the individual failure proportions.

My thoughts:

Well it seems that the number of wires that fail is a function of the voltage. As voltage increases, so to does the proportion of failed wires.

Possibly I could use rejection method to sample from the distribution that is creating this?

So I would want to find some function $g(x)$ such that $g(x) \ge f(x)$ for all $x$ , then simulate uniform random variables and check the conditions.

However, as of now I don't have a distribution. I guess I could form a hand drawn using the points and x axis as 1,2,3,4,5,6... and y being the corresponding proportion rate of failure.

I know for a distribution, we need the probabilities to sum/integrate to 1.

The probabilities here I assume would be the probability that a certain proportion fail. So for n wires, we would have the probability that $p_{1}=\frac{1}{n}$ proportional fail, a probability for $p_{2}=\frac{2}{n} $proportion fail, all the way to the probability that all wires fail.

So it looks more like the form of a CDF, as voltage increases, ie if we write in the form of a function, $F(v_{1})=\frac{4}{14}$ , $F(v_{2})=\frac{4}{13}$, and so forth, so if we had an unlimited sample, as $n \to \infty$ , $F(v_{n}) \to 1$

and I suppose then $F^{-1}=f$ would be our density, but I am still not sure how to do this in finite case.

Issues: We are not told anything about underlying distribution, parameters or form. Only the data given. So do we take the data that is given to be the initialising values?

I was thinking I could possibly just assume that the failures follow a binomial distribution, with the binomial parameter following some other distribution such as a beta. How does that sound? Would we then need to also put some distribution on the $w_{i}$ ? I would be okay trying it without that distribution, but I want to understand how I can have the failure probability increase

Any advice , ideas and answers are much appreciated.

2

There are 2 best solutions below

0
On BEST ANSWER

$\textbf{Edition of 06.12.2018}$

Let us consider the third observation ($w_3=7,\quad f_3=3$).

The binomial distribution can be presented as the table of values $$P(w,f,p)=\binom wf p^f(1-p)^{w-f},\quad f=0,1,\dots,w,\tag1$$ $$\begin{vmatrix} f & P(w_3,f,p) & P_i\left(w_3,f,\dfrac{f_3}{w_3}\right) & P_F(w_3,f)\\ 0 & (1-p)^7 & 0.0198945 & 0.0512821\\ 1 & 7p(1-p)^6 & 0.104446 & 0.130536\\ 2 & 21p^2(1-p)^5 & 0.235004 & 0.195804\\ 3 & 35p^3(1-p)^7 & 0.293755 & 0.217560\\ 4 & 35p^4(1-p)^3 & 0.220316 & 0.190365\\ 5 & 21p^5(1-p)^2 & 0.0991424 & 0.130536\\ 6 & 7p^6(1-p) & 0.0247856 & 0.0652681\\ 7 & p^7 & 0.0026556 & 0.018648 \end{vmatrix}\tag2$$ where $p$ is unknown probability of the fail result in the single test.

There are two main ways to obtain $p(w_3,f_3).$

The first way MLM (maximum likelihood method) is to determine $p$ as the frequency $$p(w_3,f_3) = \dfrac{f_3}{w_3},\tag4$$ (see also Wolfram Alpha plot of distribution)

p=3/7

The second way is Fiducial (Fisher) approach, when $p$ considers as the random value, the distribution function of which is $$f_F(w_i,f_i,p) = CP(f_i,p) = C\binom {w_i}{f_i} p^{f_i}(1-p)^{w_i-f_i},\tag5$$ where the constant $C$ should be found from the condition $$\int\limits_0^1 f_F(w_i,f_i,p)\,\mathrm dp = 1,$$ For $i=3$ $$f_F(w_3,f_3,p) = C_3P(w_3,f_3,p)= C_3\cdot35p^3(1-p)^4,\tag6$$ $$C_3=\dfrac1{\int\limits_0^1 P(w_3,f_3,p)\,\mathrm dp} = \dfrac1{\int\limits_0^1 35p^3(1-p)^4\,\mathrm dp}=8\tag7$$ (see also Wolfram Alpha).

Therefore, $$f_F(w_3,f_3,p) = 8\binom 73p^3(1-p)^4 = 280p^3(1-p)^4,\tag8$$ and the distribution $(1)$ changes to $$P_F(w_3,f,p)= \int\limits_0^1 \binom{w_3}f p^f(1-p)^{w_3-f} f_F(w_3,f_3,p)\,\mathrm dp,\quad f=0,1,\dots,7\tag9$$ (see also Wolfram Alpha plot of distribution)

fiducial

This approach looks more strict, because it takes in account parameter $w_i.$

The expectation $E(f)$ can be calculated as

$$E(f) = \sum_{f=0}^w fP(f),$$ and variance $V(f)$ - as $$V(f) = \sum_{f=0}^w (f-E(f))^2 P(f)$$

The obtained information about parameter $p$ allows to get the distributions law for any $w.$ For $w_3=20$ the plot of the calculated distributions for the first way is

p=3/7 w=20

and for the second one is

Fiducial w=20

This allow comparing the probability distributions under observations with inhomogeneous statistics.

$$\begin{vmatrix} i & w_i & f_i & F_i & f_{Fi} & E\left(20,\frac{f_i}{w_i}\right) & V\left(20,\frac {f_i}{w_i}\right) & E_F(20,p) & V_F(20,p) \\ 1 & 14 & 4 & \dfrac27 & 15015p^4(1-p)^{10} & \dfrac{40}7 & \dfrac{200}{49} & \dfrac{25}4 & \dfrac{2475}{72}\\ 2 & 13 & 4 & \dfrac4{13} & 10010p^4(1-p)^{9} & \dfrac{80}{13} & \dfrac{720}{169} & \dfrac{20}3 & \dfrac{175}{18}\\ 3 & 7 & 3 & \dfrac37 & 280p^3(1-p)^{4} & \dfrac{60}7 & \dfrac{240}{49} & \dfrac{80}9 & \dfrac{1160}{81}\\ 4 & 10 & 5 & \dfrac12 & 2772p^5(1-p)^{5} & 10 & 5 & 10 & \dfrac{160}{13}\\ 5 & 12 & 7 & \dfrac7{12} & 10296p^7(1-p)^{5} & \dfrac{35}3 & \dfrac{175}{36} & \dfrac{80}7 & \dfrac{544}{49}\\ 6 & 20 & 13 & \dfrac{13}{20} & 1627920p^{13}(1-p)^{7} & 13 & \dfrac{91}{20} & \dfrac{140}{11} & \dfrac{23520}{2783}\\ \end{vmatrix}\tag{10}$$ $\mathbf{Observation\ 1\quad w_1=14\quad f_1=4}$

MLM plot:

MLM 4/14

Fiducial plot:

Fiducial 4/14

$\mathbf{Observation\ 2\quad w_2=13\quad f_2=4}$

MLM plot:

MLM 4/13

Fiducial plot:

Fiducial 4/13

$\mathbf{Observation\ 3\quad w_3=7\quad f_3=3}$

MLM plot:

MLM 3/7

Fiducial plot:

Fiducial 3/7

$\mathbf{Observation\ 4\quad w_4=10\quad f_4=5}$

MLM plot:

MLM 5/10

Fiducial plot:

Fiducial 5/10

$\mathbf{Observation\ 5\quad w_5=12\quad f_1=7}$

MLM plot:

MLM 5/12

Fiducial plot:

Fiducial 5/12

$\mathbf{Observation\ 6\quad w_6=20\quad f_6=13}$

MLM plot

MLM 13/20

Fiducial plot:

Fiducial 13/20

Analysis of the graphs shows that with an increase in the volume of statistics, the results for the two methods converge.

0
On

A sketch will be of much help to resume the terms of the problem.

Wire_Insulation_1

We have a production of wires in which the insulation resistance is spread over a range of voltages with a certain PDF and relevant CDF.

We set a voltage $V_k$ in the range, and we take a relatively small sample of wires, of size $w_k$ (variable for each test) and record the number of wires that fails $f_k$.

The $w_k$ wires will have a distribution of breaking voltages which ideally follows the population CDF, that is, when dividing the vertical range of probability into $w_k$ equal intervals, we would expect to find one wire into each (placed at its center).
That means to say that the elements projected on the vertical scale will follow there a uniform probability density on the $[0,1]$ interval.

Then we are going to assign to $V_k$ a value $P'_k$ for the CDF, corresponding to the interval limit between failed/not-failed as indicated in the sketch ($0.4$ in the example shown).

Now, with respect to the underlying population distribution, corresponding to a huge sample, a small sample will introduce two kind of error:
- a "discretization" error, because of the gap interval between failed / survived;
- a "sampling" error, because the sample will deviate from an exact uniform distribution.

We can inglobate the two by asking ourselves:
given $w_k$ elements from a uniform distribution on $[0,1]$, with $f_k$ that failed the test, which is the probability that one of the failed elements be at the limit of the threshold $0 \le P'_k \le 1$, the remaining $f_k-1$ be below that, and $w_k-f_k$ above.

That is clearly expressible as $$ \bbox[lightyellow] { p(P'_{\,k} )\,dP'_{\,k} = w_{\,k} \,dP'_{\,k} \left( \matrix{ w_{\,k} - 1 \cr w_{\,k} - 1 \cr} \right) {P'_{\,k}} ^{f_{\,k} - 1} \left( {1 - P'_{\,k} } \right)^{w_{\,k} - f_{\,k} } }$$

It is easy to check, through the expression of the Beta Function that the integral of the above correctly gives $1$.
In fact $p(P'_{\,k} )$ is a Beta Distribution PDF $$ \bbox[lightyellow] { p(P'_{\,k} ) = Beta\left( {f_{\,k} ,\,w_{\,k} - f_{\,k} + 1} \right) }$$ because $$ w\left( \matrix{ w - 1 \cr f - 1 \cr} \right) = w{{\Gamma \left( w \right)} \over {\Gamma \left( f \right)\Gamma \left( {w - f + 1} \right)}} = {{\Gamma \left( {w + 1} \right)} \over {\Gamma \left( f \right)\Gamma \left( {w + 1 - f} \right)}} = {1 \over {{\rm B}\left( {f,w - f + 1} \right)}} $$

Note that in the cited reference it is affirmed that
The beta distribution is a suitable model for the random behavior of percentages and proportions.

In the construction above, we have set the threshold $P'_k$ to coincide with the failed element of highest resistance. Actually there is a gap between this and the first good item (that with lower resistance), so that the threshold could be moved up to this. That is equivalent to choosing a $ Beta\left( {f_{\,k}+1 ,\,w_{\,k} - f_{\,k}} \right)$.

So, if there is not a need for more sophistication, we can take the threshold to be at half of the gap, thus to take $$ \bbox[lightyellow] { p(P'_{\,k} ) = Beta\left( {f_{\,k} + 1/2,\,w_{\,k} - f_{\,k} + 1/2} \right) }$$ which gives a mean and variance of $$ \bbox[lightyellow] { E\left( {P'_{\,k} } \right) = {{f_{\,k} + 1/2} \over {w_{\,k} + 1}}\quad {\rm var}\left( {P'_{\,k} } \right) = {{\left( {f_{\,k} + 1/2} \right)\left( {w_{\,k} - f_{\,k} + 1/2} \right)} \over {\left( {w_{\,k} + 1} \right)^{\,2} \left( {w_{\,k} + 2} \right)}} }$$

It is this mean the value to assign to $P'_k$, associated with an "error" following the Beta distribution around that.

After that you can perform a regression on the plot $V_k, P_k$ obtained, or a distribution fitting, to estimate the underlying population CDF.