Anderson-Darling Goodness of Fit MATLAB trying to understand parameters

77 Views Asked by At

I'll preface this saying it is for an assignment, but an assignment I've already handed in (not yet graded). I am trying to learn how to use Anderson-Darling goodness of fit tests to understand how they work better. I am using the adtest function specifically. To see how it works, the assignment had us generate a random uniformly distributed sample, then use the function to see if the samples belong to a Gaussian distribution. (Ideally this should fail, no it's not a Gaussian distribution). Now we get into the specifics of using the function.

The assignment specifically tells us to give the function the parameters to test for a normal distribution with the same mean and std of the generated random variable.

According to the documentation, if I want to check a specific probability distribution,

  1. I could go about it like this:

enter image description here

In this regard, I'd first calculate the sigma and mean of the variable using min and max of the uniform distribution.

mu = (a+b)/2
std = (b-a)^2/12

Then call it as seen in the documentation.

  1. Or I could simply call it as is:

enter image description here

And the function will determine the mean and std from the data and perform the tests. When running with a low sample rate (<100), it seems that method 2) vastly outperforms method 1). When I use more samples, they both seem to perform similarly (almost consistently rejecting the null hypothesis). I don't exactly understand why this is so:


Question 1: Why did the results change so much between these two tests?


Continuing on:

To me, we generally don't have the "actual distribution" of the random variable. Specifically, I won't know the actual mean and std of the random variable to feed it to the function as in option 1).

To my understanding, the whole point of the test is to check if the random variable matches to a specific distribution so that we can make further assumptions about the data in a controlled way that allows us to make predictions.

So, I guess


Question 2: Why would we ever give it an exact mean and variance to test against other than deriving this from the data itself? Is this practical or commonplace?


Thanks for your time.

(To any keeners out there, we also used the chi2gof test and kstest with similar requirements, if you want to explain the same questions above with respect to these functions as well, it'd be greatly appreciated, but I'd consider the question answered even if it's only in regards to adtest)

1

There are 1 best solutions below

1
On BEST ANSWER
  1. The estimators that are computed are the sample mean, i.e., $\bar{X}_n$, and the sample variance $S^2$. While they are consistent estimators (i.e., converge to the true $E[X]$ and $Var(X)$), for relatively small sample size, they may differ from the true values of the parameters. Therefore, for large enough sample $\bar{X}$ and $S^2$ are close enough to the true values, hence, the results should be approximately the same.

  2. For example, you may have some historical data or any theoretical considerations that provide you with these possible values that you want to check. In most real-world situations you don't know the true values, so you need to estimate them.