How to deal with copulas?

438 Views Asked by At

I want to model a couple of variables $(X,Y)$ using copulas. My idea is to model the marginal distributions and then use copulas to combine marginal distributions and copula to get a joint distribution.

I am not sure how it works though.

Let's say that $X$ and $Y$ follow Gamma distributions and that I found the optimal parameters to fit my curves. The following is my (Python) code to fit the Gamma distribution to $X$, I use almost the same for $Y$.

# x = [80, 100, 285, 186, 134, 89, 78, 45, 98, ...]
sns.distplot(x)
fit_alpha, fit_loc, fit_beta = stats.gamma.fit(x)
rv = stats.gamma(a=fit_alpha, loc=fit_loc, scale=fit_beta)
absc = np.linspace(0, 500, 100)
sns.lineplot(absc, rv.pdf(absc))

Graph of X

I can now plot the cumulative distribution functions of my observations (the cdf are the cdf of the fitted Gamma distributions), and I check that the graphs obtained are (almost) those of uniform distributions (which I believe is a condition to use copulas). Same as before, here is the code for $X$ but I did the same for $Y$.

x_prime = stats.gamma.cdf(x, a=fit_alpha, loc=fit_loc, scale=fit_beta)
plt.hist(x_prime);

Graph of x_prime, the fitted cdf of the X observations

Then I can display the scatter plot of x_prime and y_prime (which are the fitted cdf of my observations). I believe that what I get at this stage is a copula.

plt.scatter(x_prime, y_prime, alpha=0.22)

Copula (based on X and Y)

Was it an appropriate procedure? And more importantly, what am I supposed to do with that to get a joint distribution?

1

There are 1 best solutions below

12
On BEST ANSWER

By doing so, the modelization of the copula function will depend on the parametric model you used for your marginals. Even If you know the true model that generated your data (which is rarely the case in real-world problems), the method you used to estimate the parameters is not exact and the copula you obtain will still depend on the marginals.

To avoid that, you need to transform your data into rank data. For example, let say you have a two-dimensional random vector $\mathbf{X} = (X_1, X_2)$ and a sample of 4 observations $\mathcal{S} = \left\{(-1,3), (0.5,1), (0, -2), (1,0)\right\}$. The rank variable $\mathbf{U}_i[j]$ is given by the rank of $\mathbf{X}_i[j]$ for all $1\leq j \leq 4$. So from my example you have $\mathcal{R} = \left\{(1,4), (3,3), (2, 1), (4,2)\right\}$.

Using this transformation, you can then renormalize the data in order to have values in $[0, 1]$ and obtain the empirical copula. Then choosing a parametric model for the copula, you can estimate the best parameters using, for example, the MLE method.

I finish with an example using the library OpenTURNS (https://openturns.github.io/www/index.html) which has already a lot of implemented methods to deal with copulas. Let say I have data generated from a distribution who has a Gaussian copula with a correlation parameter $\rho = 0.8$ and Gamma marginals with rate parameter $k = 1$ and shape parameter $\lambda = 1$ :

import openturns as ot
import matplotlib.pyplot as plt
from openturns.viewer import View

ot.RandomGenerator.SetSeed(42)

marginals = [ot.Gamma()] * 2 # default parameters are [1, 1]
copula = ot.NormalCopula(ot.CorrelationMatrix([[1, 0.8], [0.8, 1]]))

distribution = ot.ComposedDistribution(marginals, copula)

sample = distribution.getSample(1000)}

A sample of 1000 data points

If I transform my data into rank data, I obtain the copula (the chosen normalization is so that data are in $(0,1)$):

rank_sample = (sample.rank() + 0.5) / (sample.getSize() + 1)

The same sample in the "copula space"

Then, I can estimate the copula function using whatever the model I want. Since I know the true copula model I will use it:

copula_fit = ot.NormalCopulaFactory().build(sample)
print(copula_fit.getParameter())
# returns 0.796686

And then, I can estimate the marginals:

marginals_fit = [ot.ExponentialFactory().build(sample.getMarginal(0)),
                 ot.ExponentialFactory().build(sample.getMarginal(1))]
print([mf.getParameter() for mf in marginals_fit])
# returns [1.11684,1.08329] and [1.10042,1.07048]

Note that I have chosen to estimate the copula and then the marginals but I could have done the converse since the two problems are independent.

Finally you can merge the marginals and the copula you learned into a distribution and plot its density:

fig, ax = plt.subplots()
distribution_fit = ot.ComposedDistribution(marginals_fit, copula_fit)
View(distribution_fit.drawPDF(), figure=fig, axes=[ax], add_legend=False)
plt.show()

Learned density

Or you can evaluate the density at any point:

print(distribution_fit.computePDF([0.5, 0.5])) # Returns 0.6708