I want to model a couple of variables $(X,Y)$ using copulas. My idea is to model the marginal distributions and then use copulas to combine marginal distributions and copula to get a joint distribution.
I am not sure how it works though.
Let's say that $X$ and $Y$ follow Gamma distributions and that I found the optimal parameters to fit my curves. The following is my (Python) code to fit the Gamma distribution to $X$, I use almost the same for $Y$.
# x = [80, 100, 285, 186, 134, 89, 78, 45, 98, ...]
sns.distplot(x)
fit_alpha, fit_loc, fit_beta = stats.gamma.fit(x)
rv = stats.gamma(a=fit_alpha, loc=fit_loc, scale=fit_beta)
absc = np.linspace(0, 500, 100)
sns.lineplot(absc, rv.pdf(absc))
I can now plot the cumulative distribution functions of my observations (the cdf are the cdf of the fitted Gamma distributions), and I check that the graphs obtained are (almost) those of uniform distributions (which I believe is a condition to use copulas). Same as before, here is the code for $X$ but I did the same for $Y$.
x_prime = stats.gamma.cdf(x, a=fit_alpha, loc=fit_loc, scale=fit_beta)
plt.hist(x_prime);
Then I can display the scatter plot of x_prime and y_prime (which are the fitted cdf of my observations). I believe that what I get at this stage is a copula.
plt.scatter(x_prime, y_prime, alpha=0.22)
Was it an appropriate procedure? And more importantly, what am I supposed to do with that to get a joint distribution?



By doing so, the modelization of the copula function will depend on the parametric model you used for your marginals. Even If you know the true model that generated your data (which is rarely the case in real-world problems), the method you used to estimate the parameters is not exact and the copula you obtain will still depend on the marginals.
To avoid that, you need to transform your data into rank data. For example, let say you have a two-dimensional random vector $\mathbf{X} = (X_1, X_2)$ and a sample of 4 observations $\mathcal{S} = \left\{(-1,3), (0.5,1), (0, -2), (1,0)\right\}$. The rank variable $\mathbf{U}_i[j]$ is given by the rank of $\mathbf{X}_i[j]$ for all $1\leq j \leq 4$. So from my example you have $\mathcal{R} = \left\{(1,4), (3,3), (2, 1), (4,2)\right\}$.
Using this transformation, you can then renormalize the data in order to have values in $[0, 1]$ and obtain the empirical copula. Then choosing a parametric model for the copula, you can estimate the best parameters using, for example, the MLE method.
I finish with an example using the library OpenTURNS (https://openturns.github.io/www/index.html) which has already a lot of implemented methods to deal with copulas. Let say I have data generated from a distribution who has a Gaussian copula with a correlation parameter $\rho = 0.8$ and Gamma marginals with rate parameter $k = 1$ and shape parameter $\lambda = 1$ :
If I transform my data into rank data, I obtain the copula (the chosen normalization is so that data are in $(0,1)$):
Then, I can estimate the copula function using whatever the model I want. Since I know the true copula model I will use it:
And then, I can estimate the marginals:
Note that I have chosen to estimate the copula and then the marginals but I could have done the converse since the two problems are independent.
Finally you can merge the marginals and the copula you learned into a distribution and plot its density:
Or you can evaluate the density at any point: