This notebook contains material from cbe67701-uncertainty-quantification; content is available on Github.

3.1 Copulas

Created by Krishnendu Mukherjee (kmukherj@nd.edu)

In [1]:
## installing copula library via pip
!pip install copulas
Requirement already satisfied: copulas in /anaconda3/lib/python3.7/site-packages (0.3.0)
Requirement already satisfied: scipy<1.3,>=1.2 in /anaconda3/lib/python3.7/site-packages (from copulas) (1.2.1)
Requirement already satisfied: pandas<0.25,>=0.22.0 in /anaconda3/lib/python3.7/site-packages (from copulas) (0.24.2)
Requirement already satisfied: docutils<0.15,>=0.10 in /anaconda3/lib/python3.7/site-packages (from copulas) (0.14)
Requirement already satisfied: numpy<1.17,>=1.13.1 in /anaconda3/lib/python3.7/site-packages (from copulas) (1.16.2)
Requirement already satisfied: exrex<0.11,>=0.10.5 in /anaconda3/lib/python3.7/site-packages (from copulas) (0.10.5)
Requirement already satisfied: boto3<1.10,>=1.7.47 in /anaconda3/lib/python3.7/site-packages (from copulas) (1.9.253)
Requirement already satisfied: matplotlib<4,>=2.2.2 in /anaconda3/lib/python3.7/site-packages (from copulas) (3.0.3)
Requirement already satisfied: pytz>=2011k in /anaconda3/lib/python3.7/site-packages (from pandas<0.25,>=0.22.0->copulas) (2018.9)
Requirement already satisfied: python-dateutil>=2.5.0 in /anaconda3/lib/python3.7/site-packages (from pandas<0.25,>=0.22.0->copulas) (2.8.0)
Requirement already satisfied: s3transfer<0.3.0,>=0.2.0 in /anaconda3/lib/python3.7/site-packages (from boto3<1.10,>=1.7.47->copulas) (0.2.1)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /anaconda3/lib/python3.7/site-packages (from boto3<1.10,>=1.7.47->copulas) (0.10.0)
Requirement already satisfied: botocore<1.13.0,>=1.12.253 in /anaconda3/lib/python3.7/site-packages (from boto3<1.10,>=1.7.47->copulas) (1.12.253)
Requirement already satisfied: cycler>=0.10 in /anaconda3/lib/python3.7/site-packages (from matplotlib<4,>=2.2.2->copulas) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /anaconda3/lib/python3.7/site-packages (from matplotlib<4,>=2.2.2->copulas) (1.0.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /anaconda3/lib/python3.7/site-packages (from matplotlib<4,>=2.2.2->copulas) (2.3.1)
Requirement already satisfied: six>=1.5 in /anaconda3/lib/python3.7/site-packages (from python-dateutil>=2.5.0->pandas<0.25,>=0.22.0->copulas) (1.12.0)
Requirement already satisfied: urllib3<1.26,>=1.20; python_version >= "3.4" in /anaconda3/lib/python3.7/site-packages (from botocore<1.13.0,>=1.12.253->boto3<1.10,>=1.7.47->copulas) (1.24.1)
Requirement already satisfied: setuptools in /anaconda3/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib<4,>=2.2.2->copulas) (40.8.0)

Warning: If you run this notebook on Colab, you must select Runtime --> Restart and Run All. This is because the package copulas uses a different version of pandas and other packages already available on Colab.

In [2]:
## import all needed Python libraries here
import numpy as np
import pandas as pd
from scipy import stats
import random
from copulas import random_seed
import matplotlib.pyplot as plt
# essential function for generating Normal copula and visualization
from copulas.multivariate import GaussianMultivariate
from copulas.visualization import compare_3d
from mpl_toolkits import mplot3d

3.1.1 Definition

Copulas can help to "link" Cumulative distribution function (CDF) to Joint distribution for any number of input parameters.

$$ F_{XY} = C(F_X (x), F_Y (y)) $$

Here, C is the copula function which takes marginal CDF for each variable and creates a joint CDF.

  • Very useful in finance and insurance industries for modeling risk distribution.
  • They allow one to easily model and estimate the distribution of random vectors by estimating marginals and copulae separately.

Normal Copula

One of the simplest Copula in the Normal/Gaussian copula:

$$ C_N (u,v) = \Phi_R (\Phi^{-1}(u), \Phi^{-1}(v)) $$

Where, $u$ and $v$ are the random variables, $R$ the correlation matrix and $C_N$ is the normal copula.

In [3]:
## https://pypi.org/project/copulas/#:~:text=Copulas%20is%20a%20Python%20library,following%20the%20same%20statistical%20properties. 
## Defining A trivariate correlated distribution, 2 beta and 1 Normal    
def sample_trivariate_xyz(size=1000, seed=42):
    """Sample from three dimensional toy dataset.
    The output is a DataFrame containing three columns:
    * ``x``: Beta distribution with a=0.1 and b=0.1
    * ``y``: Beta distribution with a=0.1 and b=0.5
    * ``z``: Normal distribution + 10 times ``y``
    Args:
        size (int):
            Amount of samples to generate. Defaults to 1000.
        seed (int):
            Random seed to use. Defaults to 42.
    Retruns:
        pandas.DataFrame:
            DataFrame with three columns, ``x``, ``y`` and ``z``.
    """
    with random_seed(seed):
        x = stats.beta.rvs(a=0.1, b=0.1, size=size)
        y = stats.beta.rvs(a=0.1, b=0.5, size=size)
        return pd.DataFrame({
            'x': x,
            'y': y,
            'z': np.random.normal(size=size) + y*10
    })
        
# Defining a function for 3 dimensional plot for visualization, 
# followed by a comparision plot

def scatter_3d(data, title="Original dataset", columns=None, fig=None, position=None):
    """Plot 3 dimensional data in a scatter plot."""
    fig = fig or plt.figure()
    position = position or 111

    ax = fig.add_subplot(position, projection='3d')
    ax.scatter(*(
        data[column]
        for column in columns or data.columns
    ))
    ax.set_xlabel('X',Fontsize="10")
    ax.set_ylabel('Y',Fontsize="10")
    ax.set_zlabel('Z',Fontsize="10")
    
    if title:
        ax.set_title(title,Fontsize="20")
        ax.title.set_position([.5, 1.05])

    return ax
def scatter_3d_2(data, title="Syntheic dataset", columns=None, fig=None, position=None):
    """Plot 3 dimensional data in a scatter plot."""
    fig = fig or plt.figure()
    position = position or 111

    ax = fig.add_subplot(position, projection='3d')
    ax.scatter(*(
        data[column]
        for column in columns or data.columns
    ))
    ax.set_xlabel('X',Fontsize="10")
    ax.set_ylabel('Y',Fontsize="10")
    ax.set_zlabel('Z',Fontsize="10")
    
    if title:
        ax.set_title(title,Fontsize="20")
        ax.title.set_position([.5, 1.05])

    return ax

# defining another trivariate distribution for which outputs age, income and health 
# expectancy 
def sample_trivariate_age_income_health(size=100, seed=42):
    """Sample from a bivariate toy dataset.
    This dataset contains two columns which correspond to the simulated age and
    income which are positively correlated with outliers.
    Args:
        size (int):
            Amount of samples to generate. Defaults to 100.
        seed (int):
            Random seed to use. Defaults to 42.
    Retruns:
        pandas.DataFrame:
            DataFrame with two columns, ``age`` and ``income``.
    """
    with random_seed(seed):
        age = stats.norm.rvs(25, 15.0, size=size)
        income = 2.3*(age**2) 
        health_expec = abs(age) + 2*np.log(income) 
    return pd.DataFrame({
        "age": age,
        "income": income,
        "health_expec" : health_expec
    })
In [4]:
## Sampling and visualizing the input tri-variate normal distribution dataset
data = sample_trivariate_xyz()
scatter_3d(data)
# Tip: Any .csv can be used as input with pd.dataframe command
Out[4]:
<matplotlib.axes._subplots.Axes3DSubplot at 0x1a19f1d6a0>
In [5]:
## Fitting a gaussian copula to the data
copula = GaussianMultivariate()
copula.fit(data)

# Sampling synthetic data
synthetic_data = copula.sample(len(data))

# Plotting the real and the synthetic data to compare
scatter_3d_2(synthetic_data)
/anaconda3/lib/python3.7/site-packages/scipy/stats/_continuous_distns.py:515: RuntimeWarning: invalid value encountered in sqrt
  sk = 2*(b-a)*np.sqrt(a + b + 1) / (a + b + 2) / np.sqrt(a*b)
Out[5]:
<matplotlib.axes._subplots.Axes3DSubplot at 0x1a1ad90ac8>
In [ ]: