Estimating Eigenvalues in Empirical DataΒΆ

BackgroundΒΆ

This notebook explores how missing values affect the estimation of the covariance matrix and the eigenvalues of the covariance matrix. This is a serious issue as missing values lead to noisy estimates of the covariance and its eigenvalues. In some cases it can also lead to negative eigenvalues (which should not be possible for covariance matrices). Before we do this we will first review some linear algebra concepts. If you know linear algebra or just don't care about math you can skip the background section and jump straight to the code.

Positive Semi-Definite MatricesΒΆ

Recall that a symmetric $(n\times n)$ matrix $\Sigma$ is positive semi-definite if $x^T\Sigma x \geq 0, \forall x\in \mathbb{R}^n$. Equivalently, a $(n\times n)$ matrix $\Sigma$ is positive semi-definite if $\lambda_i \geq 0, \forall i\in \{1, \dots, n\}$, that is all eigenvalues are non-negative. To prove this we consider the characteristic equation:

$$ (\Sigma -\lambda I)x = 0\\ x^T(\Sigma -\lambda I)x = 0\\ x^T\Sigma x -x^T\lambda Ix = 0\\ x^T\Sigma x = \lambda x^Tx $$

Now if $x^T\Sigma x\geq0$ we have that $x^T\Sigma x = \lambda x^Tx \geq 0$ and since $x^Tx=\sum_{i=1}^n x_i^2\geq0, \forall x\in\mathbb{R}^n$ then we must have $\lambda_i \geq 0, \forall i\in{\{1, \dots n\}}$.

For the other direction we use the spectral theorem for real matrices and decompose $\Sigma = U\Lambda U^T$ where $U$ is orthonormal and $\Lambda$ is diagonal with eigenvalues on the diagonal. Now for any $x\in \mathbb{R}^n$ (not only eigenvectors) consider $x^T\Sigma x = x^T U\Lambda U^T x = (U^T x)^T\Lambda (U^T x) = y^T \Lambda y = \sum_{i=1}^n \lambda_i y_i^2$ where $y = U^Tx$. So, if $\lambda_i\geq 0, \forall i\in{\{1, \dots n\}}$ we must have $x^T\Sigma x \geq 0$.

Data SimulationΒΆ

To produce a positive semi-definite matrix $\Sigma$ we let $\Sigma = LL^T$ which is guaranteed to be positive semi-definite. The proof of this is straightforward. Let $x \in \mathbb{R}^n$. We need to show that $x^TLL^Tx \geq 0$: $$ x^TLL^Tx = (L^Tx)^T(L^Tx) = y^Ty \geq 0 $$

For our simulation we will use a multivariate normal distribution $X \sim \mathcal{N}(\mu, \Sigma)$ where $\mu\in\mathbb{R}^n$ and $\Sigma \in \mathbb{R}^{n\times n}$. To produce such an $X$ we set $X=\mu + LZ$ where $L$ is a lower triangular matrix and $Z\sim \mathcal{N}(0, I_n)$ where $I_n$ is the $n$-dimensional identity matrix. Using standard rules for expectation and variance we get: $$ \mathbb{E}[X] = \mathbb{E}[\mu + LZ] = \mu\\ \text{Var}[X] = \text{Var}[\mu + LZ] = LI_nL^T = LL^T = \Sigma $$ Since a linear combination of normally distributed variables are normally distributed we get $X \sim \mathcal{N}(\mu, \Sigma)$ as desired. In this experiment we use a lower triangular $L$ (see Cholesky decomposition for more on this).

InΒ [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.covariance import LedoitWolf

seed = 42
np.set_printoptions(precision=2, suppress=True)
np.random.seed(seed)
InΒ [2]:
m = 10_000
n = 3
Z = np.random.randn(m, n)
InΒ [3]:
Z = np.random.randn(m, n)

L = np.array([
    [1, 0, 0], 
    [2, 1, 0], 
    [-3, 1, 1]
])
sigma = L @ L.T
np.fill_diagonal(L, 1)
mu = np.random.randn(n, 1)

X = mu + L @ Z.T
print("Theoretical covariance matrix:")
print(sigma)
print("Sample covariance matrix:")
print(np.cov(X))
Theoretical covariance matrix:
[[ 1  2 -3]
 [ 2  5 -5]
 [-3 -5 11]]
Sample covariance matrix:
[[ 1.01  2.04 -3.02]
 [ 2.04  5.08 -5.06]
 [-3.02 -5.06 10.99]]

From the above we can see that the theoretical and sample covariance matrices coincide.

Introducing Missing Values and Calculating EigenvaluesΒΆ

In a perfect environment we can easily estimate the covariance, but in reality we might have missing data. In the following section we introduce missing values into our data to simulate a real-world environment. We choose an $\eta \in [0, 1)$ fraction of the $X$ matrix to be missing. That is if $\eta=0$ we have the original matrix and if $\eta=1$ the $X$ matrix will be all missing values. $\eta = 1$ does not really make sense so we have a right-open interval for the fraction of missing values. The positions of the missing values are chosen uniformly at random. After introducing missing values, we calculate the covariance matrix of the resulting matrix and the eigenvalues of the covariance matrix. We bootstrap and do this $B$ times to get uncertainty estimates of the eigenvalues. We finally plot boxplots for all eigenvalues at each level of missing values.

InΒ [4]:
def bootstrap_eigenvalues(
    X: np.ndarray, 
    B: int = 100, 
    na_fractions: np.ndarray = np.linspace(0.0, 0.8, 9), 
    shrinkage: bool = False, 
    imputation: bool = False, 
    imputation_method: str = "mean",
    bootstrap: bool = False
):
    """
    Bootstrap the eigenvalues of the covariance matrix of X for different fractions of missing values.
    Args:
        X: The data matrix of shape (m, n).
        B: The number of bootstrap samples.
        na_fractions: The fractions of missing values to use.
        shrinkage: Whether to use Ledoit Wolf shrinkage.
        imputation: Whether to impute missing values.
        imputation_method: The method to impute missing values.
        bootstrap: Whether to use a bootstrapped eigenvalue estimate.
    Returns:
        eigenvalues_samples: The eigenvalues of the covariance matrix of X for different fractions of missing values of shape (n, B, len(na_fractions)).
    """
    assert imputation_method in ["mean", "median"], "Invalid imputation method"

    eigenvalues_samples = []
    removed_fraction_when_negative, na_fraction_when_negative = [], []
    for na_frac in na_fractions:
        eigenvalues_na_fraction = []
        for _ in range(B):
            # Introduce missing values
            mask_na = np.random.choice([True, False], size=X.shape, p=[1-na_frac, na_frac])
            X_na = np.where(mask_na, X, np.nan)

            # Perform imputation if needed
            df = pd.DataFrame(X_na.T)
            if imputation:
                if imputation_method == "mean":
                    df = df.fillna(df.mean())
                elif imputation_method == "median":
                    df = df.fillna(df.median())
            

            # Calculate eigenvalues
            if shrinkage: # Ledoit-Wolf Shrinkage
                try:
                    cov_ledoit_wolf = LedoitWolf().fit(df.dropna()).covariance_
                    eigenvalues = np.linalg.eigvals(cov_ledoit_wolf)
                except np.linalg.LinAlgError as e:
                    print(f"Error calculating eigenvalues for na fraction {na_frac}: {e}")
                    eigenvalues = np.zeros(X.shape[1])
            elif bootstrap: # Bootstrap the Data with Missing Values
                B_prime = 100
                df = df.dropna()
                bootstrapped_eigenvalues = []
                for _ in range(B_prime):
                    df_subset = df.sample(frac=0.5)
                    cov = df_subset.cov().to_numpy()
                    eigenvalues = np.linalg.eigvals(cov)
                    bootstrapped_eigenvalues.append(eigenvalues)
                eigenvalues = np.array(bootstrapped_eigenvalues).mean(axis=0)
            else: 
                try:
                    cov = df.cov().to_numpy()
                    eigenvalues = np.linalg.eigvals(cov)
                except np.linalg.LinAlgError as e:
                    print(f"Error calculating eigenvalues for na fraction {na_frac}: {e}")
                    eigenvalues = np.zeros(X.shape[1])

            if np.any(eigenvalues < 0):
                na_fraction_when_negative.append(na_frac)
                removed_fraction_when_negative.append(1-df.dropna().size/df.size)

            eigenvalues_na_fraction.append(eigenvalues)

        eigenvalues_samples.append(np.array(eigenvalues_na_fraction))

    eigenvalues_samples = np.array(eigenvalues_samples)
    eigenvalues_samples = np.swapaxes(eigenvalues_samples, 0, 2)
    return eigenvalues_samples, na_fraction_when_negative, removed_fraction_when_negative

ExamplesΒΆ

Although the variance of the estimates increases significantly with the level of missing values, the estimates can remain approximately unbiased depending on how you compute the covariance. With complete-case analysis (dropping all rows containing any missing value), the remaining data still comes from the same distribution under MCAR, so the estimates are unbiased. With pairwise deletion (as used by pandas.DataFrame.cov()), each entry $\widehat{\text{Cov}}(X_j, X_k)$ is computed using only rows where both variables are observed. Different pairs use different subsets of the data, so sample sizes vary across entries and the resulting matrix need not be positive semi-definite, which can produce negative eigenvalues and unstable estimates.

InΒ [5]:
# Base example
Z1 = np.random.randn(m, n)

L1 = np.tril(np.random.randn(n, n))
np.fill_diagonal(L1, 1)
mu1 = np.random.randn(n, 1)

# We transpose Z1 because it makes it easier to calculate the covariance matrix using numpy because it treats the rows as features
X1 = mu1 + L1 @ Z1.T

# High correlation example
Z2 = np.random.randn(m, n)

L2 = 10*np.tril(np.random.uniform(0.9, 1.1, size=(n, n)))
np.fill_diagonal(L2, 1)
mu2 = np.random.randn(n, 1)

# Again, we transpose Z2 because it makes it easier to calculate the covariance matrix using numpy because it treats the rows as features
X2 = mu2 + L2 @ Z2.T

print("Correlation matrix example 1:")
print(np.corrcoef(X1))

print("Correlation matrix example 2:")
print(np.corrcoef(X2))
Correlation matrix example 1:
[[ 1.    0.24 -0.93]
 [ 0.24  1.   -0.38]
 [-0.93 -0.38  1.  ]]
Correlation matrix example 2:
[[1.   1.   0.69]
 [1.   1.   0.76]
 [0.69 0.76 1.  ]]
InΒ [6]:
# Setup for plotting
tick_labels = np.linspace(0.0, 0.8, 9)
tick_labels = [f"{tick:.1f}" for tick in tick_labels]

eigenvalues_samples_X1, _, _ = bootstrap_eigenvalues(X1)
true_eigenvalues_X1 = np.linalg.eigvals(np.cov(X1))

eigenvalues_samples_X2, na_fraction_when_negative_X2, removed_fraction_when_negative_X2 = bootstrap_eigenvalues(X2)
true_eigenvalues_X2 = np.linalg.eigvals(np.cov(X2))
InΒ [7]:
# Base example plotting
for i in range(n):
    plt.figure(figsize=(10, 5))
    plt.boxplot(eigenvalues_samples_X1[i], tick_labels=tick_labels)
    plt.axhline(y=true_eigenvalues_X1[i], color='r', linestyle='--', label='True Eigenvalue')
    plt.title(f'Eigenvalues for Feature {i+1} for Different Fractions of Missing Values')
    plt.xlabel('Fraction of Missing Values')
    plt.ylabel('Eigenvalue')
    plt.legend()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
InΒ [8]:
# High correlation example plotting
for i in range(n):
    plt.figure(figsize=(10, 5))
    plt.boxplot(eigenvalues_samples_X2[i], tick_labels=tick_labels)
    plt.axhline(y=true_eigenvalues_X2[i], color='r', linestyle='--', label='True Eigenvalue')
    plt.title(f'Eigenvalues for Feature {i+1} for Different Fractions of Missing Values')
    plt.xlabel('Fraction of Missing Values')
    plt.ylabel('Eigenvalue')
    plt.legend()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Dealing with Noisy EstimatesΒΆ

Dealing with this issue and also the invalid covariance matrices (due to the negative eigenvalues) is not trivial. In the below I present a few different ways to do this.

Do Nothing (But Set Negative Eigenvalues to $0$)ΒΆ

The easiest and maybe also most sensible thing to do is just to do nothing. This gives you unbiased estimates of the eigenvalues. One problem with this is that some eigenvalues might be negative. As mentioned, we see this happening in the second example above. This happens when the true eigenvalue are close to $0$. An easy fix to this is to set the negative eigenvalues to $0$. The fraction of missing values does seem to affect if we get negative eigenvalues. In the dataframe below we see that on average we remove $76\%$ of the values when getting negative eigenvalues. However, in the most extreme case we only remove $26\%$ of the values and still get negative eigenvalues. Doing nothing and setting negative eigenvalues to zero does not deal with reducing the variance of the estimates. To this end we can employ different variance reduction techniques discussed below.

InΒ [9]:
df = pd.DataFrame({"na_fraction_when_negative": na_fraction_when_negative_X2, "removed_fraction_when_negative": removed_fraction_when_negative_X2})
df.describe()
Out[9]:
na_fraction_when_negative removed_fraction_when_negative
count 416.000000 416.000000
mean 0.463221 0.762566
std 0.225630 0.235474
min 0.100000 0.261100
25% 0.300000 0.654075
50% 0.500000 0.872800
75% 0.700000 0.970700
max 0.800000 0.993800

Ledoit-Wolf ShrinkageΒΆ

One way to reduce variance is Ledoit-Wolf shrinkage which shrinks the covariance matrix towards a targeted covariance matrix. The targeted covariance matrix is often that all covariances are the same. Your prior is therefore that all features have the same covariance. This can be good for high-dimensional data such as financial time series which often has very noisy covariance matrices. Sklearn has an implementation of the Ledoit-Wolf shrinkage. We implement this in the below. However, as you can see this does not really work when the data has many missing values, i.e., it is sparse. This is because the Ledoit-Wolf shrinkage requires a reliable empirical covariance matrix which it doesn't have for high-sparsity cases.

InΒ [10]:
shrinkage_samples_X1, _, _ = bootstrap_eigenvalues(X1, shrinkage=True)

for i in range(X.shape[0]):
    fig, axs = plt.subplots(1, 2, figsize=(16, 5), sharey=True)
    axs[0].boxplot(eigenvalues_samples_X1[i], tick_labels=tick_labels)
    axs[1].boxplot(shrinkage_samples_X1[i], tick_labels=tick_labels)
    axs[0].axhline(y=true_eigenvalues_X1[i], color='r', linestyle='--', label='True Eigenvalue')
    axs[1].axhline(y=true_eigenvalues_X1[i], color='r', linestyle='--', label='True Eigenvalue')
    axs[0].set_title(f'Eigenvalues for Feature {i+1} for Different Fractions of Missing Values')
    axs[1].set_title(f'Shrinkage Eigenvalues for Feature {i+1} for Different Fractions of Missing Values')
    axs[0].set_xlabel('Fraction of Missing Values')
    axs[1].set_xlabel('Fraction of Missing Values')
    axs[0].set_ylabel('Eigenvalue')
    axs[1].set_ylabel('Eigenvalue')
    axs[0].legend()
    axs[1].legend()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Deleting Columns With Too Many Missing ValuesΒΆ

In the examples above we choose the positions of the missing values uniformly at random. There is therefore no structure in where the missing values are and you can end up with with deleting a lot of perfectly fine data. This happens because pandas calculates covariance matrices by deleting missing values row-wise. If a few columns contain most of the missing values, deleting those columns would be preferable.

Bootstrap the Data with Missing ValuesΒΆ

Given a dataset with missing values you can bootstrap the data and calculate the covariance matrix and its corresponding eigenvalues $B'$ times and let the the mean of those be your estimate. Bootstrapping is a very common variance reduction technique. The plots below demonstrates that bootstrapping can reduce the variance in the estimates. In this example, bootstrapping works very well for the second feature, but less so for the two remaining. This is most likely due to numerical instability. Bootstrapping is computationally expensive, but it does reduce variance and keeps the estimates unbiased.

InΒ [11]:
bootstrapped_samples_X1, _, _ = bootstrap_eigenvalues(X1, bootstrap=True)

for i in range(X.shape[0]):
    fig, axs = plt.subplots(1, 2, figsize=(16, 5), sharey=True)
    axs[0].boxplot(eigenvalues_samples_X1[i], tick_labels=tick_labels)
    axs[1].boxplot(bootstrapped_samples_X1[i], tick_labels=tick_labels)
    axs[0].axhline(y=true_eigenvalues_X1[i], color='r', linestyle='--', label='True Eigenvalue')
    axs[1].axhline(y=true_eigenvalues_X1[i], color='r', linestyle='--', label='True Eigenvalue')
    axs[0].set_title(f'Eigenvalues for Feature {i+1} for Different Fractions of Missing Values')
    axs[1].set_title(f'Bootstrapped Eigenvalues for Feature {i+1} for Different Fractions of Missing Values')
    axs[0].set_xlabel('Fraction of Missing Values')
    axs[1].set_xlabel('Fraction of Missing Values')
    axs[0].set_ylabel('Eigenvalue')
    axs[1].set_ylabel('Eigenvalue')
    axs[0].legend()
    axs[1].legend()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

ImputationΒΆ

Some imputation techniques like EM-algorithms can be used. In time series data where there is temporal structure one might also consider forward-filling. However, simple imputation techniques such as a mean- or median-imputation can be very dangerous. Replacing missing values with the mean or median often leads to biased estimates of the eigenvalues of the covariance matrix as it breaks down the correlation structure. Imputing with mean or median implicitly treat the features as independent which is bad when we want to capture the correlation structure. This bias is evident from the plots below. As we can see, we do reduce the variance of the estimates dramatically but the bias is quite serious. Simple imputation like this will also lead to underestimating the variance of the data.

InΒ [12]:
eigenvalues_samples_X1_imputed, _, _ = bootstrap_eigenvalues(X1, imputation=True, imputation_method="mean")

for i in range(X.shape[0]):
    fig, axs = plt.subplots(1, 2, figsize=(16, 5), sharey=True)
    axs[0].boxplot(eigenvalues_samples_X1[i], tick_labels=tick_labels)
    axs[1].boxplot(eigenvalues_samples_X1_imputed[i], tick_labels=tick_labels)
    axs[0].axhline(y=true_eigenvalues_X1[i], color='r', linestyle='--', label='True Eigenvalue')
    axs[1].axhline(y=true_eigenvalues_X1[i], color='r', linestyle='--', label='True Eigenvalue')
    axs[0].set_title(f'Eigenvalues for Feature {i+1} for Different Fractions of Missing Values')
    axs[1].set_title(f'Eigenvalues for Feature {i+1} for Different Fractions of Missing Values with Imputation')
    axs[0].set_xlabel('Fraction of Missing Values')
    axs[1].set_xlabel('Fraction of Missing Values')
    axs[0].set_ylabel('Eigenvalue')
    axs[1].set_ylabel('Eigenvalue')
    axs[0].legend()
    axs[1].legend()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

ConclusionΒΆ

This notebook examines how missing values impact the estimate of the covariance matrix and its eigenvalues. In the last section I present different methods for dealing with this issue. The following is a summary of the methods and when to use them.

  • Do Nothing (But Set Negative Eigenvalues to $0$): The go-to solution.
  • Ledoit-Wolf Shrinkage: Only use this when the fraction of missing values are relatively low.
  • Deleting Columns With Too Many Missing Values: Do this when a few columns contain most of the missing data.
  • Bootstrap the Data with Missing Values: Do this on small datasets or when compute is not an issue.
  • Imputation: Depending on the imputation method this can be dangerous, so be thoughtful on which imputation method to use. EM can work.