Imputing Missing Values and Those Below LOD#

Author: Nathan A. Mahynski

Date: 2023/08/23

Description: Illustrate different ways to impute missing or bad data and implement this in pipelines.

[ ]:

if 'google.colab' in str(get_ipython()):
    !pip install git+https://github.com/mahynski/pychemauth@main
    import os
    os.kill(os.getpid(), 9) # Automatically restart the runtime to reload libraries

[ ]:

try:
    import pychemauth
except:
    raise ImportError("pychemauth not installed")

import matplotlib.pyplot as plt
%matplotlib inline

import watermark
%load_ext watermark

%load_ext autoreload
%autoreload 2

[ ]:

import imblearn
import sklearn

from sklearn.model_selection import GridSearchCV

import numpy as np
import pandas as pd

[ ]:

%watermark -t -m -v --iversions

Python implementation: CPython
Python version       : 3.11.4
IPython version      : 8.14.0

Compiler    : GCC 12.2.0
OS          : Linux
Release     : 6.2.0-34-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 40
Architecture: 64bit

numpy     : 1.24.3
pandas    : 1.5.3
pychemauth: 0.0.0b3
json      : 2.0.9
matplotlib: 3.7.2
imblearn  : 0.11.0
watermark : 2.4.3
sklearn   : 1.3.0

Load the Data

Let’s load some sample data, then purposefully corrupt it so that we can try to reconstruct it with various methods.

[ ]:

from sklearn.datasets import fetch_california_housing
data_bunch = fetch_california_housing(return_X_y=False, as_frame=False)

# Let's just work with the first 2k datapoints in this example
raw_X = data_bunch.data[:2000,:]
raw_y = data_bunch.target[:2000]

# Randomly delete 5% of the entries
n_delete = int(raw_X.shape[0]*0.05)

np.random.seed(42)
row_idx = [np.random.randint(low=0, high=raw_X.shape[0])
     for i in range(n_delete)]
col_idx = [np.random.randint(low=0, high=raw_X.shape[1])
     for i in range(n_delete)]

missing_X = raw_X.copy()
for i,j in zip(row_idx, col_idx):
    missing_X[i,j] = np.nan

We’ll use these functions to visualize the corrupted data and reconstructions later on.

[ ]:

def compare(raw_X, reconstructed_X, first=None):
    print('Reconstructed\tOriginal\tDifference\tRelative Err')
    for i,j in zip(row_idx[:first], col_idx[:first]):
        print('%.3e\t'%reconstructed_X[i,j]
              +'%.3e\t'%raw_X[i,j]
              +'%.3e\t'%(reconstructed_X[i,j]-raw_X[i,j])
              +'%.3f'%(np.abs((reconstructed_X[i,j]-raw_X[i,j])/raw_X[i,j]))
             )

def plot_performance(gs, components):
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,4))

    for i,ax in enumerate(axes):
        ax.errorbar(
            x=components,
            y=-gs.cv_results_['mean_test_score'],
            yerr=gs.cv_results_['std_test_score'] # Errorbars are +/- 1 standard deviation
        )
        if i == 0:
            ax.set_yscale('log')
            ax.set_title('In Log Scale')
        else:
            ax.set_title('Not in Log Scale')
        ax.set_xlabel('Number of Components')
        ax.set_ylabel('Sum Squared Error on Test Set')

Iterative PCA (Missing X values)#

PCA can be used to reconstruct the missing values using a maximum likelihood estimation method.

As stated in [1]:

“The PCA/IA algorithm in the version applied in chemometrics can be summarized as:

fill in missing elements with their initial estimates,

perform singular value decomposition of the complete data set,

reconstruct X with the predefined number of factors,

replace the missing elements with the predicted values and go to step 2 until convergence.”

References:

[1] Walczak, B. and Massart, D. L., “Dealing with missing data: Part I,” Chemometrics and Intelligent Laboratory Systems 58 (2001) 15-27.

[2] Walczak, B. and Massart, D. L., “Dealing with missing data: Part II,” Chemometrics and Intelligent Laboratory Systems 58 (2001) 29-42.

Fixed Number of Components

The number of components being used in the singular value decomposition is a hyperparameter that should be optimized inside a pipeline using cross-validation; however, in this first example we will just set the value for illustration.

[ ]:

from pychemauth.preprocessing.missing import PCA_IA

[ ]:

itim = PCA_IA(
    n_components=3,
    scale_x=True,
    missing_values=np.nan,
    tol=1.0e-4,
    max_iters=5000
)

[ ]:

reconstructed_X = itim.fit_transform(missing_X)
compare(raw_X, reconstructed_X, first=10)

Reconstructed   Original        Difference      Relative Err
1.528e+01       2.300e+01       -7.721e+00      0.336
1.230e+00       9.953e-01       2.350e-01       0.236
2.639e+01       1.500e+01       1.139e+01       0.760
2.772e+01       1.300e+01       1.472e+01       1.132
2.318e+00       2.620e+00       -3.020e-01      0.115
1.386e+00       1.117e+00       2.690e-01       0.241
2.400e+00       3.728e+00       -1.328e+00      0.356
3.924e+01       3.842e+01       8.229e-01       0.021
-1.222e+02      -1.222e+02      -5.861e-02      0.000
3.771e+01       3.785e+01       -1.373e-01      0.004

Unknown Number of Components

Usually, we need to figure out what a good number of components is. We can use cross-validation for this.

[ ]:

pipeline = sklearn.pipeline.Pipeline(steps=[
    ("pca_ia", PCA_IA(
        n_components=1,
        tol=1e-04,
        max_iters=5000,
        scale_x=True)
    )
])

# Hyperparameters of pipeline steps are given in standard notation: step__parameter_name
components = np.arange(1, raw_X.shape[1], 1)
param_grid = [{
    'pca_ia__n_components': components,
}]

# Let's use a simple grid search cross-validation stratgegy with 5-fold CV
gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=sklearn.model_selection.KFold(n_splits=5, shuffle=True, random_state=0),
    error_score=0,
    refit=True
)

# Fit the PCA_IA object
_ = gs.fit(
    missing_X,
)

[ ]:

# Let's plot the results
plot_performance(gs, components)

../../_images/jupyter_api_imputing_20_0.png

PCA/IA is working well when there are nearly as many components as features. In log scale it is clear that the test set error is monotonically decreasing, which means the CV object will choose the maximum number of components as the optimum.

[ ]:

gs.best_params_ # Indeed, N = 7 is what the CV object chooses

{'pca_ia__n_components': 7}

[ ]:

reconstructed_X = gs.transform(missing_X)
compare(raw_X, reconstructed_X, first=10)

Reconstructed   Original        Difference      Relative Err
2.969e+01       2.300e+01       6.685e+00       0.291
1.087e+00       9.953e-01       9.147e-02       0.092
3.107e+01       1.500e+01       1.607e+01       1.071
3.265e+01       1.300e+01       1.965e+01       1.511
2.652e+00       2.620e+00       3.176e-02       0.012
1.020e+00       1.117e+00       -9.738e-02      0.087
2.708e+00       3.728e+00       -1.020e+00      0.274
3.824e+01       3.842e+01       -1.844e-01      0.005
-1.220e+02      -1.222e+02      1.970e-01       0.002
3.831e+01       3.785e+01       4.560e-01       0.012

You can then use this in other pipelines. You can specify the imputer without any hyperparameters in those cases, for example. Below is an example of how you might do that when using a subsquent PLS-DA model.

pipeline = imblearn.pipeline.Pipeline(steps=[
    ("pca_ia", PCA_IA(n_components=7, scale_x=False)),
    ("plsda", PLSDA(n_components=5,
                    alpha=0.05,
                    scale_x=True,
                    not_assigned='UNKNOWN',
                    style='soft',
                   )
    )
])

# Specify hyperparameters associated with each step
param_grid = [{
    "pca_ia__n_components": np.arange(1, 10, 2)
    "plsda__n_components": np.arange(1, 10, 2),
    "plsda__alpha": [0.07, 0.05, 0.03, 0.01],
}]

# Create grid search cross-validation object
gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=5,
    error_score=0,
    refit=True
)

# Train
_ = gs.fit(x_train, y_train)

Iterative PLS (Missing X values)#

PCA is an unsupervised approach and only looks at the feature matrix, X, to reconstruct the missing values. If we use PLS instead this becomes a supervised method, which now considers the response variable, y, in the reconstruction.

As stated in [1]:

“For a PLS model, the IA is included into the PLS procedure as follows:

fill in missing elements with their initial estimates for fn=1:A (number of factors),

calculate the mean of X and of y,

center X and y,

calculate weights, scores and loadings,

substract the predicted X and y from the original X and y and go to step 4 end,

reconstruct X with the actual set of scores and loadings (and A factors),

fill in missing elements in X with their estimates and go to step 2 until convergence.”

Fixed Number of Components

[ ]:

from pychemauth.preprocessing.missing import PLS_IA

[ ]:

itim = PLS_IA(
    n_components=3,
    missing_values=np.nan,
    scale_x=True,
    tol=1.0e-2,
    max_iters=10000
)

[ ]:

reconstructed_X = itim.fit_transform(
    missing_X,
    raw_y.reshape(-1,1)
)
compare(raw_X, reconstructed_X, first=10)

Reconstructed   Original        Difference      Relative Err
3.746e+01       2.300e+01       1.446e+01       0.629
8.934e-01       9.953e-01       -1.019e-01      0.102
2.734e+01       1.500e+01       1.234e+01       0.823
3.071e+01       1.300e+01       1.771e+01       1.363
1.992e+00       2.620e+00       -6.286e-01      0.240
-5.122e-01      1.117e+00       -1.629e+00      1.459
2.614e+00       3.728e+00       -1.114e+00      0.299
3.827e+01       3.842e+01       -1.507e-01      0.004
-1.223e+02      -1.222e+02      -1.529e-01      0.001
3.829e+01       3.785e+01       4.413e-01       0.012

Unknown Number of Components

[ ]:

pipeline = sklearn.pipeline.Pipeline(steps=[
    ("pls_ia", PLS_IA(
        n_components=1,
        max_iters=10000,
        tol=1.0e-2,
        scale_x=False)
    )
])

# Hyperparameters of pipeline steps are given in standard notation: step__parameter_name
components = np.arange(1, raw_X.shape[1], 1)
param_grid = [{
    'pls_ia__n_components': components,
}]

# Let's use a simple grid search cross-validation stratgegy with 5-fold CV
gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=sklearn.model_selection.KFold(n_splits=5, shuffle=True, random_state=0),
    error_score=0,
    refit=True
)

# Fit the PLS_IA object
_ = gs.fit(
    missing_X,
    raw_y.reshape(-1,1)
)

[ ]:

# Let's plot the results
plot_performance(gs, components)

../../_images/jupyter_api_imputing_33_0.png

PLS/IA seems to work much better than PCA/IA, as expected when comparing a supervised to an unsuprvised approach.

[ ]:

gs.best_params_

{'pls_ia__n_components': 7}

[ ]:

reconstructed_X = gs.transform(missing_X)
compare(raw_X, reconstructed_X, first=10)

Reconstructed   Original        Difference      Relative Err
3.135e+01       2.300e+01       8.348e+00       0.363
1.070e+00       9.953e-01       7.423e-02       0.075
3.133e+01       1.500e+01       1.633e+01       1.089
3.133e+01       1.300e+01       1.833e+01       1.410
2.618e+00       2.620e+00       -2.178e-03      0.001
8.100e-01       1.117e+00       -3.071e-01      0.275
2.663e+00       3.728e+00       -1.066e+00      0.286
3.994e+01       3.842e+01       1.525e+00       0.040
-1.220e+02      -1.222e+02      2.144e-01       0.002
3.823e+01       3.785e+01       3.802e-01       0.010

Below Limit of Detection (LOD)#

Another commonly encounted case where imputation is necessary is when measurements are below an instrument’s LOD. This data may be biased based on the instrument itself (e.g., could return a fixed value or have non-Gaussian behavior) and it is generally preferable to impute those value randomly to avoid introducing bias in models trained on this data.

Data in the feature matrix (X) is assumed “missing” because it is below the limit of detection (LOD). However, any values explicitly less than the LODs provided are also selected for imputation. In both cases, random values are chosen between 0 and the LOD (which must be provided by the user).

[ ]:

from pychemauth.preprocessing.missing import LOD

Missing values are only because < LOD

First, let’s consider a feature matrix where values less than LOD have been reported as NaN. We simply need to detect and impute.

[ ]:

# Hypothetical feature matrix
X = np.array(
    [
        [1.0, 2.0, 3.0, 4.0],
        [np.nan, 3.0, 2.0, np.nan],
        [5.0, 1.0, np.nan, 5.0],
        [2.0, 3.0, 4.0, 5.0]
    ]
)

# Hypothetical limit of detection for each column in the feature matrix
lod = np.array([0.01, 0.01, 1.0, 3.0])

[ ]:

imputer = LOD(
    lod,
    missing_values=np.nan,
    seed=42
)

# Each NaN is replaced with a random value < lod for each column
imputer.fit_transform(X)

array([[1.        , 2.        , 3.        , 4.        ],
       [0.00773956, 3.        , 2.        , 2.57579376],
       [5.        , 1.        , 0.43887844, 5.        ],
       [2.        , 3.        , 4.        , 5.        ]])

[ ]:

# However, any value less than LOD is automatically imputed so we don't technically need them to be flagged

X = np.array(
    [
        [1.0, 2.0, 3.0, 4.0],
        [0.001, 3.0, 2.0, 0.001],
        [5.0, 1.0, 0.001, 5.0],
        [2.0, 3.0, 4.0, 5.0]
    ]
)

imputer.transform(X)

array([[1.        , 2.        , 3.        , 4.        ],
       [0.00697368, 3.        , 2.        , 2.92686705],
       [5.        , 1.        , 0.09417735, 5.        ],
       [2.        , 3.        , 4.        , 5.        ]])

Missing values and < LOD are different

It is also possible that we may encounter cases where some data is missing and some data is < LOD.

In this case, we would may want to impute the missing data with PLS_IA or PCA_IA, but randomly impute the data that has been recorded, but is less than LOD.

[ ]:

# Now assume NaN is a missing value and there are other entries that are < LOD also present.
X = np.array(
    [
        [1.0, np.nan, 3.0, 4.0],
        [0.001, 3.0, 2.0, np.nan],
        [5.0, 1.0, np.nan, 5.0],
        [2.0, 3.0, 0.001, 5.0]
    ]
)

A naive imputer will replace BOTH the < LOD values and NaNs.

[ ]:

imputer = LOD(
    lod,
    seed=42,
#     missing_values=np.nan
)

imputer.fit_transform(X)

array([[1.00000000e+00, 4.38878440e-03, 3.00000000e+00, 4.00000000e+00],
       [7.73956049e-03, 3.00000000e+00, 2.00000000e+00, 2.82532044e-01],
       [5.00000000e+00, 1.00000000e+00, 8.58597920e-01, 5.00000000e+00],
       [2.00000000e+00, 3.00000000e+00, 6.97368029e-01, 5.00000000e+00]])

Instead, we should ignore NaN values and let the imputer only work on the values explicitly less than LOD. One option is change the default value for missing_values to something else, e.g., -1. The other option is to specify this value as ‘ignored.’

[ ]:

# Step 1: Impute < LOD values while ignoring the NaNs which will be handled later
imputer = LOD(
    lod,
    ignore=np.nan,
    seed=42
)
X_lod = imputer.fit_transform(X)
X_lod

array([[1.        ,        nan, 3.        , 4.        ],
       [0.00773956, 3.        , 2.        ,        nan],
       [5.        , 1.        ,        nan, 5.        ],
       [2.        , 3.        , 0.43887844, 5.        ]])

[ ]:

# Step 2: Now remove NaNs by doing imputation
itim = PCA_IA(
    n_components=1,
    missing_values=np.nan, # NaN's were preserved during LOD imputation
    scale_x=True,
    tol=1.0e-6,
    max_iters=5000
)
X_recon = itim.fit_transform(
    X_lod,
)
X_recon

array([[ 1.        ,  3.64184538,  3.        ,  4.        ],
       [ 0.00773956,  3.        ,  2.        ,  4.3137826 ],
       [ 5.        ,  1.        , -2.32238208,  5.        ],
       [ 2.        ,  3.        ,  0.43887844,  5.        ]])

Below is an example of using these two imputation methods together in a pipeline.

pipeline = imblearn.pipeline.Pipeline(steps=[
    ("lod", LOD(lod, ignore=np.nan, seed=42)),
    ("pca_ia", PCA_IA(n_components=9, scale_x=False)),
    ("plsda", PLSDA(n_components=5,
                    alpha=0.05,
                    scale_x=True,
                    not_assigned='UNKNOWN',
                    style='soft',
                   )
    )
])

# Specify hyperparameters associated with each step
param_grid = [{
    "pca_ia__n_components": np.arange(1, 10, 2)
    "plsda__n_components": np.arange(1, 10, 2),
    "plsda__alpha": [0.07, 0.05, 0.03, 0.01],
}]

# Create grid search cross-validation object
gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=5,
    error_score=0,
    refit=True
)

# Train
_ = gs.fit(x_train, y_train)

Imputing Missing Values and Those Below LOD

Contents

Imputing Missing Values and Those Below LOD#

Load the Data

Iterative PCA (Missing X values)#

Fixed Number of Components

Unknown Number of Components

Iterative PLS (Missing X values)#

Fixed Number of Components

Unknown Number of Components

Below Limit of Detection (LOD)#

Missing values are only because < LOD

Missing values and < LOD are different