Imputing Missing Values and Those Below LOD#
Author: Nathan A. Mahynski
Date: 2023/08/23
Description: Illustrate different ways to impute missing or bad data and implement this in pipelines.
[ ]:
if 'google.colab' in str(get_ipython()):
!pip install git+https://github.com/mahynski/pychemauth@main
import os
os.kill(os.getpid(), 9) # Automatically restart the runtime to reload libraries
[ ]:
try:
import pychemauth
except:
raise ImportError("pychemauth not installed")
import matplotlib.pyplot as plt
%matplotlib inline
import watermark
%load_ext watermark
%load_ext autoreload
%autoreload 2
[ ]:
import imblearn
import sklearn
from sklearn.model_selection import GridSearchCV
import numpy as np
import pandas as pd
[ ]:
%watermark -t -m -v --iversions
Python implementation: CPython
Python version : 3.11.4
IPython version : 8.14.0
Compiler : GCC 12.2.0
OS : Linux
Release : 6.2.0-34-generic
Machine : x86_64
Processor : x86_64
CPU cores : 40
Architecture: 64bit
numpy : 1.24.3
pandas : 1.5.3
pychemauth: 0.0.0b3
json : 2.0.9
matplotlib: 3.7.2
imblearn : 0.11.0
watermark : 2.4.3
sklearn : 1.3.0
Load the Data
Let’s load some sample data, then purposefully corrupt it so that we can try to reconstruct it with various methods.
[ ]:
from sklearn.datasets import fetch_california_housing
data_bunch = fetch_california_housing(return_X_y=False, as_frame=False)
# Let's just work with the first 2k datapoints in this example
raw_X = data_bunch.data[:2000,:]
raw_y = data_bunch.target[:2000]
# Randomly delete 5% of the entries
n_delete = int(raw_X.shape[0]*0.05)
np.random.seed(42)
row_idx = [np.random.randint(low=0, high=raw_X.shape[0])
for i in range(n_delete)]
col_idx = [np.random.randint(low=0, high=raw_X.shape[1])
for i in range(n_delete)]
missing_X = raw_X.copy()
for i,j in zip(row_idx, col_idx):
missing_X[i,j] = np.nan
We’ll use these functions to visualize the corrupted data and reconstructions later on.
[ ]:
def compare(raw_X, reconstructed_X, first=None):
print('Reconstructed\tOriginal\tDifference\tRelative Err')
for i,j in zip(row_idx[:first], col_idx[:first]):
print('%.3e\t'%reconstructed_X[i,j]
+'%.3e\t'%raw_X[i,j]
+'%.3e\t'%(reconstructed_X[i,j]-raw_X[i,j])
+'%.3f'%(np.abs((reconstructed_X[i,j]-raw_X[i,j])/raw_X[i,j]))
)
def plot_performance(gs, components):
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,4))
for i,ax in enumerate(axes):
ax.errorbar(
x=components,
y=-gs.cv_results_['mean_test_score'],
yerr=gs.cv_results_['std_test_score'] # Errorbars are +/- 1 standard deviation
)
if i == 0:
ax.set_yscale('log')
ax.set_title('In Log Scale')
else:
ax.set_title('Not in Log Scale')
ax.set_xlabel('Number of Components')
ax.set_ylabel('Sum Squared Error on Test Set')
Iterative PCA (Missing X values)#
PCA can be used to reconstruct the missing values using a maximum likelihood estimation method.
As stated in [1]:
“The PCA/IA algorithm in the version applied in chemometrics can be summarized as:
fill in missing elements with their initial estimates,
perform singular value decomposition of the complete data set,
reconstruct X with the predefined number of factors,
replace the missing elements with the predicted values and go to step 2 until convergence.”
References:
Fixed Number of Components
The number of components being used in the singular value decomposition is a hyperparameter that should be optimized inside a pipeline using cross-validation; however, in this first example we will just set the value for illustration.
[ ]:
from pychemauth.preprocessing.missing import PCA_IA
[ ]:
itim = PCA_IA(
n_components=3,
scale_x=True,
missing_values=np.nan,
tol=1.0e-4,
max_iters=5000
)
[ ]:
reconstructed_X = itim.fit_transform(missing_X)
compare(raw_X, reconstructed_X, first=10)
Reconstructed Original Difference Relative Err
1.528e+01 2.300e+01 -7.721e+00 0.336
1.230e+00 9.953e-01 2.350e-01 0.236
2.639e+01 1.500e+01 1.139e+01 0.760
2.772e+01 1.300e+01 1.472e+01 1.132
2.318e+00 2.620e+00 -3.020e-01 0.115
1.386e+00 1.117e+00 2.690e-01 0.241
2.400e+00 3.728e+00 -1.328e+00 0.356
3.924e+01 3.842e+01 8.229e-01 0.021
-1.222e+02 -1.222e+02 -5.861e-02 0.000
3.771e+01 3.785e+01 -1.373e-01 0.004
Unknown Number of Components
Usually, we need to figure out what a good number of components is. We can use cross-validation for this.
[ ]:
pipeline = sklearn.pipeline.Pipeline(steps=[
("pca_ia", PCA_IA(
n_components=1,
tol=1e-04,
max_iters=5000,
scale_x=True)
)
])
# Hyperparameters of pipeline steps are given in standard notation: step__parameter_name
components = np.arange(1, raw_X.shape[1], 1)
param_grid = [{
'pca_ia__n_components': components,
}]
# Let's use a simple grid search cross-validation stratgegy with 5-fold CV
gs = GridSearchCV(
estimator=pipeline,
param_grid=param_grid,
n_jobs=-1,
cv=sklearn.model_selection.KFold(n_splits=5, shuffle=True, random_state=0),
error_score=0,
refit=True
)
# Fit the PCA_IA object
_ = gs.fit(
missing_X,
)
[ ]:
# Let's plot the results
plot_performance(gs, components)
PCA/IA is working well when there are nearly as many components as features. In log scale it is clear that the test set error is monotonically decreasing, which means the CV object will choose the maximum number of components as the optimum.
[ ]:
gs.best_params_ # Indeed, N = 7 is what the CV object chooses
{'pca_ia__n_components': 7}
[ ]:
reconstructed_X = gs.transform(missing_X)
compare(raw_X, reconstructed_X, first=10)
Reconstructed Original Difference Relative Err
2.969e+01 2.300e+01 6.685e+00 0.291
1.087e+00 9.953e-01 9.147e-02 0.092
3.107e+01 1.500e+01 1.607e+01 1.071
3.265e+01 1.300e+01 1.965e+01 1.511
2.652e+00 2.620e+00 3.176e-02 0.012
1.020e+00 1.117e+00 -9.738e-02 0.087
2.708e+00 3.728e+00 -1.020e+00 0.274
3.824e+01 3.842e+01 -1.844e-01 0.005
-1.220e+02 -1.222e+02 1.970e-01 0.002
3.831e+01 3.785e+01 4.560e-01 0.012
You can then use this in other pipelines. You can specify the imputer without any hyperparameters in those cases, for example. Below is an example of how you might do that when using a subsquent PLS-DA model.
pipeline = imblearn.pipeline.Pipeline(steps=[
("pca_ia", PCA_IA(n_components=7, scale_x=False)),
("plsda", PLSDA(n_components=5,
alpha=0.05,
scale_x=True,
not_assigned='UNKNOWN',
style='soft',
)
)
])
# Specify hyperparameters associated with each step
param_grid = [{
"pca_ia__n_components": np.arange(1, 10, 2)
"plsda__n_components": np.arange(1, 10, 2),
"plsda__alpha": [0.07, 0.05, 0.03, 0.01],
}]
# Create grid search cross-validation object
gs = GridSearchCV(
estimator=pipeline,
param_grid=param_grid,
n_jobs=-1,
cv=5,
error_score=0,
refit=True
)
# Train
_ = gs.fit(x_train, y_train)
Iterative PLS (Missing X values)#
PCA is an unsupervised approach and only looks at the feature matrix, X, to reconstruct the missing values. If we use PLS instead this becomes a supervised method, which now considers the response variable, y, in the reconstruction.
As stated in [1]:
“For a PLS model, the IA is included into the PLS procedure as follows:
fill in missing elements with their initial estimates for fn=1:A (number of factors),
calculate the mean of X and of y,
center X and y,
calculate weights, scores and loadings,
substract the predicted X and y from the original X and y and go to step 4 end,
reconstruct X with the actual set of scores and loadings (and A factors),
fill in missing elements in X with their estimates and go to step 2 until convergence.”
Fixed Number of Components
[ ]:
from pychemauth.preprocessing.missing import PLS_IA
[ ]:
itim = PLS_IA(
n_components=3,
missing_values=np.nan,
scale_x=True,
tol=1.0e-2,
max_iters=10000
)
[ ]:
reconstructed_X = itim.fit_transform(
missing_X,
raw_y.reshape(-1,1)
)
compare(raw_X, reconstructed_X, first=10)
Reconstructed Original Difference Relative Err
3.746e+01 2.300e+01 1.446e+01 0.629
8.934e-01 9.953e-01 -1.019e-01 0.102
2.734e+01 1.500e+01 1.234e+01 0.823
3.071e+01 1.300e+01 1.771e+01 1.363
1.992e+00 2.620e+00 -6.286e-01 0.240
-5.122e-01 1.117e+00 -1.629e+00 1.459
2.614e+00 3.728e+00 -1.114e+00 0.299
3.827e+01 3.842e+01 -1.507e-01 0.004
-1.223e+02 -1.222e+02 -1.529e-01 0.001
3.829e+01 3.785e+01 4.413e-01 0.012
Unknown Number of Components
[ ]:
pipeline = sklearn.pipeline.Pipeline(steps=[
("pls_ia", PLS_IA(
n_components=1,
max_iters=10000,
tol=1.0e-2,
scale_x=False)
)
])
# Hyperparameters of pipeline steps are given in standard notation: step__parameter_name
components = np.arange(1, raw_X.shape[1], 1)
param_grid = [{
'pls_ia__n_components': components,
}]
# Let's use a simple grid search cross-validation stratgegy with 5-fold CV
gs = GridSearchCV(
estimator=pipeline,
param_grid=param_grid,
n_jobs=-1,
cv=sklearn.model_selection.KFold(n_splits=5, shuffle=True, random_state=0),
error_score=0,
refit=True
)
# Fit the PLS_IA object
_ = gs.fit(
missing_X,
raw_y.reshape(-1,1)
)
[ ]:
# Let's plot the results
plot_performance(gs, components)
PLS/IA seems to work much better than PCA/IA, as expected when comparing a supervised to an unsuprvised approach.
[ ]:
gs.best_params_
{'pls_ia__n_components': 7}
[ ]:
reconstructed_X = gs.transform(missing_X)
compare(raw_X, reconstructed_X, first=10)
Reconstructed Original Difference Relative Err
3.135e+01 2.300e+01 8.348e+00 0.363
1.070e+00 9.953e-01 7.423e-02 0.075
3.133e+01 1.500e+01 1.633e+01 1.089
3.133e+01 1.300e+01 1.833e+01 1.410
2.618e+00 2.620e+00 -2.178e-03 0.001
8.100e-01 1.117e+00 -3.071e-01 0.275
2.663e+00 3.728e+00 -1.066e+00 0.286
3.994e+01 3.842e+01 1.525e+00 0.040
-1.220e+02 -1.222e+02 2.144e-01 0.002
3.823e+01 3.785e+01 3.802e-01 0.010
Below Limit of Detection (LOD)#
Another commonly encounted case where imputation is necessary is when measurements are below an instrument’s LOD. This data may be biased based on the instrument itself (e.g., could return a fixed value or have non-Gaussian behavior) and it is generally preferable to impute those value randomly to avoid introducing bias in models trained on this data.
Data in the feature matrix (X) is assumed “missing” because it is below the limit of detection (LOD). However, any values explicitly less than the LODs provided are also selected for imputation. In both cases, random values are chosen between 0 and the LOD (which must be provided by the user).
[ ]:
from pychemauth.preprocessing.missing import LOD
Missing values are only because < LOD
First, let’s consider a feature matrix where values less than LOD have been reported as NaN. We simply need to detect and impute.
[ ]:
# Hypothetical feature matrix
X = np.array(
[
[1.0, 2.0, 3.0, 4.0],
[np.nan, 3.0, 2.0, np.nan],
[5.0, 1.0, np.nan, 5.0],
[2.0, 3.0, 4.0, 5.0]
]
)
# Hypothetical limit of detection for each column in the feature matrix
lod = np.array([0.01, 0.01, 1.0, 3.0])
[ ]:
imputer = LOD(
lod,
missing_values=np.nan,
seed=42
)
# Each NaN is replaced with a random value < lod for each column
imputer.fit_transform(X)
array([[1. , 2. , 3. , 4. ],
[0.00773956, 3. , 2. , 2.57579376],
[5. , 1. , 0.43887844, 5. ],
[2. , 3. , 4. , 5. ]])
[ ]:
# However, any value less than LOD is automatically imputed so we don't technically need them to be flagged
X = np.array(
[
[1.0, 2.0, 3.0, 4.0],
[0.001, 3.0, 2.0, 0.001],
[5.0, 1.0, 0.001, 5.0],
[2.0, 3.0, 4.0, 5.0]
]
)
imputer.transform(X)
array([[1. , 2. , 3. , 4. ],
[0.00697368, 3. , 2. , 2.92686705],
[5. , 1. , 0.09417735, 5. ],
[2. , 3. , 4. , 5. ]])
Missing values and < LOD are different
[ ]:
# Now assume NaN is a missing value and there are other entries that are < LOD also present.
X = np.array(
[
[1.0, np.nan, 3.0, 4.0],
[0.001, 3.0, 2.0, np.nan],
[5.0, 1.0, np.nan, 5.0],
[2.0, 3.0, 0.001, 5.0]
]
)
A naive imputer will replace BOTH the < LOD values and NaNs.
[ ]:
imputer = LOD(
lod,
seed=42,
# missing_values=np.nan
)
imputer.fit_transform(X)
array([[1.00000000e+00, 4.38878440e-03, 3.00000000e+00, 4.00000000e+00],
[7.73956049e-03, 3.00000000e+00, 2.00000000e+00, 2.82532044e-01],
[5.00000000e+00, 1.00000000e+00, 8.58597920e-01, 5.00000000e+00],
[2.00000000e+00, 3.00000000e+00, 6.97368029e-01, 5.00000000e+00]])
Instead, we should ignore NaN values and let the imputer only work on the values explicitly less than LOD. One option is change the default value for missing_values to something else, e.g., -1. The other option is to specify this value as ‘ignored.’
[ ]:
# Step 1: Impute < LOD values while ignoring the NaNs which will be handled later
imputer = LOD(
lod,
ignore=np.nan,
seed=42
)
X_lod = imputer.fit_transform(X)
X_lod
array([[1. , nan, 3. , 4. ],
[0.00773956, 3. , 2. , nan],
[5. , 1. , nan, 5. ],
[2. , 3. , 0.43887844, 5. ]])
[ ]:
# Step 2: Now remove NaNs by doing imputation
itim = PCA_IA(
n_components=1,
missing_values=np.nan, # NaN's were preserved during LOD imputation
scale_x=True,
tol=1.0e-6,
max_iters=5000
)
X_recon = itim.fit_transform(
X_lod,
)
X_recon
array([[ 1. , 3.64184538, 3. , 4. ],
[ 0.00773956, 3. , 2. , 4.3137826 ],
[ 5. , 1. , -2.32238208, 5. ],
[ 2. , 3. , 0.43887844, 5. ]])
Below is an example of using these two imputation methods together in a pipeline.
pipeline = imblearn.pipeline.Pipeline(steps=[
("lod", LOD(lod, ignore=np.nan, seed=42)),
("pca_ia", PCA_IA(n_components=9, scale_x=False)),
("plsda", PLSDA(n_components=5,
alpha=0.05,
scale_x=True,
not_assigned='UNKNOWN',
style='soft',
)
)
])
# Specify hyperparameters associated with each step
param_grid = [{
"pca_ia__n_components": np.arange(1, 10, 2)
"plsda__n_components": np.arange(1, 10, 2),
"plsda__alpha": [0.07, 0.05, 0.03, 0.01],
}]
# Create grid search cross-validation object
gs = GridSearchCV(
estimator=pipeline,
param_grid=param_grid,
n_jobs=-1,
cv=5,
error_score=0,
refit=True
)
# Train
_ = gs.fit(x_train, y_train)