Regression with PLS#

Author: Nathan A. Mahynski

Date: 2023/08/23

Description: This is an example of using PLS to create a model, following the procedures outlined in Detection of Outliers in Projection-Based Modeling,” Rodionova and Pomerantsev, Analytical Chemistry 92 (2020) 2656−2664.

Open In Colab

Figure 1 from this paper illustrates the workflow:

b13d453268074376add0ba3714f33828

[ ]:
if 'google.colab' in str(get_ipython()):
    !pip install git+https://github.com/mahynski/pychemauth@main
    import os
    os.kill(os.getpid(), 9) # Automatically restart the runtime to reload libraries
[ ]:
try:
    import pychemauth
except:
    raise ImportError("pychemauth not installed")

import matplotlib.pyplot as plt
%matplotlib inline

import watermark
%load_ext watermark

%load_ext autoreload
%autoreload 2
[ ]:
import imblearn
import sklearn

from sklearn.model_selection import GridSearchCV

import numpy as np
import pandas as pd
[ ]:
%watermark -t -m -v --iversions
Python implementation: CPython
Python version       : 3.11.4
IPython version      : 8.14.0

Compiler    : GCC 12.2.0
OS          : Linux
Release     : 6.2.0-34-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 40
Architecture: 64bit

matplotlib: 3.7.2
watermark : 2.4.3
imblearn  : 0.11.0
pandas    : 1.5.3
pychemauth: 0.0.0b3
numpy     : 1.24.3
json      : 2.0.9
sklearn   : 1.3.0

Utilities for Later

[ ]:
def plot_predictions(model, X, y):
    plt.plot(y, model.predict(X).ravel(), 'o')
    plt.plot(y, y, '-')
    plt.xlabel('Actual Salary')
    _ = plt.ylabel('Predicted Salary')

    extremes, outliers = model.check_xy_outliers(X, y)
    plt.plot(
        y[outliers],
        model.predict(X[outliers]),
        color='red',
        marker='x',
        ms=10,
        lw=0,
        label='Outliers'
    )

    plt.plot(
        y[extremes],
        model.predict(X[extremes]),
        color='yellow',
        marker='*',
        ms=10,
        lw=0,
        label='Extreme Values'
    )

    plt.legend(loc='best')

Load the Data

Load some example data downloaded from Kaggle here.

[ ]:
df = pd.read_csv('https://raw.githubusercontent.com/mahynski/pychemauth/main/docs/jupyter/data/Hitters.csv')
[ ]:
# Let's do a little clean up and only select the numerical columns.
df = df.select_dtypes(include=np.number).dropna()
[ ]:
df.head(5)
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks PutOuts Assists Errors Salary
1 315 81 7 24 38 39 14 3449 835 69 321 414 375 632 43 10 475.0
2 479 130 18 66 72 76 3 1624 457 63 224 266 263 880 82 14 480.0
3 496 141 20 65 78 37 11 5628 1575 225 828 838 354 200 11 3 500.0
4 321 87 10 39 42 30 2 396 101 12 48 46 33 805 40 4 91.5
5 594 169 4 74 51 35 11 4408 1133 19 501 336 194 282 421 25 750.0
[ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.iloc[:,:-1].values,
    df['Salary'].values, # Let's try to predict the salary based on the other numerical features.
    shuffle=True,
    random_state=42,
    test_size=0.2
)

Fitting a PLS Model#

[ ]:
from pychemauth.regressor.pls import PLS
[ ]:
model = PLS(
    n_components=3,
    alpha=0.05,
    gamma=0.01,
    scale_x=True,
    robust=True, # Use robust methods at first before the data is "clean"
    sft=False
)

_ = model.fit(X_train, y_train)
[ ]:
# Some of the training data might be considered outliers (based on total distance) which can skew the model.
_ = model.visualize(X_train, y_train, figsize=(12,4))
../../_images/jupyter_api_pls_17_0.png
[ ]:
# There might be 1 outlier in the test set, too.
_ = model.visualize(X_test, y_test, figsize=(12,4))
../../_images/jupyter_api_pls_18_0.png
[ ]:
# The model has picked out many reasonable points to consider outliers (which we shouldn't train using, and don't
# expect the model to perform well on) as well as extreme points.

# Note that based on alpha ~ 0.05 we expect about 5% of the data to be irregular.
plot_predictions(model, X_train, y_train)
../../_images/jupyter_api_pls_19_0.png

Optimizing the Model Without Data Cleaning#

Let’s use a pipeline to optimize the model by searching over many hyperparameters. This should really only be done when the dataset has been cleaned, but we will proceed here for the sake of illustration.

[ ]:
pipeline = imblearn.pipeline.Pipeline(steps=[
    ("pls", PLS(n_components=1, alpha=0.05, gamma=0.01, scale_x=True)
    )
])

param_grid = [{
    'pls__n_components':np.arange(1, 10),
}]

gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=sklearn.model_selection.KFold(n_splits=3, shuffle=True, random_state=0),
    error_score=0,
    refit=True
)

_ = gs.fit(X_train, y_train)
[ ]:
# The best parameters found can be accessed like this.
gs.best_params_
{'pls__n_components': 1}
[ ]:
gs.best_score_ # The best (default is R^2, coefficient of determination) score it recieved was.
0.4263544920180595
[ ]:
# For a 1D optimization like this (only optimizing the number of components) you can easily visualize where the best
# value occurs.
plt.errorbar(gs.cv_results_['param_pls__n_components'].data,
             gs.cv_results_['mean_test_score'],
             yerr=gs.cv_results_['std_test_score'])
plt.xlabel('n_components')
plt.ylabel(r'$R^2$')

_ = plt.axvline(gs.best_params_['pls__n_components'], color='red')
../../_images/jupyter_api_pls_25_0.png
[ ]:
# The refit=True (default) refits the model on the entire training set in the end so you can use it directly.
gs.score(X_test, y_test)
0.12406291855662177
[ ]:
plot_predictions(gs.best_estimator_.named_steps['pls'], X_test, y_test)
../../_images/jupyter_api_pls_27_0.png

Outlier Detection#

Steps 1 and 2 in the workflow at the beginning of this document are handled by the last step with CV. Now we can turn to optimizing the training set by removing outliers. Let’s use the same model as in the last example.

[ ]:
optimal_model = PLS(n_components=1, alpha=0.05, gamma=0.01, scale_x=True)
_ = optimal_model.fit(X_train, y_train)

Step 3

[ ]:
_ = optimal_model.visualize(X_train, y_train, figsize=(12,4))
../../_images/jupyter_api_pls_32_0.png
[ ]:
# Let's look at how many outliers we have.
extremes, outliers = optimal_model.check_xy_outliers(X_train, y_train)
np.sum(outliers) # Indeed, we have 1 outlier
3
[ ]:
# Let's start collecting these outliers.
X_out = X_train[outliers, :]
y_out = y_train[outliers]

Step 4

[ ]:
# Select data that is NOT an outlier (inlier and extreme points, or "regular" points) to use for training.
X_train_new = X_train[~outliers]
y_train_new = y_train[~outliers]
[ ]:
# Train again - once more, we see there are outliers.
_ = optimal_model.fit(X_train_new, y_train_new)
_ = optimal_model.visualize(X_train_new, y_train_new, figsize=(12,4))
../../_images/jupyter_api_pls_37_0.png
[ ]:
# Once again, get the outliers, continue to collect them, and train again on the updated dataset
extremes, outliers = optimal_model.check_xy_outliers(X_train_new, y_train_new)

X_out = np.vstack((X_out, X_train_new[outliers,:]))
y_out = np.concatenate((y_out, y_train_new[outliers]))

X_train_new = X_train_new[~outliers]
y_train_new = y_train_new[~outliers]

_ = optimal_model.fit(X_train_new, y_train_new)
_ = optimal_model.visualize(X_train_new, y_train_new, figsize=(12,4))
../../_images/jupyter_api_pls_38_0.png
[ ]:
# There is still 1 outlier, so let's repeat the last cell's code again.
extremes, outliers = optimal_model.check_xy_outliers(X_train_new, y_train_new)

X_out = np.vstack((X_out, X_train_new[outliers,:]))
y_out = np.concatenate((y_out, y_train_new[outliers]))

X_train_new = X_train_new[~outliers]
y_train_new = y_train_new[~outliers]

_ = optimal_model.fit(X_train_new, y_train_new)
_ = optimal_model.visualize(X_train_new, y_train_new, figsize=(12,4))
../../_images/jupyter_api_pls_39_0.png
[ ]:
# The inner loop is done and we have a candidate "clean" dataset.
# Here are the points we removed so far.
X_out
array([[  495,   151,    17,    61,    84,    78,    10,  5624,  1679,
          275,   884,  1015,   709,  1045,    88,    13],
       [  419,   101,    18,    65,    58,    92,    20,  9528,  2510,
          548,  1509,  1659,  1342,     0,     0,     0],
       [  237,    52,     0,    15,    25,    30,    24, 14053,  4256,
          160,  2165,  1314,  1566,   523,    43,     6],
       [  514,   144,     0,    67,    54,    79,     9,  4739,  1169,
           13,   583,   374,   528,   229,   453,    15],
       [  677,   238,    31,   117,   113,    53,     5,  2223,   737,
           93,   349,   401,   171,  1377,   100,     6],
       [  354,    77,    16,    36,    55,    41,    20,  8716,  2172,
          384,  1172,  1267,  1057,    83,   174,    16],
       [  507,   122,    29,    78,    85,    91,    18,  7761,  1947,
          347,  1175,  1152,  1380,   808,   108,     2],
       [  618,   200,    20,    98,   110,    62,    13,  7127,  2163,
          351,  1104,  1289,   564,   330,    16,     8]])

Step 5

[ ]:
# Let's check if the outliers we iteratively collected are still all considered outliers.
# Indeed, they are so we do not need to go back to step 6 and repeat the loop above.
_ = optimal_model.visualize(X_out, y_out, figsize=(12,4))
../../_images/jupyter_api_pls_42_0.png
[ ]:
extremes, outliers = optimal_model.check_xy_outliers(X_out, y_out)
[ ]:
np.all(outliers)
True

Step 7

[ ]:
final_model = PLS(
    n_components=1,
    alpha=0.05,
    gamma=0.01,
    scale_x=True,
    robust=False, # Now that the dataset is considered "clean" we need should use classical statistical methods
    sft=False
)

_ = final_model.fit(X_train_new, y_train_new)
[ ]:
_ = final_model.visualize(X_train_new, y_train_new, figsize=(12,4))
../../_images/jupyter_api_pls_47_0.png
[ ]:
# The classical model accepts a few outliers as extremes, but only barely.  These were determined to be outliers by
# a robust version of this model.
_ = final_model.visualize(X_out, y_out, figsize=(12,4))
../../_images/jupyter_api_pls_48_0.png
[ ]:
plot_predictions(final_model, X_train, y_train)
../../_images/jupyter_api_pls_49_0.png
[ ]:
plot_predictions(final_model, X_test, y_test)
../../_images/jupyter_api_pls_50_0.png

Automatic Loop

That was a lot of manual work and code-copying. It would be nice to have all of that automated, and in fact, the sft option does just that.

[ ]:
# The 'robust' option will be ignored here - robust=True is used during the cleaning, but after the dataset is cleaned
# the final model will be re-trained the clean dataset using robust=False

auto_model = PLS(
    n_components=1,
    alpha=0.05,
    gamma=0.01,
    scale_x=True,
    robust=True, # Will be ignored
    sft=True
)

_ = auto_model.fit(X_train, y_train)
[ ]:
auto_model.sft_history # These are the points we removed manually!
{'outer_loops': 1,
 'removed': {'X': array([[495, 151, 17, 61, 84, 78, 10, 5624, 1679, 275, 884, 1015, 709,
          1045, 88, 13],
         [419, 101, 18, 65, 58, 92, 20, 9528, 2510, 548, 1509, 1659, 1342,
          0, 0, 0],
         [237, 52, 0, 15, 25, 30, 24, 14053, 4256, 160, 2165, 1314, 1566,
          523, 43, 6],
         [514, 144, 0, 67, 54, 79, 9, 4739, 1169, 13, 583, 374, 528, 229,
          453, 15],
         [677, 238, 31, 117, 113, 53, 5, 2223, 737, 93, 349, 401, 171,
          1377, 100, 6],
         [354, 77, 16, 36, 55, 41, 20, 8716, 2172, 384, 1172, 1267, 1057,
          83, 174, 16],
         [507, 122, 29, 78, 85, 91, 18, 7761, 1947, 347, 1175, 1152, 1380,
          808, 108, 2],
         [618, 200, 20, 98, 110, 62, 13, 7127, 2163, 351, 1104, 1289, 564,
          330, 16, 8]], dtype=object),
  'y': array([2460.0, 487.5, 750.0, 1940.0, 1975.0, 200.0, 535.0, 2412.5],
        dtype=object)},
 'iterations': {1: {'initially removed X': array([[  495,   151,    17,    61,    84,    78,    10,  5624,  1679,
             275,   884,  1015,   709,  1045,    88,    13],
          [  419,   101,    18,    65,    58,    92,    20,  9528,  2510,
             548,  1509,  1659,  1342,     0,     0,     0],
          [  237,    52,     0,    15,    25,    30,    24, 14053,  4256,
             160,  2165,  1314,  1566,   523,    43,     6],
          [  514,   144,     0,    67,    54,    79,     9,  4739,  1169,
              13,   583,   374,   528,   229,   453,    15],
          [  677,   238,    31,   117,   113,    53,     5,  2223,   737,
              93,   349,   401,   171,  1377,   100,     6],
          [  354,    77,    16,    36,    55,    41,    20,  8716,  2172,
             384,  1172,  1267,  1057,    83,   174,    16],
          [  507,   122,    29,    78,    85,    91,    18,  7761,  1947,
             347,  1175,  1152,  1380,   808,   108,     2],
          [  618,   200,    20,    98,   110,    62,    13,  7127,  2163,
             351,  1104,  1289,   564,   330,    16,     8]]),
   'initially removed y': array([2460. ,  487.5,  750. , 1940. , 1975. ,  200. ,  535. , 2412.5]),
   'returned X': None,
   'returned y': None}}}
[ ]:
np.allclose(
    np.asarray(auto_model.sft_history['removed']['X'], dtype=np.float64),
    np.asarray(X_out, dtype=np.float64)
)
True
[ ]:
np.allclose(
    np.asarray(auto_model.sft_history['removed']['y'], dtype=np.float64),
    np.asarray(y_out, dtype=np.float64)
)
True
[ ]:
plot_predictions(auto_model, X_train, y_train)
../../_images/jupyter_api_pls_57_0.png
[ ]:
plot_predictions(auto_model, X_test, y_test)
../../_images/jupyter_api_pls_58_0.png

Optimizing the Model With Data Cleaning#

With the ‘sft’ option we can use pipelines to optimize the models which internally use this data cleaning. It would be a lot of work to do all of the above manually inside a cross-validation loop!

[ ]:
pipeline = imblearn.pipeline.Pipeline(steps=[
    ("pls",
     PLS(
         n_components=1,
         alpha=0.05,
         gamma=0.01,
         scale_x=True,
         sft=True
     )
    )
])

param_grid = [{
    'pls__n_components':np.arange(1, 10),
}]

gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=sklearn.model_selection.KFold(n_splits=3, shuffle=True, random_state=0),
    error_score=0,
    refit=True
)

_ = gs.fit(X_train, y_train)
[ ]:
# A few fits still fail because they are not able to perfectly clean the dataset since the loops shown in the figure
# do not converge.  Still, we are able to get generally good performance, and in this case we obtain the same result.
gs.best_params_
{'pls__n_components': 1}
[ ]:
plot_predictions(gs.best_estimator_.named_steps['pls'], X_train, y_train)
../../_images/jupyter_api_pls_63_0.png
[ ]:
plot_predictions(gs.best_estimator_.named_steps['pls'], X_test, y_test)
../../_images/jupyter_api_pls_64_0.png
[ ]:
_ = final_model.visualize(X_train, y_train, figsize=(12,4))
../../_images/jupyter_api_pls_65_0.png
[ ]:
_ = final_model.visualize(X_test, y_test, figsize=(12,4))
../../_images/jupyter_api_pls_66_0.png