Regression with PLS#

Author: Nathan A. Mahynski

Date: 2023/08/23

Description: This is an example of using PLS to create a model, following the procedures outlined in Detection of Outliers in Projection-Based Modeling,” Rodionova and Pomerantsev, Analytical Chemistry 92 (2020) 2656−2664.

Figure 1 from this paper illustrates the workflow:

b13d453268074376add0ba3714f33828

[ ]:

if 'google.colab' in str(get_ipython()):
    !pip install git+https://github.com/mahynski/pychemauth@main
    import os
    os.kill(os.getpid(), 9) # Automatically restart the runtime to reload libraries

[ ]:

try:
    import pychemauth
except:
    raise ImportError("pychemauth not installed")

import matplotlib.pyplot as plt
%matplotlib inline

import watermark
%load_ext watermark

%load_ext autoreload
%autoreload 2

[ ]:

import imblearn
import sklearn

from sklearn.model_selection import GridSearchCV

import numpy as np
import pandas as pd

[ ]:

%watermark -t -m -v --iversions

Python implementation: CPython
Python version       : 3.11.4
IPython version      : 8.14.0

Compiler    : GCC 12.2.0
OS          : Linux
Release     : 6.2.0-34-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 40
Architecture: 64bit

matplotlib: 3.7.2
watermark : 2.4.3
imblearn  : 0.11.0
pandas    : 1.5.3
pychemauth: 0.0.0b3
numpy     : 1.24.3
json      : 2.0.9
sklearn   : 1.3.0

Utilities for Later

[ ]:

def plot_predictions(model, X, y):
    plt.plot(y, model.predict(X).ravel(), 'o')
    plt.plot(y, y, '-')
    plt.xlabel('Actual Salary')
    _ = plt.ylabel('Predicted Salary')

    extremes, outliers = model.check_xy_outliers(X, y)
    plt.plot(
        y[outliers],
        model.predict(X[outliers]),
        color='red',
        marker='x',
        ms=10,
        lw=0,
        label='Outliers'
    )

    plt.plot(
        y[extremes],
        model.predict(X[extremes]),
        color='yellow',
        marker='*',
        ms=10,
        lw=0,
        label='Extreme Values'
    )

    plt.legend(loc='best')

Load the Data

Load some example data downloaded from Kaggle here.

[ ]:

df = pd.read_csv('https://raw.githubusercontent.com/mahynski/pychemauth/main/docs/jupyter/data/Hitters.csv')

[ ]:

# Let's do a little clean up and only select the numerical columns.
df = df.select_dtypes(include=np.number).dropna()

[ ]:

df.head(5)

	AtBat	Hits	HmRun	Runs	RBI	Walks	Years	CAtBat	CHits	CHmRun	CRuns	CRBI	CWalks	PutOuts	Assists	Errors	Salary
1	315	81	7	24	38	39	14	3449	835	69	321	414	375	632	43	10	475.0
2	479	130	18	66	72	76	3	1624	457	63	224	266	263	880	82	14	480.0
3	496	141	20	65	78	37	11	5628	1575	225	828	838	354	200	11	3	500.0
4	321	87	10	39	42	30	2	396	101	12	48	46	33	805	40	4	91.5
5	594	169	4	74	51	35	11	4408	1133	19	501	336	194	282	421	25	750.0

[ ]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.iloc[:,:-1].values,
    df['Salary'].values, # Let's try to predict the salary based on the other numerical features.
    shuffle=True,
    random_state=42,
    test_size=0.2
)

Fitting a PLS Model#

[ ]:

from pychemauth.regressor.pls import PLS

[ ]:

model = PLS(
    n_components=3,
    alpha=0.05,
    gamma=0.01,
    scale_x=True,
    robust=True, # Use robust methods at first before the data is "clean"
    sft=False
)

_ = model.fit(X_train, y_train)

[ ]:

# Some of the training data might be considered outliers (based on total distance) which can skew the model.
_ = model.visualize(X_train, y_train, figsize=(12,4))

[ ]:

# There might be 1 outlier in the test set, too.
_ = model.visualize(X_test, y_test, figsize=(12,4))

[ ]:

# The model has picked out many reasonable points to consider outliers (which we shouldn't train using, and don't
# expect the model to perform well on) as well as extreme points.

# Note that based on alpha ~ 0.05 we expect about 5% of the data to be irregular.
plot_predictions(model, X_train, y_train)

Optimizing the Model Without Data Cleaning#

Let’s use a pipeline to optimize the model by searching over many hyperparameters. This should really only be done when the dataset has been cleaned, but we will proceed here for the sake of illustration.

[ ]:

pipeline = imblearn.pipeline.Pipeline(steps=[
    ("pls", PLS(n_components=1, alpha=0.05, gamma=0.01, scale_x=True)
    )
])

param_grid = [{
    'pls__n_components':np.arange(1, 10),
}]

gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=sklearn.model_selection.KFold(n_splits=3, shuffle=True, random_state=0),
    error_score=0,
    refit=True
)

_ = gs.fit(X_train, y_train)

[ ]:

# The best parameters found can be accessed like this.
gs.best_params_

{'pls__n_components': 1}

[ ]:

gs.best_score_ # The best (default is R^2, coefficient of determination) score it recieved was.

0.4263544920180595

[ ]:

# For a 1D optimization like this (only optimizing the number of components) you can easily visualize where the best
# value occurs.
plt.errorbar(gs.cv_results_['param_pls__n_components'].data,
             gs.cv_results_['mean_test_score'],
             yerr=gs.cv_results_['std_test_score'])
plt.xlabel('n_components')
plt.ylabel(r'$R^2$')

_ = plt.axvline(gs.best_params_['pls__n_components'], color='red')

[ ]:

# The refit=True (default) refits the model on the entire training set in the end so you can use it directly.
gs.score(X_test, y_test)

0.12406291855662177

[ ]:

plot_predictions(gs.best_estimator_.named_steps['pls'], X_test, y_test)

Outlier Detection#

Steps 1 and 2 in the workflow at the beginning of this document are handled by the last step with CV. Now we can turn to optimizing the training set by removing outliers. Let’s use the same model as in the last example.

[ ]:

optimal_model = PLS(n_components=1, alpha=0.05, gamma=0.01, scale_x=True)
_ = optimal_model.fit(X_train, y_train)

Step 3

[ ]:

_ = optimal_model.visualize(X_train, y_train, figsize=(12,4))

[ ]:

# Let's look at how many outliers we have.
extremes, outliers = optimal_model.check_xy_outliers(X_train, y_train)
np.sum(outliers) # Indeed, we have 1 outlier

[ ]:

# Let's start collecting these outliers.
X_out = X_train[outliers, :]
y_out = y_train[outliers]

Step 4

[ ]:

# Select data that is NOT an outlier (inlier and extreme points, or "regular" points) to use for training.
X_train_new = X_train[~outliers]
y_train_new = y_train[~outliers]

[ ]:

# Train again - once more, we see there are outliers.
_ = optimal_model.fit(X_train_new, y_train_new)
_ = optimal_model.visualize(X_train_new, y_train_new, figsize=(12,4))

[ ]:

# Once again, get the outliers, continue to collect them, and train again on the updated dataset
extremes, outliers = optimal_model.check_xy_outliers(X_train_new, y_train_new)

X_out = np.vstack((X_out, X_train_new[outliers,:]))
y_out = np.concatenate((y_out, y_train_new[outliers]))

X_train_new = X_train_new[~outliers]
y_train_new = y_train_new[~outliers]

_ = optimal_model.fit(X_train_new, y_train_new)
_ = optimal_model.visualize(X_train_new, y_train_new, figsize=(12,4))

[ ]:

# There is still 1 outlier, so let's repeat the last cell's code again.
extremes, outliers = optimal_model.check_xy_outliers(X_train_new, y_train_new)

X_out = np.vstack((X_out, X_train_new[outliers,:]))
y_out = np.concatenate((y_out, y_train_new[outliers]))

X_train_new = X_train_new[~outliers]
y_train_new = y_train_new[~outliers]

_ = optimal_model.fit(X_train_new, y_train_new)
_ = optimal_model.visualize(X_train_new, y_train_new, figsize=(12,4))

[ ]:

# The inner loop is done and we have a candidate "clean" dataset.
# Here are the points we removed so far.
X_out

array([[  495,   151,    17,    61,    84,    78,    10,  5624,  1679,
          275,   884,  1015,   709,  1045,    88,    13],
       [  419,   101,    18,    65,    58,    92,    20,  9528,  2510,
          548,  1509,  1659,  1342,     0,     0,     0],
       [  237,    52,     0,    15,    25,    30,    24, 14053,  4256,
          160,  2165,  1314,  1566,   523,    43,     6],
       [  514,   144,     0,    67,    54,    79,     9,  4739,  1169,
           13,   583,   374,   528,   229,   453,    15],
       [  677,   238,    31,   117,   113,    53,     5,  2223,   737,
           93,   349,   401,   171,  1377,   100,     6],
       [  354,    77,    16,    36,    55,    41,    20,  8716,  2172,
          384,  1172,  1267,  1057,    83,   174,    16],
       [  507,   122,    29,    78,    85,    91,    18,  7761,  1947,
          347,  1175,  1152,  1380,   808,   108,     2],
       [  618,   200,    20,    98,   110,    62,    13,  7127,  2163,
          351,  1104,  1289,   564,   330,    16,     8]])

Step 5

[ ]:

# Let's check if the outliers we iteratively collected are still all considered outliers.
# Indeed, they are so we do not need to go back to step 6 and repeat the loop above.
_ = optimal_model.visualize(X_out, y_out, figsize=(12,4))

[ ]:

extremes, outliers = optimal_model.check_xy_outliers(X_out, y_out)

[ ]:

np.all(outliers)

True

Step 7

[ ]:

final_model = PLS(
    n_components=1,
    alpha=0.05,
    gamma=0.01,
    scale_x=True,
    robust=False, # Now that the dataset is considered "clean" we need should use classical statistical methods
    sft=False
)

_ = final_model.fit(X_train_new, y_train_new)

[ ]:

_ = final_model.visualize(X_train_new, y_train_new, figsize=(12,4))

[ ]:

# The classical model accepts a few outliers as extremes, but only barely.  These were determined to be outliers by
# a robust version of this model.
_ = final_model.visualize(X_out, y_out, figsize=(12,4))

[ ]:

plot_predictions(final_model, X_train, y_train)

[ ]:

plot_predictions(final_model, X_test, y_test)

Automatic Loop

That was a lot of manual work and code-copying. It would be nice to have all of that automated, and in fact, the sft option does just that.

[ ]:

# The 'robust' option will be ignored here - robust=True is used during the cleaning, but after the dataset is cleaned
# the final model will be re-trained the clean dataset using robust=False

auto_model = PLS(
    n_components=1,
    alpha=0.05,
    gamma=0.01,
    scale_x=True,
    robust=True, # Will be ignored
    sft=True
)

_ = auto_model.fit(X_train, y_train)

[ ]:

auto_model.sft_history # These are the points we removed manually!

{'outer_loops': 1,
 'removed': {'X': array([[495, 151, 17, 61, 84, 78, 10, 5624, 1679, 275, 884, 1015, 709,
          1045, 88, 13],
         [419, 101, 18, 65, 58, 92, 20, 9528, 2510, 548, 1509, 1659, 1342,
          0, 0, 0],
         [237, 52, 0, 15, 25, 30, 24, 14053, 4256, 160, 2165, 1314, 1566,
          523, 43, 6],
         [514, 144, 0, 67, 54, 79, 9, 4739, 1169, 13, 583, 374, 528, 229,
          453, 15],
         [677, 238, 31, 117, 113, 53, 5, 2223, 737, 93, 349, 401, 171,
          1377, 100, 6],
         [354, 77, 16, 36, 55, 41, 20, 8716, 2172, 384, 1172, 1267, 1057,
          83, 174, 16],
         [507, 122, 29, 78, 85, 91, 18, 7761, 1947, 347, 1175, 1152, 1380,
          808, 108, 2],
         [618, 200, 20, 98, 110, 62, 13, 7127, 2163, 351, 1104, 1289, 564,
          330, 16, 8]], dtype=object),
  'y': array([2460.0, 487.5, 750.0, 1940.0, 1975.0, 200.0, 535.0, 2412.5],
        dtype=object)},
 'iterations': {1: {'initially removed X': array([[  495,   151,    17,    61,    84,    78,    10,  5624,  1679,
             275,   884,  1015,   709,  1045,    88,    13],
          [  419,   101,    18,    65,    58,    92,    20,  9528,  2510,
             548,  1509,  1659,  1342,     0,     0,     0],
          [  237,    52,     0,    15,    25,    30,    24, 14053,  4256,
             160,  2165,  1314,  1566,   523,    43,     6],
          [  514,   144,     0,    67,    54,    79,     9,  4739,  1169,
              13,   583,   374,   528,   229,   453,    15],
          [  677,   238,    31,   117,   113,    53,     5,  2223,   737,
              93,   349,   401,   171,  1377,   100,     6],
          [  354,    77,    16,    36,    55,    41,    20,  8716,  2172,
             384,  1172,  1267,  1057,    83,   174,    16],
          [  507,   122,    29,    78,    85,    91,    18,  7761,  1947,
             347,  1175,  1152,  1380,   808,   108,     2],
          [  618,   200,    20,    98,   110,    62,    13,  7127,  2163,
             351,  1104,  1289,   564,   330,    16,     8]]),
   'initially removed y': array([2460. ,  487.5,  750. , 1940. , 1975. ,  200. ,  535. , 2412.5]),
   'returned X': None,
   'returned y': None}}}

[ ]:

np.allclose(
    np.asarray(auto_model.sft_history['removed']['X'], dtype=np.float64),
    np.asarray(X_out, dtype=np.float64)
)

True

[ ]:

np.allclose(
    np.asarray(auto_model.sft_history['removed']['y'], dtype=np.float64),
    np.asarray(y_out, dtype=np.float64)
)

True

[ ]:

plot_predictions(auto_model, X_train, y_train)

[ ]:

plot_predictions(auto_model, X_test, y_test)

Optimizing the Model With Data Cleaning#

With the ‘sft’ option we can use pipelines to optimize the models which internally use this data cleaning. It would be a lot of work to do all of the above manually inside a cross-validation loop!

[ ]:

pipeline = imblearn.pipeline.Pipeline(steps=[
    ("pls",
     PLS(
         n_components=1,
         alpha=0.05,
         gamma=0.01,
         scale_x=True,
         sft=True
     )
    )
])

param_grid = [{
    'pls__n_components':np.arange(1, 10),
}]

gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=sklearn.model_selection.KFold(n_splits=3, shuffle=True, random_state=0),
    error_score=0,
    refit=True
)

_ = gs.fit(X_train, y_train)

[ ]:

# A few fits still fail because they are not able to perfectly clean the dataset since the loops shown in the figure
# do not converge.  Still, we are able to get generally good performance, and in this case we obtain the same result.
gs.best_params_

{'pls__n_components': 1}

[ ]:

plot_predictions(gs.best_estimator_.named_steps['pls'], X_train, y_train)

[ ]:

plot_predictions(gs.best_estimator_.named_steps['pls'], X_test, y_test)

[ ]:

_ = final_model.visualize(X_train, y_train, figsize=(12,4))

[ ]:

_ = final_model.visualize(X_test, y_test, figsize=(12,4))

Regression with PLS

Contents