Authentication with EllipticManifold (including UNEQ)#

Author: Nathan A. Mahynski

Date: 2023/08/23

Description: Illustrate modeling with the EllipticManifold model.

Open In Colab

In projection methods a dimensionality reduction is first performed to compress the data into a lower dimensional space; authentication can be performed by enveloping the projected points in an elliptical boundary (often based on Mahalanobis distance). The critical distance can be determined using a \(\chi^2\) statistics, for example. This uses “score” distance alone, while other methods like DD-SIMCA use both the score distance and an “outer distance” (error) to determine class membership.

PCA and related approaches (e.g., SIMCA) perform linear dimensionality reduction (DR) before computing the boundary which determines class membership. This authentication approach can be generalized by replacing the PCA with any non-linear DR step. This is what the EllipticManifold model allows you to do; it is essentially the application of sklearn’s EllipticEnvelope fitted around the training data after undergoing some kind of DR. It assumes the data follows a Gaussian distribution in the lower dimensional space. Not using any DR results in an UNEQ model, discussed below.

[ ]:
if 'google.colab' in str(get_ipython()):
    !pip install git+https://github.com/mahynski/pychemauth@main
    import os
    os.kill(os.getpid(), 9) # Automatically restart the runtime to reload libraries
[ ]:
try:
    import pychemauth
except:
    raise ImportError("pychemauth not installed")

import matplotlib.pyplot as plt
%matplotlib inline

import watermark
%load_ext watermark

%load_ext autoreload
%autoreload 2
[ ]:
import imblearn
import numpy as np
import sklearn.decomposition
import sklearn.manifold

from sklearn.model_selection import GridSearchCV

from pychemauth.manifold.elliptic import EllipticManifold_Model
[ ]:
%watermark -t -m -v --iversions
Python implementation: CPython
Python version       : 3.10.12
IPython version      : 7.34.0

Compiler    : GCC 11.4.0
OS          : Linux
Release     : 6.1.85+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit

matplotlib: 3.7.2
sklearn   : 1.3.0
imblearn  : 0.11.0
numpy     : 1.24.3
watermark : 2.4.3
pychemauth: 0.0.0b4

Load the Data

[ ]:
from sklearn.datasets import load_iris as load_data
X, y = load_data(return_X_y=True, as_frame=True)
[ ]:
# Let's turn the indices into names
names = dict(zip(np.arange(3), ['setosa', 'versicolor', 'virginica']))
y = y.apply(lambda x: names[x])
[ ]:
X.head(5)
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
[ ]:
from sklearn.model_selection import train_test_split
from pychemauth.preprocessing.scaling import CorrectedScaler

X_train, X_test, y_train, y_test = train_test_split(
    X.values,
    y.values, # Let's try to predict the salary based on the other numerical features.
    shuffle=True,
    random_state=0,
    test_size=0.2,
    stratify=y # It is usually important to balance the test and train set so they have the same fraction of classes
)

# Normally we would include this in a pipeline, but for these examples we will just do this ahead of time.
scaler = CorrectedScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Unequal Class Models (UNEQ) = No Dimensionality Reduction#

Unequal class models [1] are probabilistic models based on the distance from the center of ellipse fit around multivariate data. Essentially, the empirical center of the feature matrix for a class is taken as the model, and the Mahalonobis distance is computed between each point and the center. Critical values of Hotelling’s \(T^2\) statistics (aka score distance) are used to define the size of the ellipse (class space) using a predetermined confidence level. Forina and co-workers [2] suggest Beta statistics are more appropriate for defining this critical size for the training samples while the use of \(T^2\) statistics is appropriate for the test set. Here, we compute critical distances using \(\chi^2\) statistics as in DD-SIMCA and other projection methods (in the limit of a large number of samples Hotellings \(T^2\) becomes \(\chi^2\) distributed). Thus, when no dimensionality reduction is performed, this results in an UNEQ model.

  1. “Multivariate class modeling for the verification of food-authenticity claims,” Oliveri, P., and Downey, G., TrAC Trends in Anal. Chem. 35 (2012) 74-86.

  2. “UNEQ: a disjoint modelling technique for pattern recognition based on normal distribution,” Derde, M.P., and Massart, D.L., Anal. Chim. Acta 184 (1986) 33.

  3. “Distance and class space in the UNEQ class-modeling technique,” Forina, M., Lanteri, S., and Sarabia, L., J. Chemometrics 9 (1995) 69.

[ ]:
# If we do not provide a dr_model then the EllipticManifold simply fits an ellipse in the original data space.
# For this example, let's just use the first 2 columns so that we will be able to plot the results nicely.

setosa = EllipticManifold_Model(
    alpha=0.05,
    robust=True, # Estimate the covariance matrix for the Mahalanobis distance using a robust approach (MCD)
    center='score', # Center the ellipse around the empirical mean of the projected data
)

_ = setosa.fit(
    X_train[y_train == 'setosa'][:,:2],
    y_train[y_train == 'setosa']
)
[ ]:
# No transformation occurs!
np.allclose(setosa.transform(X_train[:,:2]), X_train[:,:2])
True
[ ]:
# Here PC1 = the first column of X, and PC2 is the second column since no transformation has occured.
_ = setosa.visualize(
    [
        X_train[y_train == 'setosa'][:,:2],
        X_train[y_train == 'versicolor'][:,:2],
        X_train[y_train == 'virginica'][:,:2]
    ],
    ['setosa', 'versicolor', 'virginica']
)
../../_images/jupyter_api_elliptic_manifold_16_0.png
[ ]:
# Predict how many training examples are considered inliers
setosa.predict(X_train[y_train == 'setosa'][:,:2])
array([ True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False, False,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True,  True])
[ ]:
setosa.predict(X_train[y_train == 'virginica'][:,:2])
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False])
[ ]:
setosa.predict(X_train[y_train == 'versicolor'][:,:2])
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False])

Principal Components Analysis#

We do not need to use a non-linear DR method. As a first example, let’s use PCA.

[ ]:
# If more than 1 class is used during training an exception is thrown
setosa = EllipticManifold_Model(
    alpha=0.05,
    dr_model=sklearn.decomposition.PCA, # A linear approach
    kwargs={'n_components':2}, # Keywords for the dr_model
    ndims='n_components', # The keyword that corresponds to the dimensionality of space
    robust=True, # Estimate the covariance matrix for the Mahalanobis distance using a robust approach
    center='score' # Center the ellipse around the empirical mean of the projected data
)

_ = setosa.fit(
    X_train[y_train == 'setosa'],
    y_train[y_train == 'setosa']
)
[ ]:
_ = setosa.visualize(
    [
        X_train[y_train == 'setosa'],
        X_train[y_train == 'versicolor'],
        X_train[y_train == 'virginica']
    ],
    ['setosa', 'versicolor', 'virginica']
)
../../_images/jupyter_api_elliptic_manifold_22_0.png
[ ]:
# Predict how many training examples are considered inliers
setosa.predict(X_train[y_train == 'setosa'])
array([ True,  True,  True,  True,  True,  True, False,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True, False,
        True, False, False,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True,  True])
[ ]:
# Predict if the other classes are considered inliers
setosa.predict(X_train[y_train != 'setosa'])
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False])
[ ]:
# The extremes plot helps you understand how Gaussian the data in the reduced dimensionality space
_ = setosa.extremes_plot(X_train[y_train == 'setosa'], upper_frac=1.0)
../../_images/jupyter_api_elliptic_manifold_25_0.png

Kernel PCA#

PCA can be made non-linear by switching to kernel PCA. See scikit-learn’s documentation for details on hyperparameters.

[ ]:
setosa = EllipticManifold_Model(
    alpha=0.05,
    dr_model=sklearn.decomposition.KernelPCA,
    kwargs={'n_components': 2, 'kernel': 'cosine'},
    ndims='n_components',
    robust=True,
    center='score'
)

_ = setosa.fit(
    X_train[y_train == 'setosa'],
    y_train[y_train == 'setosa']
)
[ ]:
_ = setosa.visualize(
    [
        X_train[y_train == 'setosa'],
        X_train[y_train == 'versicolor'],
        X_train[y_train == 'virginica']
    ],
    ['setosa', 'versicolor', 'virginica']
)
../../_images/jupyter_api_elliptic_manifold_29_0.png
[ ]:
_ = setosa.extremes_plot(X_train[y_train == 'setosa'], upper_frac=1.0)
../../_images/jupyter_api_elliptic_manifold_30_0.png

Isometric Mapping#

Isometric mapping is another non-linear DR approach. You can read more in scikit-learn’s documentation.

[ ]:
fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(16,6))
ax = axes.flatten()

idx = 0
for dims in [1, 2]:
    for nebrs in [5, 10, 15, 20]:
        setosa = EllipticManifold_Model(
            alpha=0.05,
            dr_model=sklearn.manifold.Isomap,
            kwargs={"n_components": dims, "metric":'minkowski', "p":2, "n_neighbors":nebrs},
            ndims="n_components",
            robust=True,
            center='score'
        )
        _ = setosa.fit(X_train[y_train == 'setosa'])

        _ = setosa.visualize(
            [
                X_train[y_train == 'setosa'],
                X_train[y_train == 'versicolor'],
                X_train[y_train == 'virginica']
            ],
            ['setosa', 'versicolor', 'virginica'],
            ax=ax[idx]
        )

        test_sns = np.sum(setosa.predict(X_test[y_test == 'setosa'])) / np.sum(y_test == 'setosa')
        test_sps = 1.0 - np.sum(setosa.predict(X_test[y_test != 'setosa'])) / np.sum(y_test != 'setosa')

        ax[idx].set_title('Nebrs = {}, Test SNS = {}, SPS = {}'.format(nebrs, '%.2f'%test_sns, '%.2f'%test_sps))
        idx += 1
plt.tight_layout()
../../_images/jupyter_api_elliptic_manifold_33_0.png

Locally Linear Embedding#

Locally linear embedding mapping is another non-linear DR approach. You can read more in scikit-learn’s documentation.

[ ]:
fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(16,6))
ax = axes.flatten()

idx = 0
for dims in [1, 2]:
    for nebrs in [5, 10, 15, 20]:
        setosa = EllipticManifold_Model(
            alpha=0.05,
            dr_model=sklearn.manifold.LocallyLinearEmbedding,
            kwargs={"n_components": dims, "n_neighbors":nebrs},
            ndims="n_components",
            robust=True,
            center='score'
        )
        _ = setosa.fit(X_train[y_train == 'setosa'])

        _ = setosa.visualize(
            [
                X_train[y_train == 'setosa'],
                X_train[y_train == 'versicolor'],
                X_train[y_train == 'virginica']
            ],
            ['setosa', 'versicolor', 'virginica'],
            ax=ax[idx]
        )

        test_sns = np.sum(setosa.predict(X_test[y_test == 'setosa'])) / np.sum(y_test == 'setosa')
        test_sps = 1.0 - np.sum(setosa.predict(X_test[y_test != 'setosa'])) / np.sum(y_test != 'setosa')

        ax[idx].set_title('Nebrs = {}, Test SNS = {}, SPS = {}'.format(nebrs, '%.2f'%test_sns, '%.2f'%test_sps))
        idx += 1
plt.tight_layout()
../../_images/jupyter_api_elliptic_manifold_36_0.png

UMAP Example#

UMAP has a lot of parameters that should be understood before using it. See the documentation for explanation. Briefly, there are 4 that matter the most:

  • n_neighbors

  • n_components

  • metric

  • min_dist

IMPORTANTLY you should always set random_state to an integer value to ensure reproducibility between runs since UMAP is stochastic.

[ ]:
import umap
[ ]:
fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(16,6))
ax = axes.flatten()

idx = 0
for dims in [1, 2]:
    for nebrs in [5, 10, 15, 20]:
        setosa = EllipticManifold_Model(
            alpha=0.05,
            dr_model=umap.UMAP,
            kwargs={
                "n_components": dims,
                "n_neighbors": nebrs,
                "random_state": 0, # Always set this for reproducibility
                "metric": "euclidean",
                "min_dist": 1.0
            },
            ndims="n_components",
            robust=True,
            center='score'
        )
        _ = setosa.fit(X_train[y_train == 'setosa'])

        _ = setosa.visualize(
            [
                X_train[y_train == 'setosa'],
                X_train[y_train == 'versicolor'],
                X_train[y_train == 'virginica']
            ],
            ['setosa', 'versicolor', 'virginica'],
            ax=ax[idx]
        )

        test_sns = np.sum(setosa.predict(X_test[y_test == 'setosa'])) / np.sum(y_test == 'setosa')
        test_sps = 1.0 - np.sum(setosa.predict(X_test[y_test != 'setosa'])) / np.sum(y_test != 'setosa')

        ax[idx].set_title('Nebrs = {}, Test SNS = {}, SPS = {}'.format(nebrs, '%.2f'%test_sns, '%.2f'%test_sps))
        idx += 1
plt.tight_layout()
../../_images/jupyter_api_elliptic_manifold_40_0.png

Building an Authenticator#

In training loops and other applications it is convenient to not have to separate classes for training. This is one thing that EllipticManifold_Authenticator conveniently does for you, which is illustrated below. Here, we simply specify the class we care about and all others are automatically ignored like we did manually above.

Another factor we need to consider is that there are 2 ways to train an authentication model [1]. A “rigorous” approach would be to train a model considering only the true examples of a known class. We could select a desired alpha value and adjust n_components to try to achieve this, since it is the desired type I error rate. This is essentially like setting the accuracy for true positives (TSNS), while ignoring the behavior with respect to alternative classes.

Alternatively, we could use examples of alternatives and adjust our hyperparameters to optimize an overall metric like total efficiency (TEFF) which considers how often we get true positives correct (TSNS) and how often we correctly reject the true negatives (TSPS). This is a “compliant” approach. These sorts of models use more information and might seem more performant, but this data is always biased by the set of known alternatives we use. Since there are an infinite number of inauthentic things, this introduces some bias which can be difficult to quantify.

[1] “Rigorous and compliant approaches to one-class classification,” Rodionova, O., Oliveri, P., and Pomerantsev, A. Chem. and Intell. Lab. Sys. (2016) 89-96.

[ ]:
from pychemauth.manifold.elliptic import EllipticManifold_Authenticator
[ ]:
# First let's consider a compliant model
compliant = EllipticManifold_Authenticator(
    alpha=0.05,
    dr_model=sklearn.decomposition.PCA, # A linear approach
    kwargs={'n_components':2}, # Keywords for the dr_model
    ndims='n_components', # The keyword that corresponds to the dimensionality of space
    robust=True, # Estimate the covariance matrix for the Mahalanobis distance using a robust approach
    center='score', # Center the ellipse around the empirical mean of the projected data
    target_class='setosa', # Tell the classifier which class to use
    use='compliant' # Using a compliant approach will use the other classes provided at training time
)
[ ]:
# Note that y_train should contain examples of the target_class or else an exception will be thrown.  Otherwise,
# all other classes will be used by the model to optimize TEFF
_ = compliant.fit(X_train, y_train)
[ ]:
compliant.metrics(X_train, y_train)
{'TEFF': 0.9082951062292475,
 'TSNS': 0.825,
 'TSPS': 1.0,
 'CSPS': {'versicolor': 1.0, 'virginica': 1.0}}
[ ]:
# Now consider a rigorous model
rigorous = EllipticManifold_Authenticator(
    alpha=0.05,
    dr_model=sklearn.decomposition.PCA, # A linear approach
    kwargs={'n_components':2}, # Keywords for the dr_model
    ndims='n_components', # The keyword that corresponds to the dimensionality of space
    robust=True, # Estimate the covariance matrix for the Mahalanobis distance using a robust approach
    center='score', # Center the ellipse around the empirical mean of the projected data
    target_class='setosa', # Tell the classifier which class to use
    use='rigorous' # Using a compliant approach will use the other classes provided at training time
)
[ ]:
_ = rigorous.fit(X_train, y_train)
[ ]:
rigorous.metrics(X_train, y_train)
{'TEFF': 0.9082951062292475,
 'TSNS': 0.825,
 'TSPS': 1.0,
 'CSPS': {'versicolor': 1.0, 'virginica': 1.0}}

The metrics are identical! This is because the choice of rigorous vs. compliant has nothing to do with the training of the model. This is important when we optimize the model since it determines what the optimal model looks like; i.e., should it be the one with the TSNS ~ 1-alpha, or the one with the best TEFF?

[ ]:
# The compliant model's score is the TEFF.
compliant.score(X_train, y_train)
0.9082951062292475
[ ]:
# We score a rigorous model as -(TSNS - (1-alpha))^2, so that a higher score is better (least negative) and a score of
# 0 corresponds to hitting the target exactly.
rigorous.score(X_train, y_train)
-0.015625

Optimizing the Authenticator#

Compliant Model

[ ]:
compliant_pipeline = imblearn.pipeline.Pipeline(
    steps=[
        ("model", EllipticManifold_Authenticator(
            alpha=0.05,
            dr_model=sklearn.decomposition.PCA, # A linear approach
            kwargs={'n_components':1}, # Keywords for the dr_model
            ndims='n_components', # The keyword that corresponds to the dimensionality of space
            robust=True, # Estimate the covariance matrix for the Mahalanobis distance using a robust approach
            center='score', # Center the ellipse around the empirical mean of the projected data
            target_class='setosa', # Tell the classifier which class to use
            use='compliant' # Using a compliant approach will use the other classes provided at training time
        )
    )
])

param_grid = [{
    'model__kwargs':[{'n_components':i} for i in [1, 2, 3, 4]],
    'model__alpha': np.logspace(-3, np.log10(0.5), 10), # For compliant models we can also adjust alpha to increase TEFF
}]

gs = GridSearchCV(
    estimator=compliant_pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=sklearn.model_selection.StratifiedKFold(n_splits=3, shuffle=True, random_state=0),
    error_score=0,
    refit=True
)

_ = gs.fit(X_train, y_train)
[ ]:
gs.best_params_
{'model__alpha': 0.001, 'model__kwargs': {'n_components': 2}}
[ ]:
print("Test set TEFF = {}; training set TEFF = {}".format(
    '%.3f'%(gs.score(X_test, y_test)),
    '%.3f'%(gs.score(X_train, y_train)),
    )
)
Test set TEFF = 1.000; training set TEFF = 1.000
[ ]:
_ = gs.best_estimator_["model"].model.visualize(
    [
        X_train[y_train == 'setosa'],
        X_train[y_train == 'versicolor'],
        X_train[y_train == 'virginica']
    ],
    ['setosa', 'versicolor', 'virginica']
)
../../_images/jupyter_api_elliptic_manifold_57_0.png
[ ]:
gs.best_estimator_["model"].metrics(X_train, y_train)
{'TEFF': 1.0,
 'TSNS': 1.0,
 'TSPS': 1.0,
 'CSPS': {'versicolor': 1.0, 'virginica': 1.0}}
[ ]:
gs.best_estimator_["model"].metrics(X_test, y_test)
{'TEFF': 1.0,
 'TSNS': 1.0,
 'TSPS': 1.0,
 'CSPS': {'versicolor': 1.0, 'virginica': 1.0}}

Rigorous Model

[ ]:
rigorous_pipeline = imblearn.pipeline.Pipeline(
    steps=[
        ("model", EllipticManifold_Authenticator(
            alpha=0.05,
            dr_model=sklearn.decomposition.PCA, # A linear approach
            kwargs={'n_components':1}, # Keywords for the dr_model
            ndims='n_components', # The keyword that corresponds to the dimensionality of space
            robust=True, # Estimate the covariance matrix for the Mahalanobis distance using a robust approach
            center='score', # Center the ellipse around the empirical mean of the projected data
            target_class='setosa', # Tell the classifier which class to use
            use='rigorous' # Using a rigorous approach will use the other classes provided at training time
        )
    )
])

param_grid = [{
    'model__kwargs':[{'n_components':i} for i in [1, 2, 3, 4]], # Now alpha is fixed and we adjust n_components to reach it
}]

gs = GridSearchCV(
    estimator=rigorous_pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=sklearn.model_selection.StratifiedKFold(n_splits=3, shuffle=True, random_state=0),
    error_score=0,
    refit=True
)

_ = gs.fit(X_train, y_train)
[ ]:
gs.best_params_
{'model__kwargs': {'n_components': 1}}
[ ]:
print("Test set TEFF = {}; training set TEFF = {}".format(
    '%.4f'%(gs.score(X_test, y_test)),
    '%.4f'%(gs.score(X_train, y_train)),
    )
)
Test set TEFF = -0.0025; training set TEFF = -0.0025
[ ]:
# We can manually compute these scores
tsns_test = np.sum(gs.predict(X_test[y_test == 'setosa'])) / np.sum(y_test == 'setosa')
(tsns_test - (1.0 - 0.05))**2
0.0025000000000000044
[ ]:
# We can manually compute these scores
tsns_train = np.sum(gs.predict(X_train[y_train == 'setosa'])) / np.sum(y_train == 'setosa')
(tsns_train - (1.0 - 0.05))**2
0.0024999999999999935
[ ]:
_ = gs.best_estimator_["model"].model.visualize(
    [
        X_train[y_train == 'setosa'],
        X_train[y_train == 'versicolor'],
        X_train[y_train == 'virginica']
    ],
    ['setosa', 'versicolor', 'virginica']
)
../../_images/jupyter_api_elliptic_manifold_66_0.png
[ ]:
gs.best_estimator_["model"].metrics(X_train, y_train)
{'TEFF': 0.4107919181288746,
 'TSNS': 0.9,
 'TSPS': 0.1875,
 'CSPS': {'versicolor': 0.275, 'virginica': 0.09999999999999998}}
[ ]:
gs.best_estimator_["model"].metrics(X_test, y_test)
{'TEFF': 0.22360679774997907,
 'TSNS': 1.0,
 'TSPS': 0.050000000000000044,
 'CSPS': {'versicolor': 0.09999999999999998, 'virginica': 0.0}}