Python-based Chemometric Authentication#

This is a toolkit to perform chemometric analysis, though it is primarily focused on authentication. These methods are designed to follow scikit-learn’s estimator API so that they can be deployed in pipelines used with GridSearchCV, etc. and are compatible with workflows involving other modern machine learning (ML) tools. Wikipedia defines chemometrics as “the science of extracting information from chemical systems by data-driven means.” Unlike other areas of science, technology and engineering, many chemical systems remain difficult to collect measurements on making data more scarce than in other arenas. As a result, conventional statistical methods remain the predominant tool with which chemometric analysis is performed. As instruments improve, databases are developed, and advanced algorithms become less data-intensive it is clear that modern machine learning and artificial intelligence (AI) methods will be brought to bear on these problems. A consistent API enables many different models to be easily deployed and compared.

Authentication is typically a one-class classification (OCC), or class modeling, approach designed to detect anomalies. This contrasts with conventional multi-class classification (discriminative) models which involve supervised learning of multiple classes to distinguish between them; the primary weakness of such a model is that it cannot predict if a new sample belongs to none of the classes trained on. Within the context of anomaly detection, scikit-learn differentiates between outlier detection and novelty detection. In outlier detection, the training data is considered polluted and certain samples need to be detected and removed, whereas novelty detection methods assume the training data is “clean” and anomalies need to be detected during the testing phase of new samples only. Both are important in the context of authentication models; here is a nice repository for more anomaly detection resources.

Out-of-distribution (OOD) detection is a more general term which encompasses these and other tasks, such as open-set recognition. A taxonomy describing how these tasks are interrelated can be found here and further reading here. Detecting distribution shifts in the data at test-time is critical to building safe, reliable models deployed in real (open) world settings.

License Information#

See LICENSE.md for more information.
Any mention of commercial products is for information only; it does not imply recommendation or endorsement by NIST.

Core Capabilities#

Exploratory Data Analysis#

You should always perform exploratory data analysis to understand your data. For example, understanding missing values, NaN, inf and basic descriptive statistics. While python packages like Pandas and DTale are excellent pre-existing tools, the eda package herein contains additional methods.

Preprocessors#

scikit-learn provides a number of other simple preprocessing and normalization steps, including data standardization and imputation approaches. PyChemAuth extends these to include:

Imputing Missing Data#

Expectation Maximization with Iterative PCA (missing X values)
Expectation Maximization with Iterative PLS (missing X values)
Limit of Detection (randomly below LOD)

Scaling#

Corrected Scaling (akin to scikit-learn’s StandardScaler but uses unbiased/corrected standard deviation instead)
Pareto Scaling (divides by square root of standard deviation)
Robust Scaling (divides by IQR instead of standard deviation)

Filtering#

Savitzky-Golay
Standard and Robust Normal Variates, SNV, RNV
Multiplicative Scatter Correction, MSC

Generating Synthetic Data#

Resampling can be used to balance classes during training, or to supplement measurements that are very hard to make. New, synthetic data can also be generated by various means; imblearn pipelines are designed to work with various up/down sampling routines and can be used as drop-in replacements for standard scikit-learn pipelines.
See Imbalanced Learning for methods like SMOTE, ADASYN, etc.

Feature Selection#

Feature extraction, such as PCA, involves manipulating inputs to produce new “dimensions” or composite features, such as the first principal component. Feature selection simply involves selecting a subset of known features (such as columns) to use. scikit-learn has many built-in examples that you can use. Additional tools such as BorutaSHAP and some based on the Jensen-Shannon Divergence are also implemented here.

Conventional Chemometrics (Small data limit)#

Conventional chemometric authentication methods generally fall under the umbrella of multivariate regression or classification tasks. For example, the model proposed when performing multilinear regression is \(\vec{y} = \mathbf{M} \mathbf{X} + \vec{b}\), where the matrix \(\mathbf{M}\) must be solved for. (Un)supervised classification is commonly performed via projection methods, which compress the data into a lower dimensional space. A common choice of data models is: \(\mathbf{X} = \mathbf{T} \mathbf{P^T} + \mathbf{E}\), where the scores matrix, \(\mathbf{T}\), represents the projection of the \(\mathbf{X}\) matrix into the lower dimensional score space. The \(\mathbf{P}\) matrix, often referred to as loadings (though conventions differ between disciplines), is computed in different ways. For example, PCA (unsupervised) uses the leading eigenvectors of the covariance matrix of \(\mathbf{X^T}\), whereas PLS uses a different (supervised, cross-) decomposition which is a function of both \(\mathbf{X}\) and \(\vec{y}\). \(\mathbf{E}\) is the error resulting from this model.

OCC methods require careful preparation of the training set to remove outliers so that “masking” effects do not affect your final model. Manual data inspection is typically required. Thus, conventional authentication methods can be considered novelty detection methods (no outliers in training), but many have built in capabilities to interatively “clean” the training set if outliers are assumed to be present initially. See “Detection of Outliers in Projection-Based Modeling” by Rodionova and Pomerantsev for an example of outlier detection and removal in projection-based modeling.

Classifiers#

PCA (for data inspection)
PLS-DA (soft and hard variants) - discriminant analysis is not the same as OCC for authentication.
SIMCA
DD-SIMCA

Regressors#

Some Further Reading#

Topological Methods (Intermediate data limit)#

“Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-linear structure in data. Though supervised variants exist, the typical manifold learning problem is unsupervised: it learns the high-dimensional structure of the data from the data itself, without the use of predetermined classifications.” - scikit-learn documentation

These approaches may be considered intermediate between conventional chemometric methods and modern AI/ML algorithms. These are generally non-linear dimensionality reduction methods that try to preserve properties, like the topology, of the original data; once projected into a lower dimensional space (embedding), statistical models can be constructed, for example, by drawing an ellipse around the points belonging to a known class. Conventional chemometric authentication methods operate in a similar fashion but with a simpler dimensionality reduction step. Although many methods can be used to detect anomalies in this embedding (score space), we favor the elliptic envelope here for its simplicity and statistical interpretability. Only members of one known class are purposefully trained on (at a time).

EllipticManifold - a combined manifold learning/dimensionality reduction step followed by the determination of an elliptical boundary to detect outliers.

Machine Learning (Large data limit)#

In ML/AI, the problem of detecting novelties (a previously unknown class) when only a finite subset of known classes are available for training is known as open set recognition. Here are some references for further reading:

These routines offer the most flexible approaches and include alternative boundary construction methods besides ellipses. Detecting “new or unusual” objects with AI/ML is often formulated as an outlier detection problem. This is a fairly mature field when it comes to feature-vector data for conventional ML/statistical algorithms, but deep learning approaches currently struggle to outperform their other ML counterparts (see the “Familiarity Hypothesis” by Dietterich and Guyer).

Outlier detection with PyOD - This encompasses many different approaches including isolation forests and autoencoders. It’s API is largely simlar to sklearn and therefore compatible with this ecosystem.
Semi-supervised Positive-Unlabeled (PU) learning.

Explanations and Interpretations#

While examination of loadings, for example, is one way to understand commonly employed chemometric tools, more complex models require more complex tools to inspect these “black boxes”.

SHAP - “(SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions.” Its model-agnostic nature means that this can be employed to explain any model or pipeline.
LIME - Local Interpretable Model-agnostic Explanations constructs a simpler, more interpretable model around a point in question to help understand why a prediction about this point has been made.

“Interpretable AI” refers to models which are inherently a “glassbox” and their inner-workings are transparent. This is not the same as “explained black boxes” (XAI) which are inscrutable, by definition, but methods like SHAP can be used to help the user develop a sense of (dis)trust about the model and potentially debug it. Explainable boosting machines (EBM) are a discriminative method, but can be helpful to compare and contrast with explained black boxes or authentication models. EBMs are slow to train so they are best for small-medium data applications, which many chemometric applications fall under.

An EBM from interpretML is a “tree-based, cyclic gradient boosting Generalized Additive Model with automatic interaction detection. EBMs are often as accurate as state-of-the-art blackbox models while remaining completely interpretable.”
pyGAM does not follow scikit-learn’s API but are very useful glassbox models to consider.

“Probabilities” that ML routines produce are usually not guaranteed to be “meaningful.” Elliptic boundaries and other conventional techniques often invoke assumptions about the normality of the data, for example, that allow meaningful interpretation of distances and probabilities that these methods yield. For example, if you have a set of points for which the probability of class membership is 80%, you would expect 80% of those points to belong to the class and to be incorrect about 20% of them. However, ML routines often produce “probabilities” which are nothing more than numerical values whose maximum determines the assigned classes; the exact value of that probability does not need to be meaningful for these routines to produce (accurate) predictions of class membership. This can be addressed with probability calibration. The basic solution is to add another function that translates the output of an ML model into something more meaningful. See here for more detailed examples and discussion. Calibration may be particular useful before trying to apply explanation tools or interpret the results of a model.

Introduction

API Reference

pychemauth
- pychemauth package

Tutorial

Learn

Python-based Chemometric Authentication

Contents