Multivariate Drift Detection

Here we describe in detail NannyML’s multivariate drift detection methods helping build a deeper understanding of how they work. But first, let’s see why they are needed.

Limitations of Univariate Drift Detection

Machine Learning models typically have a multidimensional input space. In binary classification problems, models are trained to find the optimal decision boundary. This boundary depends on the structure of the data within the model input space. However the world is not static, and the structure of a model’s input data can change. This change can then cause our existing decision boundary to be suboptimal.

The Univariate Drift Detection tutorial describes how NannyML analyzes each feature individually, and observes whether there are changes in the resulting feature distributions over time.

However, this is not enough to capture all the changes that may affect a machine learning model. The changes in correlations and more complex changes in relationships between model inputs might have a significant impact on model performance without changing univariate distributions of features. The “butterfly” dataset, introduced below, demonstrates this.

“Butterfly” Dataset

Let’s see first how we can construct an instance of the Butterfly dataset.

>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> import nannyml as nml
>>> from scipy.spatial.transform import Rotation

>>> # 10 reference periods
>>> # 10 analysis periods
>>> # Days/week * Hours/day * events/hour
>>> DPP = 7*24*12

>>> np.random.seed(23)
>>> s1 = np.random.randn(DPP*20)
>>> x1 = s1 + np.random.randn(DPP*20)/8
>>> x2 = s1 + np.random.randn(DPP*20)/8
>>> x3 = np.random.randn(DPP*20)/8
>>> xdat = np.array([x1, x2, x3]).T

>>> rot = Rotation.from_euler('z', 90, degrees=True)

>>> # following matrix multiplication implementation, we need a 3xN data matrix hence we transpose
>>> ydat = np.matmul(rot.as_matrix(), xdat.T).T

>>> # create overall array that has drifted and not drifted subsets.
>>> # drift is sudden and affects last 5 weeks
>>> dataar = np.concatenate(
...     (xdat[:-5*DPP], ydat[-5*DPP:]),
...     axis=0
>>> )

>>> # convert data to dataframe
>>> datadf = pd.DataFrame(dataar, columns=['feature1', 'feature2', 'feature3'])

>>> # add "timestamp" column
>>> datadf = datadf.assign(ordered = pd.date_range(start='1/6/2020', freq='5min', periods=20*DPP))

>>> # Adding helper column - duplicates date range functionality
>>> datadf['week'] = datadf.ordered.dt.isocalendar().week - 1
>>> # Adding partition column
>>> datadf['partition'] = 'reference'
>>> datadf.loc[datadf.week >= 11, ['partition']] = 'analysis'

>>> # Assign random predictions and targets (we won't be using them but they are needed for NannyML)
>>> datadf = datadf.assign(y_pred_proba = np.random.rand(DPP*20))
>>> datadf = datadf.assign(y_true = np.random.randint(2, size=DPP*20))

The key property of the butterfly dataset is the data drift on its first two features. This data drift is introduced by a 90 degree rotation across the z-axis. The practical effect of this rotation is that we started with two features, feature1 and feature2 that were positively correlated but are now negatively correlated. The following code creates a plot that clearly shows the resulting data drift.

>>> # let's construct a dataframe for visuzlization purposes
>>> dat1 = datadf.loc[datadf.week == 10, ['feature1', 'feature2']][:1500]
>>> dat1['week'] = 10
>>> dat2 = datadf.loc[datadf.week == 16, ['feature1', 'feature2']][:1500]
>>> dat2['week'] = 16
>>> data_sample = pd.concat([dat1, dat2], ignore_index=True)

>>> # let's plot
>>> colors = nml.plots.colors.Colors
>>> figure = sns.jointplot(
...     data=data_sample,
...     x="feature1",
...     y="feature2",
...     hue="week",
...     palette=[colors.BLUE_SKY_CRAYOLA.value, colors.RED_IMPERIAL.value]
>>> )
>>> figure.fig.suptitle('Data Distributions before and after rotation drift')
../_images/butterfly-scatterplot.svg

The plot shows that the univariate distribution of features feature1 and feature2 are unchanged. With blue color you can see the original distribution of the two features and with red color you can see the resulting distribution after we have applied our transformation.

Using NannyML to compute and plot the univariate drift statistics shows that on the individual feature level no changes are visible.

>>> # Let's first create the analysis and reference datasets NannyML needs.
>>> reference_df = datadf.loc[datadf['partition'] == 'reference'].reset_index(drop=True)
>>> reference_df.drop(['week', 'partition'], axis=1, inplace=True)
>>> analysis_df = datadf.loc[datadf['partition'] == 'analysis'].reset_index(drop=True)
>>> analysis_df.drop(['y_true', 'week', 'partition'], axis=1, inplace=True)

>>> feature_column_names = ['feature1', 'feature2', 'feature3']

>>> # Let's instantiate and calibrate univariate drift
>>> univariate_calculator = nml.UnivariateDriftCalculator(
...     column_names=feature_column_names,
...     timestamp_column_name='ordered',
...     continuous_methods=['kolmogorov_smirnov'],
...     categorical_methods=['chi2'],
...     chunk_size=DPP
>>> )
>>> univariate_calculator.fit(reference_data=reference_df)

>>> # let's compute (and visualize) results across all the dataset.
>>> univariate_results = univariate_calculator.calculate(data=analysis_df)
>>> figure = univariate_results.filter(
...     period='all',
...     column_names=univariate_results.continuous_column_names
>>> ).plot(kind='distribution')
>>> figure.show()
../_images/butterfly-univariate-drift-distributions.svg

These results make it clear that the univariate distribution results do not detect any drift. However, we know there is data drift in the butterfly dataset. As mentioned, the correlation between features feature1 and feature2 has changed from positive to negative. A methodology that is able to identify this change is needed. This is where Multivariate Drift Detection using Data Reconstruction with PCA can be applied.

Data Reconstruction with PCA

This method is able to capture complex changes in our data. The algorithm implementing Data Reconstruction with PCA works in three steps as described below.

The first step is data preparation. This includes missing values Imputation, frequency encoding, and scaling the data. Missing values need to be imputed because it is a PCA requirement. Frequency encoding is used to convert all categorical features into numbers. The next thing to do is standardize all features to 0 mean and unit variance, to make sure that all features contribute to PCA on equal footing.

The second step is the dimensionality reduction where PCA comes in. By default it aims to capture 65% of the dataset’s variance, but this is a parameter that can be changed. The PCA algorithm is fitted on the reference dataset and learns a transformation from the pre-processed model input space to a latent space.

NannyML then applies this transformation to compress the data that is being analyzed. It is important to note that the PCA method captures the internal structure of the model input data and ignores any random noise that is usually present.

The third step is decompressing the data we just compressed. This is done using the inverse PCA transformation which transforms the data from latent space back to the preprocessed model input space.

Then, the euclidean distance between the original data points and their re-constructed counterparts is computed. The resulting distances are then aggregated to get their average. The resulting number is called Reconstruction Error.

Understanding Reconstruction Error with PCA

As PCA learns the internal structure of the data, a significant change in the reconstruction error means that the learned structure no longer accurately approximates the current data structure. This indicates data drift.

When applying PCA we lose some information about our dataset. This means that the reconstructed data will always be slightly different compared to the original, and the reconstruction error reflects that. Because of this the valuable insight doesn’t come from the value of the reconstruction error but from the change in reconstruction error values over time. The change tells if there is data drift.

This is because when there is data drift the principal components the PCA method has learnt become suboptimal. This will result in worse reconstruction of the new data and therefore a different reconstruction error.

Because of the noise present in real world datasets, there will always be some variability in the reconstruction error results. So not every change in reconstruction error values means that we have data drift.

The variability of reconstruction error values on a known good dataset is used to determine an acceptable variance on the reconstruction error values. Any reconstruction error values outside of that variance represent a significant change in reconstruction error.

NannyML computes the mean and standard deviation of the reconstruction error with PCA on the reference dataset based on the different results for each Data Chunk. This establishes a range of expected values of reconstruction error. A threshold for significant change in NannyML is defined as values that are more than three standard deviations away from the mean of the reference data.

Reconstruction Error with PCA on the butterfly dataset

Now that we have a better understanding of Reconstruction Error with PCA, let’s see how it performs on the butterfly dataset.

>>> # Let's compute multivariate drift
>>> rcerror_calculator = nml.DataReconstructionDriftCalculator(
...     column_names=feature_column_names,
...     timestamp_column_name='ordered',
...     chunk_size=DPP
>>> ).fit(reference_data=reference_df)
>>> # let's compute results for analysis period
>>> rcerror_results = rcerror_calculator.calculate(data=analysis_df)

>>> # let's visualize results across all the dataset
>>> figure1 = rcerror_results.plot()
>>> figure1.show()
../_images/butterfly-multivariate-drift-pca.svg

The change in the butterfly dataset is now clearly visible through the change in the reconstruction error, while our earlier univariate approach detected no change.

For more information on using Reconstruction Error with PCA check the Multivariate Drift - Data Reconstruction with PCA tutorial.

Domain Classifier

A Domain Classifier allows us to create a measure of how easy it is to discriminate the reference data from the examined chunk data. NannyML uses a LightGBM classifier. As a measure of discrimination performance NannyML uses the cross-validated AUROC score. Similar to data reconstruction with PCA this method is also able to capture complex changes in our data.

The algorithm implementing Domain Classifier follows the steps described below. Please note that the process described below is repeated for each Data Chunk. The process consists of two basic parts, data preprocessing and classifier cross validation.

The data pre-processing part consists of the following steps:

  • Assigning label 0 to reference data and label 1 to chunk data rows.

  • Use the model inputs as features.

  • Concatenate resulting data.

  • Remove duplicate rows once. We are keeping the rows comping from chunk data in order to get meaningful results when we use the method on reference data chunks.

  • Encode categorical data as integers for better compatibility with LightGBM.

The classifier cross validation part uses the data created and consists of the following steps:

  • Optionally, hyperparameter tuning is performed. The hyperparameters learnt during this step will be used in the model training steps below. If hyperparameter tuning is not requested, user specified hyperparameters can be used instead of the default LightGBM options.

  • Stratified split is used to split the data into validation folds

  • For each split NannyML trains an LGBMClassifier and saves its predicted scores in the validation fold.

  • The predictions across all folds are used to calculate the resulting AUROC score.

The higher the AUROC score the easier it is to distinguish the datasets, hence the more different they are.

Understanding Domain Classifier

The Domain Classifier method relies on a machine learning algorithm to distinguish between the reference and the chunk data. We are using a LightGBM Classifier. Because of the versatility of this approach the classifier is quite sensitive to shifts in the data. Moreover the classifier’s behavior is non linear. Hence we shouldn’t directly translate classifier AUROC values to possible performance impact. It is better to rely on performance estimation methods for that.

Domain Classifier on the butterfly dataset

Now that we have a better understanding of Domain Classifier, let’s see how it performs on the butterfly dataset.

>>> # Let's compute multivariate drift
>>> drift_classifier = nml.DomainClassifierCalculator(
...     feature_column_names=feature_column_names,
...     timestamp_column_name='ordered',
...     chunk_size=DPP
>>> ).fit(reference_data=reference_df)
>>> # let's compute results for analysis period
>>> drift_classifier_results = drift_classifier.calculate(data=analysis_df)

>>> # let's visualize results across all the dataset
>>> figure2 = drift_classifier_results.plot()
>>> figure2.show()
../_images/butterfly-multivariate-drift-cdd.svg

The change in the butterfly dataset is now clearly visible through the change in the classifier’s AUROC, while our earlier univariate approach detected no change.

For more information on using Domain Classifier check the Multivariate Drift - Domain Classifier tutorial.