Multivariate Drift Detection

Why Perform Multivariate Drift Detection

Multivariate data drift detection addresses the shortcomings of univariate data detection methods. It provides one summary number reducing the risk of false alerts, and detects more subtle changes in the data structure that cannot be detected with univariate approaches.

Just The Code

>>> import nannyml as nml
>>> from IPython.display import display

>>> # Load synthetic data
>>> reference, analysis, _ = nml.load_synthetic_car_loan_dataset()
>>> display(reference.head())

>>> non_feature_columns = ['timestamp', 'y_pred_proba', 'y_pred', 'repaid']

>>> # Define feature columns
>>> feature_column_names = [
...     col for col in reference.columns
...     if col not in non_feature_columns
>>> ]

>>> calc = nml.DataReconstructionDriftCalculator(
...     column_names=feature_column_names,
...     timestamp_column_name='timestamp',
...     chunk_size=5000
>>> )
>>> calc.fit(reference)
>>> results = calc.calculate(analysis)

>>> display(results.filter(period='analysis').to_df())

>>> display(results.filter(period='reference').to_df())

>>> figure = results.plot()
>>> figure.show()

Advanced configuration

Set up custom chunking [tutorial] [API reference]
Set up custom thresholds [tutorial] [API reference]

Walkthrough

NannyML uses Data Reconstruction with PCA to detect such changes. For a detailed explanation of the method see Data Reconstruction with PCA Deep Dive.

The method returns a single number, measuring the Reconstruction Error. The changes in this value reflect a change in the structure of the model inputs.

NannyML calculates the reconstruction error over time for the monitored model, and raises an alert if the values get outside of a range defined by the variance in the reference data period.

In order to monitor a model, NannyML needs to learn about it from a reference dataset. Then it can monitor the data that is subject to actual analysis, provided as the analysis dataset. You can read more about this in our section on data periods

Let’s start by loading some synthetic data provided by the NannyML package, and setting it up as our reference and analysis dataframes. This synthetic data is for a binary classification model, but multi-class classification can be handled in the same way.

>>> import nannyml as nml
>>> from IPython.display import display

>>> # Load synthetic data
>>> reference, analysis, _ = nml.load_synthetic_car_loan_dataset()
>>> display(reference.head())

	car_value	salary_range	debt_to_income_ratio	loan_length	repaid_loan_on_prev_car	size_of_downpayment	driver_tenure	repaid	timestamp	y_pred_proba	y_pred
0	39811	40K - 60K €	0.63295	19	False	40%	0.212653	1	2018-01-01 00:00:00.000	0.99	1
1	12679	40K - 60K €	0.718627	7	True	10%	4.92755	0	2018-01-01 00:08:43.152	0.07	0
2	19847	40K - 60K €	0.721724	17	False	0%	0.520817	1	2018-01-01 00:17:26.304	1	1
3	22652	20K - 20K €	0.705992	16	False	10%	0.453649	1	2018-01-01 00:26:09.456	0.98	1
4	21268	60K+ €	0.671888	21	True	30%	5.69526	1	2018-01-01 00:34:52.608	0.99	1

The DataReconstructionDriftCalculator module implements this functionality. We need to instantiate it with appropriate parameters - the column names of the features that we want to run drift detection on, and the timestamp column name. The features can be passed in as a simple list of strings. Alternatively we can create a list by excluding the columns in the dataframe that are not features, and pass them into the argument.

Next the fit() method needs to be called on the reference data where results will be based off. Then the calculate() method will calculate the multivariate drift results on the data provided to it.

>>> non_feature_columns = ['timestamp', 'y_pred_proba', 'y_pred', 'repaid']

>>> # Define feature columns
>>> feature_column_names = [
...     col for col in reference.columns
...     if col not in non_feature_columns
>>> ]

>>> calc = nml.DataReconstructionDriftCalculator(
...     column_names=feature_column_names,
...     timestamp_column_name='timestamp',
...     chunk_size=5000
>>> )
>>> calc.fit(reference)
>>> results = calc.calculate(analysis)

Any missing values in our data need to be imputed. The default Imputation implemented by NannyML imputes the most frequent value for categorical features and the mean for continuous features. These defaults can be overridden with an instance of SimpleImputer class in which cases NannyML will perform the imputation as instructed.

An example where custom imputation strategies are used can be seen below.

>>> non_feature_columns = ['timestamp', 'y_pred_proba', 'y_pred', 'repaid']

>>> feature_column_names = [
...     col for col in reference.columns
...     if col not in non_feature_columns
>>> ]

>>> from sklearn.impute import SimpleImputer

>>> calc = nml.DataReconstructionDriftCalculator(
...     column_names=feature_column_names,
...     timestamp_column_name='timestamp',
...     chunk_size=5000,
...     imputer_categorical=SimpleImputer(strategy='constant', fill_value='missing'),
...     imputer_continuous=SimpleImputer(strategy='median')
>>> )
>>> calc.fit(reference)
>>> results = calc.calculate(analysis)

Because our synthetic dataset does not have missing values, the results are the same in both cases. We can see these results of the data provided to the calculate() method as a dataframe.

>>> display(results.filter(period='analysis').to_df())

	chunk key	chunk_index	start_index	end_index	start_date	end_date	period	reconstruction_error sampling_error	value	upper_confidence_boundary	lower_confidence_boundary	upper_threshold	lower_threshold	alert
0	[0:4999]	0	0	4999	2018-10-30 18:00:00	2018-11-30 00:27:16.848000	analysis	0.00699616	1.14152	1.16251	1.12053	1.15404	1.11426	False
1	[5000:9999]	1	5000	9999	2018-11-30 00:36:00	2018-12-30 07:03:16.848000	analysis	0.00699616	1.13064	1.15162	1.10965	1.15404	1.11426	False
2	[10000:14999]	2	10000	14999	2018-12-30 07:12:00	2019-01-29 13:39:16.848000	analysis	0.00699616	1.13891	1.1599	1.11793	1.15404	1.11426	False
3	[15000:19999]	3	15000	19999	2019-01-29 13:48:00	2019-02-28 20:15:16.848000	analysis	0.00699616	1.14504	1.16603	1.12405	1.15404	1.11426	False
4	[20000:24999]	4	20000	24999	2019-02-28 20:24:00	2019-03-31 02:51:16.848000	analysis	0.00699616	1.13756	1.15855	1.11657	1.15404	1.11426	False
5	[25000:29999]	5	25000	29999	2019-03-31 03:00:00	2019-04-30 09:27:16.848000	analysis	0.00699616	1.24921	1.27019	1.22822	1.15404	1.11426	True
6	[30000:34999]	6	30000	34999	2019-04-30 09:36:00	2019-05-30 16:03:16.848000	analysis	0.00699616	1.2431	1.26409	1.22211	1.15404	1.11426	True
7	[35000:39999]	7	35000	39999	2019-05-30 16:12:00	2019-06-29 22:39:16.848000	analysis	0.00699616	1.25815	1.27914	1.23716	1.15404	1.11426	True
8	[40000:44999]	8	40000	44999	2019-06-29 22:48:00	2019-07-30 05:15:16.848000	analysis	0.00699616	1.22818	1.24917	1.20719	1.15404	1.11426	True
9	[45000:49999]	9	45000	49999	2019-07-30 05:24:00	2019-08-29 11:51:16.848000	analysis	0.00699616	1.25988	1.28087	1.2389	1.15404	1.11426	True

The drift results from the reference data are accessible from the properties of the results object:

>>> display(results.filter(period='reference').to_df())

	chunk key	chunk_index	start_index	end_index	start_date	end_date	period	reconstruction_error sampling_error	value	upper_confidence_boundary	lower_confidence_boundary	upper_threshold	lower_threshold	alert
0	[0:4999]	0	0	4999	2018-01-01 00:00:00	2018-01-31 06:27:16.848000	reference	0.00699616	1.13641	1.1574	1.11542	1.15404	1.11426	False
1	[5000:9999]	1	5000	9999	2018-01-31 06:36:00	2018-03-02 13:03:16.848000	reference	0.00699616	1.13448	1.15547	1.11349	1.15404	1.11426	False
2	[10000:14999]	2	10000	14999	2018-03-02 13:12:00	2018-04-01 19:39:16.848000	reference	0.00699616	1.13608	1.15706	1.11509	1.15404	1.11426	False
3	[15000:19999]	3	15000	19999	2018-04-01 19:48:00	2018-05-02 02:15:16.848000	reference	0.00699616	1.14011	1.1611	1.11913	1.15404	1.11426	False
4	[20000:24999]	4	20000	24999	2018-05-02 02:24:00	2018-06-01 08:51:16.848000	reference	0.00699616	1.12608	1.14707	1.10509	1.15404	1.11426	False
5	[25000:29999]	5	25000	29999	2018-06-01 09:00:00	2018-07-01 15:27:16.848000	reference	0.00699616	1.14202	1.16301	1.12103	1.15404	1.11426	False
6	[30000:34999]	6	30000	34999	2018-07-01 15:36:00	2018-07-31 22:03:16.848000	reference	0.00699616	1.12515	1.14613	1.10416	1.15404	1.11426	False
7	[35000:39999]	7	35000	39999	2018-07-31 22:12:00	2018-08-31 04:39:16.848000	reference	0.00699616	1.14321	1.16419	1.12222	1.15404	1.11426	False
8	[40000:44999]	8	40000	44999	2018-08-31 04:48:00	2018-09-30 11:15:16.848000	reference	0.00699616	1.13095	1.15194	1.10997	1.15404	1.11426	False
9	[45000:49999]	9	45000	49999	2018-09-30 11:24:00	2018-10-30 17:51:16.848000	reference	0.00699616	1.12703	1.14801	1.10604	1.15404	1.11426	False

NannyML can also visualize the multivariate drift results in a plot. Our plot contains several key elements.

The purple step plot shows the reconstruction error in each chunk of the analysis period. Thick squared point markers indicate the middle of these chunks.
The low-saturated purple area around the reconstruction error indicates the sampling error.
The red horizontal dashed lines show upper and lower thresholds for alerting purposes.
If the reconstruction error crosses the upper or lower threshold an alert is raised which is indicated with a red, low-saturated background across the whole width of the relevant chunk. This is additionally indicated by a red, diamond-shaped point marker in the middle of the chunk.

>>> figure = results.plot()
>>> figure.show()

The multivariate drift results provide a concise summary of where data drift is happening in our input data.

Insights

Using this method of detecting drift we can identify changes that we may not have seen using solely univariate methods.

What Next

After reviewing the results we may want to look at the drift results of individual features to see what changed in the model’s feature’s individually.

The Performance Estimation functionality can be used to estimate the impact of the observed changes.

For more information on how multivariate drift detection works the Data Reconstruction with PCA explanation page gives more details.