Multivariate Data Drift Detection

Why Perform Multivariate Drift Detection

Multivariate data drift detection addresses the shortcomings of univariate data detection methods. It provides one summary number reducing the risk of false alerts and detects more subtle changes in the data structure that cannot be detected with univariate approaches.

Just The Code

If you just want the code to experiment yourself within a Jupyter Notebook, here you go:

>>> import nannyml as nml
>>> import pandas as pd
>>> from IPython.display import display
>>> reference, analysis, analysis_target = nml.load_synthetic_binary_classification_dataset()
>>> metadata = nml.extract_metadata(data = reference, model_name='wfh_predictor', model_type='classification_binary', exclude_columns=['identifier'])
>>> metadata.target_column_name = 'work_home_actual'
>>> display(reference.head())

>>> # Let's initialize the object that will perform Data Reconstruction with PCA
>>> # Let's use a chunk size of 5000 data points to create our drift statistics
>>> rcerror_calculator = nml.DataReconstructionDriftCalculator(model_metadata=metadata, chunk_size=5000)
>>> rcerror_calculator = rcerror_calculator.fit(reference_data=reference)
>>> # let's see RC error statistics for all available data
>>> data = pd.concat([reference, analysis], ignore_index=True)
>>> rcerror_results = rcerror_calculator.calculate(data=data)

>>> from sklearn.impute import SimpleImputer

>>> # Let's initialize the object that will perform Data Reconstruction with PCA
>>> rcerror_calculator = nml.DataReconstructionDriftCalculator(
>>>     model_metadata=metadata,
>>>     chunk_size=5000,
>>>     imputer_categorical=SimpleImputer(strategy='constant', fill_value='missing'),
>>>     imputer_continuous=SimpleImputer(strategy='median')
>>> )
>>> # NannyML compares drift versus the full reference dataset.
>>> rcerror_calculator.fit(reference_data=reference)
>>> # let's see RC error statistics for all available data
>>> rcerror_results = rcerror_calculator.calculate(data=data)

>>> # We use the data property of the results class to view the relevant data.
>>> display(rcerror_results.data)

>>> figure = rcerror_results.plot(kind='drift')
>>> figure.show()

Walkthrough on multivariate drift detection

NannyML uses Data Reconstruction with PCA to detect such changes. For a detailed explanation of the method see Data Reconstruction with PCA Deep Dive. The method returns a single number, Reconstruction Error. The changes in this value reflect a change in the structure of the model inputs. NannyML calculates the reconstruction error over time for the monitored model and raises an alert if the values get outside of a range defined by the variance in the reference period.

Let’s start by loading some synthetic data provided by the NannyML package.

>>> import nannyml as nml
>>> import pandas as pd
>>> from IPython.display import display
>>> reference, analysis, analysis_target = nml.load_synthetic_binary_classification_dataset()
>>> metadata = nml.extract_metadata(data = reference, model_name='wfh_predictor', model_type='classification_binary', exclude_columns=['identifier'])
>>> metadata.target_column_name = 'work_home_actual'
>>> display(reference.head())

distance_from_office

salary_range

gas_price_per_litre

public_transportation_cost

wfh_prev_workday

workday

tenure

identifier

work_home_actual

timestamp

y_pred_proba

partition

y_pred

0

5.96225

40K - 60K €

2.11948

8.56806

False

Friday

0.212653

0

1

2014-05-09 22:27:20

0.99

reference

1

1

0.535872

40K - 60K €

2.3572

5.42538

True

Tuesday

4.92755

1

0

2014-05-09 22:59:32

0.07

reference

0

2

1.96952

40K - 60K €

2.36685

8.24716

False

Monday

0.520817

2

1

2014-05-09 23:48:25

1

reference

1

3

2.53041

20K - 40K €

2.31872

7.94425

False

Tuesday

0.453649

3

1

2014-05-10 01:12:09

0.98

reference

1

4

2.25364

60K+ €

2.22127

8.88448

True

Thursday

5.69526

4

1

2014-05-10 02:21:34

0.99

reference

1

The DataReconstructionDriftCalculator module implements this functionality. After instantiating it with appropriate parameters the fit() method needs to be called on the reference data where results will be based off. Then the calculate() method will calculate the multivariate drift results on the data provided to it. One way to use it can be seen below:

>>> # Let's initialize the object that will perform Data Reconstruction with PCA
>>> # Let's use a chunk size of 5000 data points to create our drift statistics
>>> rcerror_calculator = nml.DataReconstructionDriftCalculator(model_metadata=metadata, chunk_size=5000)
>>> rcerror_calculator = rcerror_calculator.fit(reference_data=reference)
>>> # let's see RC error statistics for all available data
>>> data = pd.concat([reference, analysis], ignore_index=True)
>>> rcerror_results = rcerror_calculator.calculate(data=data)

Missing values in our data need to be imputed. The default Imputation implemented by NannyML imputes the most frequent value for categorical features and the mean for continuous features. These defaults can be overridden with an instance of SimpleImputer class in which cases NannyML will perform the imputation as instructed. An example where custom imputation strategies are used can be seen below:

>>> from sklearn.impute import SimpleImputer
>>> # Let's initialize the object that will perform Data Reconstruction with PCA
>>> rcerror_calculator = nml.DataReconstructionDriftCalculator(
>>>     model_metadata=metadata,
>>>     chunk_size=5000,
>>>     imputer_categorical=SimpleImputer(strategy='constant', fill_value='missing'),
>>>     imputer_continuous=SimpleImputer(strategy='median')
>>> )
>>> # NannyML compares drift versus the full reference dataset.
>>> rcerror_calculator.fit(reference_data=reference)
>>> # let's see RC error statistics for all available data
>>> rcerror_results = rcerror_calculator.calculate(data=data)

Because our synthetic dataset does not have missing values, the results are the same in both cases:

>>> # We use the data property of the results class to view the relevant data.
>>> display(rcerror_results.data)

key

start_index

end_index

start_date

end_date

partition

reconstruction_error

lower_threshold

upper_threshold

alert

0

[0:4999]

0

4999

2014-05-09 22:27:20

2014-09-09 08:18:27

reference

1.12096

1.09658

1.13801

False

1

[5000:9999]

5000

9999

2014-09-09 09:13:35

2015-01-09 00:02:51

reference

1.11807

1.09658

1.13801

False

2

[10000:14999]

10000

14999

2015-01-09 00:04:43

2015-05-09 15:54:26

reference

1.11724

1.09658

1.13801

False

3

[15000:19999]

15000

19999

2015-05-09 16:02:08

2015-09-07 07:14:37

reference

1.12551

1.09658

1.13801

False

4

[20000:24999]

20000

24999

2015-09-07 07:27:47

2016-01-08 16:02:05

reference

1.10945

1.09658

1.13801

False

5

[25000:29999]

25000

29999

2016-01-08 17:22:00

2016-05-09 11:09:39

reference

1.12276

1.09658

1.13801

False

6

[30000:34999]

30000

34999

2016-05-09 11:19:36

2016-09-04 03:30:35

reference

1.10714

1.09658

1.13801

False

7

[35000:39999]

35000

39999

2016-09-04 04:09:35

2017-01-03 18:48:21

reference

1.12713

1.09658

1.13801

False

8

[40000:44999]

40000

44999

2017-01-03 19:00:51

2017-05-03 02:34:24

reference

1.11424

1.09658

1.13801

False

9

[45000:49999]

45000

49999

2017-05-03 02:49:38

2017-08-31 03:10:29

reference

1.11045

1.09658

1.13801

False

10

[50000:54999]

50000

54999

2017-08-31 04:20:00

2018-01-02 00:45:44

analysis

1.11854

1.09658

1.13801

False

11

[55000:59999]

55000

59999

2018-01-02 01:13:11

2018-05-01 13:10:10

analysis

1.11504

1.09658

1.13801

False

12

[60000:64999]

60000

64999

2018-05-01 14:25:25

2018-09-01 15:40:40

analysis

1.12546

1.09658

1.13801

False

13

[65000:69999]

65000

69999

2018-09-01 16:19:07

2018-12-31 10:11:21

analysis

1.12845

1.09658

1.13801

False

14

[70000:74999]

70000

74999

2018-12-31 10:38:45

2019-04-30 11:01:30

analysis

1.12289

1.09658

1.13801

False

15

[75000:79999]

75000

79999

2019-04-30 11:02:00

2019-09-01 00:24:27

analysis

1.22839

1.09658

1.13801

True

16

[80000:84999]

80000

84999

2019-09-01 00:28:54

2019-12-31 09:09:12

analysis

1.22003

1.09658

1.13801

True

17

[85000:89999]

85000

89999

2019-12-31 10:07:15

2020-04-30 11:46:53

analysis

1.23739

1.09658

1.13801

True

18

[90000:94999]

90000

94999

2020-04-30 12:04:32

2020-09-01 02:46:02

analysis

1.20605

1.09658

1.13801

True

19

[95000:99999]

95000

99999

2020-09-01 02:46:13

2021-01-01 04:29:32

analysis

1.24258

1.09658

1.13801

True

NannyML can also visualize multivariate drift results with the following code:

>>> figure = rcerror_results.plot(kind='drift')
>>> figure.show()
../../_images/drift-guide-multivariate.svg

The multivariate drift results provide a concise summary of where data drift is happening in our input data.

Insights and Follow Ups

After reviewing the results we may want to look at the drift results of individual features to see what changed in the model’s feature’s individually. Moreover the Performance Estimation functionality can be used to estimate the impact of the observed changes.

For more information on how multivariate drift works the Data Reconstruction with PCA explanation page gives more details.