# Multivariate Drift Detection

## Why Perform Multivariate Drift Detection

Multivariate data drift detection addresses the shortcomings of univariate data detection methods. It provides one summary number reducing the risk of false alerts, and detects more subtle changes in the data structure that cannot be detected with univariate approaches.

## Just The Code

```>>> import nannyml as nml
>>> from IPython.display import display

>>> reference, analysis, _ = nml.load_synthetic_car_loan_dataset()

>>> non_feature_columns = ['timestamp', 'y_pred_proba', 'y_pred', 'repaid']

>>> # Define feature columns
>>> feature_column_names = [
...     col for col in reference.columns
...     if col not in non_feature_columns
>>> ]

>>> calc = nml.DataReconstructionDriftCalculator(
...     column_names=feature_column_names,
...     timestamp_column_name='timestamp',
...     chunk_size=5000
>>> )
>>> calc.fit(reference)
>>> results = calc.calculate(analysis)

>>> display(results.filter(period='analysis').to_df())

>>> display(results.filter(period='reference').to_df())

>>> figure = results.plot()
>>> figure.show()
```

## Walkthrough

NannyML uses Data Reconstruction with PCA to detect such changes. For a detailed explanation of the method see Data Reconstruction with PCA Deep Dive.

The method returns a single number, measuring the Reconstruction Error. The changes in this value reflect a change in the structure of the model inputs.

NannyML calculates the reconstruction error over time for the monitored model, and raises an alert if the values get outside a range defined by the variance in the reference data period.

In order to monitor a model, NannyML needs to learn about it from a reference dataset. Then it can monitor the data subject to actual analysis, provided as the analysis dataset. You can read more about this in our section on data periods.

Let’s start by loading some synthetic data provided by the NannyML package and setting it up as our reference and analysis dataframes. This synthetic data is for a binary classification model, but multi-class classification can be handled in the same way.

```>>> import nannyml as nml
>>> from IPython.display import display

>>> reference, analysis, _ = nml.load_synthetic_car_loan_dataset()
```

car_value

salary_range

debt_to_income_ratio

loan_length

repaid_loan_on_prev_car

size_of_downpayment

driver_tenure

repaid

timestamp

y_pred_proba

y_pred

0

39811

40K - 60K €

0.63295

19

False

40%

0.212653

1

2018-01-01 00:00:00.000

0.99

1

1

12679

40K - 60K €

0.718627

7

True

10%

4.92755

0

2018-01-01 00:08:43.152

0.07

0

2

19847

40K - 60K €

0.721724

17

False

0%

0.520817

1

2018-01-01 00:17:26.304

1

1

3

22652

20K - 20K €

0.705992

16

False

10%

0.453649

1

2018-01-01 00:26:09.456

0.98

1

4

21268

60K+ €

0.671888

21

True

30%

5.69526

1

2018-01-01 00:34:52.608

0.99

1

The `DataReconstructionDriftCalculator` module implements this functionality. We need to instantiate it with appropriate parameters:

• column_names: A list with the column names of the features we want to run drift detection on.

• timestamp_column_name (Optional): The name of the column in the reference data that contains timestamps.

• n_components (Optional): The n_components parameter as passed to the sklearn PCA constructor.

• chunk_size (Optional): The number of observations in each chunk of data used. Only one chunking argument needs to be provided. For more information about chunking configurations check out the chunking tutorial.

• chunk_number (Optional): The number of chunks to be created out of data provided for each period.

• chunk_period (Optional): The time period based on which we aggregate the provided data in order to create chunks.

• chunker (Optional): A NannyML `Chunker` object that will handle the aggregation provided data in order to create chunks.

• imputer_categorical (Optional): An sklearn SimpleImputer object specifying an appropriate strategy for imputing missing values for categorical features.

• imputer_continuous (Optional): An sklearn SimpleImputer object specifying an appropriate strategy for imputing missing values for continuous features.

• threshold (Optional): The threshold strategy used to calculate the alert threshold limits. For more information about thresholds, check out the thresholds tutorial.

Next, the `fit()` method needs to be called on the reference data, which the results will be based on. Then the `calculate()` method will calculate the multivariate drift results on the provided data.

```>>> non_feature_columns = ['timestamp', 'y_pred_proba', 'y_pred', 'repaid']

>>> # Define feature columns
>>> feature_column_names = [
...     col for col in reference.columns
...     if col not in non_feature_columns
>>> ]

>>> calc = nml.DataReconstructionDriftCalculator(
...     column_names=feature_column_names,
...     timestamp_column_name='timestamp',
...     chunk_size=5000
>>> )
>>> calc.fit(reference)
>>> results = calc.calculate(analysis)
```

Any missing values in our data need to be imputed. The default Imputation implemented by NannyML imputes the most frequent value for categorical features and the mean for continuous features. These defaults can be overridden with an instance of SimpleImputer class, in which case NannyML will perform the imputation as instructed.

An example of where custom imputation strategies are used can be seen below.

```>>> non_feature_columns = ['timestamp', 'y_pred_proba', 'y_pred', 'repaid']

>>> feature_column_names = [
...     col for col in reference.columns
...     if col not in non_feature_columns
>>> ]

>>> from sklearn.impute import SimpleImputer

>>> calc = nml.DataReconstructionDriftCalculator(
...     column_names=feature_column_names,
...     timestamp_column_name='timestamp',
...     chunk_size=5000,
...     imputer_categorical=SimpleImputer(strategy='constant', fill_value='missing'),
...     imputer_continuous=SimpleImputer(strategy='median')
>>> )
>>> calc.fit(reference)
>>> results = calc.calculate(analysis)
```

Because our synthetic dataset does not have missing values, the results are the same in both cases. We can see these results of the data provided to the `calculate()` method as a dataframe.

```>>> display(results.filter(period='analysis').to_df())
```

chunk
key
chunk_index
start_index
end_index
start_date
end_date
period
reconstruction_error
sampling_error
value
upper_confidence_boundary
lower_confidence_boundary
upper_threshold
lower_threshold

0

[0:4999]

0

0

4999

2018-10-30 18:00:00

2018-11-30 00:27:16.848000

analysis

0.00699616

1.14152

1.16251

1.12053

1.15302

1.11528

False

1

[5000:9999]

1

5000

9999

2018-11-30 00:36:00

2018-12-30 07:03:16.848000

analysis

0.00699616

1.13064

1.15162

1.10965

1.15302

1.11528

False

2

[10000:14999]

2

10000

14999

2018-12-30 07:12:00

2019-01-29 13:39:16.848000

analysis

0.00699616

1.13891

1.1599

1.11793

1.15302

1.11528

False

3

[15000:19999]

3

15000

19999

2019-01-29 13:48:00

2019-02-28 20:15:16.848000

analysis

0.00699616

1.14504

1.16603

1.12405

1.15302

1.11528

False

4

[20000:24999]

4

20000

24999

2019-02-28 20:24:00

2019-03-31 02:51:16.848000

analysis

0.00699616

1.13756

1.15855

1.11657

1.15302

1.11528

False

5

[25000:29999]

5

25000

29999

2019-03-31 03:00:00

2019-04-30 09:27:16.848000

analysis

0.00699616

1.24921

1.27019

1.22822

1.15302

1.11528

True

6

[30000:34999]

6

30000

34999

2019-04-30 09:36:00

2019-05-30 16:03:16.848000

analysis

0.00699616

1.2431

1.26409

1.22211

1.15302

1.11528

True

7

[35000:39999]

7

35000

39999

2019-05-30 16:12:00

2019-06-29 22:39:16.848000

analysis

0.00699616

1.25815

1.27914

1.23716

1.15302

1.11528

True

8

[40000:44999]

8

40000

44999

2019-06-29 22:48:00

2019-07-30 05:15:16.848000

analysis

0.00699616

1.22818

1.24917

1.20719

1.15302

1.11528

True

9

[45000:49999]

9

45000

49999

2019-07-30 05:24:00

2019-08-29 11:51:16.848000

analysis

0.00699616

1.25988

1.28087

1.2389

1.15302

1.11528

True

The drift results from the reference data are accessible from the properties of the results object:

```>>> display(results.filter(period='reference').to_df())
```

chunk
key
chunk_index
start_index
end_index
start_date
end_date
period
reconstruction_error
sampling_error
value
upper_confidence_boundary
lower_confidence_boundary
upper_threshold
lower_threshold

0

[0:4999]

0

0

4999

2018-01-01 00:00:00

2018-01-31 06:27:16.848000

reference

0.00699616

1.13641

1.1574

1.11542

1.15302

1.11528

False

1

[5000:9999]

1

5000

9999

2018-01-31 06:36:00

2018-03-02 13:03:16.848000

reference

0.00699616

1.13448

1.15547

1.11349

1.15302

1.11528

False

2

[10000:14999]

2

10000

14999

2018-03-02 13:12:00

2018-04-01 19:39:16.848000

reference

0.00699616

1.13608

1.15706

1.11509

1.15302

1.11528

False

3

[15000:19999]

3

15000

19999

2018-04-01 19:48:00

2018-05-02 02:15:16.848000

reference

0.00699616

1.14011

1.1611

1.11913

1.15302

1.11528

False

4

[20000:24999]

4

20000

24999

2018-05-02 02:24:00

2018-06-01 08:51:16.848000

reference

0.00699616

1.12608

1.14707

1.10509

1.15302

1.11528

False

5

[25000:29999]

5

25000

29999

2018-06-01 09:00:00

2018-07-01 15:27:16.848000

reference

0.00699616

1.14202

1.16301

1.12103

1.15302

1.11528

False

6

[30000:34999]

6

30000

34999

2018-07-01 15:36:00

2018-07-31 22:03:16.848000

reference

0.00699616

1.12515

1.14613

1.10416

1.15302

1.11528

False

7

[35000:39999]

7

35000

39999

2018-07-31 22:12:00

2018-08-31 04:39:16.848000

reference

0.00699616

1.14321

1.16419

1.12222

1.15302

1.11528

False

8

[40000:44999]

8

40000

44999

2018-08-31 04:48:00

2018-09-30 11:15:16.848000

reference

0.00699616

1.13095

1.15194

1.10997

1.15302

1.11528

False

9

[45000:49999]

9

45000

49999

2018-09-30 11:24:00

2018-10-30 17:51:16.848000

reference

0.00699616

1.12703

1.14801

1.10604

1.15302

1.11528

False

NannyML can also visualize the multivariate drift results in a plot. Our plot contains several key elements.

• The purple step plot shows the reconstruction error in each chunk of the analysis period. Thick squared point markers indicate the middle of these chunks.

• The low-saturated purple area around the reconstruction error indicates the sampling error.

• The red horizontal dashed lines show upper and lower thresholds for alerting purposes.

• If the reconstruction error crosses the upper or lower threshold an alert is raised which is indicated with a red, low-saturated background across the whole width of the relevant chunk. A red, diamond-shaped point marker additionally indicates this in the middle of the chunk.

```>>> figure = results.plot()
>>> figure.show()
```

The multivariate drift results provide a concise summary of where data drift is happening in our input data.

## Insights

Using this method of detecting drift, we can identify changes that we may not have seen using solely univariate methods.

## What Next

After reviewing the results, we want to look at the drift results of individual features to see what changed in the model’s features individually.

The Performance Estimation functionality can be used to estimate the impact of the observed changes.

For more information on how multivariate drift detection works, the Data Reconstruction with PCA explanation page gives more details.