Univariate Drift Detection

Just The Code

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df = nml.load_synthetic_binary_classification_dataset()[0]
>>> analysis_df = nml.load_synthetic_binary_classification_dataset()[1]
>>> display(reference_df.head())

>>> column_names = ['distance_from_office', 'salary_range', 'gas_price_per_litre', 'public_transportation_cost', 'wfh_prev_workday', 'workday', 'tenure', 'y_pred_proba', 'y_pred']
>>> calc = nml.UnivariateDriftCalculator(
...     column_names=column_names,
...     timestamp_column_name='timestamp',
...     continuous_methods=['kolmogorov_smirnov', 'jensen_shannon'],
...     categorical_methods=['chi2', 'jensen_shannon'],
>>> )

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='analysis', column_names=['distance_from_office']).to_df())

>>> drift_fig = results.filter(column_names=results.continuous_column_names, methods=['jensen_shannon']).plot(kind='drift')
>>> drift_fig.show()

>>> drift_fig = results.filter(column_names=results.categorical_column_names, methods=['chi2']).plot(kind='drift')
>>> drift_fig.show()

>>> figure = results.filter(column_names=results.continuous_column_names, methods=['jensen_shannon']).plot(kind='distribution')
>>> figure.show()

>>> figure = results.filter(column_names=results.categorical_column_names, methods=['chi2']).plot(kind='distribution')
>>> figure.show()

Walkthrough

NannyML’s univariate approach for data drift looks at each variable individually and compares the chunks created from the analysis data period with the reference period. You can read more about periods and other data requirements in our section on data periods

The comparison results in a single number, a drift metric, representing the amount of drift between the reference and analysis chunks. NannyML calculates them for every chunk, allowing you to track them over time.

NannyML offers both statistical tests as well as distance measures to detect drift. They are being referred to as methods. Some methods are only applicable to continuous data, others to categorical data and some might be used on both. NannyML lets you choose which methods are to be used on these two types of data.

We begin by loading some synthetic data provided in the NannyML package. This is data for a binary classification model, but other model types operate in the same way.

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df = nml.load_synthetic_binary_classification_dataset()[0]
>>> analysis_df = nml.load_synthetic_binary_classification_dataset()[1]
>>> display(reference_df.head())

distance_from_office

salary_range

gas_price_per_litre

public_transportation_cost

wfh_prev_workday

workday

tenure

identifier

work_home_actual

timestamp

y_pred_proba

period

y_pred

0

5.96225

40K - 60K €

2.11948

8.56806

False

Friday

0.212653

0

1

2014-05-09 22:27:20

0.99

reference

1

1

0.535872

40K - 60K €

2.3572

5.42538

True

Tuesday

4.92755

1

0

2014-05-09 22:59:32

0.07

reference

0

2

1.96952

40K - 60K €

2.36685

8.24716

False

Monday

0.520817

2

1

2014-05-09 23:48:25

1

reference

1

3

2.53041

20K - 40K €

2.31872

7.94425

False

Tuesday

0.453649

3

1

2014-05-10 01:12:09

0.98

reference

1

4

2.25364

60K+ €

2.22127

8.88448

True

Thursday

5.69526

4

1

2014-05-10 02:21:34

0.99

reference

1

The UnivariateDriftCalculator class implements the functionality needed for univariate drift detection. We need to instantiate it with appropriate parameters:

  • The names of the columns to be evaluated.

  • A list of methods to use on continuous columns. You can chose from kolmogorov_smirnov, jensen_shannon and wasserstein.

  • A list of methods to use on categorical columns. You can chose from chi2, jensen_shannon and l_infinity.

  • Optionally, the name of the column containing the observation timestamps.

  • Optionally, a chunking approach or a predifined chunker. If neither is provided, the default chunker creating 10 chunks will be used.

>>> column_names = ['distance_from_office', 'salary_range', 'gas_price_per_litre', 'public_transportation_cost', 'wfh_prev_workday', 'workday', 'tenure', 'y_pred_proba', 'y_pred']
>>> calc = nml.UnivariateDriftCalculator(
...     column_names=column_names,
...     timestamp_column_name='timestamp',
...     continuous_methods=['kolmogorov_smirnov', 'jensen_shannon'],
...     categorical_methods=['chi2', 'jensen_shannon'],
>>> )

Next, the fit() method needs to be called on the reference data, which provides the baseline that the analysis data will be compared with. Then the calculate() method will calculate the drift results on the data provided to it.

The results can be filtered to only include a certain data period, method or column by using the filter method. You can evaluate the result data by converting the results into a DataFrame, by calling the to_df() method. By default this will return a DataFrame with a multi-level index. The first level represents the column, the second level is the method that was used and the third level are the values, thresholds and alerts for that method.

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='analysis', column_names=['distance_from_office']).to_df())

(‘chunk’, ‘chunk’, ‘chunk_index’)

(‘chunk’, ‘chunk’, ‘end_date’)

(‘chunk’, ‘chunk’, ‘end_index’)

(‘chunk’, ‘chunk’, ‘key’)

(‘chunk’, ‘chunk’, ‘period’)

(‘chunk’, ‘chunk’, ‘start_date’)

(‘chunk’, ‘chunk’, ‘start_index’)

(‘distance_from_office’, ‘jensen_shannon’, ‘alert’)

(‘distance_from_office’, ‘jensen_shannon’, ‘lower_threshold’)

(‘distance_from_office’, ‘jensen_shannon’, ‘upper_threshold’)

(‘distance_from_office’, ‘jensen_shannon’, ‘value’)

(‘distance_from_office’, ‘kolmogorov_smirnov’, ‘alert’)

(‘distance_from_office’, ‘kolmogorov_smirnov’, ‘lower_threshold’)

(‘distance_from_office’, ‘kolmogorov_smirnov’, ‘upper_threshold’)

(‘distance_from_office’, ‘kolmogorov_smirnov’, ‘value’)

0

0

2018-01-02 00:45:44

4999

[0:4999]

analysis

2017-08-31 04:20:00

0

False

0.1

0.0261007

False

0.0131

1

1

2018-05-01 13:10:10

9999

[5000:9999]

analysis

2018-01-02 01:13:11

5000

False

0.1

0.0202971

False

0.01124

2

2

2018-09-01 15:40:40

14999

[10000:14999]

analysis

2018-05-01 14:25:25

10000

False

0.1

0.0210957

False

0.01682

3

3

2018-12-31 10:11:21

19999

[15000:19999]

analysis

2018-09-01 16:19:07

15000

False

0.1

0.0362101

False

0.01436

4

4

2019-04-30 11:01:30

24999

[20000:24999]

analysis

2018-12-31 10:38:45

20000

False

0.1

0.0287082

False

0.01116

5

5

2019-09-01 00:24:27

29999

[25000:29999]

analysis

2019-04-30 11:02:00

25000

True

0.1

0.464732

True

0.43548

6

6

2019-12-31 09:09:12

34999

[30000:34999]

analysis

2019-09-01 00:28:54

30000

True

0.1

0.460044

True

0.43032

7

7

2020-04-30 11:46:53

39999

[35000:39999]

analysis

2019-12-31 10:07:15

35000

True

0.1

0.466746

True

0.43786

8

8

2020-09-01 02:46:02

44999

[40000:44999]

analysis

2020-04-30 12:04:32

40000

True

0.1

0.4663

True

0.43608

9

9

2021-01-01 04:29:32

49999

[45000:49999]

analysis

2020-09-01 02:46:13

45000

True

0.1

0.467798

True

0.43852

You can also disable the multi-level index behavior and return a flat structure by setting multilevel=False. Both the column name and the method have now been included within the column names.

>>> display(results.filter(period='analysis', column_names=['distance_from_office']).to_df(multilevel=False))

chunk_index

chunk_end_date

chunk_end_index

chunk_key

chunk_period

chunk_start_date

chunk_start_index

distance_from_office_jensen_shannon_alert

distance_from_office_jensen_shannon_lower_threshold

distance_from_office_jensen_shannon_upper_threshold

distance_from_office_jensen_shannon_value

distance_from_office_kolmogorov_smirnov_alert

distance_from_office_kolmogorov_smirnov_lower_threshold

distance_from_office_kolmogorov_smirnov_upper_threshold

distance_from_office_kolmogorov_smirnov_value

0

0

2018-01-02 00:45:44

4999

[0:4999]

analysis

2017-08-31 04:20:00

0

False

0.1

0.0261007

False

0.0131

1

1

2018-05-01 13:10:10

9999

[5000:9999]

analysis

2018-01-02 01:13:11

5000

False

0.1

0.0202971

False

0.01124

2

2

2018-09-01 15:40:40

14999

[10000:14999]

analysis

2018-05-01 14:25:25

10000

False

0.1

0.0210957

False

0.01682

3

3

2018-12-31 10:11:21

19999

[15000:19999]

analysis

2018-09-01 16:19:07

15000

False

0.1

0.0362101

False

0.01436

4

4

2019-04-30 11:01:30

24999

[20000:24999]

analysis

2018-12-31 10:38:45

20000

False

0.1

0.0287082

False

0.01116

5

5

2019-09-01 00:24:27

29999

[25000:29999]

analysis

2019-04-30 11:02:00

25000

True

0.1

0.464732

True

0.43548

6

6

2019-12-31 09:09:12

34999

[30000:34999]

analysis

2019-09-01 00:28:54

30000

True

0.1

0.460044

True

0.43032

7

7

2020-04-30 11:46:53

39999

[35000:39999]

analysis

2019-12-31 10:07:15

35000

True

0.1

0.466746

True

0.43786

8

8

2020-09-01 02:46:02

44999

[40000:44999]

analysis

2020-04-30 12:04:32

40000

True

0.1

0.4663

True

0.43608

9

9

2021-01-01 04:29:32

49999

[45000:49999]

analysis

2020-09-01 02:46:13

45000

True

0.1

0.467798

True

0.43852

The drift results from the reference data are accessible though the filter() method of the drift calculator results:

>>> display(results.filter(period='reference', column_names=['distance_from_office']).to_df())

(‘chunk’, ‘chunk’, ‘chunk_index’)

(‘chunk’, ‘chunk’, ‘end_date’)

(‘chunk’, ‘chunk’, ‘end_index’)

(‘chunk’, ‘chunk’, ‘key’)

(‘chunk’, ‘chunk’, ‘period’)

(‘chunk’, ‘chunk’, ‘start_date’)

(‘chunk’, ‘chunk’, ‘start_index’)

(‘distance_from_office’, ‘jensen_shannon’, ‘alert’)

(‘distance_from_office’, ‘jensen_shannon’, ‘lower_threshold’)

(‘distance_from_office’, ‘jensen_shannon’, ‘upper_threshold’)

(‘distance_from_office’, ‘jensen_shannon’, ‘value’)

(‘distance_from_office’, ‘kolmogorov_smirnov’, ‘alert’)

(‘distance_from_office’, ‘kolmogorov_smirnov’, ‘lower_threshold’)

(‘distance_from_office’, ‘kolmogorov_smirnov’, ‘upper_threshold’)

(‘distance_from_office’, ‘kolmogorov_smirnov’, ‘value’)

0

0

2014-09-09 08:18:27

4999

[0:4999]

reference

2014-05-09 22:27:20

0

False

0.1

0.0294645

False

0.01034

1

1

2015-01-09 00:02:51

9999

[5000:9999]

reference

2014-09-09 09:13:35

5000

False

0.1

0.0236588

False

0.0075

2

2

2015-05-09 15:54:26

14999

[10000:14999]

reference

2015-01-09 00:04:43

10000

False

0.1

0.0264403

False

0.0082

3

3

2015-09-07 07:14:37

19999

[15000:19999]

reference

2015-05-09 16:02:08

15000

False

0.1

0.0217733

False

0.0086

4

4

2016-01-08 16:02:05

24999

[20000:24999]

reference

2015-09-07 07:27:47

20000

False

0.1

0.0239721

False

0.0091

5

5

2016-05-09 11:09:39

29999

[25000:29999]

reference

2016-01-08 17:22:00

25000

False

0.1

0.0275768

False

0.01458

6

6

2016-09-04 03:30:35

34999

[30000:34999]

reference

2016-05-09 11:19:36

30000

False

0.1

0.0268749

False

0.0129

7

7

2017-01-03 18:48:21

39999

[35000:39999]

reference

2016-09-04 04:09:35

35000

False

0.1

0.0312645

False

0.0138

8

8

2017-05-03 02:34:24

44999

[40000:44999]

reference

2017-01-03 19:00:51

40000

False

0.1

0.0273523

False

0.01586

9

9

2017-08-31 03:10:29

49999

[45000:49999]

reference

2017-05-03 02:49:38

45000

False

0.1

0.0296272

False

0.00924

The next step is visualizing the results. NannyML can plot both the drift as well as distribution for a given column. We’ll first plot the jensen_shannon method results for each continuous column:

>>> drift_fig = results.filter(column_names=results.continuous_column_names, methods=['jensen_shannon']).plot(kind='drift')
>>> drift_fig.show()
../../_images/drift-guide-continuous.svg

We then plot the chi2 results for each categorical column:

>>> drift_fig = results.filter(column_names=results.categorical_column_names, methods=['chi2']).plot(kind='drift')
>>> drift_fig.show()
../../_images/drift-guide-categorical.svg

NannyML also shows details about the distributions of continuous and categorical variables.

For continuous variables NannyML plots the estimated probability distribution of the variable for each chunk in a plot called joyplot. The chunks where drift was detected are highlighted. We can create joyplots for the model’s continuous variables as following:

>>> figure = results.filter(column_names=results.continuous_column_names, methods=['jensen_shannon']).plot(kind='distribution')
>>> figure.show()
../../_images/drift-guide-joyplot-continuous.svg

For categorical variables NannyML plots stacked bar charts to show the variable’s distribution for each chunk. If a variable has more than 5 categories, the top 4 are displayed and the rest are grouped together to make the plots easier to view. We can stacked bar charts for the model’s categorical variables with the code below:

>>> figure = results.filter(column_names=results.categorical_column_names, methods=['chi2']).plot(kind='distribution')
>>> figure.show()
../../_images/drift-guide-categorical.svg

The drift calculator operates on any column. This not only limits it to model features, but allows it to work on model scores and predictions as well. You can see the drift plots for the model scores (y_pred_proba) and the model predictions (y_pred) below.

../../_images/drift-guide-y_pred_proba.svg../../_images/drift-guide-y_pred.svg

Insights

After reviewing the above results we have a good understanding of what has changed in our model’s population.

What Next

The Performance Estimation functionality of NannyML can help provide estimates of the impact of the observed changes to Model Performance. The ranking functionality can help rank drifted features in order to suggest which ones to prioritize for further investigation if needed. This would be an ad-hoc investigating that is not covered by NannyML.