Univariate Drift Detection

Just The Code

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, _ = nml.load_synthetic_car_loan_dataset()
>>> display(reference_df.head())

>>> column_names = ['car_value', 'salary_range', 'debt_to_income_ratio', 'loan_length', 'repaid_loan_on_prev_car', 'size_of_downpayment', 'driver_tenure', 'y_pred_proba', 'y_pred']

>>> calc = nml.UnivariateDriftCalculator(
...     column_names=column_names,
...     treat_as_categorical=['y_pred'],
...     timestamp_column_name='timestamp',
...     continuous_methods=['kolmogorov_smirnov', 'jensen_shannon'],
...     categorical_methods=['chi2', 'jensen_shannon'],
>>> )

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='analysis', column_names=['debt_to_income_ratio']).to_df())

>>> figure = results.filter(column_names=results.continuous_column_names, methods=['jensen_shannon']).plot(kind='drift')
>>> figure.show()

>>> figure = results.filter(column_names=results.categorical_column_names, methods=['chi2']).plot(kind='drift')
>>> figure.show()

>>> figure = results.filter(column_names=results.continuous_column_names, methods=['jensen_shannon']).plot(kind='distribution')
>>> figure.show()

>>> figure = results.filter(column_names=results.categorical_column_names, methods=['chi2']).plot(kind='distribution')
>>> figure.show()

Advanced configuration

Set up custom chunking [tutorial] [API reference]
Set up custom thresholds [tutorial] [API reference]

Walkthrough

NannyML’s univariate approach for data drift looks at each variable individually and compares the chunks created from the analysis data period with the reference period. You can read more about periods and other data requirements in our section on data periods.

The comparison results in a single number, a drift metric, representing the amount of drift between the reference and analysis chunks. NannyML calculates them for every chunk, allowing you to track them over time.

NannyML offers both statistical tests as well as distance measures to detect drift. They are being referred to as methods. Some methods are only applicable to continuous data, others to categorical data and some might be used on both. NannyML lets you choose which methods are to be used on these two types of data.

We begin by loading some synthetic data provided in the NannyML package. This is data for a binary classification model, but other model types operate in the same way.

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, _ = nml.load_synthetic_car_loan_dataset()
>>> display(reference_df.head())

	car_value	salary_range	debt_to_income_ratio	loan_length	repaid_loan_on_prev_car	size_of_downpayment	driver_tenure	repaid	timestamp	y_pred_proba	y_pred
0	39811	40K - 60K €	0.63295	19	False	40%	0.212653	1	2018-01-01 00:00:00.000	0.99	1
1	12679	40K - 60K €	0.718627	7	True	10%	4.92755	0	2018-01-01 00:08:43.152	0.07	0
2	19847	40K - 60K €	0.721724	17	False	0%	0.520817	1	2018-01-01 00:17:26.304	1	1
3	22652	20K - 20K €	0.705992	16	False	10%	0.453649	1	2018-01-01 00:26:09.456	0.98	1
4	21268	60K+ €	0.671888	21	True	30%	5.69526	1	2018-01-01 00:34:52.608	0.99	1

The UnivariateDriftCalculator class implements the functionality needed for univariate drift detection. We need to instantiate it with appropriate parameters:

The names of the columns to be evaluated.
A list of methods to use on continuous columns. You can chose from kolmogorov_smirnov, jensen_shannon, wasserstein and hellinger.
A list of methods to use on categorical columns. You can chose from chi2, jensen_shannon, l_infinity and hellinger.
Optionally, the name of the column containing the observation timestamps.
Optionally, a chunking approach or a predifined chunker. If neither is provided, the default chunker creating 10 chunks will be used.

>>> column_names = ['car_value', 'salary_range', 'debt_to_income_ratio', 'loan_length', 'repaid_loan_on_prev_car', 'size_of_downpayment', 'driver_tenure', 'y_pred_proba', 'y_pred']

>>> calc = nml.UnivariateDriftCalculator(
...     column_names=column_names,
...     treat_as_categorical=['y_pred'],
...     timestamp_column_name='timestamp',
...     continuous_methods=['kolmogorov_smirnov', 'jensen_shannon'],
...     categorical_methods=['chi2', 'jensen_shannon'],
>>> )

Next, the fit() method needs to be called on the reference data, which provides the baseline that the analysis data will be compared with. Then the calculate() method will calculate the drift results on the data provided to it.

The results can be filtered to only include a certain data period, method or column by using the filter method. You can evaluate the result data by converting the results into a DataFrame, by calling the to_df() method. By default this will return a DataFrame with a multi-level index. The first level represents the column, the second level is the method that was used and the third level are the values, thresholds and alerts for that method.

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='analysis', column_names=['debt_to_income_ratio']).to_df())

	chunk chunk chunk_index	end_date	end_index	key	period	start_date	start_index	debt_to_income_ratio kolmogorov_smirnov alert	value	jensen_shannon alert	upper_threshold	value
0	0	2018-11-30 00:27:16.848000	4999	[0:4999]	analysis	2018-10-30 18:00:00	0	False	0.01576	False	0.1	0.0316611
1	1	2018-12-30 07:03:16.848000	9999	[5000:9999]	analysis	2018-11-30 00:36:00	5000	False	0.01268	False	0.1	0.0300113
2	2	2019-01-29 13:39:16.848000	14999	[10000:14999]	analysis	2018-12-30 07:12:00	10000	False	0.01734	False	0.1	0.0311286
3	3	2019-02-28 20:15:16.848000	19999	[15000:19999]	analysis	2019-01-29 13:48:00	15000	False	0.0128	False	0.1	0.0294644
4	4	2019-03-31 02:51:16.848000	24999	[20000:24999]	analysis	2019-02-28 20:24:00	20000	False	0.01918	False	0.1	0.0308095
5	5	2019-04-30 09:27:16.848000	29999	[25000:29999]	analysis	2019-03-31 03:00:00	25000	False	0.00824	False	0.1	0.0286811
6	6	2019-05-30 16:03:16.848000	34999	[30000:34999]	analysis	2019-04-30 09:36:00	30000	False	0.01058	False	0.1	0.0436276
7	7	2019-06-29 22:39:16.848000	39999	[35000:39999]	analysis	2019-05-30 16:12:00	35000	False	0.01002	False	0.1	0.0292533
8	8	2019-07-30 05:15:16.848000	44999	[40000:44999]	analysis	2019-06-29 22:48:00	40000	False	0.01068	False	0.1	0.0306276
9	9	2019-08-29 11:51:16.848000	49999	[45000:49999]	analysis	2019-07-30 05:24:00	45000	False	0.0068	False	0.1	0.0283303

You can also disable the multi-level index behavior and return a flat structure by setting multilevel=False. Both the column name and the method have now been included within the column names.

>>> display(results.filter(period='analysis', column_names=['debt_to_income_ratio']).to_df(multilevel=False))

	chunk_index	chunk_end_date	chunk_end_index	chunk_key	chunk_period	chunk_start_date	chunk_start_index	debt_to_income_ratio_kolmogorov_smirnov_alert	debt_to_income_ratio_kolmogorov_smirnov_value	debt_to_income_ratio_jensen_shannon_alert	debt_to_income_ratio_jensen_shannon_upper_threshold	debt_to_income_ratio_jensen_shannon_value
0	0	2018-11-30 00:27:16.848000	4999	[0:4999]	analysis	2018-10-30 18:00:00	0	False	0.01576	False	0.1	0.0316611
1	1	2018-12-30 07:03:16.848000	9999	[5000:9999]	analysis	2018-11-30 00:36:00	5000	False	0.01268	False	0.1	0.0300113
2	2	2019-01-29 13:39:16.848000	14999	[10000:14999]	analysis	2018-12-30 07:12:00	10000	False	0.01734	False	0.1	0.0311286
3	3	2019-02-28 20:15:16.848000	19999	[15000:19999]	analysis	2019-01-29 13:48:00	15000	False	0.0128	False	0.1	0.0294644
4	4	2019-03-31 02:51:16.848000	24999	[20000:24999]	analysis	2019-02-28 20:24:00	20000	False	0.01918	False	0.1	0.0308095
5	5	2019-04-30 09:27:16.848000	29999	[25000:29999]	analysis	2019-03-31 03:00:00	25000	False	0.00824	False	0.1	0.0286811
6	6	2019-05-30 16:03:16.848000	34999	[30000:34999]	analysis	2019-04-30 09:36:00	30000	False	0.01058	False	0.1	0.0436276
7	7	2019-06-29 22:39:16.848000	39999	[35000:39999]	analysis	2019-05-30 16:12:00	35000	False	0.01002	False	0.1	0.0292533
8	8	2019-07-30 05:15:16.848000	44999	[40000:44999]	analysis	2019-06-29 22:48:00	40000	False	0.01068	False	0.1	0.0306276
9	9	2019-08-29 11:51:16.848000	49999	[45000:49999]	analysis	2019-07-30 05:24:00	45000	False	0.0068	False	0.1	0.0283303

The drift results from the reference data are accessible though the filter() method of the drift calculator results:

>>> display(results.filter(period='reference', column_names=['debt_to_income_ratio']).to_df())

	chunk chunk chunk_index	end_date	end_index	key	period	start_date	start_index	debt_to_income_ratio kolmogorov_smirnov alert	value	jensen_shannon alert	upper_threshold	value
0	0	2018-01-31 06:27:16.848000	4999	[0:4999]	reference	2018-01-01 00:00:00	0	False	0.01112	False	0.1	0.0333679
1	1	2018-03-02 13:03:16.848000	9999	[5000:9999]	reference	2018-01-31 06:36:00	5000	False	0.01218	False	0.1	0.028066
2	2	2018-04-01 19:39:16.848000	14999	[10000:14999]	reference	2018-03-02 13:12:00	10000	False	0.00878	False	0.1	0.0225969
3	3	2018-05-02 02:15:16.848000	19999	[15000:19999]	reference	2018-04-01 19:48:00	15000	False	0.0095	False	0.1	0.0315869
4	4	2018-06-01 08:51:16.848000	24999	[20000:24999]	reference	2018-05-02 02:24:00	20000	False	0.00754	False	0.1	0.0310501
5	5	2018-07-01 15:27:16.848000	29999	[25000:29999]	reference	2018-06-01 09:00:00	25000	False	0.0103	False	0.1	0.0316479
6	6	2018-07-31 22:03:16.848000	34999	[30000:34999]	reference	2018-07-01 15:36:00	30000	False	0.01094	False	0.1	0.0258014
7	7	2018-08-31 04:39:16.848000	39999	[35000:39999]	reference	2018-07-31 22:12:00	35000	False	0.01736	False	0.1	0.0325098
8	8	2018-09-30 11:15:16.848000	44999	[40000:44999]	reference	2018-08-31 04:48:00	40000	False	0.00842	False	0.1	0.0248975
9	9	2018-10-30 17:51:16.848000	49999	[45000:49999]	reference	2018-09-30 11:24:00	45000	False	0.00786	False	0.1	0.0284742

The next step is visualizing the results. NannyML can plot both the drift as well as distribution for a given column. We’ll first plot the jensen_shannon method results for each continuous column which are shown below.

>>> figure = results.filter(column_names=results.continuous_column_names, methods=['jensen_shannon']).plot(kind='drift')
>>> figure.show()

Note that among the columns shown y_pred_proba is included. The drift calculator operates on any column. This not only limits it to model features, but allows it to work on model scores and predictions as well. This also applies to categorical columns. The plot below shows the chi2 results for each categorical column and that also includes the y_pred column.

>>> figure = results.filter(column_names=results.categorical_column_names, methods=['chi2']).plot(kind='drift')
>>> figure.show()

NannyML also shows details about the distributions of continuous and categorical variables.

For continuous variables NannyML plots the estimated probability distribution of the variable for each chunk in a plot called joyplot. The chunks where drift was detected are highlighted.

We can create joyplots for the model’s continuous variables with the code below.

>>> figure = results.filter(column_names=results.continuous_column_names, methods=['jensen_shannon']).plot(kind='distribution')
>>> figure.show()

For categorical variables NannyML plots stacked bar charts to show the variable’s distribution for each chunk. If a variable has more than 5 categories, the top 4 are displayed and the rest are grouped together to make the plots easier to view. The chunks where drift was detected are highlighted.

We can create stacked bar charts for the model’s categorical variables with the code below.

>>> figure = results.filter(column_names=results.categorical_column_names, methods=['chi2']).plot(kind='distribution')
>>> figure.show()

Insights

After reviewing the above results we have a good understanding of what has changed in our model’s population.

What Next

The Performance Estimation functionality of NannyML can help provide estimates of the impact of the observed changes to Model Performance. The ranking functionality can help rank drifted features in order to suggest which ones to prioritize for further investigation if needed. This would be an ad-hoc investigating that is not covered by NannyML.