Univariate Drift Detection

Just The Code

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, _ = nml.load_synthetic_car_loan_dataset()
>>> display(reference_df.head())

>>> column_names = ['car_value', 'salary_range', 'debt_to_income_ratio', 'loan_length', 'repaid_loan_on_prev_car', 'size_of_downpayment', 'driver_tenure', 'y_pred_proba', 'y_pred']

>>> calc = nml.UnivariateDriftCalculator(
...     column_names=column_names,
...     treat_as_categorical=['y_pred'],
...     timestamp_column_name='timestamp',
...     continuous_methods=['kolmogorov_smirnov', 'jensen_shannon'],
...     categorical_methods=['chi2', 'jensen_shannon'],
>>> )

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='analysis', column_names=['debt_to_income_ratio']).to_df())

>>> figure = results.filter(column_names=results.continuous_column_names, methods=['jensen_shannon']).plot(kind='drift')
>>> figure.show()

>>> figure = results.filter(column_names=results.categorical_column_names, methods=['chi2']).plot(kind='drift')
>>> figure.show()

>>> figure = results.filter(column_names=results.continuous_column_names, methods=['jensen_shannon']).plot(kind='distribution')
>>> figure.show()

>>> figure = results.filter(column_names=results.categorical_column_names, methods=['chi2']).plot(kind='distribution')
>>> figure.show()

Advanced configuration

To learn how Chunk works and to set up custom chunkings check out the chunking tutorial
To learn how ConstantThreshold works and to set up custom threshold check out the thresholds tutorial

Walkthrough

NannyML’s univariate approach for data drift looks at each variable individually and compares the chunks created from the analysis data period with the reference period. You can read more about periods and other data requirements in our section on data periods.

The comparison results in a single number, a drift metric, representing the amount of drift between the reference and analysis chunks. NannyML calculates them for every chunk, allowing you to track them over time.

NannyML offers both statistical tests as well as distance measures to detect drift. They are referred to as methods. Some methods only apply to continuous data, others to categorical data and some might be used on both. NannyML lets you choose which methods to use for these two types of data.

We begin by loading some synthetic data provided in the NannyML package. This is data for a binary classification model, but other model types operate in the same way.

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, _ = nml.load_synthetic_car_loan_dataset()
>>> display(reference_df.head())

	id	car_value	salary_range	debt_to_income_ratio	loan_length	repaid_loan_on_prev_car	size_of_downpayment	driver_tenure	repaid	timestamp	y_pred_proba	y_pred
0	0	39811	40K - 60K €	0.63295	19	False	40%	0.212653	1	2018-01-01 00:00:00.000	0.99	1
1	1	12679	40K - 60K €	0.718627	7	True	10%	4.92755	0	2018-01-01 00:08:43.152	0.07	0
2	2	19847	40K - 60K €	0.721724	17	False	0%	0.520817	1	2018-01-01 00:17:26.304	1	1
3	3	22652	20K - 40K €	0.705992	16	False	10%	0.453649	1	2018-01-01 00:26:09.456	0.98	1
4	4	21268	60K+ €	0.671888	21	True	30%	5.69526	1	2018-01-01 00:34:52.608	0.99	1

The UnivariateDriftCalculator class implements the functionality needed for univariate drift detection. First, we need to instantiate it with the appropriate parameters:

column_names: A list with the names of columns to be evaluated.
treat_as_categorical (Optional): A list of column names to treat as categorical columns.
timestamp_column_name (Optional): The name of the column in the reference data that contains timestamps.
categorical_methods (Optional): A list of methods to use on categorical columns. You can choose from chi2, jensen_shannon, l_infinity, and hellinger.
continuous_methods (Optional): A list of methods to use on continuous columns. You can chose from kolmogorov_smirnov, jensen_shannon, wasserstein and hellinger.
chunk_size (Optional): The number of observations in each chunk of data used. Only one chunking argument needs to be provided. For more information about chunking configurations check out the chunking tutorial.
chunk_number (Optional): The number of chunks to be created out of data provided for each period.
chunk_period (Optional): The time period based on which we aggregate the provided data in order to create chunks.
chunker (Optional): A NannyML Chunker object that will handle the aggregation provided data in order to create chunks.
thresholds (Optional): A dictionary allowing users to set a custom threshold strategy for each method. It links a Threshold subclass to a method name. For more information about thresholds, check out the thresholds tutorial.
computation_params (Optional): A dictionary which allows users to specify whether they want drift calculated on the exact reference data or an estimated distribution of the reference data obtained using binning techniques. Applicable only to Kolmogorov-Smirnov and Wasserstein. For more information look UnivariateDriftCalculator.

>>> column_names = ['car_value', 'salary_range', 'debt_to_income_ratio', 'loan_length', 'repaid_loan_on_prev_car', 'size_of_downpayment', 'driver_tenure', 'y_pred_proba', 'y_pred']

>>> calc = nml.UnivariateDriftCalculator(
...     column_names=column_names,
...     treat_as_categorical=['y_pred'],
...     timestamp_column_name='timestamp',
...     continuous_methods=['kolmogorov_smirnov', 'jensen_shannon'],
...     categorical_methods=['chi2', 'jensen_shannon'],
>>> )

Next, the fit() method needs to be called on the reference data, which provides the baseline that the analysis data will be compared with. Then the calculate() method will calculate the drift results on the provided data.

The results can be filtered to only include a certain data period, method, or column by using the filter() method. You can evaluate the result data by converting the results into a DataFrame, by calling the to_df() method. By default, this will return a DataFrame with a multi-level index. The first level represents the column, the second level is the method and the third level is the values, thresholds, and alerts for that method.

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='analysis', column_names=['debt_to_income_ratio']).to_df())

	chunk chunk key	chunk_index	start_index	end_index	start_date	end_date	period	debt_to_income_ratio kolmogorov_smirnov value	upper_threshold	alert	jensen_shannon value	upper_threshold	alert
0	[0:4999]	0	0	4999	2018-10-30 18:00:00	2018-11-30 00:27:16.848000	analysis	0.01576	0.0185838	False	0.0316611	0.0393276	False
1	[5000:9999]	1	5000	9999	2018-11-30 00:36:00	2018-12-30 07:03:16.848000	analysis	0.01268	0.0185838	False	0.0300113	0.0393276	False
2	[10000:14999]	2	10000	14999	2018-12-30 07:12:00	2019-01-29 13:39:16.848000	analysis	0.01734	0.0185838	False	0.0311286	0.0393276	False
3	[15000:19999]	3	15000	19999	2019-01-29 13:48:00	2019-02-28 20:15:16.848000	analysis	0.0128	0.0185838	False	0.0294644	0.0393276	False
4	[20000:24999]	4	20000	24999	2019-02-28 20:24:00	2019-03-31 02:51:16.848000	analysis	0.01918	0.0185838	True	0.0308095	0.0393276	False
5	[25000:29999]	5	25000	29999	2019-03-31 03:00:00	2019-04-30 09:27:16.848000	analysis	0.00824	0.0185838	False	0.0286811	0.0393276	False
6	[30000:34999]	6	30000	34999	2019-04-30 09:36:00	2019-05-30 16:03:16.848000	analysis	0.01058	0.0185838	False	0.0436276	0.0393276	True
7	[35000:39999]	7	35000	39999	2019-05-30 16:12:00	2019-06-29 22:39:16.848000	analysis	0.01002	0.0185838	False	0.0292533	0.0393276	False
8	[40000:44999]	8	40000	44999	2019-06-29 22:48:00	2019-07-30 05:15:16.848000	analysis	0.01068	0.0185838	False	0.0306276	0.0393276	False
9	[45000:49999]	9	45000	49999	2019-07-30 05:24:00	2019-08-29 11:51:16.848000	analysis	0.0068	0.0185838	False	0.0283303	0.0393276	False

You can also disable the multi-level index behavior and return a flat structure by setting multilevel=False. Both the column name and the method have now been included within the column names.

>>> display(results.filter(period='analysis', column_names=['debt_to_income_ratio']).to_df(multilevel=False))

	chunk_key	chunk_index	chunk_start_index	chunk_end_index	chunk_start_date	chunk_end_date	chunk_period	debt_to_income_ratio_kolmogorov_smirnov_value	debt_to_income_ratio_kolmogorov_smirnov_upper_threshold	debt_to_income_ratio_kolmogorov_smirnov_alert	debt_to_income_ratio_jensen_shannon_value	debt_to_income_ratio_jensen_shannon_upper_threshold	debt_to_income_ratio_jensen_shannon_alert
0	[0:4999]	0	0	4999	2018-10-30 18:00:00	2018-11-30 00:27:16.848000	analysis	0.01576	0.0185838	False	0.0316611	0.0393276	False
1	[5000:9999]	1	5000	9999	2018-11-30 00:36:00	2018-12-30 07:03:16.848000	analysis	0.01268	0.0185838	False	0.0300113	0.0393276	False
2	[10000:14999]	2	10000	14999	2018-12-30 07:12:00	2019-01-29 13:39:16.848000	analysis	0.01734	0.0185838	False	0.0311286	0.0393276	False
3	[15000:19999]	3	15000	19999	2019-01-29 13:48:00	2019-02-28 20:15:16.848000	analysis	0.0128	0.0185838	False	0.0294644	0.0393276	False
4	[20000:24999]	4	20000	24999	2019-02-28 20:24:00	2019-03-31 02:51:16.848000	analysis	0.01918	0.0185838	True	0.0308095	0.0393276	False
5	[25000:29999]	5	25000	29999	2019-03-31 03:00:00	2019-04-30 09:27:16.848000	analysis	0.00824	0.0185838	False	0.0286811	0.0393276	False
6	[30000:34999]	6	30000	34999	2019-04-30 09:36:00	2019-05-30 16:03:16.848000	analysis	0.01058	0.0185838	False	0.0436276	0.0393276	True
7	[35000:39999]	7	35000	39999	2019-05-30 16:12:00	2019-06-29 22:39:16.848000	analysis	0.01002	0.0185838	False	0.0292533	0.0393276	False
8	[40000:44999]	8	40000	44999	2019-06-29 22:48:00	2019-07-30 05:15:16.848000	analysis	0.01068	0.0185838	False	0.0306276	0.0393276	False
9	[45000:49999]	9	45000	49999	2019-07-30 05:24:00	2019-08-29 11:51:16.848000	analysis	0.0068	0.0185838	False	0.0283303	0.0393276	False

The drift results from the reference data are accessible though the filter() method of the drift calculator results:

>>> display(results.filter(period='reference', column_names=['debt_to_income_ratio']).to_df())

	chunk chunk key	chunk_index	start_index	end_index	start_date	end_date	period	debt_to_income_ratio kolmogorov_smirnov value	upper_threshold	alert	jensen_shannon value	upper_threshold	alert
0	[0:4999]	0	0	4999	2018-01-01 00:00:00	2018-01-31 06:27:16.848000	reference	0.01112	0.0185838	False	0.0333679	0.0393276	False
1	[5000:9999]	1	5000	9999	2018-01-31 06:36:00	2018-03-02 13:03:16.848000	reference	0.01218	0.0185838	False	0.028066	0.0393276	False
2	[10000:14999]	2	10000	14999	2018-03-02 13:12:00	2018-04-01 19:39:16.848000	reference	0.00878	0.0185838	False	0.0225969	0.0393276	False
3	[15000:19999]	3	15000	19999	2018-04-01 19:48:00	2018-05-02 02:15:16.848000	reference	0.0095	0.0185838	False	0.0315869	0.0393276	False
4	[20000:24999]	4	20000	24999	2018-05-02 02:24:00	2018-06-01 08:51:16.848000	reference	0.00754	0.0185838	False	0.0310501	0.0393276	False
5	[25000:29999]	5	25000	29999	2018-06-01 09:00:00	2018-07-01 15:27:16.848000	reference	0.0103	0.0185838	False	0.0316479	0.0393276	False
6	[30000:34999]	6	30000	34999	2018-07-01 15:36:00	2018-07-31 22:03:16.848000	reference	0.01094	0.0185838	False	0.0258014	0.0393276	False
7	[35000:39999]	7	35000	39999	2018-07-31 22:12:00	2018-08-31 04:39:16.848000	reference	0.01736	0.0185838	False	0.0325098	0.0393276	False
8	[40000:44999]	8	40000	44999	2018-08-31 04:48:00	2018-09-30 11:15:16.848000	reference	0.00842	0.0185838	False	0.0248975	0.0393276	False
9	[45000:49999]	9	45000	49999	2018-09-30 11:24:00	2018-10-30 17:51:16.848000	reference	0.00786	0.0185838	False	0.0284742	0.0393276	False

The next step is visualizing the results. NannyML can plot both the drift and distribution for a given column. We will first plot the jensen_shannon method results for each continuous column shown below.

>>> figure = results.filter(column_names=results.continuous_column_names, methods=['jensen_shannon']).plot(kind='drift')
>>> figure.show()

Note that among the columns shown y_pred_proba is included. This means that the drift calculator is not only limited to model features, but can also be applied to model scores and predictions. This also applies to categorical columns. The plot below shows the chi2 results for each categorical column and that also includes the y_pred column.

>>> figure = results.filter(column_names=results.categorical_column_names, methods=['chi2']).plot(kind='drift')
>>> figure.show()

NannyML also shows details about the distributions of continuous and categorical variables.

For continuous variables, NannyML plots the estimated probability distribution of the variable for each chunk in a plot called joyplot. The chunks where the drift was detected are highlighted.

Using the code below, we can create joyplots for the model’s continuous variables.

>>> figure = results.filter(column_names=results.continuous_column_names, methods=['jensen_shannon']).plot(kind='distribution')
>>> figure.show()

NannyML plots stacked bar charts for categorical variables to show the variable’s distribution for each chunk. If a variable has more than 5 categories, the top 4 are displayed and the rest are grouped together to make the plots easier to view. In addition, the chunks where drift was detected are highlighted.

We can create stacked bar charts for the model’s categorical variables with the code below.

>>> figure = results.filter(column_names=results.categorical_column_names, methods=['chi2']).plot(kind='distribution')
>>> figure.show()

Insights

After reviewing the above results we have a good understanding of what has changed in our model’s population.

What Next

The Performance Estimation functionality of NannyML can help provide estimates of the impact of the observed changes to Model Performance. The ranking functionality can help rank drifted features to suggest which ones to prioritize for further investigation if needed. This would be an ad-hoc investigating that is not covered by NannyML.