Univariate Drift Detection

Why Perform Univariate Drift Detection

Univariate Drift Detection looks at each feature individually and checks whether its distribution has changed. It’s a simple, fully explainable form of data drift detection and is the most straightforward to understand and communicate.

Just The Code

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df = nml.load_synthetic_binary_classification_dataset()[0]
>>> analysis_df = nml.load_synthetic_binary_classification_dataset()[1]
>>> display(reference_df.head())

>>> feature_column_names = [
...     col for col in reference_df.columns if col not in [
...     'timestamp', 'y_pred_proba', 'period', 'y_pred', 'work_home_actual', 'identifier'
>>> ]]
>>> calc = nml.UnivariateStatisticalDriftCalculator(
...     feature_column_names=feature_column_names,
...     timestamp_column_name='timestamp'
>>> )

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.data.iloc[:, :9])

>>> display(calc.previous_reference_results.iloc[:, :9])

>>> for feature in calc.feature_column_names:
...     drift_fig = results.plot(
...         kind='feature_drift',
...         feature_column_name=feature,
...         plot_reference=True
...     )
...     drift_fig.show()

>>> for cont_feat in calc.continuous_column_names:
...     figure = results.plot(
...         kind='feature_distribution',
...         feature_column_name=cont_feat,
...         plot_reference=True
...     )
...     figure.show()

>>> for cat_feat in calc.categorical_column_names:
...     figure = results.plot(
...         kind='feature_distribution',
...         feature_column_name=cat_feat,
...         plot_reference=True)
...     figure.show()

>>> ranker = nml.Ranker.by('alert_count')
>>> ranked_features = ranker.rank(results, only_drifting = False)
>>> display(ranked_features)

Walkthrough

NannyML’s Univariate approach for data drift looks at each variable individually and conducts statistical tests comparing the chunks created from the analysis data period with the reference period. You can read more about the data required in our section on data periods

NannyML uses the 2 sample Kolmogorov-Smirnov Test for continuous features and the Chi squared test for categorical features. Both tests provide a statistic where they measure the observed drift and a p-value that shows how likely we are to get the observed sample under the assumption that there was no drift.

If the p-value is less than 0.05 NannyML considers the result unlikely to be due to chance and issues an alert for the associated chunk and feature.

We begin by loading some synthetic data provided in the NannyML package. This is data for a binary classification model, but other model types operate in the same way.

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df = nml.load_synthetic_binary_classification_dataset()[0]
>>> analysis_df = nml.load_synthetic_binary_classification_dataset()[1]
>>> display(reference_df.head())

	distance_from_office	salary_range	gas_price_per_litre	public_transportation_cost	wfh_prev_workday	workday	tenure	identifier	work_home_actual	timestamp	y_pred_proba	period	y_pred
0	5.96225	40K - 60K €	2.11948	8.56806	False	Friday	0.212653	0	1	2014-05-09 22:27:20	0.99	reference	1
1	0.535872	40K - 60K €	2.3572	5.42538	True	Tuesday	4.92755	1	0	2014-05-09 22:59:32	0.07	reference	0
2	1.96952	40K - 60K €	2.36685	8.24716	False	Monday	0.520817	2	1	2014-05-09 23:48:25	1	reference	1
3	2.53041	20K - 40K €	2.31872	7.94425	False	Tuesday	0.453649	3	1	2014-05-10 01:12:09	0.98	reference	1
4	2.25364	60K+ €	2.22127	8.88448	True	Thursday	5.69526	4	1	2014-05-10 02:21:34	0.99	reference	1

The UnivariateStatisticalDriftCalculator class implements the functionality needed for Univariate Drift Detection. We need to instantiate it with appropriate parameters - the column headers of the features that we want to run drift detection on, and the timestamp column header. The features can be passed in as a simple list of strings, but here we have created this list by excluding the columns in the dataframe that are not features, and passed that into the argument.

>>> feature_column_names = [
...     col for col in reference_df.columns if col not in [
...     'timestamp', 'y_pred_proba', 'period', 'y_pred', 'work_home_actual', 'identifier'
>>> ]]
>>> calc = nml.UnivariateStatisticalDriftCalculator(
...     feature_column_names=feature_column_names,
...     timestamp_column_name='timestamp'
>>> )

Next, the fit() method needs to be called on the reference data, which provides the baseline that the analysis data will be compared with. Then the calculate() method will calculate the drift results on the data provided to it.

We then display a small subset of our results by specifying columns in the results() method.

NannyML returns a dataframe with 3 columns for each feature. The first column contains the corresponding test statistic. The second column contains the corresponding p-value and the third column says whether there is a drift alert for that feature and chunk.

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.data.iloc[:, :9])

	key	start_index	end_index	start_date	end_date	salary_range_chi2	salary_range_p_value	salary_range_alert	salary_range_threshold
0	[0:4999]	0	4999	2017-08-31 04:20:00	2018-01-02 00:45:44	1.03368	0.793	False	0.05
1	[5000:9999]	5000	9999	2018-01-02 01:13:11	2018-05-01 13:10:10	5.76241	0.124	False	0.05
2	[10000:14999]	10000	14999	2018-05-01 14:25:25	2018-09-01 15:40:40	2.65396	0.448	False	0.05
3	[15000:19999]	15000	19999	2018-09-01 16:19:07	2018-12-31 10:11:21	0.0708428	0.995	False	0.05
4	[20000:24999]	20000	24999	2018-12-31 10:38:45	2019-04-30 11:01:30	1.00542	0.8	False	0.05
5	[25000:29999]	25000	29999	2019-04-30 11:02:00	2019-09-01 00:24:27	455.622	0	True	0.05
6	[30000:34999]	30000	34999	2019-09-01 00:28:54	2019-12-31 09:09:12	428.633	0	True	0.05
7	[35000:39999]	35000	39999	2019-12-31 10:07:15	2020-04-30 11:46:53	453.247	0	True	0.05
8	[40000:44999]	40000	44999	2020-04-30 12:04:32	2020-09-01 02:46:02	438.26	0	True	0.05
9	[45000:49999]	45000	49999	2020-09-01 02:46:13	2021-01-01 04:29:32	474.892	0	True	0.05

The drift results from the reference data are accessible though the previous_reference_results property of the drift calculator:

>>> display(calc.previous_reference_results.iloc[:, :9])

	key	start_index	end_index	start_date	end_date	salary_range_chi2	salary_range_p_value	salary_range_alert	salary_range_threshold
0	[0:4999]	0	4999	2014-05-09 22:27:20	2014-09-09 08:18:27	2.89878	0.407	False	0.05
1	[5000:9999]	5000	9999	2014-09-09 09:13:35	2015-01-09 00:02:51	3.14439	0.37	False	0.05
2	[10000:14999]	10000	14999	2015-01-09 00:04:43	2015-05-09 15:54:26	2.45188	0.484	False	0.05
3	[15000:19999]	15000	19999	2015-05-09 16:02:08	2015-09-07 07:14:37	4.06262	0.255	False	0.05
4	[20000:24999]	20000	24999	2015-09-07 07:27:47	2016-01-08 16:02:05	2.41399	0.491	False	0.05
5	[25000:29999]	25000	29999	2016-01-08 17:22:00	2016-05-09 11:09:39	3.79606	0.284	False	0.05
6	[30000:34999]	30000	34999	2016-05-09 11:19:36	2016-09-04 03:30:35	3.22884	0.358	False	0.05
7	[35000:39999]	35000	39999	2016-09-04 04:09:35	2017-01-03 18:48:21	1.3933	0.707	False	0.05
8	[40000:44999]	40000	44999	2017-01-03 19:00:51	2017-05-03 02:34:24	0.304785	0.959	False	0.05
9	[45000:49999]	45000	49999	2017-05-03 02:49:38	2017-08-31 03:10:29	2.98758	0.394	False	0.05

NannyML can also visualize those results on plots.

>>> for feature in calc.feature_column_names:
...     drift_fig = results.plot(
...         kind='feature_drift',
...         feature_column_name=feature,
...         plot_reference=True
...     )
...     drift_fig.show()

NannyML also shows details about the distributions of continuous variables and categorical variables. For continuous variables NannyML plots the estimated probability distribution of the variable for each chunk in a plot called joyplot. The chunks where drift was detected are highlighted. We can create joyplots for the model’s continuous variables with the code below:

>>> for cont_feat in calc.continuous_column_names:
...     figure = results.plot(
...         kind='feature_distribution',
...         feature_column_name=cont_feat,
...         plot_reference=True
...     )
...     figure.show()

NannyML can also plot details about the distributions of different features. In these plots, NannyML highlights the areas with possible data drift. If we want to focus only on the categorical plots, we can specify that only these be plotted.

For categorical variables NannyML plots stacked bar charts to show the variable’s distribution for each chunk. If a variable has more than 5 categories, the top 4 are displayed and the rest are grouped together to make the plots easier to view. We can stacked bar charts for the model’s categorical variables with the code below:

>>> for cat_feat in calc.categorical_column_names:
...     figure = results.plot(
...         kind='feature_distribution',
...         feature_column_name=cat_feat,
...         plot_reference=True)
...     figure.show()

NannyML can also rank features according to how many alerts they have had within the data analyzed for data drift. NannyML allows viewing the ranking of all the model inputs, or just the ones that have drifted. NannyML provides a dataframe with the resulting ranking of features.

>>> ranker = nml.Ranker.by('alert_count')
>>> ranked_features = ranker.rank(results, only_drifting = False)
>>> display(ranked_features)

	feature	number_of_alerts	rank
0	distance_from_office	5	1
1	salary_range	5	2
2	public_transportation_cost	5	3
3	wfh_prev_workday	5	4
4	tenure	2	5
5	gas_price_per_litre	0	6
6	workday	0	7

Insights

After reviewing the above results we have a good understanding of what has changed in our model’s population.

What Next

The Performance Estimation functionality of NannyML can help provide estimates of the impact of the observed changes to Model Performance.

If needed, we can investigate further as to why our population characteristics have changed the way they did. This is an ad-hoc investigating that is not covered by NannyML.