Drift Detection for Regression Model Targets

Why Perform Drift Detection for Model Targets

The performance of a machine learning model can be affected if the distribution of targets changes. The target distribution can change both because of data drift but also because of label shift.

A change in the target distribution may mean that business assumptions on which the model is used may need to be revisited.

NannyML uses TargetDistributionCalculator in order to monitor drift in the Target distribution. It can calculate the KS statistic (from the Kolmogorov-Smirnov test) for aggregated drift results but also show the target distribution results per chunk with joyploys.

Note

The Target Drift detection process can handle missing target values across all data periods.

Note

The following example uses timestamps. These are optional but have an impact on the way data is chunked and results are plotted. You can read more about them in the data requirements.

Just The Code

>>> import pandas as pd
>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df = nml.load_synthetic_car_price_dataset()[0]
>>> analysis_df = nml.load_synthetic_car_price_dataset()[1]
>>> analysis_target_df = nml.load_synthetic_car_price_dataset()[2]
>>> analysis_df = analysis_df.join(analysis_target_df)

>>> display(reference_df.head(3))

>>> calc = nml.TargetDistributionCalculator(
...     y_true='y_true',
...     timestamp_column_name='timestamp',
...     problem_type='regression',
>>> )

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.data)

>>> display(results.calculator.previous_reference_results)

>>> target_drift_fig = results.plot(kind='target_drift', plot_reference=True)
>>> target_drift_fig.show()

>>> target_distribution_fig = results.plot(kind='target_distribution', plot_reference=True)
>>> target_distribution_fig.show()

Walkthrough

In order to monitor a model, NannyML needs to learn about it from a reference dataset. Then it can monitor the data that is subject to actual analysis, provided as the analysis dataset. You can read more about this in our section on data periods.

Let’s start by loading some synthetic car pricing data provided by the NannyML package, and setting it up as our reference and analysis dataframes.

The analysis_targets dataframe contains the target results of the analysis period. This is kept separate in the synthetic data because it is not used during performance estimation. But it is required to detect drift for the targets, so the first thing we need to in this case is set up the right data in the right dataframes. The analysis target values are expected to be ordered correctly, just like in sklearn.

>>> import pandas as pd
>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df = nml.load_synthetic_car_price_dataset()[0]
>>> analysis_df = nml.load_synthetic_car_price_dataset()[1]
>>> analysis_target_df = nml.load_synthetic_car_price_dataset()[2]
>>> analysis_df = analysis_df.join(analysis_target_df)

>>> display(reference_df.head(3))

	car_age	km_driven	price_new	accident_count	door_count	fuel	transmission	y_true	y_pred	timestamp
0	15	144020	42810	4	3	diesel	automatic	569	1246	2017-01-24 08:00:00.000
1	12	57078	31835	3	3	electric	automatic	4277	4924	2017-01-24 08:00:33.600
2	2	76288	31851	3	5	diesel	automatic	7011	5744	2017-01-24 08:01:07.200

Now that the data is in place we’ll create a new TargetDistributionCalculator instantiating it with the appropriate parameters. We need the name for the target, y_true, and the timestamp columns. We also need to specify the machine learning problem we are working on.

>>> calc = nml.TargetDistributionCalculator(
...     y_true='y_true',
...     timestamp_column_name='timestamp',
...     problem_type='regression',
>>> )

Afterwards, the fit() method gets called on the reference period, which represent an accepted target distribution which we will compare against the analysis period.

Then the calculate() method is called to calculate the target drift results on the data provided. We use the previously assembled data as an argument.

We can display the results of this calculation in a dataframe.

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.data)

	key	chunk_index	start_index	end_index	start_date	end_date	metric_target_drift	statistical_target_drift	p_value	thresholds	alert	significant
0	[0:5999]	0	0	5999	2017-02-16 16:00:00	2017-02-18 23:59:26.400000	4862.94	0.01425	0.215879	0.05	False	False
1	[6000:11999]	1	6000	11999	2017-02-19 00:00:00	2017-02-21 07:59:26.400000	4790.58	0.0165667	0.0990255	0.05	False	False
2	[12000:17999]	2	12000	17999	2017-02-21 08:00:00	2017-02-23 15:59:26.400000	4793.35	0.0100667	0.634331	0.05	False	False
3	[18000:23999]	3	18000	23999	2017-02-23 16:00:00	2017-02-25 23:59:26.400000	4838.26	0.0119167	0.4175	0.05	False	False
4	[24000:29999]	4	24000	29999	2017-02-26 00:00:00	2017-02-28 07:59:26.400000	4799.13	0.00866667	0.803771	0.05	False	False
5	[30000:35999]	5	30000	35999	2017-02-28 08:00:00	2017-03-02 15:59:26.400000	4852.64	0.171683	4.6704e-141	0.05	True	True
6	[36000:41999]	6	36000	41999	2017-03-02 16:00:00	2017-03-04 23:59:26.400000	4875.46	0.180117	2.5805e-155	0.05	True	True
7	[42000:47999]	7	42000	47999	2017-03-05 00:00:00	2017-03-07 07:59:26.400000	4867.59	0.179067	1.67957e-153	0.05	True	True
8	[48000:53999]	8	48000	53999	2017-03-07 08:00:00	2017-03-09 15:59:26.400000	4885.11	0.183233	9.21428e-161	0.05	True	True
9	[54000:59999]	9	54000	59999	2017-03-09 16:00:00	2017-03-11 23:59:26.400000	4787.09	0.187383	3.64399e-168	0.05	True	True

We can also display the results from the reference dataframe.

>>> display(results.calculator.previous_reference_results)

	key	chunk_index	start_index	end_index	start_date	end_date	metric_target_drift	statistical_target_drift	p_value	thresholds	alert	significant
0	[0:5999]	0	0	5999	2017-01-24 08:00:00	2017-01-26 15:59:26.400000	4894.48	0.0161	0.116973	0.05	False	False
1	[6000:11999]	1	6000	11999	2017-01-26 16:00:00	2017-01-28 23:59:26.400000	4767.13	0.0120167	0.407002	0.05	False	False
2	[12000:17999]	2	12000	17999	2017-01-29 00:00:00	2017-01-31 07:59:26.400000	4744	0.0119667	0.412231	0.05	False	False
3	[18000:23999]	3	18000	23999	2017-01-31 08:00:00	2017-02-02 15:59:26.400000	4791.32	0.0122667	0.381454	0.05	False	False
4	[24000:29999]	4	24000	29999	2017-02-02 16:00:00	2017-02-04 23:59:26.400000	4827.68	0.0100833	0.632257	0.05	False	False
5	[30000:35999]	5	30000	35999	2017-02-05 00:00:00	2017-02-07 07:59:26.400000	4811.59	0.0095	0.704746	0.05	False	False
6	[36000:41999]	6	36000	41999	2017-02-07 08:00:00	2017-02-09 15:59:26.400000	4773.53	0.00613333	0.985698	0.05	False	False
7	[42000:47999]	7	42000	47999	2017-02-09 16:00:00	2017-02-11 23:59:26.400000	4766	0.01155	0.457293	0.05	False	False
8	[48000:53999]	8	48000	53999	2017-02-12 00:00:00	2017-02-14 07:59:26.400000	4750.15	0.00665	0.968017	0.05	False	False
9	[54000:59999]	9	54000	59999	2017-02-14 08:00:00	2017-02-16 15:59:26.400000	4744.8	0.00786667	0.885651	0.05	False	False

The results can be also easily plotted by using the plot() method. We first plot the KS Statistic drift results for each chunk.

>>> target_drift_fig = results.plot(kind='target_drift', plot_reference=True)
>>> target_drift_fig.show()

Note that a dashed line, instead of a solid line, will be used for chunks that have missing target values.

And then we create the joyplot to visualize the target distribution values for each chunk.

>>> target_distribution_fig = results.plot(kind='target_distribution', plot_reference=True)
>>> target_distribution_fig.show()

Insights

Looking at the results we can see that there has been some target drift towards lower car prices. We should also check to see if the performance of our model has been affected through realized performance monitoring. Lastly we would need to check with the business stakeholders to see if the changes observed can affect the company’s sales and marketing policies.

What Next

The Monitoring Realized Performance functionality of NannyML can can add context to the target drift results showing whether there are associated performance changes. Moreover the Univariate Drift Detection as well as the Multivariate Drift Detection can add further context if needed.