Drift Detection for Multiclass Classification Model Targets
Why Perform Drift Detection for Model Targets
The performance of a machine learning model can be affected if the distribution of targets changes. The target distribution can change both because of data drift but also because of label shift.
A change in the target distribution may mean that business assumptions on which the model is used may need to be revisited.
NannyML uses TargetDistributionCalculator
in order to monitor drift in the Target distribution. It can calculate the mean occurrence of positive
events for binary classification problems.
It can also calculate the chi squared statistic (from the Chi Squared test) of the available target values for each chunk, for both binary and multiclass classification problems.
Note
The Target Drift detection process can handle missing target values across all data periods.
Note
The following example uses timestamps. These are optional but have an impact on the way data is chunked and results are plotted. You can read more about them in the data requirements.
Just The Code
>>> import nannyml as nml
>>> from IPython.display import display
>>> reference_df = nml.load_synthetic_multiclass_classification_dataset()[0]
>>> analysis_df = nml.load_synthetic_multiclass_classification_dataset()[1]
>>> analysis_target_df = nml.load_synthetic_multiclass_classification_dataset()[2]
>>> analysis_df = analysis_df.merge(analysis_target_df, on='identifier')
>>> display(reference_df.head(3))
>>> calc = nml.TargetDistributionCalculator(
... y_true='y_true',
... timestamp_column_name='timestamp',
... problem_type='classification_multiclass'
>>> )
>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.data.head(3))
>>> target_drift_fig = results.plot(kind='target_drift', plot_reference=True)
>>> target_drift_fig.show()
>>> target_distribution_fig = results.plot(kind='target_distribution', plot_reference=True)
>>> target_distribution_fig.show()
Walkthrough
In order to monitor a model, NannyML needs to learn about it from a reference dataset. Then it can monitor the data that is subject to actual analysis, provided as the analysis dataset. You can read more about this in our section on data periods.
Let’s start by loading some synthetic data provided by the NannyML package, and setting it up as our reference and analysis dataframes. This synthetic data is for a binary classification model, but multi-class classification can be handled in the same way.
The analysis_targets
dataframe contains the target results of the analysis period. This is kept separate in the synthetic data because it is
not used during performance estimation.. But it is required to detect drift for the targets, so the first thing we need to in this case is set up the right data in the right dataframes. The analysis target values are joined on the analysis frame by the identifier
column.
>>> import nannyml as nml
>>> from IPython.display import display
>>> reference_df = nml.load_synthetic_multiclass_classification_dataset()[0]
>>> analysis_df = nml.load_synthetic_multiclass_classification_dataset()[1]
>>> analysis_target_df = nml.load_synthetic_multiclass_classification_dataset()[2]
>>> analysis_df = analysis_df.merge(analysis_target_df, on='identifier')
>>> display(reference_df.head(3))
acq_channel |
app_behavioral_score |
requested_credit_limit |
app_channel |
credit_bureau_score |
stated_income |
is_customer |
period |
identifier |
timestamp |
y_pred_proba_prepaid_card |
y_pred_proba_highstreet_card |
y_pred_proba_upmarket_card |
y_pred |
y_true |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
Partner3 |
1.80823 |
350 |
web |
309 |
15000 |
True |
reference |
60000 |
2020-05-02 02:01:30 |
0.97 |
0.03 |
0 |
prepaid_card |
prepaid_card |
1 |
Partner2 |
4.38257 |
500 |
mobile |
418 |
23000 |
True |
reference |
60001 |
2020-05-02 02:03:33 |
0.87 |
0.13 |
0 |
prepaid_card |
prepaid_card |
2 |
Partner2 |
-0.787575 |
400 |
web |
507 |
24000 |
False |
reference |
60002 |
2020-05-02 02:04:49 |
0.47 |
0.35 |
0.18 |
prepaid_card |
upmarket_card |
Now that the data is in place we’ll create a new
TargetDistributionCalculator
instantiating it with the appropriate parameters. We only need the target (y_true
) and timestamp.
>>> calc = nml.TargetDistributionCalculator(
... y_true='y_true',
... timestamp_column_name='timestamp',
... problem_type='classification_multiclass'
>>> )
Afterwards, the fit()
method gets called on the reference period, which represent an accepted target distribution
which we will compare against the analysis period.
Then the calculate()
method is
called to calculate the target drift results on the data provided. We use the previously assembled data as an argument.
We can display the results of this calculation in a dataframe.
>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.data.head(3))
key |
chunk_index |
start_index |
end_index |
start_date |
end_date |
period |
targets_missing_rate |
metric_target_drift |
statistical_target_drift |
p_value |
thresholds |
alert |
significant |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
[0:5999] |
0 |
0 |
5999 |
2020-09-01 03:10:01 |
2020-09-13 16:15:10 |
0 |
nan |
0.521545 |
0.770456 |
0.05 |
False |
False |
|
1 |
[6000:11999] |
1 |
6000 |
11999 |
2020-09-13 16:15:32 |
2020-09-25 19:48:42 |
0 |
nan |
2.11226 |
0.3478 |
0.05 |
False |
False |
|
2 |
[12000:17999] |
2 |
12000 |
17999 |
2020-09-25 19:50:04 |
2020-10-08 02:53:47 |
0 |
nan |
0.940108 |
0.624969 |
0.05 |
False |
False |
The results can be also easily plotted by using the
plot()
method.
>>> target_drift_fig = results.plot(kind='target_drift', plot_reference=True)
>>> target_drift_fig.show()
>>> target_distribution_fig = results.plot(kind='target_distribution', plot_reference=True)
>>> target_distribution_fig.show()
Warning
Since our target data contains non-numerical values and over 3 values, we currently don’t support plotting using the
distribution='metric'
parameter. NannyML will print out warnings to inform you about this:
UserWarning: the target column contains 3 unique values. NannyML cannot provide a value for 'metric_target_drift' when there are more than 2 unique values. All 'metric_target_drift' values will be set to np.NAN
UserWarning: the target column contains non-numerical values. NannyML cannot provide a value for 'metric_target_drift'.All 'metric_target_drift' values will be set to np.NAN
What Next
The Monitoring Realized Performance functionality of NannyML can can add context to the target drift results showing whether there are associated performance changes.