Drift Detection for Model Outputs
Why Perform Drift Detection for Model Outputs
The distribution of model outputs tells us how likely our population will do what the model predicts. If the model’s population changes, then our populations’ actions will be different. The difference in actions is very important to know as soon as possible because they directly affect the business results from operating a machine learning model.
Just The Code
If you just want the code to experiment yourself within a Jupyter Notebook, here you go:
>>> import nannyml as nml
>>> import pandas as pd
>>> from IPython.display import display
>>> reference, analysis, analysis_target = nml.load_synthetic_binary_classification_dataset()
>>> metadata = nml.extract_metadata(data = reference, model_name='wfh_predictor', model_type='classification_binary', exclude_columns=['identifier'])
>>> metadata.target_column_name = 'work_home_actual'
>>> display(reference.head())
>>> # Let's initialize the object that will perform the Univariate Drift calculations
>>> # Let's use a chunk size of 5000 data points to create our drift statistics
>>> univariate_calculator = nml.UnivariateStatisticalDriftCalculator(model_metadata=metadata, chunk_size=5000)
>>> univariate_calculator = univariate_calculator.fit(reference_data=reference)
>>> # let's see drift statistics for all available data
>>> data = pd.concat([reference, analysis], ignore_index=True)
>>> univariate_results = univariate_calculator.calculate(data=data)
>>> # let's view a small subset of our results:
>>> # We use the data property of the results class to view the relevant data.
>>> y_pred_proba_result_columns = list(univariate_results.data.columns)[:5] + [s for s in list(univariate_results.data.columns) if "y_pred_proba" in s]
>>> display(univariate_results.data[y_pred_proba_result_columns][-7:-3])
>>> figure = univariate_results.plot(kind='prediction_drift', metric='statistic')
>>> figure.show()
>>> figure = univariate_results.plot(kind='prediction_distribution', metric='statistic')
>>> figure.show()
Walkthrough on Drift Detection for Model Outputs
NannyML detects data drift for Model Outputs using the Univariate Drift Detection methodology.
Let’s start by loading some synthetic data provided by the NannyML package.
>>> import nannyml as nml
>>> import pandas as pd
>>> from IPython.display import display
>>> reference, analysis, analysis_target = nml.load_synthetic_binary_classification_dataset()
>>> metadata = nml.extract_metadata(data = reference, model_name='wfh_predictor', model_type='classification_binary', exclude_columns=['identifier'])
>>> metadata.target_column_name = 'work_home_actual'
>>> display(reference.head())
distance_from_office |
salary_range |
gas_price_per_litre |
public_transportation_cost |
wfh_prev_workday |
workday |
tenure |
identifier |
work_home_actual |
timestamp |
y_pred_proba |
partition |
y_pred |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
5.96225 |
40K - 60K € |
2.11948 |
8.56806 |
False |
Friday |
0.212653 |
0 |
1 |
2014-05-09 22:27:20 |
0.99 |
reference |
1 |
1 |
0.535872 |
40K - 60K € |
2.3572 |
5.42538 |
True |
Tuesday |
4.92755 |
1 |
0 |
2014-05-09 22:59:32 |
0.07 |
reference |
0 |
2 |
1.96952 |
40K - 60K € |
2.36685 |
8.24716 |
False |
Monday |
0.520817 |
2 |
1 |
2014-05-09 23:48:25 |
1 |
reference |
1 |
3 |
2.53041 |
20K - 20K € |
2.31872 |
7.94425 |
False |
Tuesday |
0.453649 |
3 |
1 |
2014-05-10 01:12:09 |
0.98 |
reference |
1 |
4 |
2.25364 |
60K+ € |
2.22127 |
8.88448 |
True |
Thursday |
5.69526 |
4 |
1 |
2014-05-10 02:21:34 |
0.99 |
reference |
1 |
The UnivariateStatisticalDriftCalculator
class implements the functionality needed for drift detection in model outputs as well.
Following the process shown at Univariate Drift Detection
UnivariateStatisticalDriftCalculator
is instantiated with appropriate parameters and the
fit()
method
is called on the reference data where results will be based off. Then the
calculate()
method
calculates the drift results on the data provided to it. An example using it can be seen below:
>>> # Let's initialize the object that will perform the Univariate Drift calculations
>>> # Let's use a chunk size of 5000 data points to create our drift statistics
>>> univariate_calculator = nml.UnivariateStatisticalDriftCalculator(model_metadata=metadata, chunk_size=5000)
>>> univariate_calculator = univariate_calculator.fit(reference_data=reference)
>>> # let's see drift statistics for all available data
>>> data = pd.concat([reference, analysis], ignore_index=True)
>>> univariate_results = univariate_calculator.calculate(data=data)
>>> # let's view a small subset of our results:
>>> # We use the data property of the results class to view the relevant data.
>>> y_pred_proba_result_columns = list(univariate_results.data.columns)[:5] + [s for s in list(univariate_results.data.columns) if "y_pred_proba" in s]
>>> display(univariate_results.data[y_pred_proba_result_columns][-7:-3])
key |
start_index |
end_index |
start_date |
end_date |
y_pred_proba_dstat |
y_pred_proba_p_value |
y_pred_proba_alert |
y_pred_proba_threshold |
|
---|---|---|---|---|---|---|---|---|---|
13 |
[65000:69999] |
65000 |
69999 |
2018-09-01 16:19:07 |
2018-12-31 10:11:21 |
0.01058 |
0.685 |
False |
0.05 |
14 |
[70000:74999] |
70000 |
74999 |
2018-12-31 10:38:45 |
2019-04-30 11:01:30 |
0.01408 |
0.325 |
False |
0.05 |
15 |
[75000:79999] |
75000 |
79999 |
2019-04-30 11:02:00 |
2019-09-01 00:24:27 |
0.1307 |
0 |
True |
0.05 |
16 |
[80000:84999] |
80000 |
84999 |
2019-09-01 00:28:54 |
2019-12-31 09:09:12 |
0.1273 |
0 |
True |
0.05 |
NannyML can visualize the statistical properties of the drift in model outputs with:
>>> figure = univariate_results.plot(kind='prediction_drift', metric='statistic')
>>> figure.show()
NannyML can also show how the distributions of the model predictions evolved over time:
>>> figure = univariate_results.plot(kind='prediction_distribution', metric='statistic')
>>> figure.show()
Insights and Follow Ups
Looking at the results we see that we have a false alert on the first chunk of the analysis data. This is similar
to the tenure
variable in the univariate drift results where there is also
a false alert because the drift measured by the KS d-statistic is very low. This
can happen when the statistical tests consider significant a small change in the distribution of a variable
in the chunks. But because the change is small it is usually not significant from a model monitoring perspective.
If required the Performance Estimation functionality of NannyML can help provide estimates of the impact of the observed changes to Model Outputs.