Monitoring Realized Performance

Why Monitoring Realized Performance

The realized performance of a machine learning model is typically a good proxy for the business impact of the model. A significant drop in performance normally means a lot of value generated by the model is at risk, so close monitoring and quick resolution of issues are essential.

This guide shows how to use NannyML to calculate the Realized Performance of a model. Target values need to be available in both the reference and analysis data. All monitoring metrics available by NannyML for monitoring will be showed.

Note

The performance monitoring process requires no missing values in the target data on the reference dataset. However, the analysis data can contain missing values. The entries with missing values will simply be ignored when calculating the performance results. If there are so many missing values that the available data are below the Minimum chunk size then the performance results are omitted from the resulting visualizations because they are too noisy, to be reliable.

Binary Classification

Just The Code

If you just want the code to experiment yourself, here you go:

>>> import pandas as pd
>>> import nannyml as nml
>>> from IPython.display import display
>>> reference, analysis, analysis_targets = nml.datasets.load_synthetic_binary_classification_dataset()
>>> display(reference.head(3))

>>> data = pd.concat([reference, analysis.set_index('identifier').join(analysis_targets.set_index('identifier'), on='identifier', rsuffix='_r')], ignore_index=True).reset_index(drop=True)
>>> display(data.loc[data['partition'] == 'analysis'].head(3))

>>> metadata = nml.extract_metadata(reference, model_type='classification_binary', exclude_columns=['identifier'])
>>> metadata.target_column_name = 'work_home_actual'
>>> display(metadata.is_complete())

>>> performance_calculator = nml.PerformanceCalculator(
...     model_metadata=metadata,
...     # use NannyML to tell us what metrics are supported
...     metrics=nml.performance_estimation.confidence_based.results.SUPPORTED_METRIC_VALUES,
...     chunk_size=5000
... ).fit(reference_data=reference)

>>> realized_performance = performance_calculator.calculate(data)

>>> display(realized_performance.data.head(3))

>>> for metric in performance_calculator.metrics:
...     figure = realized_performance.plot(kind='performance', metric=metric)
...     figure.show()

Walkthrough on Monitoring Realized Performance

Prepare the data

For simplicity the guide is based on a synthetic dataset where the monitored model predicts whether an employee will work from home.

>>> import pandas as pd
>>> import nannyml as nml
>>> from IPython.display import display
>>> reference, analysis, analysis_targets = nml.datasets.load_synthetic_binary_classification_dataset()
>>> display(reference.head(3))

	distance_from_office	salary_range	gas_price_per_litre	public_transportation_cost	wfh_prev_workday	workday	tenure	identifier	work_home_actual	timestamp	y_pred_proba	partition	y_pred
0	5.96225	40K - 60K €	2.11948	8.56806	False	Friday	0.212653	0	1	2014-05-09 22:27:20	0.99	reference	1
1	0.535872	40K - 60K €	2.3572	5.42538	True	Tuesday	4.92755	1	0	2014-05-09 22:59:32	0.07	reference	0
2	1.96952	40K - 60K €	2.36685	8.24716	False	Monday	0.520817	2	1	2014-05-09 23:48:25	1	reference	1

The realized performance will be calculated on the combination of both reference and analysis data. The analysis target values are joined on the analysis frame by the identifier column.

>>> data = pd.concat([reference, analysis.set_index('identifier').join(analysis_targets.set_index('identifier'), on='identifier', rsuffix='_r')], ignore_index=True).reset_index(drop=True)
>>> display(data.loc[data['partition'] == 'analysis'].head(3))

	distance_from_office	salary_range	gas_price_per_litre	public_transportation_cost	wfh_prev_workday	workday	tenure	identifier	work_home_actual	timestamp	y_pred_proba	partition	y_pred
50000	0.527691	0 - 20K €	1.8	8.96072	False	Tuesday	4.22463	nan	1	2017-08-31 04:20:00	0.99	analysis	1
50001	8.48513	20K - 40K €	2.22207	8.76879	False	Friday	4.9631	nan	1	2017-08-31 05:16:16	0.98	analysis	1
50002	2.07388	40K - 60K €	2.31008	8.64998	True	Friday	4.58895	nan	1	2017-08-31 05:56:44	0.98	analysis	1

The reference and analysis dataframes correspond to reference and analysis periods of the monitored data. To understand what they are read data periods. The analysis_targets dataframe contains the target results of the analysis period and we will not be using it during Performance Estimation.

One of the first steps in using NannyML is providing metadata information about the model we are monitoring. Some information is inferred automatically and we provide the rest.

>>> metadata = nml.extract_metadata(reference, model_type='classification_binary', exclude_columns=['identifier'])
>>> metadata.target_column_name = 'work_home_actual'
>>> display(metadata.is_complete())
(True, [])

We see that the metadata are complete. Full information on how to extract metadata can be found in the providing metadata guide.

Fit calculator and calculate

In the next step a PerformanceCalculator is created using the previously extracted ModelMetadata, a list of metrics and an optional chunking specification. The list of metrics specifies which metrics should be calculated. The following metrics are currently supported:

roc_auc
f1
precision
recall
specificity
accuracy

For more information on metrics, check the metrics module.

The new PerformanceCalculator is then fitted using the fit() method on the reference data.

>>> performance_calculator = nml.PerformanceCalculator(
...     model_metadata=metadata,
...     # use NannyML to tell us what metrics are supported
...     metrics=nml.performance_estimation.confidence_based.results.SUPPORTED_METRIC_VALUES,
...     chunk_size=5000
... ).fit(reference_data=reference)

The fitted PerformanceCalculator can be used to calculate realized performance metrics on data for which target values are available. This is typically done on all data for which target values are available. In our example this includes both reference and analysis.

>>> realized_performance = performance_calculator.calculate(data)

View the results

NannyML can output a dataframe that contains all the results:

>>> display(realized_performance.data.head(3))

	key	start_index	end_index	start_date	end_date	partition	roc_auc	roc_auc_thresholds	roc_auc_alert	f1	f1_thresholds	f1_alert	precision	precision_thresholds	precision_alert	recall	recall_thresholds	recall_alert	specificity	specificity_thresholds	specificity_alert	accuracy	accuracy_thresholds	accuracy_alert
0	[0:4999]	0	4999	2014-05-09 22:27:20	2014-09-09 08:18:27	reference	0.976253	(0.963316535948479, 0.9786597341713761)	False	0.953803	(0.9350467474218009, 0.9610943245280688)	False	0.951308	(0.9247411224999635, 0.9611314708654666)	False	0.956311	(0.940831383455992, 0.9657258748427315)	False	0.952136	(0.9247408281519457, 0.9601131753790443)	False	0.9542	(0.9350787461431096, 0.9606012538568904)	False
1	[5000:9999]	5000	9999	2014-09-09 09:13:35	2015-01-09 00:02:51	reference	0.969045	(0.963316535948479, 0.9786597341713761)	False	0.940963	(0.9350467474218009, 0.9610943245280688)	False	0.934748	(0.9247411224999635, 0.9611314708654666)	False	0.947262	(0.940831383455992, 0.9657258748427315)	False	0.9357	(0.9247408281519457, 0.9601131753790443)	False	0.9414	(0.9350787461431096, 0.9606012538568904)	False
2	[10000:14999]	10000	14999	2015-01-09 00:04:43	2015-05-09 15:54:26	reference	0.971742	(0.963316535948479, 0.9786597341713761)	False	0.954483	(0.9350467474218009, 0.9610943245280688)	False	0.949804	(0.9247411224999635, 0.9611314708654666)	False	0.959208	(0.940831383455992, 0.9657258748427315)	False	0.948283	(0.9247408281519457, 0.9601131753790443)	False	0.9538	(0.9350787461431096, 0.9606012538568904)	False

Apart from chunking and chunk and partition-related data, the results data have the a set of columns for each calculated metric. When taking roc_auc as an example:

roc_auc - The value of the metric for a specific chunk.

roc_auc_thresholds - A tuple containing the lower and upper thresholds. Crossing them will raise an alert on significant metric change. The thresholds are calculated based on the realized performance metric of the monitored model on chunks in the reference period. The thresholds are 3 standard deviations away from the mean performance calculated on reference chunks.

roc_auc_alert - Flag indicating potentially significant performance change. True if realized performance crosses upper or lower threshold.

The results can be plotted for vizual inspection:

>>> for metric in performance_calculator.metrics:
...     figure = realized_performance.plot(kind='performance', metric=metric)
...     figure.show()

../_images/tutorial-perf-guide-Accuracy.svg

../_images/tutorial-perf-guide-Precision.svg

../_images/tutorial-perf-guide-ROC_AUC.svg

../_images/tutorial-perf-guide-Recall.svg

../_images/tutorial-perf-guide-Specificity.svg

Multiclass Classification

Just The Code

If you just want the code to experiment yourself, here you go:

>>> import pandas as pd
>>> import nannyml as nml
>>> from IPython.display import display
>>> reference, analysis, analysis_targets = nml.datasets.load_synthetic_multiclass_classification_dataset()
>>> display(reference.head(3))

>>> data = pd.concat([
...     reference,
...     analysis.set_index('identifier').join(analysis_targets.set_index('identifier'), on='identifier', rsuffix='_r')
>>> ], ignore_index=True).reset_index(drop=True)
>>> display(data.loc[data['partition'] == 'analysis'].head(3))

>>> metadata = nml.extract_metadata(
...     reference,
...     model_name='credit_card_segment',
...     model_type='classification_multiclass',
...     exclude_columns=['identifier']
>>> )
>>> metadata.target_column_name = 'y_true'
>>> display(metadata.is_complete())

>>> performance_calculator = nml.PerformanceCalculator(
...     model_metadata=metadata,
...     metrics=['roc_auc', 'f1'],
...     chunk_size=6000
>>> ).fit(reference_data=reference)

>>> realized_performance = performance_calculator.calculate(data)

>>> display(realized_performance.data.head(3))

>>> for metric in performance_calculator.metrics:
...     figure = realized_performance.plot(kind='performance', metric=metric)
...     figure.show()

Walkthrough on Monitoring Realized Performance

Prepare the data

For simplicity the guide is based on a synthetic dataset where the monitored model predicts which type of credit card product new customers should be assigned to.

>>> import pandas as pd
>>> import nannyml as nml
>>> from IPython.display import display
>>> reference, analysis, analysis_targets = nml.datasets.load_synthetic_multiclass_classification_dataset()
>>> display(reference.head(3))

	acq_channel	app_behavioral_score	requested_credit_limit	app_channel	credit_bureau_score	stated_income	is_customer	partition	identifier	timestamp	y_pred_proba_prepaid_card	y_pred_proba_highstreet_card	y_pred_proba_upmarket_card	y_pred	y_true
0	Partner3	1.80823	350	web	309	15000	True	reference	60000	2020-05-02 02:01:30	0.97	0.03	0	prepaid_card	prepaid_card
1	Partner2	4.38257	500	mobile	418	23000	True	reference	60001	2020-05-02 02:03:33	0.87	0.13	0	prepaid_card	prepaid_card
2	Partner2	-0.787575	400	web	507	24000	False	reference	60002	2020-05-02 02:04:49	0.47	0.35	0.18	prepaid_card	upmarket_card

The realized performance will be calculated on the combination of both reference and analysis data. The analysis target values are joined on the analysis frame by the identifier column.

>>> data = pd.concat([
...     reference,
...     analysis.set_index('identifier').join(analysis_targets.set_index('identifier'), on='identifier', rsuffix='_r')
>>> ], ignore_index=True).reset_index(drop=True)
>>> display(data.loc[data['partition'] == 'analysis'].head(3))

	acq_channel	app_behavioral_score	requested_credit_limit	app_channel	credit_bureau_score	stated_income	is_customer	partition	identifier	timestamp	y_pred_proba_prepaid_card	y_pred_proba_highstreet_card	y_pred_proba_upmarket_card	y_pred	y_true
60000	Organic	-1.64376	300	store	439	15000	False	analysis	nan	2020-09-01 03:10:01	0.39	0.35	0.26	prepaid_card	upmarket_card
60001	Partner2	-0.148435	450	store	565	18000	False	analysis	nan	2020-09-01 03:10:53	0.72	0.01	0.27	prepaid_card	prepaid_card
60002	Partner1	-2.28461	600	mobile	691	28000	False	analysis	nan	2020-09-01 03:11:39	0.03	0.75	0.22	highstreet_card	highstreet_card

The reference and analysis dataframes correspond to reference and analysis periods of the monitored data. To understand what they are read data periods. The analysis_targets dataframe contains the target results of the analysis period and we will not be using it during Performance Estimation.

One of the first steps in using NannyML is providing metadata information about the model we are monitoring. Some information is infered automatically and we provide the rest.

>>> metadata = nml.extract_metadata(
...     reference,
...     model_name='credit_card_segment',
...     model_type='classification_multiclass',
...     exclude_columns=['identifier']
>>> )
>>> metadata.target_column_name = 'y_true'
>>> display(metadata.is_complete())
(True, [])

We see that the metadata are complete. Full information on how to extract metadata can be found in the providing metadata guide.

Fit calculator and calculate

In the next step a PerformanceCalculator is created using the previously extracted ModelMetadata, a list of metrics and an optional chunking specification. The list of metrics specifies which metrics should be calculated. The following metrics are currently supported:

roc_auc
f1
precision
recall
specificity
accuracy

For more information on metrics, check the metrics module.

The new PerformanceCalculator is then fitted using the fit() method on the reference data.

>>> performance_calculator = nml.PerformanceCalculator(
...     model_metadata=metadata,
...     metrics=['roc_auc', 'f1'],
...     chunk_size=6000
>>> ).fit(reference_data=reference)

The fitted PerformanceCalculator can be used to calculate realized performance metrics on data for which target values are available. This is typically done on all data for which target values are available. In our example this includes both reference and analysis.

>>> realized_performance = performance_calculator.calculate(data)

View the results

NannyML can output a dataframe that contains all the results:

>>> display(realized_performance.data.head(3))

	key	start_index	end_index	start_date	end_date	partition	roc_auc	roc_auc_thresholds	roc_auc_alert	f1	f1_thresholds	f1_alert
0	[0:5999]	0	5999	2020-05-02 02:01:30	2020-05-14 12:25:35	reference	0.90476	(0.900902260737325, 0.9135156728918074)	False	0.750532	(0.741253919065521, 0.7649438592270994)	False
1	[6000:11999]	6000	11999	2020-05-14 12:29:25	2020-05-26 18:27:42	reference	0.905917	(0.900902260737325, 0.9135156728918074)	False	0.751148	(0.741253919065521, 0.7649438592270994)	False
2	[12000:17999]	12000	17999	2020-05-26 18:31:06	2020-06-07 19:55:45	reference	0.909329	(0.900902260737325, 0.9135156728918074)	False	0.75714	(0.741253919065521, 0.7649438592270994)	False

Apart from chunking and chunk and partition-related data, the results data have the a set of columns for each calculated metric. When taking roc_auc as an example:

roc_auc - The value of the metric for a specific chunk.

roc_auc_thresholds - A tuple containing the lower and upper thresholds. Crossing them will raise an alert on significant metric change. The thresholds are calculated based on the realized performance metric of the monitored model on chunks in the reference period. The thresholds are 3 standard deviations away from the mean performance calculated on reference chunks.

roc_auc_alert - Flag indicating potentially significant performance change. True if realized performance crosses upper or lower threshold.

The results can be plotted for vizual inspection:

>>> for metric in performance_calculator.metrics:
...     figure = realized_performance.plot(kind='performance', metric=metric)
...     figure.show()

../_images/tutorial-perf-guide-mc-F1.svg

../_images/tutorial-perf-guide-mc-ROC_AUC.svg

Regression

Warning

Performance calculation does not support regression use cases yet.

Insights and Follow Ups

After reviewing the performance calculation results we have to decide if further investigation is needed. The Data Drift functionality can help here.

If needed further investigation can be performed as to wheher the model’s performance is satisfactory according to business requirements. This is an ad-hoc investigation that is not covered by NannyML.