Comparing Estimated and Realized Performance

Just the code

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, analysis_target_df = nml.load_synthetic_car_loan_dataset()

>>> analysis_target_df.head(3)

>>> analysis_with_targets = analysis_df.merge(analysis_target_df, left_index=True, right_index=True)

>>> display(analysis_with_targets.head(3))

>>> # Estimate performance without targets
>>> estimator = nml.CBPE(
...     y_pred_proba='y_pred_proba',
...     y_pred='y_pred',
...     y_true='repaid',
...     timestamp_column_name='timestamp',
...     metrics=['roc_auc'],
...     chunk_size=5000,
...     problem_type='classification_binary',
>>> )

>>> estimator.fit(reference_df)

>>> results = estimator.estimate(analysis_df)

>>> display(results.filter(period='analysis').to_df())

>>> # Calculate realized performance using targets
>>> calculator = nml.PerformanceCalculator(
...     y_pred_proba='y_pred_proba',
...     y_pred='y_pred',
...     y_true='repaid',
...     timestamp_column_name='timestamp',
...     metrics=['roc_auc'],
...     chunk_size=5000,
...     problem_type='classification_binary',
>>> ).fit(reference_df)
>>> realized_results = calculator.calculate(analysis_with_targets)
>>> display(realized_results.filter(period='analysis').to_df())

>>> # Show comparison plot
>>> results.filter(period='analysis').compare(realized_results).plot().show()

Walkthrough

When the targets become available, the quality of estimations provided by NannyML can be evaluated.

The beginning of the code below is similar to the one in tutorial on performance calculation with binary classification data.

For simplicity this guide is based on a synthetic dataset included in the library, where the monitored model predicts whether a customer will repay a loan to buy a car. Check out Car Loan Dataset to learn more about this dataset.

This datasets provided contains targets for analysis period. It has the target values for the monitored model in the repaid column.

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, analysis_target_df = nml.load_synthetic_car_loan_dataset()

>>> analysis_target_df.head(3)

	repaid
0	1
1	1
2	1

For this example, the analysis targets and the analysis frame are joined by their index.

>>> analysis_with_targets = analysis_df.merge(analysis_target_df, left_index=True, right_index=True)

>>> display(analysis_with_targets.head(3))

	car_value	salary_range	debt_to_income_ratio	loan_length	repaid_loan_on_prev_car	size_of_downpayment	driver_tenure	timestamp	y_pred_proba	y_pred	repaid
0	12638	0 - 20K €	0.487926	21	False	10%	4.22463	2018-10-30 18:00:00.000	0.99	1	1
1	52425	20K - 20K €	0.672183	20	False	40%	4.9631	2018-10-30 18:08:43.152	0.98	1	1
2	20369	40K - 60K €	0.70309	19	True	40%	4.58895	2018-10-30 18:17:26.304	0.98	1	1

Estimating performance without targets

We create the Confidence-based Performance Estimation (CBPE) estimator with a list of metrics, and an optional chunking specification. For more information about chunking you can check the chunking tutorial.

>>> # Estimate performance without targets
>>> estimator = nml.CBPE(
...     y_pred_proba='y_pred_proba',
...     y_pred='y_pred',
...     y_true='repaid',
...     timestamp_column_name='timestamp',
...     metrics=['roc_auc'],
...     chunk_size=5000,
...     problem_type='classification_binary',
>>> )

The CBPE estimator is then fitted using the fit() method on the reference data.

We estimate the performance of both the reference and analysis datasets, to compare the estimated and actual performance of the reference period.

We filter the results to only have the estimated values.

>>> estimator.fit(reference_df)

>>> results = estimator.estimate(analysis_df)

>>> display(results.filter(period='analysis').to_df())

	chunk key	chunk_index	start_index	end_index	start_date	end_date	period	roc_auc value	sampling_error	realized	upper_confidence_boundary	lower_confidence_boundary	upper_threshold	lower_threshold	alert
0	[0:4999]	0	0	4999	2018-10-30 18:00:00	2018-11-30 00:27:16.848000	analysis	0.968631	0.00181072	nan	0.974063	0.963198	0.97866	0.963317	False
1	[5000:9999]	1	5000	9999	2018-11-30 00:36:00	2018-12-30 07:03:16.848000	analysis	0.969044	0.00181072	nan	0.974476	0.963612	0.97866	0.963317	False
2	[10000:14999]	2	10000	14999	2018-12-30 07:12:00	2019-01-29 13:39:16.848000	analysis	0.969444	0.00181072	nan	0.974876	0.964012	0.97866	0.963317	False
3	[15000:19999]	3	15000	19999	2019-01-29 13:48:00	2019-02-28 20:15:16.848000	analysis	0.969047	0.00181072	nan	0.974479	0.963615	0.97866	0.963317	False
4	[20000:24999]	4	20000	24999	2019-02-28 20:24:00	2019-03-31 02:51:16.848000	analysis	0.968873	0.00181072	nan	0.974305	0.963441	0.97866	0.963317	False
5	[25000:29999]	5	25000	29999	2019-03-31 03:00:00	2019-04-30 09:27:16.848000	analysis	0.960478	0.00181072	nan	0.96591	0.955046	0.97866	0.963317	True
6	[30000:34999]	6	30000	34999	2019-04-30 09:36:00	2019-05-30 16:03:16.848000	analysis	0.961134	0.00181072	nan	0.966566	0.955701	0.97866	0.963317	True
7	[35000:39999]	7	35000	39999	2019-05-30 16:12:00	2019-06-29 22:39:16.848000	analysis	0.960536	0.00181072	nan	0.965968	0.955104	0.97866	0.963317	True
8	[40000:44999]	8	40000	44999	2019-06-29 22:48:00	2019-07-30 05:15:16.848000	analysis	0.961869	0.00181072	nan	0.967301	0.956437	0.97866	0.963317	True
9	[45000:49999]	9	45000	49999	2019-07-30 05:24:00	2019-08-29 11:51:16.848000	analysis	0.960537	0.00181072	nan	0.965969	0.955104	0.97866	0.963317	True

Comparing to realized performance

We’ll first calculate the realized performance:

>>> # Calculate realized performance using targets
>>> calculator = nml.PerformanceCalculator(
...     y_pred_proba='y_pred_proba',
...     y_pred='y_pred',
...     y_true='repaid',
...     timestamp_column_name='timestamp',
...     metrics=['roc_auc'],
...     chunk_size=5000,
...     problem_type='classification_binary',
>>> ).fit(reference_df)
>>> realized_results = calculator.calculate(analysis_with_targets)
>>> display(realized_results.filter(period='analysis').to_df())

	chunk key	chunk_index	start_index	end_index	start_date	end_date	period	roc_auc sampling_error	value	upper_threshold	lower_threshold	alert
0	[0:4999]	0	0	4999	2018-10-30 18:00:00	2018-11-30 00:27:16.848000	analysis	0.00181072	0.970962	0.97866	0.963317	False
1	[5000:9999]	1	5000	9999	2018-11-30 00:36:00	2018-12-30 07:03:16.848000	analysis	0.00181072	0.970248	0.97866	0.963317	False
2	[10000:14999]	2	10000	14999	2018-12-30 07:12:00	2019-01-29 13:39:16.848000	analysis	0.00181072	0.976282	0.97866	0.963317	False
3	[15000:19999]	3	15000	19999	2019-01-29 13:48:00	2019-02-28 20:15:16.848000	analysis	0.00181072	0.967721	0.97866	0.963317	False
4	[20000:24999]	4	20000	24999	2019-02-28 20:24:00	2019-03-31 02:51:16.848000	analysis	0.00181072	0.969886	0.97866	0.963317	False
5	[25000:29999]	5	25000	29999	2019-03-31 03:00:00	2019-04-30 09:27:16.848000	analysis	0.00181072	0.96005	0.97866	0.963317	True
6	[30000:34999]	6	30000	34999	2019-04-30 09:36:00	2019-05-30 16:03:16.848000	analysis	0.00181072	0.95853	0.97866	0.963317	True
7	[35000:39999]	7	35000	39999	2019-05-30 16:12:00	2019-06-29 22:39:16.848000	analysis	0.00181072	0.959041	0.97866	0.963317	True
8	[40000:44999]	8	40000	44999	2019-06-29 22:48:00	2019-07-30 05:15:16.848000	analysis	0.00181072	0.963094	0.97866	0.963317	True
9	[45000:49999]	9	45000	49999	2019-07-30 05:24:00	2019-08-29 11:51:16.848000	analysis	0.00181072	0.957556	0.97866	0.963317	True

We can then visualize both estimated and realized performance in a single comparison plot.

>>> # Show comparison plot
>>> results.filter(period='analysis').compare(realized_results).plot().show()