Creating and Estimating a Custom Binary Classification Metric

This tutorial explains how to use NannyML to estimate a custom metric based on confusion matrix for binary classification models in the absence of target data. In particular, we will be creating a balanced accuracy metric. To find out how CBPE estimates the confusion matrix components, read the explanation of Confidence-based Performance Estimation.

Just the Code

>>> import nannyml as nml
>>> from IPython.display import display
>>> import numpy as np
>>> import matplotlib.pyplot as plt

>>> reference_df = nml.load_synthetic_car_loan_dataset()[0]
>>> analysis_df = nml.load_synthetic_car_loan_dataset()[1]

>>> display(reference_df.head(3))

>>> estimator = nml.CBPE(
...     y_pred_proba='y_pred_proba',
...     y_pred='y_pred',
...     y_true='repaid',
...     metrics=['confusion_matrix'],
...     problem_type='classification_binary',
...     normalize_confusion_matrix="all",
>>> )

>>> estimator.fit(reference_df)

>>> results = estimator.estimate(analysis_df)

>>> results_data = results.to_df()
>>> display(results_data)

>>> true_pos_rate = results_data['true_positive']['value'].values
>>> false_pos_rate = results_data['false_positive']['value'].values
>>> true_neg_rate = results_data['true_negative']['value'].values
>>> false_neg_rate = results_data['false_negative']['value'].values

>>> sensitivity = true_pos_rate / (true_pos_rate + false_neg_rate)
>>> specificity = true_neg_rate / (true_neg_rate + false_pos_rate)

>>> balanced_accuracy = (sensitivity + specificity) / 2

>>> num_ref_chunks = len(results.filter(period = 'reference').to_df())

>>> reference_index = np.arange(num_ref_chunks)
>>> analysis_index = np.arange(num_ref_chunks, len(results_data))

>>> plt.plot(reference_index, balanced_accuracy[:num_ref_chunks], label='Reference', marker='o')
>>> plt.plot(analysis_index, balanced_accuracy[num_ref_chunks:], label='Analysis', marker='o')

>>> plt.axvline(x=num_ref_chunks-0.5, color='gray')

>>> plt.xlabel('Chunk Number')
>>> plt.ylabel('Estimated Balanced Accuracy')
>>> plt.title('Estimated Balanced Accuracy')

>>> plt.legend()

>>> plt.show()

Walkthrough

While NannyML offers out-of-the-box support for the estimation of a number of metrics (see which in our Estimating Performance for Binary Classification page), it is also possible to create custom metrics. In this tutorial we will be creating a balanced accuracy metric, using the confusion matrix as a building block.

For simplicity this guide is based on a synthetic dataset included in the library, where the monitored model predicts whether a customer will repay a loan to buy a car. You can read more about this synthetic dataset here.

In order to monitor a model, NannyML needs to learn about it from a reference dataset. Then it can monitor the data that is subject to actual analysis, provided as the analysis dataset. You can read more about this in our section on data periods.

We start by importing the libraries we’ll need and loading the dataset we’ll be using:

>>> import nannyml as nml
>>> from IPython.display import display
>>> import numpy as np
>>> import matplotlib.pyplot as plt

>>> reference_df = nml.load_synthetic_car_loan_dataset()[0]
>>> analysis_df = nml.load_synthetic_car_loan_dataset()[1]

>>> display(reference_df.head(3))

	id	car_value	salary_range	debt_to_income_ratio	loan_length	repaid_loan_on_prev_car	size_of_downpayment	driver_tenure	repaid	timestamp	y_pred_proba	y_pred
0	0	39811	40K - 60K €	0.63295	19	False	40%	0.212653	1	2018-01-01 00:00:00.000	0.99	1
1	1	12679	40K - 60K €	0.718627	7	True	10%	4.92755	0	2018-01-01 00:08:43.152	0.07	0
2	2	19847	40K - 60K €	0.721724	17	False	0%	0.520817	1	2018-01-01 00:17:26.304	1	1

Next we create the Confidence-based Performance Estimation (CBPE) estimator to estimate the confusion matrix elements that we will need for our custom metric. In order to estimate the confusion_matrix elements we will specify the metrics parameter as [‘confusion_matrix’]. We will also specify the normalize_confusion_matrix parameter as “all” to get the rate instead of the count for each cell.

>>> estimator = nml.CBPE(
...     y_pred_proba='y_pred_proba',
...     y_pred='y_pred',
...     y_true='repaid',
...     metrics=['confusion_matrix'],
...     problem_type='classification_binary',
...     normalize_confusion_matrix="all",
>>> )

The CBPE estimator is then fitted using the fit() method on the reference data.

>>> estimator.fit(reference_df)

The fitted estimator can be used to estimate performance on other data, for which performance cannot be calculated. Typically, this would be used on the latest production data where target is missing. In our example this is the analysis_df data.

NannyML can then output a dataframe that contains all the results.

>>> results = estimator.estimate(analysis_df)

>>> results_data = results.to_df()
>>> display(results_data)

	chunk key	chunk_index	start_index	end_index	period	true_positive value	sampling_error	realized	upper_confidence_boundary	lower_confidence_boundary	upper_threshold	lower_threshold	alert	true_negative value	sampling_error	realized	upper_confidence_boundary	lower_confidence_boundary	upper_threshold	lower_threshold	alert	false_positive value	sampling_error	realized	upper_confidence_boundary	lower_confidence_boundary	upper_threshold	lower_threshold	alert	false_negative value	sampling_error	realized	upper_confidence_boundary	lower_confidence_boundary	upper_threshold	lower_threshold	alert
0	[0:4999]	0	0	4999	reference	0.458185	0.00705286	0.4596	0.479343	0.437026	0.478879	0.449401	False	0.486383	0.00706512	0.4866	0.507579	0.465188	0.494119	0.464881	False	0.0204154	0.00202397	0.019	0.0264873	0.0143435	0.025818	0.016022	False	0.0350166	0.00261473	0.0348	0.0428607	0.0271724	0.0416915	0.0291885	False
1	[5000:9999]	1	5000	9999	reference	0.456855	0.00705286	0.455	0.478013	0.435696	0.478879	0.449401	False	0.485678	0.00706512	0.4844	0.506873	0.464482	0.494119	0.464881	False	0.0207453	0.00202397	0.0226	0.0268172	0.0146733	0.025818	0.016022	False	0.0367222	0.00261473	0.038	0.0445664	0.028878	0.0416915	0.0291885	False
2	[10000:14999]	2	10000	14999	reference	0.469963	0.00705286	0.471	0.491121	0.448804	0.478879	0.449401	False	0.473446	0.00706512	0.4752	0.494641	0.452251	0.494119	0.464881	False	0.0208371	0.00202397	0.0198	0.0269091	0.0147652	0.025818	0.016022	False	0.035754	0.00261473	0.034	0.0435982	0.0279098	0.0416915	0.0291885	False
3	[15000:19999]	3	15000	19999	reference	0.46226	0.00705286	0.4634	0.483419	0.441102	0.478879	0.449401	False	0.481754	0.00706512	0.4808	0.502949	0.460559	0.494119	0.464881	False	0.0207396	0.00202397	0.0196	0.0268115	0.0146677	0.025818	0.016022	False	0.035246	0.00261473	0.0362	0.0430901	0.0274018	0.0416915	0.0291885	False
4	[20000:24999]	4	20000	24999	reference	0.468431	0.00705286	0.4674	0.489589	0.447272	0.478879	0.449401	False	0.475128	0.00706512	0.4708	0.496324	0.453933	0.494119	0.464881	False	0.0209695	0.00202397	0.022	0.0270414	0.0148976	0.025818	0.016022	False	0.0354715	0.00261473	0.0398	0.0433157	0.0276274	0.0416915	0.0291885	False
5	[25000:29999]	5	25000	29999	reference	0.459727	0.00705286	0.458	0.480885	0.438568	0.478879	0.449401	False	0.484389	0.00706512	0.4862	0.505584	0.463193	0.494119	0.464881	False	0.0208731	0.00202397	0.0226	0.026945	0.0148012	0.025818	0.016022	False	0.0350115	0.00261473	0.0332	0.0428557	0.0271673	0.0416915	0.0291885	False
6	[30000:34999]	6	30000	34999	reference	0.465254	0.00705286	0.4648	0.486413	0.444096	0.478879	0.449401	False	0.476255	0.00706512	0.4802	0.49745	0.45506	0.494119	0.464881	False	0.0201459	0.00202397	0.0206	0.0262178	0.014074	0.025818	0.016022	False	0.0383451	0.00261473	0.0344	0.0461892	0.0305009	0.0416915	0.0291885	False
7	[35000:39999]	7	35000	39999	reference	0.469571	0.00705286	0.469	0.49073	0.448412	0.478879	0.449401	False	0.475337	0.00706512	0.476	0.496532	0.454141	0.494119	0.464881	False	0.0210291	0.00202397	0.0216	0.027101	0.0149572	0.025818	0.016022	False	0.0340635	0.00261473	0.0334	0.0419076	0.0262193	0.0416915	0.0291885	False
8	[40000:44999]	8	40000	44999	reference	0.465682	0.00705286	0.4682	0.48684	0.444523	0.478879	0.449401	False	0.479609	0.00706512	0.4768	0.500804	0.458414	0.494119	0.464881	False	0.0207181	0.00202397	0.0182	0.02679	0.0146462	0.025818	0.016022	False	0.033991	0.00261473	0.0368	0.0418352	0.0261468	0.0416915	0.0291885	False
9	[45000:49999]	9	45000	49999	reference	0.466762	0.00705286	0.465	0.48792	0.445603	0.478879	0.449401	False	0.47831	0.00706512	0.478	0.499505	0.457115	0.494119	0.464881	False	0.0214382	0.00202397	0.0232	0.0275101	0.0153662	0.025818	0.016022	False	0.03349	0.00261473	0.0338	0.0413341	0.0256458	0.0416915	0.0291885	False
10	[0:4999]	0	0	4999	analysis	0.481766	0.00705286	nan	0.502925	0.460608	0.478879	0.449401	True	0.460026	0.00706512	nan	0.481221	0.43883	0.494119	0.464881	True	0.0212337	0.00202397	nan	0.0273056	0.0151617	0.025818	0.016022	False	0.0369745	0.00261473	nan	0.0448186	0.0291303	0.0416915	0.0291885	False
11	[5000:9999]	1	5000	9999	analysis	0.454646	0.00705286	nan	0.475804	0.433487	0.478879	0.449401	False	0.488676	0.00706512	nan	0.509871	0.46748	0.494119	0.464881	False	0.0199543	0.00202397	nan	0.0260262	0.0138824	0.025818	0.016022	False	0.0367245	0.00261473	nan	0.0445687	0.0288803	0.0416915	0.0291885	False
12	[10000:14999]	2	10000	14999	analysis	0.455756	0.00705286	nan	0.476914	0.434597	0.478879	0.449401	False	0.489736	0.00706512	nan	0.510931	0.46854	0.494119	0.464881	False	0.0198442	0.00202397	nan	0.0259161	0.0137723	0.025818	0.016022	False	0.0346643	0.00261473	nan	0.0425084	0.0268201	0.0416915	0.0291885	False
13	[15000:19999]	3	15000	19999	analysis	0.457828	0.00705286	nan	0.478987	0.43667	0.478879	0.449401	False	0.486988	0.00706512	nan	0.508183	0.465793	0.494119	0.464881	False	0.0205719	0.00202397	nan	0.0266438	0.0145	0.025818	0.016022	False	0.0346121	0.00261473	nan	0.0424563	0.0267679	0.0416915	0.0291885	False
14	[20000:24999]	4	20000	24999	analysis	0.468372	0.00705286	nan	0.489531	0.447213	0.478879	0.449401	False	0.476273	0.00706512	nan	0.497468	0.455078	0.494119	0.464881	False	0.020428	0.00202397	nan	0.0264999	0.014356	0.025818	0.016022	False	0.034927	0.00261473	nan	0.0427712	0.0270829	0.0416915	0.0291885	False
15	[25000:29999]	5	25000	29999	analysis	0.461246	0.00705286	nan	0.482404	0.440087	0.478879	0.449401	False	0.449469	0.00706512	nan	0.470664	0.428273	0.494119	0.464881	True	0.0287544	0.00202397	nan	0.0348263	0.0226825	0.025818	0.016022	True	0.0605314	0.00261473	nan	0.0683756	0.0526873	0.0416915	0.0291885	True
16	[30000:34999]	6	30000	34999	analysis	0.459067	0.00705286	nan	0.480225	0.437908	0.478879	0.449401	False	0.452083	0.00706512	nan	0.473278	0.430888	0.494119	0.464881	True	0.0283335	0.00202397	nan	0.0344054	0.0222616	0.025818	0.016022	True	0.060517	0.00261473	nan	0.0683612	0.0526729	0.0416915	0.0291885	True
17	[35000:39999]	7	35000	39999	analysis	0.458246	0.00705286	nan	0.479404	0.437087	0.478879	0.449401	False	0.452947	0.00706512	nan	0.474142	0.431752	0.494119	0.464881	True	0.0295542	0.00202397	nan	0.0356261	0.0234823	0.025818	0.016022	True	0.0592531	0.00261473	nan	0.0670972	0.0514089	0.0416915	0.0291885	True
18	[40000:44999]	8	40000	44999	analysis	0.453561	0.00705286	nan	0.47472	0.432403	0.478879	0.449401	False	0.460828	0.00706512	nan	0.482024	0.439633	0.494119	0.464881	True	0.0272388	0.00202397	nan	0.0333107	0.0211669	0.025818	0.016022	True	0.0583718	0.00261473	nan	0.066216	0.0505277	0.0416915	0.0291885	True
19	[45000:49999]	9	45000	49999	analysis	0.473578	0.00705286	nan	0.494737	0.45242	0.478879	0.449401	False	0.438153	0.00706512	nan	0.459349	0.416958	0.494119	0.464881	True	0.0296219	0.00202397	nan	0.0356938	0.02355	0.025818	0.016022	True	0.0586468	0.00261473	nan	0.066491	0.0508026	0.0416915	0.0291885	True

From these results we will want the value for each component of the confusion matrix for each chunk of data. To do so, we simply index into the results dataframe as is done below:

>>> true_pos_rate = results_data['true_positive']['value'].values
>>> false_pos_rate = results_data['false_positive']['value'].values
>>> true_neg_rate = results_data['true_negative']['value'].values
>>> false_neg_rate = results_data['false_negative']['value'].values

Now that we have these values, we can use them to calculate the sensitivity and specificity for each chunk of data. We can then use these values to calculate the balanced accuracy for each chunk of data.

As a reminder, the balanced accuracy is defined as:

\[\text{balanced accuracy} = \frac{1}{2} \left( \text{sensitivity} + \text{specificity} \right)\]

and the sensitivity and specificity are defined as:

\[\text{sensitivity} = \frac{TP}{TP + FN}\]

\[\text{specificity} = \frac{TN}{TN + FP}\]

where \(TP\) is the number of true positives (or true positive rate), \(TN\) is the number of true negatives (or true negative rate), \(FP\) is the number of false positives (or false positive rate), and \(FN\) is the number of false negatives (or false negative rate).

>>> sensitivity = true_pos_rate / (true_pos_rate + false_neg_rate)
>>> specificity = true_neg_rate / (true_neg_rate + false_pos_rate)

>>> balanced_accuracy = (sensitivity + specificity) / 2

To distinguish between the balanced accuracy for the reference data and the analysis data, we can get the number of chunks in the reference data and analysis data and then use this to index the balanced_accuracy array.

>>> num_ref_chunks = len(results.filter(period = 'reference').to_df())

>>> reference_index = np.arange(num_ref_chunks)
>>> analysis_index = np.arange(num_ref_chunks, len(results_data))

Since balanced accuracy is not supported out of the box with NannyML, we will create a custom plot to visualize the performance estimation results.

>>> plt.plot(reference_index, balanced_accuracy[:num_ref_chunks], label='Reference', marker='o')
>>> plt.plot(analysis_index, balanced_accuracy[num_ref_chunks:], label='Analysis', marker='o')

>>> plt.axvline(x=num_ref_chunks-0.5, color='gray')

>>> plt.xlabel('Chunk Number')
>>> plt.ylabel('Estimated Balanced Accuracy')
>>> plt.title('Estimated Balanced Accuracy')

>>> plt.legend()

>>> plt.show()

Insights

After reviewing the performance estimation results, we should be able to see any indications of performance change that NannyML has detected based upon the model’s inputs and outputs alone.

What’s next

The Data Drift functionality can help us to understand whether data drift is causing the performance problem. When the target values become available we can compared realized and estimated custom performance metric results.