Monitoring Realized Performance for Regression

Note

The following example uses timestamps. These are optional but have an impact on the way data is chunked and results are plotted. You can read more about them in the data requirements.

Just The Code

>>> import pandas as pd
>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df = nml.load_synthetic_car_price_dataset()[0]
>>> analysis_df = nml.load_synthetic_car_price_dataset()[1]
>>> analysis_target_df = nml.load_synthetic_car_price_dataset()[2]
>>> analysis_df = analysis_df.join(analysis_target_df)

>>> display(reference_df.head(3))

>>> calc = nml.PerformanceCalculator(
...     y_pred='y_pred',
...     y_true='y_true',
...     timestamp_column_name='timestamp',
...     problem_type='regression',
...     metrics=['mae', 'mape', 'mse', 'msle', 'rmse', 'rmsle'],
...     chunk_size=6000)

>>> calc.fit(reference_df)

>>> results = calc.calculate(analysis_df)
>>> display(results.data)

>>> display(results.calculator.previous_reference_results)

>>> for metric in calc.metrics:
...     figure = results.plot(kind='performance', plot_reference=True, metric=metric)
...     figure.show()

Walkthrough

For simplicity the guide is based on a synthetic dataset where the monitored model predicts the selling price of a used car. You can learn more about this dataset.

In order to monitor a model, NannyML needs to learn about it from a reference dataset. Then it can monitor the data that is subject to actual analysis, provided as the analysis dataset. You can read more about this in our section on data periods.

The analysis_targets dataframe contains the target results of the analysis period. This is kept separate in the synthetic data because it is not used during performance estimation. But as it is required to calculate performance, the first thing to do in this case is to join the analysis target values with the analysis data.

>>> import pandas as pd
>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df = nml.load_synthetic_car_price_dataset()[0]
>>> analysis_df = nml.load_synthetic_car_price_dataset()[1]
>>> analysis_target_df = nml.load_synthetic_car_price_dataset()[2]
>>> analysis_df = analysis_df.join(analysis_target_df)

>>> display(reference_df.head(3))

	car_age	km_driven	price_new	accident_count	door_count	fuel	transmission	y_true	y_pred	timestamp
0	15	144020	42810	4	3	diesel	automatic	569	1246	2017-01-24 08:00:00.000
1	12	57078	31835	3	3	electric	automatic	4277	4924	2017-01-24 08:00:33.600
2	2	76288	31851	3	5	diesel	automatic	7011	5744	2017-01-24 08:01:07.200

Next a PerformanceCalculator is created using a list of metrics to calculate (or just one metric), the data columns required for these metrics, an optional chunking specification and the type of machine learning problem being addressed.

The list of metrics specifies which performance metrics of the monitored model will be calculated. The following metrics are currently supported:

mae - mean absolute error
mape - mean absolute percentage error
mse - mean squared error
rmse - root mean squared error
msle - mean squared logarithmic error
rmsle - root mean squared logarithmic error

For more information on metrics, check the metrics module.

>>> calc = nml.PerformanceCalculator(
...     y_pred='y_pred',
...     y_true='y_true',
...     timestamp_column_name='timestamp',
...     problem_type='regression',
...     metrics=['mae', 'mape', 'mse', 'msle', 'rmse', 'rmsle'],
...     chunk_size=6000)

>>> calc.fit(reference_df)

The new PerformanceCalculator is fitted using the fit() method on the reference data.

The fitted PerformanceCalculator can then be used to calculate realized performance metrics on all data which has target values available with the calculate() method. NannyML can output a dataframe that contains all the results of the analysis data.

>>> results = calc.calculate(analysis_df)
>>> display(results.data)

	key	chunk_index	start_index	end_index	start_date	end_date	mae	mae_lower_threshold	mae_upper_threshold	mae_sampling_error	mae_alert	mape	mape_lower_threshold	mape_upper_threshold	mape_sampling_error	mape_alert	mse	mse_lower_threshold	mse_upper_threshold	mse_sampling_error	mse_alert	msle	msle_lower_threshold	msle_upper_threshold	msle_sampling_error	msle_alert	rmse	rmse_lower_threshold	rmse_upper_threshold	rmse_sampling_error	rmse_alert	rmsle	rmsle_lower_threshold	rmsle_upper_threshold	rmsle_sampling_error	rmsle_alert
0	[0:5999]	0	0	5999	2017-02-16 16:00:00	2017-02-18 23:59:26.400000	853.4	817.855	874.805	8.21576	False	0.228707	0.229456	0.237019	0.00248466	True	1.14313e+06	1.02681e+06	1.21572e+06	21915	False	0.0704883	0.0696521	0.0737091	0.0011989	False	1069.17	1014.28	1103.31	10.348	False	0.265496	0.263948	0.271511	0.002239	False
1	[6000:11999]	1	6000	11999	2017-02-19 00:00:00	2017-02-21 07:59:26.400000	853.137	817.855	874.805	8.21576	False	0.230818	0.229456	0.237019	0.00248466	False	1.13987e+06	1.02681e+06	1.21572e+06	21915	False	0.0699896	0.0696521	0.0737091	0.0011989	False	1067.65	1014.28	1103.31	10.348	False	0.264556	0.263948	0.271511	0.002239	False
2	[12000:17999]	2	12000	17999	2017-02-21 08:00:00	2017-02-23 15:59:26.400000	846.304	817.855	874.805	8.21576	False	0.229042	0.229456	0.237019	0.00248466	True	1.12872e+06	1.02681e+06	1.21572e+06	21915	False	0.0696923	0.0696521	0.0737091	0.0011989	False	1062.41	1014.28	1103.31	10.348	False	0.263993	0.263948	0.271511	0.002239	False
3	[18000:23999]	3	18000	23999	2017-02-23 16:00:00	2017-02-25 23:59:26.400000	855.495	817.855	874.805	8.21576	False	0.233624	0.229456	0.237019	0.00248466	False	1.15829e+06	1.02681e+06	1.21572e+06	21915	False	0.0719322	0.0696521	0.0737091	0.0011989	False	1076.24	1014.28	1103.31	10.348	False	0.268202	0.263948	0.271511	0.002239	False
4	[24000:29999]	4	24000	29999	2017-02-26 00:00:00	2017-02-28 07:59:26.400000	849.33	817.855	874.805	8.21576	False	0.233887	0.229456	0.237019	0.00248466	False	1.12429e+06	1.02681e+06	1.21572e+06	21915	False	0.0724877	0.0696521	0.0737091	0.0011989	False	1060.32	1014.28	1103.31	10.348	False	0.269235	0.263948	0.271511	0.002239	False
5	[30000:35999]	5	30000	35999	2017-02-28 08:00:00	2017-03-02 15:59:26.400000	702.518	817.855	874.805	8.21576	True	0.262864	0.229456	0.237019	0.00248466	True	829589	1.02681e+06	1.21572e+06	21915	True	0.104949	0.0696521	0.0737091	0.0011989	True	910.818	1014.28	1103.31	10.348	True	0.323958	0.263948	0.271511	0.002239	True
6	[36000:41999]	6	36000	41999	2017-03-02 16:00:00	2017-03-04 23:59:26.400000	700.736	817.855	874.805	8.21576	True	0.26346	0.229456	0.237019	0.00248466	True	829693	1.02681e+06	1.21572e+06	21915	True	0.104814	0.0696521	0.0737091	0.0011989	True	910.875	1014.28	1103.31	10.348	True	0.32375	0.263948	0.271511	0.002239	True
7	[42000:47999]	7	42000	47999	2017-03-05 00:00:00	2017-03-07 07:59:26.400000	684.702	817.855	874.805	8.21576	True	0.26095	0.229456	0.237019	0.00248466	True	792287	1.02681e+06	1.21572e+06	21915	True	0.104347	0.0696521	0.0737091	0.0011989	True	890.105	1014.28	1103.31	10.348	True	0.323027	0.263948	0.271511	0.002239	True
8	[48000:53999]	8	48000	53999	2017-03-07 08:00:00	2017-03-09 15:59:26.400000	705.814	817.855	874.805	8.21576	True	0.265371	0.229456	0.237019	0.00248466	True	835917	1.02681e+06	1.21572e+06	21915	True	0.104714	0.0696521	0.0737091	0.0011989	True	914.285	1014.28	1103.31	10.348	True	0.323596	0.263948	0.271511	0.002239	True
9	[54000:59999]	9	54000	59999	2017-03-09 16:00:00	2017-03-11 23:59:26.400000	698.344	817.855	874.805	8.21576	True	0.265757	0.229456	0.237019	0.00248466	True	825936	1.02681e+06	1.21572e+06	21915	True	0.105882	0.0696521	0.0737091	0.0011989	True	908.81	1014.28	1103.31	10.348	True	0.325394	0.263948	0.271511	0.002239	True

There results from the reference data are also available.

>>> display(results.calculator.previous_reference_results)

	key	chunk_index	start_index	end_index	start_date	end_date	mae	mae_lower_threshold	mae_upper_threshold	mae_sampling_error	mae_alert	mape	mape_lower_threshold	mape_upper_threshold	mape_sampling_error	mape_alert	mse	mse_lower_threshold	mse_upper_threshold	mse_sampling_error	mse_alert	msle	msle_lower_threshold	msle_upper_threshold	msle_sampling_error	msle_alert	rmse	rmse_lower_threshold	rmse_upper_threshold	rmse_sampling_error	rmse_alert	rmsle	rmsle_lower_threshold	rmsle_upper_threshold	rmsle_sampling_error	rmsle_alert
0	[0:5999]	0	0	5999	2017-01-24 08:00:00	2017-01-26 15:59:26.400000	863.932	817.855	874.805	8.21576	False	0.23274	0.229456	0.237019	0.00248466	False	1.18007e+06	1.02681e+06	1.21572e+06	21915	False	0.0715427	0.0696521	0.0737091	0.0011989	False	1086.31	1014.28	1103.31	10.348	False	0.267475	0.263948	0.271511	0.002239	False
1	[6000:11999]	1	6000	11999	2017-01-26 16:00:00	2017-01-28 23:59:26.400000	844.491	817.855	874.805	8.21576	False	0.234282	0.229456	0.237019	0.00248466	False	1.12407e+06	1.02681e+06	1.21572e+06	21915	False	0.0721316	0.0696521	0.0737091	0.0011989	False	1060.22	1014.28	1103.31	10.348	False	0.268573	0.263948	0.271511	0.002239	False
2	[12000:17999]	2	12000	17999	2017-01-29 00:00:00	2017-01-31 07:59:26.400000	830.578	817.855	874.805	8.21576	False	0.231986	0.229456	0.237019	0.00248466	False	1.07831e+06	1.02681e+06	1.21572e+06	21915	False	0.0709387	0.0696521	0.0737091	0.0011989	False	1038.42	1014.28	1103.31	10.348	False	0.266343	0.263948	0.271511	0.002239	False
3	[18000:23999]	3	18000	23999	2017-01-31 08:00:00	2017-02-02 15:59:26.400000	838.746	817.855	874.805	8.21576	False	0.231618	0.229456	0.237019	0.00248466	False	1.07827e+06	1.02681e+06	1.21572e+06	21915	False	0.0709489	0.0696521	0.0737091	0.0011989	False	1038.4	1014.28	1103.31	10.348	False	0.266362	0.263948	0.271511	0.002239	False
4	[24000:29999]	4	24000	29999	2017-02-02 16:00:00	2017-02-04 23:59:26.400000	857.765	817.855	874.805	8.21576	False	0.235091	0.229456	0.237019	0.00248466	False	1.14923e+06	1.02681e+06	1.21572e+06	21915	False	0.0727984	0.0696521	0.0737091	0.0011989	False	1072.02	1014.28	1103.31	10.348	False	0.269812	0.263948	0.271511	0.002239	False
5	[30000:35999]	5	30000	35999	2017-02-05 00:00:00	2017-02-07 07:59:26.400000	852.697	817.855	874.805	8.21576	False	0.232364	0.229456	0.237019	0.00248466	False	1.15555e+06	1.02681e+06	1.21572e+06	21915	False	0.0712554	0.0696521	0.0737091	0.0011989	False	1074.97	1014.28	1103.31	10.348	False	0.266937	0.263948	0.271511	0.002239	False
6	[36000:41999]	6	36000	41999	2017-02-07 08:00:00	2017-02-09 15:59:26.400000	842.253	817.855	874.805	8.21576	False	0.232789	0.229456	0.237019	0.00248466	False	1.12037e+06	1.02681e+06	1.21572e+06	21915	False	0.0715653	0.0696521	0.0737091	0.0011989	False	1058.48	1014.28	1103.31	10.348	False	0.267517	0.263948	0.271511	0.002239	False
7	[42000:47999]	7	42000	47999	2017-02-09 16:00:00	2017-02-11 23:59:26.400000	837.9	817.855	874.805	8.21576	False	0.235516	0.229456	0.237019	0.00248466	False	1.10396e+06	1.02681e+06	1.21572e+06	21915	False	0.0729194	0.0696521	0.0737091	0.0011989	False	1050.7	1014.28	1103.31	10.348	False	0.270036	0.263948	0.271511	0.002239	False
8	[48000:53999]	8	48000	53999	2017-02-12 00:00:00	2017-02-14 07:59:26.400000	844.266	817.855	874.805	8.21576	False	0.232423	0.229456	0.237019	0.00248466	False	1.09914e+06	1.02681e+06	1.21572e+06	21915	False	0.0711648	0.0696521	0.0737091	0.0011989	False	1048.4	1014.28	1103.31	10.348	False	0.266767	0.263948	0.271511	0.002239	False
9	[54000:59999]	9	54000	59999	2017-02-14 08:00:00	2017-02-16 15:59:26.400000	850.673	817.855	874.805	8.21576	False	0.233561	0.229456	0.237019	0.00248466	False	1.12369e+06	1.02681e+06	1.21572e+06	21915	False	0.0715405	0.0696521	0.0737091	0.0011989	False	1060.04	1014.28	1103.31	10.348	False	0.267471	0.263948	0.271511	0.002239	False

Apart from chunking and chunk and period-related columns, the results data have a set of columns for each calculated metric. When taking mae as an example:

targets_missing_rate - The fraction of missing target data.

<metric> - The value of the metric for a specific chunk.

<metric>_lower_threshold> and <metric>_upper_threshold> - Lower and upper thresholds for performance metric. Crossing them will raise an alert that there is a significant metric change. The thresholds are calculated based on the realized performance of chunks in the reference period. The thresholds are 3 standard deviations away from the mean performance calculated on reference chunks. They are calculated during fit phase.

<metric>_alert - A flag indicating potentially significant performance change. True if realized performance crosses upper or lower threshold.

<metric>_sampling_error - Estimated Sampling Error for the relevant metric.

The results can be plotted for visual inspection:

>>> for metric in calc.metrics:
...     figure = results.plot(kind='performance', plot_reference=True, metric=metric)
...     figure.show()

Insights

From looking at the RMSE and RMSLE performance results we can observe an interesting effect. We know that RMSE penalizes mispredictions symmetrically while RMSLE penalizes underprediction more than overprediction. Hence while our model has become a little bit more accurate according to RMSE, the increase in RMSLE tells us that our model is now underpredicting more than it was before!

What Next

If we decide further investigation is needed, the Data Drift functionality can help us to see what feature changes may be contributing to any performance changes.

It is also wise to check whether the model’s performance is satisfactory according to business requirements. This is an ad-hoc investigation that is not covered by NannyML.