Quick Start Guide

NannyML is a library that makes Model Monitoring more productive. It estimates the performance of your models in absence of the target, detects data drift and finds the data drift that’s responsible for any drop in performance.

NannyML provides a sample synthetic dataset that can be used for testing purposes.

>>> import pandas as pd
>>> import nannyml as nml
>>> reference, analysis, analysis_target = nml.load_synthetic_sample()
>>> reference.head()

	distance_from_office	salary_range	gas_price_per_litre	public_transportation_cost	wfh_prev_workday	workday	tenure	identifier	work_home_actual	timestamp	y_pred_proba	partition
0	5.96225	40K - 60K €	2.11948	8.56806	False	Friday	0.212653	0	1	2014-05-09 22:27:20	0.99	reference
1	0.535872	40K - 60K €	2.3572	5.42538	True	Tuesday	4.92755	1	0	2014-05-09 22:59:32	0.07	reference
2	1.96952	40K - 60K €	2.36685	8.24716	False	Monday	0.520817	2	1	2014-05-09 23:48:25	1	reference
3	2.53041	20K - 20K €	2.31872	7.94425	False	Tuesday	0.453649	3	1	2014-05-10 01:12:09	0.98	reference
4	2.25364	60K+ €	2.22127	8.88448	True	Thursday	5.69526	4	1	2014-05-10 02:21:34	0.99	reference

The synthetic dataset provided contains a binary classification model that predicts whether and employee will work from home the next workday or not. The probability of the employee working from home is included in the y_pred_proba column. The model inputs are distance_from_office, salary_range, gas_price_per_litre, public_transportation_cost, wfh_prev_workday, workday and tenure. identifier is the Identifier column and timestamp is the Timestamp column.

The next step is to have NannyML deduce some information about the model from the dataset and make a choice about way we will split our data in Data Chunks.

>>> metadata = nml.extract_metadata(data = reference, model_name='wfh_predictor')
>>> metadata.target_column_name = 'work_home_actual'
>>> data = pd.concat([reference, analysis], ignore_index=True)
>>> # Let's use a chunk size of 5000 data points to create our drift statistics
>>> chunk_size = 5000

The data are already split into a reference and an analysis partition. NannyML uses the reference partition to establish a baseline for expected model performance and the analysis partition to check whether the monitored model keeps performing as expected. For more information about partitions look Data Partitions. The key thing to note is that we don’t expect the analysis partition to contain information about the Target. This is why on the synthetic dataset it is provided in a separate object.

>>> analysis.head()

	distance_from_office	salary_range	gas_price_per_litre	public_transportation_cost	wfh_prev_workday	workday	tenure	identifier	timestamp	y_pred_proba	partition
49995	6.04391	0 - 20K €	1.98303	5.89122	True	Thursday	6.41158	99995	2021-01-01 02:42:38	0.17	analysis
49996	5.67666	20K - 20K €	2.04855	7.5841	True	Wednesday	3.86351	99996	2021-01-01 04:04:01	0.55	analysis
49997	3.14311	0 - 20K €	2.2082	6.57467	True	Tuesday	6.46297	99997	2021-01-01 04:12:57	0.22	analysis
49998	8.33514	40K - 60K €	2.39448	5.25745	True	Monday	6.40706	99998	2021-01-01 04:17:41	0.02	analysis
49999	8.26605	0 - 20K €	1.41597	8.10898	False	Friday	6.90411	99999	2021-01-01 04:29:32	0.02	analysis

This quick start guide will walk you through running NannyML, viewing the estimated performance of your model, and exploring the data drift detection. This is assuming you are using data which is already formatted according to the NannyML data formatting requirements.

Estimating Performance without Targets

NannyML can estimate the performance on a Machine Learning model in production without access to it’s Target. To find out how, see Performance Estimation.

>>> # fit estimator and estimate
>>> estimator = nml.CBPE(model_metadata=metadata, chunk_size=chunk_size)
>>> estimator.fit(reference)
>>> estimated_performance = estimator.estimate(data=data)
>>> # show results
>>> figure = estimated_performance.plot(kind='performance')
>>> figure.show()

The results indicate that the model’s performance is likely to be negatively impacted at the second half of 2019.

Detecting Data Drift

NannyML allows for further investigation into potential peformance issues with it’s data drift detection functionality. See Detecting drift in model inputs for more details.

An example of using NannyML to compute and visualize data drift for the model inputs can be seen below:

>>> # Let's initialize the object that will perform the Univariate Drift calculations
>>> univariate_calculator = nml.UnivariateStatisticalDriftCalculator(model_metadata=metadata, chunk_size=chunk_size)
>>> univariate_calculator.fit(reference_data=reference)
>>> univariate_results = univariate_calculator.calculate(data=data)
>>> # let's plot drift results for all model inputs
>>> for feature in metadata.features:
...     figure = univariate_results.plot(kind='feature_drift', metric='statistic', feature_label=feature.label)
...     figure.show()

When there are a lot of drifted features, NannyML can also rank them by the number of alerts they have raised:

>>> ranker = nml.Ranker.by('alert_count')
>>> ranked_features = ranker.rank(univariate_results, model_metadata=metadata, only_drifting = False)
>>> ranked_features

	feature	number_of_alerts	rank
0	wfh_prev_workday	5	1
1	salary_range	5	2
2	distance_from_office	5	3
3	public_transportation_cost	5	4
4	tenure	2	5
5	workday	0	6
6	gas_price_per_litre	0	7

NannyML can also look for drift in the model outputs:

>>> figure = univariate_results.plot(kind='prediction_drift', metric='statistic')
>>> figure.show()

More complex data drift cases can get detected by Data Reconstruction with PCA. For more information see Data Reconstruction with PCA Deep Dive.

>>> # Let's initialize the object that will perform Data Reconstruction with PCA
>>> rcerror_calculator = nml.DataReconstructionDriftCalculator(model_metadata=metadata, chunk_size=chunk_size)
>>> # NannyML compares drift versus the full reference dataset.
>>> rcerror_calculator.fit(reference_data=reference)
>>> # let's see Reconstruction error statistics for all available data
>>> rcerror_results = rcerror_calculator.calculate(data=data)
>>> figure = rcerror_results.plot(kind='drift')
>>> figure.show()

Putting everything together, we see that 4 features exhibit data drift during late 2019. They are distance_from_office, salary_range, public_transportation_cost, wfh_prev_workday. This drift is responsible for the potential negative impact in performance that we observed.