Quickstart

What is NannyML?

NannyML detects silent model failures, estimates the performance of ML models in the absence of ground truth, and robustly detects data drift potentially responsible for the failure. It can also monitor realized performance when the target data is available.

This Quickstart presents some of the core functionalities of NannyML.

Exemplary Workflow with NannyML

Loading data

We will use a real-world dataset containing inputs and predictions of a binary classification model that predicts whether an individual is employed based on survey data. To learn more about this dataset check out US Census Employment dataset.

The data is split into two periods: reference and analysis. The reference data is used by NannyML to establish a baseline for model performance and drift detection. The model’s test set can serve as the reference data. The analysis data is the data you want to analyze i.e. check whether the model maintains its performance or if the feature distributions have shifted etc. This would usually be the latest production data.

Let’s load the libraries and the data:

>>> import nannyml as nml
>>> import pandas as pd
>>> from IPython.display import display

>>> reference_df, analysis_df, _ = nml.load_us_census_ma_employment_data()
>>> display(reference_df.head())
>>> display(analysis_df.head())

	id	AGEP	…	RAC1P	employed	year	prediction	predicted_probability
0	0	62	…	1	0	2015	0	0.121211
1	1	48	…	1	0	2015	1	0.816033
2	2	47	…	1	0	2015	1	0.951815
3	3	34	…	2	0	2015	1	0.563825
4	4	33	…	1	1	2015	1	0.944436

	id	AGEP	…	SEX	RAC1P	year	prediction	predicted_probability
0	68785	46	…	1	1	2016	1	0.948828
1	68786	46	…	2	1	2016	1	0.772002
2	68787	12	…	2	1	2016	0	0.000149194
3	68788	52	…	2	1	2016	1	0.90607
4	68789	21	…	1	1	2016	1	0.699663

The dataframes contain:

model inputs like AGEP, SCHL, etc.
year - the year the data was gathered. The df_reference data covers 2015 while df_analysis ranges from 2016 to 2018.
employed - classification target. Notice that the target is not available in df_analysis.
prediction - analyzed model predictions.
predicted_probability - analyzed model predicted probability scores.

Estimating Performance without Targets

ML models are deployed to production once their business value and performance have been validated and tested. This usually takes place in the model development phase. The main goal of the ML model monitoring is to continuously verify whether the model maintains its anticipated performance (which is not the case most of the time [1]).

Monitoring performance is relatively straightforward when targets are available, but this is often not the case. The labels can be delayed, costly, or impossible to get. In such cases, estimating performance is a good start for the monitoring workflow. NannyML can estimate the performance of an ML model without access to targets.

To reliably assess the performance of an ML model, we need to aggregate data. We call this aggregation chunking, and the result of it is a chunk. There are many ways to define chunks in NannyML. In this Quickstart, we will use size-based chunking and define the size of the chunk to be 5000 observations:

>>> chunk_size = 5000

For binary classification model performance estimation we will use the CBPE class (Confidence-based Performance Estimation) to estimate the roc_auc metric. Let’s initialize the estimator and provide the required arguments:

>>> estimator = nml.CBPE(
...     problem_type='classification_binary',
...     y_pred_proba='predicted_probability',
...     y_pred='prediction',
...     y_true='employed',
...     metrics=['roc_auc'],
...     chunk_size=chunk_size,
>>> )

Now we will fit it on df_reference and estimate on df_analysis:

>>> estimator = estimator.fit(reference_df)
>>> estimated_performance = estimator.estimate(analysis_df)

Let’s visualize the results:

>>> figure = estimated_performance.plot()
>>> figure.show()

The estimated performance dropped significantly in the later part of the analysis. Let’s investigate this to determine whether we can rely on the estimation.

Investigating Data Distribution Shifts

Once we’ve identified a performance issue, we will troubleshoot it. We will quantify potential distribution shifts for all the features using the univariate drift detection module. We will instantiate the UnivariateDriftCalculator class with the required arguments, fit it on df_reference, and calculate on df_analysis.

>>> feature_column_names = ['AGEP', 'SCHL', 'MAR', 'RELP', 'DIS', 'ESP', 'CIT', 'MIG',
...                         'MIL', 'ANC', 'NATIVITY', 'DEAR', 'DEYE', 'DREM', 'SEX', 'RAC1P']

>>> univariate_calculator = nml.UnivariateDriftCalculator(
...     column_names=feature_column_names,
...     chunk_size=chunk_size
>>> )

>>> univariate_calculator.fit(reference_df)
>>> univariate_drift = univariate_calculator.calculate(analysis_df)

Now let’s select only the features that drifted the most. To do this, we use one of the ranking methods - AlertCountRanker():

>>> alert_count_ranker = nml.AlertCountRanker()
>>> alert_count_ranked_features = alert_count_ranker.rank(univariate_drift)
>>> display(alert_count_ranked_features.head())

	number_of_alerts	column_name	rank
0	37	ANC	1
1	29	AGEP	2
2	28	RELP	3
3	28	MAR	4
4	28	DREM	5

The top 3 indicated features are:

RELP - the relationship of the person with the house owner.

AGE - age of the person.

SCHL - education level.

Let’s plot univariate drift results for these features:

>>> figure = univariate_drift.filter(column_names=['RELP','AGEP', 'SCHL']).plot()
>>> figure.show()

The plots show JS-distance calculated between the reference data and each chunk for every feature. For AGEP and RELP one can see a mild shift starting around one-third of the analysis period and a high peak that likely corresponds to a performance drop. Around the same time, a similar peak can be noticed for SCHL. Let’s check whether the shift happens at the same time as the performance drop by showing both results in a single plot:

>>> uni_drift_AGEP_analysis = univariate_drift.filter(column_names=['AGEP'], period='analysis')
>>> figure = estimated_performance.compare(uni_drift_AGEP_analysis).plot()
>>> figure.show()

_images/quick-start-drift-n-performance.svg

The main drift peak indeed coincides with the strongest performance drop. It is interesting to see that there is a noticeable shift magnitude increase right before the estimated drop happens. That looks like an early sign of incoming issues. Now let’s have a closer look at changes in the distributions by visualizing them in the analysis period:

>>> figure = univariate_drift.filter(period='analysis', column_names=['RELP','AGEP', 'SCHL']).plot(kind='distribution')
>>> figure.show()

_images/quick-start-univariate-distribution.svg

Let’s summarize the shifts:

The distribution of person age (AGEP) has strongly shifted towards younger people (around 18 years old).

The relative frequencies of the categories in RELP have changed significantly. Since the plots are interactive (when run in a notebook), they allow checking the corresponding values in the bar plots. The category that has increased its relative frequency from around 5% in the reference period to almost 70% in the chunk with the strongest drift is encoded with value 17, which refers to Noninstitutionalized group quarters population. This corresponds to people who live in group quarters other than institutions. Examples are: college dormitories, rooming houses, religious group houses, communes, or halfway houses.

The distribution of SCHL changed, with one of the categories doubling its relative frequency. This category is encoded with value 19, which corresponds to people with 1 or more years of college credit, no degree.

So the main responders in the period with data shift are young people who finished at least one year of college but did not graduate and don’t live at their parents’ houses. It means that, most likely, there was a significant survey action conducted at dormitories of colleges/universities. These findings indicate that a significant part of the shift has a nature of covariate shift [2], which CBPE handles well.

Comparing Estimated with Realized Performance when Targets Arrive

Once the labels are in place, we can calculate the realized performance and compare it with the estimation to verify its accuracy. We will use the PerformanceCalculator and follow the familiar pattern: initialize, fit, and calculate. Then we will plot the comparison:

>>> _, _, analysis_targets_df = nml.load_us_census_ma_employment_data()

>>> analysis_with_targets_df = pd.concat([analysis_df, analysis_targets_df], axis=1)
>>> display(analysis_with_targets_df.head())

	id	AGEP	…	year	prediction	predicted_probability	id	employed
0	68785	46	…	2016	1	0.948828	68785	1
1	68786	46	…	2016	1	0.772002	68786	1
2	68787	12	…	2016	0	0.000149194	68787	0
3	68788	52	…	2016	1	0.90607	68788	1
4	68789	21	…	2016	1	0.699663	68789	0

>>> performance_calculator = nml.PerformanceCalculator(
...     problem_type='classification_binary',
...     y_pred_proba='predicted_probability',
...     y_pred='prediction',
...     y_true='employed',
...     metrics=['roc_auc'],
...     chunk_size=chunk_size)

>>> performance_calculator.fit(reference_df)
>>> calculated_performance = performance_calculator.calculate(analysis_with_targets_df)

>>> figure = estimated_performance.filter(period='analysis').compare(calculated_performance).plot()
>>> figure.show()

_images/quick-start-estimated-and-realized.svg

We see that the realized performance has indeed sharply dropped in the two indicated chunks. The performance was relatively stable in the preceding period even though AGEP was already slightly shifted at that time. This confirms the need to monitor performance/estimated performance, as not every shift impacts performance.

What’s next?

This Quickstart presents some of the core functionalities of NannyML on an example of real-world binary classification data. The walk-through is concise to help you get familiar with the fundamental concepts and structure of the library. NannyML provides other useful functionalities (like multivariate drift detection) that can help you monitor your models in production comprehensively. All of our tutorials are an excellent place to start exploring them.

If you want to know what is implemented under the hood - visit how it works. Finally, if you just look for examples on other datasets or ML problems, look through our examples.

References