Quickstart

What is NannyML?

NannyML detects silent model failure, estimates performance of ML models after deployment before target data become available, and robustly detects data drift potentially responsible for the failure. It can also monitor performance once target data is available.

Installing NannyML

NannyML depends on LightGBM. This might require you to set install additional OS-specific binaries. You can follow the official LightGBM installation guide.

From the shell of your python environment type:

$ pip install nannyml

or

$ conda install -c conda-forge nannyml

or

$ docker -v /local/config/dir/:/config/ run nannyml/nannyml nml run

Contents of the Quickstart

This Quickstart presents core functionalities of NannyML on an example binary classification model that predicts whether a customer will repay a loan to buy a car.

First, the whole code is shown so you can jump in and experiment right away if you want.

This is followed by a detailed walk-through to help you get familiar with the flow, and explain the details. The synthetic dataset used contains inputs that are already merged with model predictions and ready to be directly used by NannyML.

All our tutorials are a good place to get detailed guides on main concepts and functionalities. If you want to know what is implemented under the hood - visit how it works. Finally, if you just look for examples on other datasets or ML problems look through our examples.

Note

The following example does not use any timestamps. These are optional but have an impact on the way data is chunked and results are plotted. You can read more about them in the data requirements.

Just the code

>>> import nannyml as nml
>>> from IPython.display import display

>>> # Load synthetic data
>>> reference, analysis, analysis_target = nml.load_synthetic_car_loan_dataset()
>>> display(reference.head())
>>> display(analysis.head())

>>> # Choose a chunker or set a chunk size
>>> chunk_size = 5000

>>> # initialize, specify required data columns, fit estimator and estimate
>>> estimator = nml.CBPE(
...     y_pred_proba='y_pred_proba',
...     y_pred='y_pred',
...     y_true='repaid',
...     metrics=['roc_auc'],
...     chunk_size=chunk_size,
...     problem_type='classification_binary',
>>> )
>>> estimator = estimator.fit(reference)
>>> estimated_performance = estimator.estimate(analysis)

>>> # Show results
>>> figure = estimated_performance.plot()
>>> figure.show()

>>> # Define feature columns
>>> feature_column_names = [
...     col for col in reference.columns if col not in [
...         'timestamp', 'repaid',
...     ]]

>>> # Let's initialize the object that will perform the Univariate Drift calculations
>>> univariate_calculator = nml.UnivariateDriftCalculator(
...     column_names=feature_column_names,
...     treat_as_categorical=['y_pred'],
...     chunk_size=chunk_size,
...     continuous_methods=['kolmogorov_smirnov', 'jensen_shannon'],
...     categorical_methods=['chi2', 'jensen_shannon'],
>>> )
>>> univariate_calculator = univariate_calculator.fit(reference)
>>> univariate_results = univariate_calculator.calculate(analysis)
>>> # Plot drift results for all continuous columns
>>> figure = univariate_results.filter(
...     column_names=univariate_results.continuous_column_names,
...     period='analysis',
...     methods=['jensen_shannon']).plot(kind='drift')
>>> figure.show()

>>> # Plot drift results for all categorical columns
>>> figure = univariate_results.filter(
...     column_names=univariate_results.categorical_column_names,
...     period='analysis',
...     methods=['chi2']).plot(kind='drift')
>>> figure.show()

>>> ranker = nml.CorrelationRanker()
>>> # ranker fits on one metric and reference period data only
>>> ranker.fit(
...     estimated_performance.filter(period='reference', metrics=['roc_auc']))
>>> # ranker ranks on one drift method and one performance metric
>>> ranked_features = ranker.rank(
...     univariate_results.filter(methods=['jensen_shannon']),
...     estimated_performance.filter(metrics=['roc_auc']),
...     only_drifting = False)
>>> display(ranked_features)

>>> # Let's initialize the object that will perform Data Reconstruction with PCA
>>> rcerror_calculator = nml.DataReconstructionDriftCalculator(
...     column_names=feature_column_names,
...     chunk_size=chunk_size
>>> ).fit(reference_data=reference)
>>> # let's see Reconstruction error statistics for all available data
>>> rcerror_results = rcerror_calculator.calculate(analysis)
>>> figure = rcerror_results.plot()
>>> figure.show()

Walkthrough

We start by loading the synthetic dataset included in the library. This synthetic dataset contains inputs and predictions of a binary classification model that predicts whether a customer will repay a loan to buy a car.

The probability of the customer repaying the loan is included in the y_pred_proba column, while the prediction is in y_pred column.

The model inputs are car_value, salary_range, debt_to_income_ratio, loan_length, repaid_loan_on_prev_car, size_of_downpayment and tenure.

timestamp is the Timestamp column.

The data are split into a reference period and an analysis period. NannyML uses the reference period to establish a baseline for expected model performance. The analysis period is where we estimate or monitor performance, as well as detect data drift.

For more information about periods check Data Periods. A key thing to remember is that the analysis period doesn’t need to contain the Target data.

Let’s load and preview the data:

>>> import nannyml as nml
>>> from IPython.display import display

>>> # Load synthetic data
>>> reference, analysis, analysis_target = nml.load_synthetic_car_loan_dataset()
>>> display(reference.head())
>>> display(analysis.head())

car_value

salary_range

debt_to_income_ratio

loan_length

repaid_loan_on_prev_car

size_of_downpayment

driver_tenure

repaid

timestamp

y_pred_proba

y_pred

0

39811

40K - 60K €

0.63295

19

False

40%

0.212653

1

2018-01-01 00:00:00.000

0.99

1

1

12679

40K - 60K €

0.718627

7

True

10%

4.92755

0

2018-01-01 00:08:43.152

0.07

0

2

19847

40K - 60K €

0.721724

17

False

0%

0.520817

1

2018-01-01 00:17:26.304

1

1

3

22652

20K - 20K €

0.705992

16

False

10%

0.453649

1

2018-01-01 00:26:09.456

0.98

1

4

21268

60K+ €

0.671888

21

True

30%

5.69526

1

2018-01-01 00:34:52.608

0.99

1

car_value

salary_range

debt_to_income_ratio

loan_length

repaid_loan_on_prev_car

size_of_downpayment

driver_tenure

timestamp

y_pred_proba

y_pred

0

12638

0 - 20K €

0.487926

21

False

10%

4.22463

2018-10-30 18:00:00.000

0.99

1

1

52425

20K - 20K €

0.672183

20

False

40%

4.9631

2018-10-30 18:08:43.152

0.98

1

2

20369

40K - 60K €

0.70309

19

True

40%

4.58895

2018-10-30 18:17:26.304

0.98

1

3

10592

20K - 20K €

0.653258

21

False

10%

4.71101

2018-10-30 18:26:09.456

0.97

1

4

33933

0 - 20K €

0.722263

18

False

0%

0.906738

2018-10-30 18:34:52.608

0.92

1

We need to make a choice about the way we will split our data into Data Chunks.

>>> # Choose a chunker or set a chunk size
>>> chunk_size = 5000

Estimating Performance without Targets

NannyML can estimate the performance on a machine learning model in production without access to its Target. For more details on how to use performance estimation see our tutorial on performance estimation, while for more details on how the algorithm behind it works see Confidence-based Performance Estimation (CBPE).

>>> # initialize, specify required data columns, fit estimator and estimate
>>> estimator = nml.CBPE(
...     y_pred_proba='y_pred_proba',
...     y_pred='y_pred',
...     y_true='repaid',
...     metrics=['roc_auc'],
...     chunk_size=chunk_size,
...     problem_type='classification_binary',
>>> )
>>> estimator = estimator.fit(reference)
>>> estimated_performance = estimator.estimate(analysis)

>>> # Show results
>>> figure = estimated_performance.plot()
>>> figure.show()
_images/quick-start-perf-est.svg

The results indicate that the model’s performance is likely to be negatively impacted from the second half of the analysis period.

Detecting Data Drift

NannyML allows for further investigation into potential performance issues with its data drift detection functionality. See Detecting Data Drift for more details.

>>> # Define feature columns
>>> feature_column_names = [
...     col for col in reference.columns if col not in [
...         'timestamp', 'repaid',
...     ]]

>>> # Let's initialize the object that will perform the Univariate Drift calculations
>>> univariate_calculator = nml.UnivariateDriftCalculator(
...     column_names=feature_column_names,
...     treat_as_categorical=['y_pred'],
...     chunk_size=chunk_size,
...     continuous_methods=['kolmogorov_smirnov', 'jensen_shannon'],
...     categorical_methods=['chi2', 'jensen_shannon'],
>>> )
>>> univariate_calculator = univariate_calculator.fit(reference)
>>> univariate_results = univariate_calculator.calculate(analysis)
>>> # Plot drift results for all continuous columns
>>> figure = univariate_results.filter(
...     column_names=univariate_results.continuous_column_names,
...     period='analysis',
...     methods=['jensen_shannon']).plot(kind='drift')
>>> figure.show()

>>> # Plot drift results for all categorical columns
>>> figure = univariate_results.filter(
...     column_names=univariate_results.categorical_column_names,
...     period='analysis',
...     methods=['chi2']).plot(kind='drift')
>>> figure.show()
_images/quick-start-drift-continuous.svg_images/quick-start-drift-categorical.svg

When there are a lot of drifted features, NannyML can also rank them according to their correlation with a chosen performance metric’s results in order to help prioritize further investigations. For more information you can check the ranking tutorial.

>>> ranker = nml.CorrelationRanker()
>>> # ranker fits on one metric and reference period data only
>>> ranker.fit(
...     estimated_performance.filter(period='reference', metrics=['roc_auc']))
>>> # ranker ranks on one drift method and one performance metric
>>> ranked_features = ranker.rank(
...     univariate_results.filter(methods=['jensen_shannon']),
...     estimated_performance.filter(metrics=['roc_auc']),
...     only_drifting = False)
>>> display(ranked_features)

column_name

pearsonr_correlation

pearsonr_pvalue

has_drifted

rank

0

repaid_loan_on_prev_car

0.99829

1.17771e-23

True

1

1

y_pred_proba

0.998072

3.47458e-23

True

2

2

loan_length

0.996876

2.66146e-21

True

3

3

salary_range

0.996512

7.16292e-21

True

4

4

car_value

0.996148

1.74676e-20

True

5

5

size_of_downpayment

0.307497

0.18722

False

6

6

debt_to_income_ratio

0.250211

0.287342

False

7

7

y_pred

0.0752823

0.752426

False

8

8

driver_tenure

-0.134447

0.571988

False

9

More complex data drift cases can get detected by Data Reconstruction with PCA. For more information see Data Reconstruction with PCA.

>>> # Let's initialize the object that will perform Data Reconstruction with PCA
>>> rcerror_calculator = nml.DataReconstructionDriftCalculator(
...     column_names=feature_column_names,
...     chunk_size=chunk_size
>>> ).fit(reference_data=reference)
>>> # let's see Reconstruction error statistics for all available data
>>> rcerror_results = rcerror_calculator.calculate(analysis)
>>> figure = rcerror_results.plot()
>>> figure.show()
_images/quick-start-drift-multivariate.svg

Insights

With NannyML we were able to estimate performance in the absence of ground truth. The estimation has shown potential drop in ROC AUC in the second half of the analysis period. Univariate and multivariate data drift detection algorithms have identified data drift.

Putting everything together, we see that 4 features exhibit data drift from the second half of the analysis period. They are loan_length, salary_range, car_value, repaid_loan_on_prev_car.

This drift is responsible for the potential negative impact in performance that we have observed in this time period.

What next

This could be further investigated by analyzing changes of distributions of the input variables. Check tutorials on data drift to find out how to plot distributions with NannyML.

You can now try using NannyML on your own data. Our Tutorials are a good place to find out what to do for this.