Data requirements
In this guide, we’ll present an overview of the different kinds of data NannyML requires to run its various features. The specifics for each feature are also covered in the Tutorials, but an overview of all the different requirements is presented here for reference.
Data Periods
NannyML works with two Data Periods. The first one, called the reference period, is represented by the reference dataset, and is used to establish the expectations of the model’s performance.
The second is called the analysis period. And it is represented by the analysis dataset which, as the name suggests, is analyzed by NannyML to check whether model performance meets the expectations set based on the reference dataset.
Reference Period
The reference period’s purpose is to establish a baseline of expectations for the machine learning model being monitored. It needs to include the model inputs, model outputs and the performance results of the monitored model. The performance of the model for this period is assumed to be stable and acceptable.
The reference dataset contains observations for which target values are available, so the model performance can be calculated for this set. The ranges and distribution of model inputs, outputs and targets need to be well-known and validated for this set. For newly deployed models the reference dataset is usually the test dataset where the model was evaluated before entering production. For a model that has been in production for some time the reference dataset is usually a benchmark dataset selected from the production data of the model during which the model performed as expected.
Warning
Don’t use model training data as a reference dataset. Machine learning models tend to overfit on their training data. Therefore expectations for model performance will be unrealistic.
Analysis Period
The analysis period is where NannyML analyzes the data drift and the performance of the monitored model using the knowledge gained from studying the reference period. In the average use case, it will consist of the latest production data up to a desired point in the past, which should be after the reference period ends. The analysis period is not required to have targets available.
When performing drift analysis, NannyML compares each Data Chunk of the analysis period with the reference data. NannyML will flag any meaningful changes to data distributions as data drift.
The analysis data does not need to contain any target values, so performance can be estimated for it. If target data is provided for the analysis period, it can be used to calculate Realized Performance, but it will be ignored when estimating the performance.
Columns
The following sections describe the different data columns that NannyML requires. These will differ based on the type of the model being monitored, and the function being used. There will be columns that are common across model types, whereas others will be specific to a given model type. Also, note that there is an expectation that the columns have the same name between reference and analysis datasets when they describe the same thing.
We will illustrate this using the fictional car_loan model included with the library, a binary classifier trying to predict whether a prospective customer will pay off a car loan.
Below we see the columns contained in our dataset.
>>> import nannyml as nml
>>> reference_df, analysis_df, _ = nml.load_synthetic_car_loan_dataset()
>>> reference_df[['timestamp', 'y_pred_proba', 'y_pred', 'repaid']].head()
timestamp |
y_pred_proba |
y_pred |
repaid |
|
---|---|---|---|---|
0 |
2018-01-01 00:00:00.000 |
0.99 |
1 |
1 |
1 |
2018-01-01 00:08:43.152 |
0.07 |
0 |
0 |
2 |
2018-01-01 00:17:26.304 |
1 |
1 |
1 |
3 |
2018-01-01 00:26:09.456 |
0.98 |
1 |
1 |
4 |
2018-01-01 00:34:52.608 |
0.99 |
1 |
1 |
>>> reference_df[[
... 'car_value', 'salary_range', 'debt_to_income_ratio', 'loan_length', 'repaid_loan_on_prev_car', 'size_of_downpayment', 'driver_tenure'
>>> ]].head()
car_value |
salary_range |
debt_to_income_ratio |
loan_length |
repaid_loan_on_prev_car |
size_of_downpayment |
driver_tenure |
|
---|---|---|---|---|---|---|---|
0 |
39811 |
40K - 60K € |
0.63295 |
19 |
False |
40% |
0.212653 |
1 |
12679 |
40K - 60K € |
0.718627 |
7 |
True |
10% |
4.92755 |
2 |
19847 |
40K - 60K € |
0.721724 |
17 |
False |
0% |
0.520817 |
3 |
22652 |
20K - 40K € |
0.705992 |
16 |
False |
10% |
0.453649 |
4 |
21268 |
60K+ € |
0.671888 |
21 |
True |
30% |
5.69526 |
In the following sections we will explain their purpose.
Timestamp
The column containing the timestamp at which the observation occurred, i.e. when the model was invoked using the given inputs and yielding the resulting prediction. See Timestamp.
In the sample data this is the timestamp column.
Note
- Format
Any format supported by Pandas, most likely:
ISO 8601, e.g.
2021-10-13T08:47:23Z
Unix-epoch in units of seconds, e.g.
1513393355
Warning
This column is optional. When a timestamp column is not provided, plots will no longer use a time-based x-axis but will use the index of the chunks instead. The following plots illustrate this:
Some Chunker
classes might require the presence of a timestamp, such as the
PeriodBasedChunker
.
Target
The actual outcome of the event the machine learning model is trying to predict.
In the sample data this is the repaid column.
Required in the reference data for performance estimation, and in both reference and analysis data to calculate realized performance.
Features
The features of your model. These can be categorical or continuous. NannyML identifies this based on their declared pandas data types.
In the sample data, the features are car_value, salary_range, debt_to_income_ratio, loan_length, repaid_loan_on_prev_car, size_of_downpayment and driver_tenure.
Required to estimate performance for regression models and detect data drift on features.
Model Output columns
Predicted class probabilities
The score or probability that is emitted by the model, most likely a float.
In the sample data this is the y_pred_proba column.
Required for running performance estimation on binary classification models.
In multiclass classification problems each class is expected to have its own score or probability column. They are required for running performance estimation on multiclass models.
Prediction class labels
The predicted label, retrieved by interpreting (thresholding) the prediction scores or probabilities.
In the sample data this is the y_pred column.
Required for running performance estimation or performance calculation on binary classification, multiclass, and regression models. On binary classification models, it is not required for calculating the AUROC and average precision metrics.
NannyML Functionality Requirements
After version 0.5, NannyML has relaxed the column requirements so that each functionality only requires what it needs. You can see those requirements in the table below:
Data |
Performance Estimation |
Realized Performance |
Feature Drift |
Target Drift |
Output Drift |
||
---|---|---|---|---|---|---|---|
Classification models |
Regression models |
Univariate |
Multivariate |
||||
timestamp |
|||||||
features |
Required (reference and analysis) |
Required (reference and analysis) |
Required (reference and analysis) |
||||
y_pred_proba |
Required (reference and analysis) |
Required (reference and analysis) |
|||||
y_pred |
Required (reference and analysis)
Not needed for ROC_AUC or
average precision metrics
|
Required (reference and analysis) |
Required (reference and analysis)
Not needed for ROC_AUC or
average precision metrics
|
Required (reference and analysis) |
|||
y_true |
Required (reference only) |
Required (reference only) |
Required (reference and analysis) |
Required (reference and analysis) |
What’s next
You can check out our tutorials on how to estimate performance, calculate realized performance, and detect data drift.