Providing metadata
Why is data preparation required?
NannyML can process any data used in supported models. It requires model metadata to
assign a correct role to each column of the data set. You can provide a
ModelMetadata
object that allows NannyML to make sense of your data.
It allows you to specify what the model inputs, model predictions
and targets are for your monitored model.
This guide will illustrate how to use NannyML to help your create this
ModelMetadata
object.
Metadata for binary classification
We’ll use a sample data set for this guide. The dataset describes a machine learning model that tries to predict whether an employee will work from home on the next day. You can read more about it on the dataset introduction page.
Just the code
>>> import nannyml as nml
>>> reference, analysis, analysis_targets = nml.load_synthetic_binary_classification_dataset()
>>> reference.columns
Index(['distance_from_office', 'salary_range', 'gas_price_per_litre',
'public_transportation_cost', 'wfh_prev_workday', 'workday', 'tenure',
'identifier', 'work_home_actual', 'timestamp', 'y_pred_proba',
'partition', 'y_pred'],
dtype='object')
>>> reference.head()
distance_from_office |
salary_range |
gas_price_per_litre |
public_transportation_cost |
wfh_prev_workday |
workday |
tenure |
identifier |
work_home_actual |
timestamp |
y_pred_proba |
partition |
y_pred |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
5.96225 |
40K - 60K € |
2.11948 |
8.56806 |
False |
Friday |
0.212653 |
0 |
1 |
2014-05-09 22:27:20 |
0.99 |
reference |
1 |
1 |
0.535872 |
40K - 60K € |
2.3572 |
5.42538 |
True |
Tuesday |
4.92755 |
1 |
0 |
2014-05-09 22:59:32 |
0.07 |
reference |
0 |
2 |
1.96952 |
40K - 60K € |
2.36685 |
8.24716 |
False |
Monday |
0.520817 |
2 |
1 |
2014-05-09 23:48:25 |
1 |
reference |
1 |
3 |
2.53041 |
20K - 40K € |
2.31872 |
7.94425 |
False |
Tuesday |
0.453649 |
3 |
1 |
2014-05-10 01:12:09 |
0.98 |
reference |
1 |
4 |
2.25364 |
60K+ € |
2.22127 |
8.88448 |
True |
Thursday |
5.69526 |
4 |
1 |
2014-05-10 02:21:34 |
0.99 |
reference |
1 |
>>> metadata = nml.extract_metadata(data=reference, model_type='classification_binary', exclude_columns=['identifier'])
>>> metadata.is_complete()
(False, ['target_column_name'])
>>> metadata.target_column_name = 'work_home_actual'
>>> metadata.is_complete()
(True, [])
>>> metadata.to_df()
label |
column_name |
type |
description |
|
---|---|---|---|---|
0 |
timestamp_column_name |
timestamp |
continuous |
timestamp |
1 |
partition_column_name |
partition |
categorical |
partition |
2 |
target_column_name |
work_home_actual |
categorical |
target |
3 |
distance_from_office |
distance_from_office |
continuous |
extracted feature: distance_from_office |
4 |
salary_range |
salary_range |
categorical |
extracted feature: salary_range |
5 |
gas_price_per_litre |
gas_price_per_litre |
continuous |
extracted feature: gas_price_per_litre |
6 |
public_transportation_cost |
public_transportation_cost |
continuous |
extracted feature: public_transportation_cost |
7 |
wfh_prev_workday |
wfh_prev_workday |
categorical |
extracted feature: wfh_prev_workday |
8 |
workday |
workday |
categorical |
extracted feature: workday |
9 |
tenure |
tenure |
continuous |
extracted feature: tenure |
10 |
prediction_column_name |
y_pred |
continuous |
predicted label |
11 |
predicted_probability_column_name |
y_pred_proba |
continuous |
predicted score/probability |
Walkthrough
The first line loads the demo data. Remark that it returns three different DataFrames
. The first two correspond to
the different data periods, containing the data of the reference and analysis periods.
The third DataFrame
contains the target values for the analysis period. It can be joined with this period by
using the shared identifier
column.
>>> import nannyml as nml
>>> reference, analysis, analysis_targets = nml.load_synthetic_binary_classification_dataset()
The next lines takes a quick peek at the data inside the reference period.
>>> Index(['distance_from_office', 'salary_range', 'gas_price_per_litre',
'public_transportation_cost', 'wfh_prev_workday', 'workday', 'tenure',
'identifier', 'work_home_actual', 'timestamp', 'y_pred_proba',
'partition', 'y_pred'],
dtype='object')
The y_pred
and y_pred_proba
columns are housing the predicted labels and prediction scores or
probabilities, i.e. the model outputs.
The work_home_actual
column contains the target values (remember, we’re looking at the reference
period here, for which target values are available).
The partition
column contains the name of the data period the observation belongs to, in this
case all of them belong to the reference period.
The timestamp
column contains the timestamp at which the model did this particular prediction.
The identifier
column is used to uniquely identify each row. It is not a feature as it does not serve as an input
for the model.
The rest of the columns are the model inputs containing either continuous or categorical feature values.
We can now leverage the nannyml.metadata.extraction.extract_metadata()
function to create a
ModelMetadata
object from the reference data.
>>> metadata = nml.extract_metadata(data=reference, model_type='classification_binary', exclude_columns=['identifier'])
The data
argument is used to pass the data sample for the extraction.
The model_type``The model_type argument allows us to specify the type of the model that is monitored -
either ``classification_binary
or classification_multiclass
.
The exact algorithm does not matter, as NannyML doesn’t use the model when analysing data.
This argument allows the nannyml.metadata.extraction.extract_metadata()
function to look for specific patterns in the columns.
The exclude_columns
argument is used to pass along the names of columns that are not relevant to the model.
In this example case the identifier
column is such a column: it is only used as a helper to perform the join
between the analysis period data and its target values. By excluding it we can ensure it is not picked up as a
model feature by NannyML.
The nannyml.metadata.base.is_complete()
function checks if all required metadata properties have been provided.
It is normally used internally to validate user inputs. The function returns a bool
indicating if metadata is
complete. The second return argument is an array containing the name of any missing properties.
Running this step is not necessary but can be done to double-check everything is in order in advance.
>>> metadata.is_complete()
(False, ['target_column_name'])
We can see that the extraction was not able to find the target_column_name
, i.e. the column containing the target
values (work_home_actual
) in our case.
The nannyml.metadata.extraction.extract_metadata()
function uses some simple heuristics to yield its results.
You can read more on the inner workings of this function in the how it works section
This means that in some cases it will not succeed in extracting all required information.
The following line of code modifies the ModelMetadata
object returned by the
nannyml.metadata.extraction.extract_metadata()
function by setting its target_column_name
property.
>>> metadata.target_column_name = 'work_home_actual'
Note
All BinaryClassificationMetadata
properties can be updated
when they are missing or incorrect.
- These are:
target_column_name
partition_column_name
timestamp_column_name
prediction_column_name
predicted_probability_column_name
We see the metadata is now considered complete. We can represent the ModelMetadata
object as a DataFrame
for easy inspection.
>>> metadata.is_complete()
(True, [])
>>> metadata.to_df()
label |
column_name |
type |
description |
|
---|---|---|---|---|
0 |
timestamp_column_name |
timestamp |
continuous |
timestamp |
1 |
partition_column_name |
partition |
categorical |
partition |
2 |
target_column_name |
work_home_actual |
categorical |
target |
3 |
distance_from_office |
distance_from_office |
continuous |
extracted feature: distance_from_office |
4 |
salary_range |
salary_range |
categorical |
extracted feature: salary_range |
5 |
gas_price_per_litre |
gas_price_per_litre |
continuous |
extracted feature: gas_price_per_litre |
6 |
public_transportation_cost |
public_transportation_cost |
continuous |
extracted feature: public_transportation_cost |
7 |
wfh_prev_workday |
wfh_prev_workday |
categorical |
extracted feature: wfh_prev_workday |
8 |
workday |
workday |
categorical |
extracted feature: workday |
9 |
tenure |
tenure |
continuous |
extracted feature: tenure |
10 |
prediction_column_name |
y_pred |
continuous |
predicted label |
11 |
predicted_probability_column_name |
y_pred_proba |
continuous |
predicted score/probability |
Metadata for multiclass classification
We’ll use a sample data set for this guide. The dataset describes a machine learning model that tries to predict the most appropriate product for new customers applying for a credit card. You can read more about it on the dataset introduction page.
Just the code
>>> import nannyml as nml
>>> reference, analysis, analysis_targets = nml.load_synthetic_multiclass_classification_dataset()
>>> reference.columns
Index(['acq_channel', 'app_behavioral_score', 'requested_credit_limit',
'app_channel', 'credit_bureau_score', 'stated_income', 'is_customer',
'partition', 'identifier', 'timestamp', 'y_pred_proba_prepaid_card',
'y_pred_proba_highstreet_card', 'y_pred_proba_upmarket_card', 'y_pred',
'y_true'],
dtype='object')
>>> reference.head()
acq_channel |
app_behavioral_score |
requested_credit_limit |
app_channel |
credit_bureau_score |
stated_income |
is_customer |
partition |
identifier |
timestamp |
y_pred_proba_prepaid_card |
y_pred_proba_highstreet_card |
y_pred_proba_upmarket_card |
y_pred |
y_true |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
Partner3 |
1.80823 |
350 |
web |
309 |
15000 |
True |
reference |
60000 |
2020-05-02 02:01:30 |
0.97 |
0.03 |
0 |
prepaid_card |
prepaid_card |
1 |
Partner2 |
4.38257 |
500 |
mobile |
418 |
23000 |
True |
reference |
60001 |
2020-05-02 02:03:33 |
0.87 |
0.13 |
0 |
prepaid_card |
prepaid_card |
2 |
Partner2 |
-0.787575 |
400 |
web |
507 |
24000 |
False |
reference |
60002 |
2020-05-02 02:04:49 |
0.47 |
0.35 |
0.18 |
prepaid_card |
upmarket_card |
3 |
Partner3 |
-2.13177 |
300 |
mobile |
324 |
38000 |
False |
reference |
60003 |
2020-05-02 02:07:59 |
0.26 |
0.5 |
0.24 |
highstreet_card |
upmarket_card |
4 |
Partner3 |
-1.36294 |
450 |
mobile |
736 |
38000 |
True |
reference |
60004 |
2020-05-02 02:20:19 |
0.03 |
0.04 |
0.93 |
upmarket_card |
upmarket_card |
>>> metadata = nml.extract_metadata(data=reference, model_type='classification_multiclass', exclude_columns=['identifier'])
>>> metadata.is_complete()
(True, [])
>>> metadata.to_df()
label |
column_name |
type |
description |
|
---|---|---|---|---|
0 |
timestamp_column_name |
timestamp |
continuous |
timestamp |
1 |
partition_column_name |
partition |
categorical |
partition |
2 |
target_column_name |
y_true |
categorical |
target |
3 |
acq_channel |
acq_channel |
categorical |
extracted feature: acq_channel |
4 |
app_behavioral_score |
app_behavioral_score |
continuous |
extracted feature: app_behavioral_score |
5 |
requested_credit_limit |
requested_credit_limit |
categorical |
extracted feature: requested_credit_limit |
6 |
app_channel |
app_channel |
categorical |
extracted feature: app_channel |
7 |
credit_bureau_score |
credit_bureau_score |
continuous |
extracted feature: credit_bureau_score |
8 |
stated_income |
stated_income |
categorical |
extracted feature: stated_income |
9 |
is_customer |
is_customer |
categorical |
extracted feature: is_customer |
10 |
prediction_column_name |
y_pred |
continuous |
predicted label |
11 |
predicted_probability_column_name_prepaid_card |
y_pred_proba_prepaid_card |
continuous |
predicted score/probability for class ‘prepaid_card’ |
12 |
predicted_probability_column_name_highstreet_card |
y_pred_proba_highstreet_card |
continuous |
predicted score/probability for class ‘highstreet_card’ |
13 |
predicted_probability_column_name_upmarket_card |
y_pred_proba_upmarket_card |
continuous |
predicted score/probability for class ‘upmarket_card’ |
>>> metadata.predicted_probabilities_column_names
{'prepaid_card': 'y_pred_proba_prepaid_card',
'highstreet_card': 'y_pred_proba_highstreet_card',
'upmarket_card': 'y_pred_proba_upmarket_card'}
Walkthrough
The first line loads the demo data. Remark that it returns three different DataFrames
. The first two correspond to
the different data periods, containing the data of the reference and analysis periods.
The third DataFrame
contains the target values for the analysis period. It can be joined with this period by
using the shared identifier
column.
>>> import nannyml as nml
>>> reference, analysis, analysis_targets = nml.load_synthetic_multiclass_classification_dataset()
The next lines takes a quick peek at the data inside the reference period.
>>> Index(['acq_channel', 'app_behavioral_score', 'requested_credit_limit',
'app_channel', 'credit_bureau_score', 'stated_income', 'is_customer',
'partition', 'identifier', 'timestamp', 'y_pred_proba_prepaid_card',
'y_pred_proba_highstreet_card', 'y_pred_proba_upmarket_card', 'y_pred',
'y_true'],
dtype='object')
The y_pred
column contains the labels predicted by the model.
The y_pred_proba_prepaid_card
, y_pred_proba_highstreet_card
and y_pred_proba_upmarket_card
contain the predicted class probabilities for the three classes labeled prepaid_card
, highstreet_card
and upmarket_card
.
The y_true
column contains the target values (remember, we’re looking at the reference
period here, for which target values are available).
The partition
column contains the name of the data period the observation belongs to, in this
case all of them belong to the reference period.
The timestamp
column contains the timestamp at which the model did this particular prediction.
The identifier
column is used to uniquely identify each row. It is not a feature as it does not serve as an input
for the model.
The rest of the columns are the model inputs containing either continuous or categorical feature values.
We can now leverage the nannyml.metadata.extraction.extract_metadata()
function to create a
ModelMetadata
object from the reference data.
>>> metadata = nml.extract_metadata(data=reference, model_type='classification_multiclass', exclude_columns=['identifier'])
The data
argument is used to pass the data sample for the extraction.
The model_type``The model_type argument allows us to specify the type of the model that is monitored -
either ``classification_binary
or classification_multiclass
.
The exact algorithm does not matter, as NannyML doesn’t use the model when analysing data.
This argument allows the nannyml.metadata.extraction.extract_metadata()
function to look for specific patterns in the columns.
The exclude_columns
argument is used to pass along the names of columns that are not relevant to the model.
In this example case the identifier
column is such a column: it is only used as a helper to perform the join
between the analysis period data and its target values. By excluding it we can ensure it is not picked up as a
model feature by NannyML.
The nannyml.metadata.base.is_complete()
function checks if all required metadata properties have been provided.
It is normally used internally to validate user inputs. The function returns a bool
indicating if metadata is
complete. The second return argument is an array containing the name of any missing properties.
Running this step is not necessary but can be done to double-check everything is in order in advance.
>>> metadata.is_complete()
(True, [])
We can see that the extraction was able to find all required properties. The metadata is considered to be complete.
Note
All MulticlassClassificationMetadata
properties can be updated
when they are missing or incorrect.
- These are:
target_column_name
partition_column_name
timestamp_column_name
prediction_column_name
predicted_probabilities_column_names
We can represent the ModelMetadata
object as a DataFrame
for easy inspection.
>>> metadata.is_complete()
(True, [])
>>> metadata.to_df()
label |
column_name |
type |
description |
|
---|---|---|---|---|
0 |
timestamp_column_name |
timestamp |
continuous |
timestamp |
1 |
partition_column_name |
partition |
categorical |
partition |
2 |
target_column_name |
y_true |
categorical |
target |
3 |
acq_channel |
acq_channel |
categorical |
extracted feature: acq_channel |
4 |
app_behavioral_score |
app_behavioral_score |
continuous |
extracted feature: app_behavioral_score |
5 |
requested_credit_limit |
requested_credit_limit |
categorical |
extracted feature: requested_credit_limit |
6 |
app_channel |
app_channel |
categorical |
extracted feature: app_channel |
7 |
credit_bureau_score |
credit_bureau_score |
continuous |
extracted feature: credit_bureau_score |
8 |
stated_income |
stated_income |
categorical |
extracted feature: stated_income |
9 |
is_customer |
is_customer |
categorical |
extracted feature: is_customer |
10 |
prediction_column_name |
y_pred |
continuous |
predicted label |
11 |
predicted_probability_column_name_prepaid_card |
y_pred_proba_prepaid_card |
continuous |
predicted score/probability for class ‘prepaid_card’ |
12 |
predicted_probability_column_name_highstreet_card |
y_pred_proba_highstreet_card |
continuous |
predicted score/probability for class ‘highstreet_card’ |
13 |
predicted_probability_column_name_upmarket_card |
y_pred_proba_upmarket_card |
continuous |
predicted score/probability for class ‘upmarket_card’ |
We can now inspect the MulticlassClassificationMetadata
object
and find the mapping of class labels to a predicted probability column for that class, stored as a Python dict
.
>>> metadata.predicted_probabilities_column_names
{'prepaid_card': 'y_pred_proba_prepaid_card',
'highstreet_card': 'y_pred_proba_highstreet_card',
'upmarket_card': 'y_pred_proba_upmarket_card'}
Insights and Follow Ups
Warning
Because the extraction is based on simple rules the results are never guaranteed to be completely correct.
It is strongly advised to review the results of
extract_metadata
and update the values where needed.
NannyML will raise an MissingMetadataException
when trying to run any functionality
using incomplete metadata.
Note
We are aware that this boilerplate setup step creates some friction. We’re actively working on reducing it.
To find out more about the columns that should in your dataset, check out the data requirements documentation.
You can read the how metadata extraction works to find out more about our naming conventions and heuristics.
You can put your shiny new metadata to use in drift calculation, performance calculation or performance estimation.