Providing metadata

Why is data preparation required?

NannyML can process any data used in supported models. It requires model metadata to assign a correct role to each column of the data set. You can provide a ModelMetadata object that allows NannyML to make sense of your data. It allows you to specify what the model inputs, model predictions and targets are for your monitored model.

This guide will illustrate how to use NannyML to help your create this ModelMetadata object.

Metadata for binary classification

We’ll use a sample data set for this guide. The dataset describes a machine learning model that tries to predict whether an employee will work from home on the next day. You can read more about it on the dataset introduction page.

Just the code

>>> import nannyml as nml
>>> reference, analysis, analysis_targets = nml.load_synthetic_binary_classification_dataset()
>>> reference.columns
Index(['distance_from_office', 'salary_range', 'gas_price_per_litre',
   'public_transportation_cost', 'wfh_prev_workday', 'workday', 'tenure',
   'identifier', 'work_home_actual', 'timestamp', 'y_pred_proba',
   'partition', 'y_pred'],
  dtype='object')
>>> reference.head()

distance_from_office

salary_range

gas_price_per_litre

public_transportation_cost

wfh_prev_workday

workday

tenure

identifier

work_home_actual

timestamp

y_pred_proba

partition

y_pred

0

5.96225

40K - 60K €

2.11948

8.56806

False

Friday

0.212653

0

1

2014-05-09 22:27:20

0.99

reference

1

1

0.535872

40K - 60K €

2.3572

5.42538

True

Tuesday

4.92755

1

0

2014-05-09 22:59:32

0.07

reference

0

2

1.96952

40K - 60K €

2.36685

8.24716

False

Monday

0.520817

2

1

2014-05-09 23:48:25

1

reference

1

3

2.53041

20K - 40K €

2.31872

7.94425

False

Tuesday

0.453649

3

1

2014-05-10 01:12:09

0.98

reference

1

4

2.25364

60K+ €

2.22127

8.88448

True

Thursday

5.69526

4

1

2014-05-10 02:21:34

0.99

reference

1

>>> metadata = nml.extract_metadata(data=reference, model_type='classification_binary', exclude_columns=['identifier'])
>>> metadata.is_complete()
(False, ['target_column_name'])
>>> metadata.target_column_name = 'work_home_actual'
>>> metadata.is_complete()
(True, [])
>>> metadata.to_df()

label

column_name

type

description

0

timestamp_column_name

timestamp

continuous

timestamp

1

partition_column_name

partition

categorical

partition

2

target_column_name

work_home_actual

categorical

target

3

distance_from_office

distance_from_office

continuous

extracted feature: distance_from_office

4

salary_range

salary_range

categorical

extracted feature: salary_range

5

gas_price_per_litre

gas_price_per_litre

continuous

extracted feature: gas_price_per_litre

6

public_transportation_cost

public_transportation_cost

continuous

extracted feature: public_transportation_cost

7

wfh_prev_workday

wfh_prev_workday

categorical

extracted feature: wfh_prev_workday

8

workday

workday

categorical

extracted feature: workday

9

tenure

tenure

continuous

extracted feature: tenure

10

prediction_column_name

y_pred

continuous

predicted label

11

predicted_probability_column_name

y_pred_proba

continuous

predicted score/probability

Walkthrough

The first line loads the demo data. Remark that it returns three different DataFrames. The first two correspond to the different data periods, containing the data of the reference and analysis periods. The third DataFrame contains the target values for the analysis period. It can be joined with this period by using the shared identifier column.

>>> import nannyml as nml
>>> reference, analysis, analysis_targets = nml.load_synthetic_binary_classification_dataset()

The next lines takes a quick peek at the data inside the reference period.

>>> Index(['distance_from_office', 'salary_range', 'gas_price_per_litre',
   'public_transportation_cost', 'wfh_prev_workday', 'workday', 'tenure',
   'identifier', 'work_home_actual', 'timestamp', 'y_pred_proba',
   'partition', 'y_pred'],
  dtype='object')

The y_pred and y_pred_proba columns are housing the predicted labels and prediction scores or probabilities, i.e. the model outputs.

The work_home_actual column contains the target values (remember, we’re looking at the reference period here, for which target values are available).

The partition column contains the name of the data period the observation belongs to, in this case all of them belong to the reference period.

The timestamp column contains the timestamp at which the model did this particular prediction.

The identifier column is used to uniquely identify each row. It is not a feature as it does not serve as an input for the model.

The rest of the columns are the model inputs containing either continuous or categorical feature values.


We can now leverage the nannyml.metadata.extraction.extract_metadata() function to create a ModelMetadata object from the reference data.

>>> metadata = nml.extract_metadata(data=reference, model_type='classification_binary', exclude_columns=['identifier'])

The data argument is used to pass the data sample for the extraction.

The model_type``The model_type argument allows us to specify the type of the model that is monitored - either ``classification_binary or classification_multiclass. The exact algorithm does not matter, as NannyML doesn’t use the model when analysing data. This argument allows the nannyml.metadata.extraction.extract_metadata() function to look for specific patterns in the columns.

The exclude_columns argument is used to pass along the names of columns that are not relevant to the model. In this example case the identifier column is such a column: it is only used as a helper to perform the join between the analysis period data and its target values. By excluding it we can ensure it is not picked up as a model feature by NannyML.


The nannyml.metadata.base.is_complete() function checks if all required metadata properties have been provided. It is normally used internally to validate user inputs. The function returns a bool indicating if metadata is complete. The second return argument is an array containing the name of any missing properties. Running this step is not necessary but can be done to double-check everything is in order in advance.

>>> metadata.is_complete()
(False, ['target_column_name'])

We can see that the extraction was not able to find the target_column_name, i.e. the column containing the target values (work_home_actual) in our case.


The nannyml.metadata.extraction.extract_metadata() function uses some simple heuristics to yield its results. You can read more on the inner workings of this function in the how it works section This means that in some cases it will not succeed in extracting all required information.

The following line of code modifies the ModelMetadata object returned by the nannyml.metadata.extraction.extract_metadata() function by setting its target_column_name property.

>>> metadata.target_column_name = 'work_home_actual'

Note

All BinaryClassificationMetadata properties can be updated when they are missing or incorrect.

These are:
  • target_column_name

  • partition_column_name

  • timestamp_column_name

  • prediction_column_name

  • predicted_probability_column_name


We see the metadata is now considered complete. We can represent the ModelMetadata object as a DataFrame for easy inspection.

>>> metadata.is_complete()
(True, [])
>>> metadata.to_df()

label

column_name

type

description

0

timestamp_column_name

timestamp

continuous

timestamp

1

partition_column_name

partition

categorical

partition

2

target_column_name

work_home_actual

categorical

target

3

distance_from_office

distance_from_office

continuous

extracted feature: distance_from_office

4

salary_range

salary_range

categorical

extracted feature: salary_range

5

gas_price_per_litre

gas_price_per_litre

continuous

extracted feature: gas_price_per_litre

6

public_transportation_cost

public_transportation_cost

continuous

extracted feature: public_transportation_cost

7

wfh_prev_workday

wfh_prev_workday

categorical

extracted feature: wfh_prev_workday

8

workday

workday

categorical

extracted feature: workday

9

tenure

tenure

continuous

extracted feature: tenure

10

prediction_column_name

y_pred

continuous

predicted label

11

predicted_probability_column_name

y_pred_proba

continuous

predicted score/probability

Metadata for multiclass classification

We’ll use a sample data set for this guide. The dataset describes a machine learning model that tries to predict the most appropriate product for new customers applying for a credit card. You can read more about it on the dataset introduction page.

Just the code

>>> import nannyml as nml
>>> reference, analysis, analysis_targets = nml.load_synthetic_multiclass_classification_dataset()
>>> reference.columns
Index(['acq_channel', 'app_behavioral_score', 'requested_credit_limit',
   'app_channel', 'credit_bureau_score', 'stated_income', 'is_customer',
   'partition', 'identifier', 'timestamp', 'y_pred_proba_prepaid_card',
   'y_pred_proba_highstreet_card', 'y_pred_proba_upmarket_card', 'y_pred',
   'y_true'],
  dtype='object')
>>> reference.head()

acq_channel

app_behavioral_score

requested_credit_limit

app_channel

credit_bureau_score

stated_income

is_customer

partition

identifier

timestamp

y_pred_proba_prepaid_card

y_pred_proba_highstreet_card

y_pred_proba_upmarket_card

y_pred

y_true

0

Partner3

1.80823

350

web

309

15000

True

reference

60000

2020-05-02 02:01:30

0.97

0.03

0

prepaid_card

prepaid_card

1

Partner2

4.38257

500

mobile

418

23000

True

reference

60001

2020-05-02 02:03:33

0.87

0.13

0

prepaid_card

prepaid_card

2

Partner2

-0.787575

400

web

507

24000

False

reference

60002

2020-05-02 02:04:49

0.47

0.35

0.18

prepaid_card

upmarket_card

3

Partner3

-2.13177

300

mobile

324

38000

False

reference

60003

2020-05-02 02:07:59

0.26

0.5

0.24

highstreet_card

upmarket_card

4

Partner3

-1.36294

450

mobile

736

38000

True

reference

60004

2020-05-02 02:20:19

0.03

0.04

0.93

upmarket_card

upmarket_card

>>> metadata = nml.extract_metadata(data=reference, model_type='classification_multiclass', exclude_columns=['identifier'])
>>> metadata.is_complete()
(True, [])
>>> metadata.to_df()

label

column_name

type

description

0

timestamp_column_name

timestamp

continuous

timestamp

1

partition_column_name

partition

categorical

partition

2

target_column_name

y_true

categorical

target

3

acq_channel

acq_channel

categorical

extracted feature: acq_channel

4

app_behavioral_score

app_behavioral_score

continuous

extracted feature: app_behavioral_score

5

requested_credit_limit

requested_credit_limit

categorical

extracted feature: requested_credit_limit

6

app_channel

app_channel

categorical

extracted feature: app_channel

7

credit_bureau_score

credit_bureau_score

continuous

extracted feature: credit_bureau_score

8

stated_income

stated_income

categorical

extracted feature: stated_income

9

is_customer

is_customer

categorical

extracted feature: is_customer

10

prediction_column_name

y_pred

continuous

predicted label

11

predicted_probability_column_name_prepaid_card

y_pred_proba_prepaid_card

continuous

predicted score/probability for class ‘prepaid_card’

12

predicted_probability_column_name_highstreet_card

y_pred_proba_highstreet_card

continuous

predicted score/probability for class ‘highstreet_card’

13

predicted_probability_column_name_upmarket_card

y_pred_proba_upmarket_card

continuous

predicted score/probability for class ‘upmarket_card’

>>> metadata.predicted_probabilities_column_names
{'prepaid_card': 'y_pred_proba_prepaid_card',
 'highstreet_card': 'y_pred_proba_highstreet_card',
 'upmarket_card': 'y_pred_proba_upmarket_card'}

Walkthrough

The first line loads the demo data. Remark that it returns three different DataFrames. The first two correspond to the different data periods, containing the data of the reference and analysis periods. The third DataFrame contains the target values for the analysis period. It can be joined with this period by using the shared identifier column.

>>> import nannyml as nml
>>> reference, analysis, analysis_targets = nml.load_synthetic_multiclass_classification_dataset()

The next lines takes a quick peek at the data inside the reference period.

>>> Index(['acq_channel', 'app_behavioral_score', 'requested_credit_limit',
   'app_channel', 'credit_bureau_score', 'stated_income', 'is_customer',
   'partition', 'identifier', 'timestamp', 'y_pred_proba_prepaid_card',
   'y_pred_proba_highstreet_card', 'y_pred_proba_upmarket_card', 'y_pred',
   'y_true'],
  dtype='object')

The y_pred column contains the labels predicted by the model.

The y_pred_proba_prepaid_card, y_pred_proba_highstreet_card and y_pred_proba_upmarket_card contain the predicted class probabilities for the three classes labeled prepaid_card, highstreet_card and upmarket_card.

The y_true column contains the target values (remember, we’re looking at the reference period here, for which target values are available).

The partition column contains the name of the data period the observation belongs to, in this case all of them belong to the reference period.

The timestamp column contains the timestamp at which the model did this particular prediction.

The identifier column is used to uniquely identify each row. It is not a feature as it does not serve as an input for the model.

The rest of the columns are the model inputs containing either continuous or categorical feature values.


We can now leverage the nannyml.metadata.extraction.extract_metadata() function to create a ModelMetadata object from the reference data.

>>> metadata = nml.extract_metadata(data=reference, model_type='classification_multiclass', exclude_columns=['identifier'])

The data argument is used to pass the data sample for the extraction.

The model_type``The model_type argument allows us to specify the type of the model that is monitored - either ``classification_binary or classification_multiclass. The exact algorithm does not matter, as NannyML doesn’t use the model when analysing data. This argument allows the nannyml.metadata.extraction.extract_metadata() function to look for specific patterns in the columns.

The exclude_columns argument is used to pass along the names of columns that are not relevant to the model. In this example case the identifier column is such a column: it is only used as a helper to perform the join between the analysis period data and its target values. By excluding it we can ensure it is not picked up as a model feature by NannyML.


The nannyml.metadata.base.is_complete() function checks if all required metadata properties have been provided. It is normally used internally to validate user inputs. The function returns a bool indicating if metadata is complete. The second return argument is an array containing the name of any missing properties. Running this step is not necessary but can be done to double-check everything is in order in advance.

>>> metadata.is_complete()
(True, [])

We can see that the extraction was able to find all required properties. The metadata is considered to be complete.

Note

All MulticlassClassificationMetadata properties can be updated when they are missing or incorrect.

These are:
  • target_column_name

  • partition_column_name

  • timestamp_column_name

  • prediction_column_name

  • predicted_probabilities_column_names


We can represent the ModelMetadata object as a DataFrame for easy inspection.

>>> metadata.is_complete()
(True, [])
>>> metadata.to_df()

label

column_name

type

description

0

timestamp_column_name

timestamp

continuous

timestamp

1

partition_column_name

partition

categorical

partition

2

target_column_name

y_true

categorical

target

3

acq_channel

acq_channel

categorical

extracted feature: acq_channel

4

app_behavioral_score

app_behavioral_score

continuous

extracted feature: app_behavioral_score

5

requested_credit_limit

requested_credit_limit

categorical

extracted feature: requested_credit_limit

6

app_channel

app_channel

categorical

extracted feature: app_channel

7

credit_bureau_score

credit_bureau_score

continuous

extracted feature: credit_bureau_score

8

stated_income

stated_income

categorical

extracted feature: stated_income

9

is_customer

is_customer

categorical

extracted feature: is_customer

10

prediction_column_name

y_pred

continuous

predicted label

11

predicted_probability_column_name_prepaid_card

y_pred_proba_prepaid_card

continuous

predicted score/probability for class ‘prepaid_card’

12

predicted_probability_column_name_highstreet_card

y_pred_proba_highstreet_card

continuous

predicted score/probability for class ‘highstreet_card’

13

predicted_probability_column_name_upmarket_card

y_pred_proba_upmarket_card

continuous

predicted score/probability for class ‘upmarket_card’


We can now inspect the MulticlassClassificationMetadata object and find the mapping of class labels to a predicted probability column for that class, stored as a Python dict.

>>> metadata.predicted_probabilities_column_names
{'prepaid_card': 'y_pred_proba_prepaid_card',
 'highstreet_card': 'y_pred_proba_highstreet_card',
 'upmarket_card': 'y_pred_proba_upmarket_card'}

Insights and Follow Ups

Warning

Because the extraction is based on simple rules the results are never guaranteed to be completely correct. It is strongly advised to review the results of extract_metadata and update the values where needed.

NannyML will raise an MissingMetadataException when trying to run any functionality using incomplete metadata.

Note

We are aware that this boilerplate setup step creates some friction. We’re actively working on reducing it.

To find out more about the columns that should in your dataset, check out the data requirements documentation.

You can read the how metadata extraction works to find out more about our naming conventions and heuristics.

You can put your shiny new metadata to use in drift calculation, performance calculation or performance estimation.