Synthetic Binary Classification Dataset

NannyML provides a synthetic dataset describing a binary classification problem in order to make it easier to test and document its features.

Problem Description

The dataset describes a machine learning model that tries to predict whether an employee will work from home on the next day.

Dataset Description

A sample of the dataset can be seen below.

>>> import nannyml as nml
>>> reference, analysis, analysis_targets = nml.datasets.load_synthetic_binary_classification_dataset()
>>> display(reference.head(3))

distance_from_office

salary_range

gas_price_per_litre

public_transportation_cost

wfh_prev_workday

workday

tenure

identifier

work_home_actual

timestamp

y_pred_proba

partition

y_pred

0

5.96225

40K - 60K €

2.11948

8.56806

False

Friday

0.212653

0

1

2014-05-09 22:27:20

0.99

reference

1

1

0.535872

40K - 60K €

2.3572

5.42538

True

Tuesday

4.92755

1

0

2014-05-09 22:59:32

0.07

reference

0

2

1.96952

40K - 60K €

2.36685

8.24716

False

Monday

0.520817

2

1

2014-05-09 23:48:25

1

reference

1

The model uses 7 features:

  • distance_from_office: A numerical feature. The distance in kilometers from the employee’s house to the workplace.

  • salary_range: A categorical feature with 4 categories that bin the employee’s yearly income.

  • gas_price_per_litre: A numerical feature. The price of gas per litre close to the employee’s residence.

  • public_transportation_cost: A numerical feature. The price, in euros, of public transportation from the employee’s residence to the workplace.

  • wfh_prev_workday: A categorical feature with 2 categories, stating whether the employee worked from home the previous workday.

  • workday: A categorical feature with 5 categories. The day of the week where we want to predict whether the employee will work from home.

  • tenure: A numerical feature describing how many years the employee has been at the company.

The model predicts both a probability of the employee working from home that is available from the y_pred_proba column. A binary prediction is also available from the y_pred column. The work_home_actual is the Target column describing what actually happened.

There are also three auxiliarry columns that are helpful but not used by the monitored model:

  • identifier: A unique number referencing each employee. This is very useful for joining the target results on the analysis dataset when we want to compare estimated with realized performace.

  • timestamp: A date column informing us of the date the prediction was made.

  • partition: The partition column tells us which Data Period each row comes from.

Metadata Extraction

The dataset’s columns are name such that the heuristics NannyML uses to extract metadata can identify them. We can see below how to extract metadata

>>> metadata = nml.extract_metadata(data = reference, model_name='wfh_predictor', model_type='classification_binary', exclude_columns=['identifier'])
>>> metadata.is_complete()
(False, ['target_column_name'])

We see that the target_column_name has not been correctly idenfied. We need to manually specify it.

>>> metadata.target_column_name = 'work_home_actual'
>>> metadata.is_complete()
(True, [])

Let’s now see the metadata that NannyML has inferred about the model.

>>> metadata.to_df()

label

column_name

type

description

0

timestamp_column_name

timestamp

continuous

timestamp

1

partition_column_name

partition

categorical

partition

2

target_column_name

work_home_actual

categorical

target

3

distance_from_office

distance_from_office

continuous

extracted feature: distance_from_office

4

salary_range

salary_range

categorical

extracted feature: salary_range

5

gas_price_per_litre

gas_price_per_litre

continuous

extracted feature: gas_price_per_litre

6

public_transportation_cost

public_transportation_cost

continuous

extracted feature: public_transportation_cost

7

wfh_prev_workday

wfh_prev_workday

categorical

extracted feature: wfh_prev_workday

8

workday

workday

categorical

extracted feature: workday

9

tenure

tenure

continuous

extracted feature: tenure

10

prediction_column_name

y_pred

continuous

predicted label

11

predicted_probability_column_name

y_pred_proba

continuous

predicted score/probability

For more information about specifying metadata look at Providing Metadata.