Synthetic Binary Classification Dataset

NannyML provides a synthetic dataset describing a binary classification problem, to make it easier to test and document its features.

To find out what requirements NannyML has for datasets, check out Data Requirements.

Problem Description

The dataset describes a machine learning model that tries to predict whether an employee will work from home on the next day.

Dataset Description

A sample of the dataset can be seen below.

>>> import nannyml as nml
>>> reference, analysis, analysis_targets = nml.datasets.load_synthetic_binary_classification_dataset()
>>> display(reference.head(3))

distance_from_office

salary_range

gas_price_per_litre

public_transportation_cost

wfh_prev_workday

workday

tenure

identifier

work_home_actual

timestamp

y_pred_proba

y_pred

0

5.96225

40K - 60K €

2.11948

8.56806

False

Friday

0.212653

0

1

2014-05-09 22:27:20

0.99

1

1

0.535872

40K - 60K €

2.3572

5.42538

True

Tuesday

4.92755

1

0

2014-05-09 22:59:32

0.07

0

2

1.96952

40K - 60K €

2.36685

8.24716

False

Monday

0.520817

2

1

2014-05-09 23:48:25

1

1

The model uses 7 features:

  • distance_from_office: A numerical feature. The distance in kilometers from the employee’s house to the workplace.

  • salary_range: A categorical feature with 4 categories that identify the range the employee’s yearly income falls within.

  • gas_price_per_litre: A numerical feature. The price of gas per litre close to the employee’s residence.

  • public_transportation_cost: A numerical feature. The price, in euros, of public transportation from the employee’s residence to the workplace.

  • wfh_prev_workday: A categorical feature with 2 categories, stating whether the employee worked from home the previous workday.

  • workday: A categorical feature with 5 categories. The day of the week where we want to predict whether the employee will work from home.

  • tenure: A numerical feature describing how many years the employee has been at the company.

The model predicts the probability of the employee working from home, recorded in the y_pred_proba column. A binary prediction is also available from the y_pred column. The work_home_actual is the Target column describing what actually happened.

There are also three auxiliary columns that are helpful but not used by the monitored model:

  • identifier: A unique number referencing each employee. This is very useful for joining the target results on the analysis dataset, when we want to compare estimated with realized performace..

  • timestamp: A date column informing us of the date the prediction was made.