Synthetic Binary Classification Dataset

NannyML provides a synthetic dataset describing a binary classification problem, to make it easier to test and document its features.

To find out what requirements NannyML has for datasets, check out Data Requirements.

Problem Description

The dataset describes a machine learning model that tries to predict whether an employee will work from home on the next day.

Dataset Description

A sample of the dataset can be seen below.

>>> import nannyml as nml
>>> reference, analysis, analysis_targets = nml.datasets.load_synthetic_binary_classification_dataset()
>>> display(reference.head(3))

	distance_from_office	salary_range	gas_price_per_litre	public_transportation_cost	wfh_prev_workday	workday	tenure	identifier	work_home_actual	timestamp	y_pred_proba	y_pred
0	5.96225	40K - 60K €	2.11948	8.56806	False	Friday	0.212653	0	1	2014-05-09 22:27:20	0.99	1
1	0.535872	40K - 60K €	2.3572	5.42538	True	Tuesday	4.92755	1	0	2014-05-09 22:59:32	0.07	0
2	1.96952	40K - 60K €	2.36685	8.24716	False	Monday	0.520817	2	1	2014-05-09 23:48:25	1	1

The model uses 7 features:

distance_from_office: A numerical feature. The distance in kilometers from the employee’s house to the workplace.
salary_range: A categorical feature with 4 categories that identify the range the employee’s yearly income falls within.
gas_price_per_litre: A numerical feature. The price of gas per litre close to the employee’s residence.
public_transportation_cost: A numerical feature. The price, in euros, of public transportation from the employee’s residence to the workplace.
wfh_prev_workday: A categorical feature with 2 categories, stating whether the employee worked from home the previous workday.
workday: A categorical feature with 5 categories. The day of the week where we want to predict whether the employee will work from home.
tenure: A numerical feature describing how many years the employee has been at the company.

The model predicts the probability of the employee working from home, recorded in the y_pred_proba column. A binary prediction is also available from the y_pred column. The work_home_actual is the Target column describing what actually happened.

There are also three auxiliary columns that are helpful but not used by the monitored model:

identifier: A unique number referencing each employee. This is very useful for joining the target results on the analysis dataset, when we want to compare estimated with realized performace..
timestamp: A date column informing us of the date the prediction was made.