Synthetic Binary Classification Car Loan Dataset

NannyML provides a synthetic dataset describing a binary classification problem, to make it easier to test and document its features.

To find out what requirements NannyML has for datasets, check out Data Requirements.

Problem Description

The dataset describes a machine learning model that predicts whether a customer will repay a loan to buy a car.

Dataset Description

A sample of the dataset can be seen below.

>>> import nannyml as nml
>>> reference_df, analysis_df, analysis_targets_df = nml.load_synthetic_car_loan_dataset()
>>> display(reference_df.head(3))

	car_value	salary_range	debt_to_income_ratio	loan_length	repaid_loan_on_prev_car	size_of_downpayment	driver_tenure	y_pred_proba	y_pred	repaid	timestamp
0	39811	40K - 60K €	0.63295	19	False	40%	0.212653	0.99	1	1	2018-01-01 00:00:00.000
1	12679	40K - 60K €	0.718627	7	True	10%	4.92755	0.07	0	0	2018-01-01 00:08:43.152
2	19847	40K - 60K €	0.721724	17	False	0%	0.520817	1	1	1	2018-01-01 00:17:26.304

The model uses 7 features:

car_value - a numerical feature representing the price of the car.
salary_range - a categorical feature with 4 categories that identify the range the employee’s yearly income falls within.
debt_to_income_ratio - a numerical feature representing the ratio of debt to income from the customer.
loan_length - a numerical feature representing in how many months the customer wants to repay the loan.
repaid_loan_on_prev_car - a categorical feature with 2 categories, stating whether the customer repaid or not a previous loan.
size_of_downpayment - a categorical feature with 10 categories, representing the percentage in increments of 10% of the size of the downpayment of the car value.
tenure - a numerical feature describing how many years the costumer has been driving.

There are 3 columns that reference the output of the model:

y_pred_proba - the model predicted probability of the customer repaying the loan.
y_pred - the model prediction in binary form.
repaid - the Target column describing if the customer actually repaid the loan.

There is also an auxiliary column that is helpful but not used by the monitored model:

timestamp - a date column informing us of the date the prediction was made.

Data Quality Version

NannyML also provides a version of the car loan dataset that includes missing and unseen values in order to demonstrate the data quality modules provided by NannyML. The problem modeled and the columns included are the same. You can access this dataset with:

>>> import nannyml as nml
>>> reference_df, analysis_df, analysis_targets_df = nml.load_synthetic_car_loan_data_quality_dataset()
>>> # let's show an instance where new and missing values are present.
>>> display(analysis_df.iloc[41515:41520])

	car_value	salary_range	debt_to_income_ratio	loan_length	repaid_loan_on_prev_car	size_of_downpayment	driver_tenure	timestamp	y_pred_proba	period	y_pred
41515	58071	40K - 60K €	0.694352	20	True	30%	0.44644	2019-07-09 02:57:35.280	0.9	analysis	1
41516	40317	20K - 20K €	0.581372	8	True	50%	nan	2019-07-09 03:06:18.432	0.16	analysis	0
41517	57487	40K - 60K €	0.703041	7	True	30%	5.2826	2019-07-09 03:15:01.584	0.07	analysis	0
41518	21555	0 - 20K €	0.268774	16	False	0%	4.04887	2019-07-09 03:23:44.736	0.01	analysis	0
41519	78265	40K - 60K €	0.71856	19	True	40%	0.208278	2019-07-09 03:32:27.888	0.85	analysis	1

The dataset has induced missing values at salary_range and driver_tenure features. And it has a new value, 50% at size_of_downpayment feature. You can see how the dataset is used on the data quality tutorials.