Synthetic Binary Classification Car Loan Dataset

NannyML provides a synthetic dataset describing a binary classification problem, to make it easier to test and document its features.

To find out what requirements NannyML has for datasets, check out Data Requirements.

Problem Description

The dataset describes a machine learning model that predicts whether a customer will repay a loan to buy a car.

Dataset Description

A sample of the dataset can be seen below.

>>> import nannyml as nml
>>> reference_df, analysis_df, analysis_targets_df = nml.load_synthetic_car_loan_dataset()
>>> display(reference_df.head(3))

car_value

salary_range

debt_to_income_ratio

loan_length

repaid_loan_on_prev_car

size_of_downpayment

driver_tenure

y_pred_proba

y_pred

repaid

timestamp

0

39811

40K - 60K €

0.63295

19

False

40%

0.212653

0.99

1

1

2018-01-01 00:00:00.000

1

12679

40K - 60K €

0.718627

7

True

10%

4.92755

0.07

0

0

2018-01-01 00:08:43.152

2

19847

40K - 60K €

0.721724

17

False

0%

0.520817

1

1

1

2018-01-01 00:17:26.304

The model uses 7 features:

  • car_value - a numerical feature representing the price of the car.

  • salary_range - a categorical feature with 4 categories that identify the range the employee’s yearly income falls within.

  • debt_to_income_ratio - a numerical feature representing the ratio of debt to income from the customer.

  • loan_length - a numerical feature representing in how many months the customer wants to repay the loan.

  • repaid_loan_on_prev_car - a categorical feature with 2 categories, stating whether the customer repaid or not a previous loan.

  • size_of_downpayment - a categorical feature with 10 categories, representing the percentage in increments of 10% of the size of the downpayment of the car value.

  • tenure - a numerical feature describing how many years the costumer has been driving.

There are 3 columns that reference the output of the model:

  • y_pred_proba - the model predicted probability of the customer repaying the loan.

  • y_pred - the model prediction in binary form.

  • repaid - the Target column describing if the customer actually repaid the loan.

There is also an auxiliary column that is helpful but not used by the monitored model:

  • timestamp - a date column informing us of the date the prediction was made.

Data Quality Version

NannyML also provides a version of the car loan dataset that includes missing and unseen values in order to demonstrate the data quality modules provided by NannyML. The problem modeled and the columns included are the same. You can access this dataset with:

>>> import nannyml as nml
>>> reference_df, analysis_df, analysis_targets_df = nml.load_synthetic_car_loan_data_quality_dataset()
>>> # let's show an instance where new and missing values are present.
>>> display(analysis_df.iloc[41515:41520])

car_value

salary_range

debt_to_income_ratio

loan_length

repaid_loan_on_prev_car

size_of_downpayment

driver_tenure

timestamp

y_pred_proba

period

y_pred

41515

58071

40K - 60K €

0.694352

20

True

30%

0.44644

2019-07-09 02:57:35.280

0.9

analysis

1

41516

40317

20K - 40K €

0.581372

8

True

50%

nan

2019-07-09 03:06:18.432

0.16

analysis

0

41517

57487

40K - 60K €

0.703041

7

True

30%

5.2826

2019-07-09 03:15:01.584

0.07

analysis

0

41518

21555

0 - 20K €

0.268774

16

False

0%

4.04887

2019-07-09 03:23:44.736

0.01

analysis

0

41519

78265

40K - 60K €

0.71856

19

True

40%

0.208278

2019-07-09 03:32:27.888

0.85

analysis

1

The dataset has induced missing values at salary_range and driver_tenure features. And it has a new value, 50% at size_of_downpayment feature. You can see how the dataset is used on the data quality tutorials.