Synthetic Regression Dataset
NannyML provides a synthetic dataset describing a regression problem, to make it easier to test and document its features.
To find out what requirements NannyML has for datasets, check out Data Requirements.
Problem Description
The dataset describes a machine learning model that tries to predict the price of a used car.
Dataset Description
A sample of the dataset can be seen below.
>>> import nannyml as nml
>>> reference_df, analysis_df, analysis_targets_df = nml.datasets.load_synthetic_car_price_dataset()
>>> display(reference_df.head())
car_age |
km_driven |
price_new |
accident_count |
door_count |
fuel |
transmission |
y_true |
y_pred |
timestamp |
|
---|---|---|---|---|---|---|---|---|---|---|
0 |
15 |
144020 |
42810 |
4 |
3 |
diesel |
automatic |
569 |
1246 |
2017-01-24 08:00:00.000 |
1 |
12 |
57078 |
31835 |
3 |
3 |
electric |
automatic |
4277 |
4924 |
2017-01-24 08:00:33.600 |
2 |
2 |
76288 |
31851 |
3 |
5 |
diesel |
automatic |
7011 |
5744 |
2017-01-24 08:01:07.200 |
3 |
7 |
97593 |
29288 |
2 |
3 |
electric |
manual |
5576 |
6781 |
2017-01-24 08:01:40.800 |
4 |
13 |
9985 |
41350 |
1 |
5 |
diesel |
automatic |
6456 |
6822 |
2017-01-24 08:02:14.400 |
The model uses 7 features:
car_age - a numerical feature. The age of the car in years.
km_driven - a numerical feature. The number of kilometers a car has drived.
price_new - a numerical feature. The price of the car in Euros when it was new.
accident_count - a numerical feature. The number of accidents the car has been involved in.
door_count - a numerical feature. The number of doors the car has. If it is a hatchback, the door count is increased by 1.
fuel - a categorical feature describing whether the car uses gas, diesel or electricity as fuel.
transmission - a categorical feature describing whether the car uses manual or automatic transmission.
The model predicts the predicted price of the car at the y_pred column. The y_true is the Target column describing the actual value of the car.
There is also an auxiliary column that is helpful but not used by the monitored model:
timestamp - a date column informing us of the date the prediction was made.