California Housing Dataset
Modifying California Housing Dataset
We are using the California Housing Dataset to create a real data example dataset for NannyML. There are three steps needed for this process:
Enriching the data
Training a Machine Learning Model
Meeting NannyML Data Requirements
To find out what requirements NannyML has for datasets, check out Data Requirements.
We need to start by loading the dataset.
>>> # Import required libraries
>>> import pandas as pd
>>> import numpy as np
>>> import datetime as dt
>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.metrics import roc_auc_score
>>> cali = fetch_california_housing(as_frame=True)
>>> df = pd.concat([cali.data, cali.target], axis=1)
>>> df.head(2)
MedInc |
HouseAge |
AveRooms |
AveBedrms |
Population |
AveOccup |
Latitude |
Longitude |
MedHouseVal |
|
---|---|---|---|---|---|---|---|---|---|
0 |
8.3252 |
41 |
6.98413 |
1.02381 |
322 |
2.55556 |
37.88 |
-122.23 |
4.526 |
1 |
8.3014 |
21 |
6.23814 |
0.97188 |
2401 |
2.10984 |
37.86 |
-122.22 |
3.585 |
Enriching the data
The things that need to be added to the dataset are:
A time dimension
Splitting the data into reference and analysis sets
A binary classification target
>>> # add artificial timestamp
>>> timestamps = [dt.datetime(2020,1,1) + dt.timedelta(hours=x/2) for x in df.index]
>>> df['timestamp'] = timestamps
>>> # add periods/partitions
>>> train_beg = dt.datetime(2020,1,1)
>>> train_end = dt.datetime(2020,5,1)
>>> test_beg = dt.datetime(2020,5,1)
>>> test_end = dt.datetime(2020,9,1)
>>> df.loc[df['timestamp'].between(train_beg, train_end, inclusive='left'), 'partition'] = 'train'
>>> df.loc[df['timestamp'].between(test_beg, test_end, inclusive='left'), 'partition'] = 'test'
>>> df['partition'] = df['partition'].fillna('production')
>>> # create new classification target - house value higher than mean
>>> df_train = df[df['partition']=='train']
>>> df['clf_target'] = np.where(df['MedHouseVal'] > df_train['MedHouseVal'].median(), 1, 0)
>>> df = df.drop('MedHouseVal', axis=1)
>>> del df_train
Training a Machine Learning Model
>>> # fit classifier
>>> target = 'clf_target'
>>> meta = 'partition'
>>> features = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
>>> df_train = df[df['partition']=='train']
>>> clf = RandomForestClassifier(random_state=42)
>>> clf.fit(df_train[features], df_train[target])
>>> df['y_pred_proba'] = clf.predict_proba(df[features])[:,1]
>>> df['y_pred'] = df['y_pred_proba'].map(lambda p: int(p >= 0.8))
>>> # Check roc auc score
>>> for partition_name, partition_data in df.groupby('partition', sort=False):
... print(partition_name, roc_auc_score(partition_data[target], partition_data['y_pred_proba']))
train 1.0
test 0.8737681614409617
production 0.8224322932364313
Meeting NannyML Data Requirements
The data are now being split to satisfy NannyML format requirements.
>>> df_for_nanny = df[df['partition']!='train'].reset_index(drop=True)
>>> df_for_nanny['partition'] = df_for_nanny['partition'].map({'test':'reference', 'production':'analysis'})
>>> reference_df = df_for_nanny[df_for_nanny['partition']=='reference'].copy()
>>> analysis_df = df_for_nanny[df_for_nanny['partition']=='analysis'].copy()
>>> analysis_targets_df = analysis_df[['clf_target']].copy()
>>> analysis_df = analysis_df.drop('clf_target', axis=1)
>>> # dropping partition column that is now removed from requirements.
>>> reference_df.drop('partition', axis=1, inplace=True)
>>> analysis_df.drop('partition', axis=1, inplace=True)
The reference_df
dataframe represents the reference Data Period and the analysis_df
dataframe represents the analysis period. The analysis_targets_df
dataframe contains the targets
for the analysis period, which is provided separately.