.. _dataset-california: ========================== California Housing Dataset ========================== Modifying California Housing Dataset ==================================== We are using the `California Housing Dataset`_ to create a real data example dataset for NannyML. There are three steps needed for this process: - Enriching the data - Training a Machine Learning Model - Meeting NannyML Data Requirements To find out what requirements NannyML has for datasets, check out :ref:`Data Requirements`. We need to start by loading the dataset. .. code-block:: python >>> # Import required libraries >>> import pandas as pd >>> import numpy as np >>> import datetime as dt >>> from sklearn.datasets import fetch_california_housing >>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.metrics import roc_auc_score >>> cali = fetch_california_housing(as_frame=True) >>> df = pd.concat([cali.data, cali.target], axis=1) >>> df.head(2) +----+----------+------------+------------+-------------+--------------+------------+------------+-------------+---------------+ | | MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | MedHouseVal | +====+==========+============+============+=============+==============+============+============+=============+===============+ | 0 | 8.3252 | 41 | 6.98413 | 1.02381 | 322 | 2.55556 | 37.88 | -122.23 | 4.526 | +----+----------+------------+------------+-------------+--------------+------------+------------+-------------+---------------+ | 1 | 8.3014 | 21 | 6.23814 | 0.97188 | 2401 | 2.10984 | 37.86 | -122.22 | 3.585 | +----+----------+------------+------------+-------------+--------------+------------+------------+-------------+---------------+ Enriching the data ================== The things that need to be added to the dataset are: - A time dimension - Splitting the data into reference and analysis sets - A binary classification target .. code-block:: python >>> # add artificial timestamp >>> timestamps = [dt.datetime(2020,1,1) + dt.timedelta(hours=x/2) for x in df.index] >>> df['timestamp'] = timestamps >>> # add periods/partitions >>> train_beg = dt.datetime(2020,1,1) >>> train_end = dt.datetime(2020,5,1) >>> test_beg = dt.datetime(2020,5,1) >>> test_end = dt.datetime(2020,9,1) >>> df.loc[df['timestamp'].between(train_beg, train_end, inclusive='left'), 'partition'] = 'train' >>> df.loc[df['timestamp'].between(test_beg, test_end, inclusive='left'), 'partition'] = 'test' >>> df['partition'] = df['partition'].fillna('production') >>> # create new classification target - house value higher than mean >>> df_train = df[df['partition']=='train'] >>> df['clf_target'] = np.where(df['MedHouseVal'] > df_train['MedHouseVal'].median(), 1, 0) >>> df = df.drop('MedHouseVal', axis=1) >>> del df_train Training a Machine Learning Model ================================= .. code-block:: python >>> # fit classifier >>> target = 'clf_target' >>> meta = 'partition' >>> features = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'] >>> df_train = df[df['partition']=='train'] >>> clf = RandomForestClassifier(random_state=42) >>> clf.fit(df_train[features], df_train[target]) >>> df['y_pred_proba'] = clf.predict_proba(df[features])[:,1] >>> df['y_pred'] = df['y_pred_proba'].map(lambda p: int(p >= 0.8)) >>> # Check roc auc score >>> for partition_name, partition_data in df.groupby('partition', sort=False): ... print(partition_name, roc_auc_score(partition_data[target], partition_data['y_pred_proba'])) train 1.0 test 0.8737681614409617 production 0.8224322932364313 Meeting NannyML Data Requirements ================================= The data are now being split to satisfy NannyML format requirements. .. code-block:: python >>> df_for_nanny = df[df['partition']!='train'].reset_index(drop=True) >>> df_for_nanny['partition'] = df_for_nanny['partition'].map({'test':'reference', 'production':'analysis'}) >>> reference_df = df_for_nanny[df_for_nanny['partition']=='reference'].copy() >>> analysis_df = df_for_nanny[df_for_nanny['partition']=='analysis'].copy() >>> analysis_targets_df = analysis_df[['clf_target']].copy() >>> analysis_df = analysis_df.drop('clf_target', axis=1) >>> # dropping partition column that is now removed from requirements. >>> reference_df.drop('partition', axis=1, inplace=True) >>> analysis_df.drop('partition', axis=1, inplace=True) The ``reference_df`` dataframe represents the reference :term:`Data Period` and the ``analysis_df`` dataframe represents the analysis period. The ``analysis_targets_df`` dataframe contains the targets for the analysis period, which is provided separately. .. _California Housing Dataset: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html