Binary Classification: California Housing Dataset
This example outlines a typical workflow for estimating performance of a model without access to ground truth, detecting performance issues and identifying potential root causes for these issues. In this examples, we are using NannyML on the modified California Housing Prices dataset.
You can see what modifications were made to the data to make it suitable for the use case in California Housing Dataset.
Load and prepare data
Let’s load the dataset from NannyML’s included datasets.
>>> import pandas as pd
>>> import nannyml as nml
>>> reference_df, analysis_df, analysis_targets_df = nml.datasets.load_modified_california_housing_dataset()
>>> reference_df.head(3)
id |
MedInc |
HouseAge |
AveRooms |
AveBedrms |
Population |
AveOccup |
Latitude |
Longitude |
timestamp |
clf_target |
y_pred_proba |
y_pred |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
0 |
9.8413 |
32 |
7.17004 |
1.01484 |
4353 |
2.93725 |
34.22 |
-118.19 |
2020-10-01 00:00:00 |
1 |
0.99 |
1 |
1 |
1 |
8.3695 |
37 |
7.45875 |
1.06271 |
941 |
3.10561 |
34.22 |
-118.21 |
2020-10-01 01:00:00 |
1 |
1 |
1 |
2 |
2 |
8.72 |
44 |
6.16318 |
1.04603 |
668 |
2.79498 |
34.2 |
-118.18 |
2020-10-01 02:00:00 |
1 |
1 |
1 |
Performance Estimation
We first want to estimate performance for the analysis period, using the reference period as our performance baseline.
>>> # fit performance estimator and estimate for combined reference and analysis
>>> cbpe = nml.CBPE(
... y_pred='y_pred',
... y_pred_proba='y_pred_proba',
... y_true='clf_target',
... timestamp_column_name='timestamp',
... problem_type='classification_binary',
... chunk_period='M',
... metrics=['roc_auc'])
>>> cbpe.fit(reference_data=reference_df)
>>> est_perf = cbpe.estimate(analysis_df)
UserWarning: The resulting list of chunks contains 1 underpopulated chunks. They contain too few records to be statistically relevant and might negatively influence the quality of calculations. Please consider splitting your data in a different way or continue at your own risk.
We get a warning that some chunks are too small. Let’s quickly check what’s going on here.
>>> est_perf_data = est_perf.to_df()
>>> print(est_perf.data[('chunk', 'end_index')] - est_perf.data[('chunk', 'start_index')])
0 743
1 719
2 743
3 743
4 671
5 743
6 719
7 743
8 719
9 743
10 743
11 719
12 743
13 719
14 743
15 743
16 671
17 743
18 719
19 215
dtype: int64
The last chunk is smaller than the others due to the selected chunking method. Let’s remove it to make sure everything we visualise is reliable.
>>> est_perf.data = est_perf.data[:-1].copy()
>>> est_perf.data.tail(2)
chunk
key
|
chunk_index
|
start_index
|
end_index
|
start_date
|
end_date
|
period
|
roc_auc
value
|
sampling_error
|
realized
|
upper_confidence_boundary
|
lower_confidence_boundary
|
upper_threshold
|
lower_threshold
|
alert
|
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17 |
2022-03 |
9 |
6552 |
7295 |
2022-03-01 00:00:00 |
2022-03-31 23:59:59.999999999 |
analysis |
0.829077 |
0.00778098 |
nan |
0.85242 |
0.805734 |
1 |
0.708336 |
False |
18 |
2022-04 |
10 |
7296 |
8015 |
2022-04-01 00:00:00 |
2022-04-30 23:59:59.999999999 |
analysis |
0.910661 |
0.0079096 |
nan |
0.93439 |
0.886932 |
1 |
0.708336 |
False |
Now we can plot the estimated performance confidently.
>>> fig = est_perf.filter(metrics=['roc_auc'], period='analysis').plot()
>>> fig.show()
CBPE estimates a significant performance drop in the chunk corresponding to the month of September.
Comparison with the actual performance
Because we have the ground truth for our dataset, we can use it to calculate ROC AUC on the relevant chunks, and compare it to the estimated values.
>>> from sklearn.metrics import roc_auc_score
>>> import matplotlib.pyplot as plt
>>> # add ground truth to analysis
>>> analysis_full = pd.merge(analysis_df, analysis_targets_df, left_index=True, right_index=True)
>>> df_all = pd.concat([reference_df, analysis_full]).reset_index(drop=True)
>>> df_all['timestamp'] = pd.to_datetime(df_all['timestamp'])
>>> # calculate actual ROC AUC
>>> target_col = cbpe.y_true
>>> pred_score_col = 'y_pred_proba'
>>> actual_performance = []
>>> for idx in est_perf_data.index:
... start_date, end_date = est_perf_data.loc[idx, ('chunk', 'start_date')], est_perf_data.loc[idx, ('chunk', 'end_date')]
... sub = df_all[df_all['timestamp'].between(start_date, end_date)]
... actual_perf = roc_auc_score(sub[target_col], sub[pred_score_col])
... est_perf_data.loc[idx, ('roc_auc', 'realized')] = actual_perf
>>> # plot
>>> first_analysis = est_perf_data[('chunk', 'key')].values[8]
>>> plt.figure(figsize=(10,5))
>>> plt.plot(est_perf_data[('chunk', 'key')], est_perf_data[('roc_auc', 'value')], label='estimated AUC')
>>> plt.plot(est_perf_data[('chunk', 'key')], est_perf_data[('roc_auc', 'realized')], label='actual ROC AUC')
>>> plt.xticks(rotation=90)
>>> plt.axvline(x=first_analysis, label='First analysis chunk', linestyle=':', color='grey')
>>> plt.ylabel('ROC AUC')
>>> plt.legend()
>>> plt.show()
We can see that the significant drop at the first few chunks of the analysis period was estimated accurately. After that, the overall trend seems to be well represented. The estimation of performance has a lower variance than actual performance.
Drift detection
The next step is to find out what might be responsible for this drop in ROC AUC. Let’s try using univariate drift detection, and see what we discover.
>>> feature_column_names = [
... col for col in reference_df
... if col not in ['y_pred', 'y_pred_proba', 'clf_target', 'timestamp']]
>>> univariate_calculator = nml.UnivariateDriftCalculator(column_names=feature_column_names,
... timestamp_column_name='timestamp',
... chunk_period='M',
... continuous_methods=['kolmogorov_smirnov'],
... categorical_methods=['chi2']).fit(reference_data=reference_df)
>>> univariate_results = univariate_calculator.calculate(analysis_df)
>>> nml.AlertCountRanker().rank(univariate_results)
number_of_alerts |
column_name |
rank |
|
---|---|---|---|
0 |
3 |
Longitude |
1 |
1 |
1 |
Latitude |
2 |
2 |
0 |
id |
3 |
3 |
0 |
Population |
4 |
4 |
0 |
MedInc |
5 |
5 |
0 |
HouseAge |
6 |
6 |
0 |
AveRooms |
7 |
7 |
0 |
AveOccup |
8 |
8 |
0 |
AveBedrms |
9 |
It looks like there is a lot of drift in this dataset. Since we have 12 chunks in the analysis period, we can see that the top 4 features drifted in all analyzed chunks. Let’s look at the magnitude of this drift by examining the KS distance statistics.
>>> # get columns with d statistics only
>>> # # # print(result.data.loc[:, (non_chunk, slice(None), 'alert')])
>>> univariate_results.to_df().loc[:, (slice(None), 'kolmogorov_smirnov', 'value')].mean().sort_values(ascending=False)
0 |
|
---|---|
(‘id’, ‘kolmogorov_smirnov’, ‘value’) |
0.87428 |
(‘Longitude’, ‘kolmogorov_smirnov’, ‘value’) |
0.712709 |
(‘Latitude’, ‘kolmogorov_smirnov’, ‘value’) |
0.672904 |
(‘HouseAge’, ‘kolmogorov_smirnov’, ‘value’) |
0.201638 |
(‘MedInc’, ‘kolmogorov_smirnov’, ‘value’) |
0.154952 |
(‘AveOccup’, ‘kolmogorov_smirnov’, ‘value’) |
0.14389 |
(‘AveRooms’, ‘kolmogorov_smirnov’, ‘value’) |
0.129277 |
(‘AveBedrms’, ‘kolmogorov_smirnov’, ‘value’) |
0.0891403 |
(‘Population’, ‘kolmogorov_smirnov’, ‘value’) |
0.0735623 |
The mean value of D-statistic for Longitude and Latitude on the analysis chunks is the largest. Let’s plot their distributions for the analysis period.
>>> fig = univariate_results.filter(
... column_names=['Longitude', 'Latitude'],
... period='analysis',
... methods=['kolmogorov_smirnov']
>>> ).plot(kind='distribution', number_of_columns=1)
>>> fig.show()
Indeed, we can see the distributions of these variables are completely different in each chunk. This was expected, as the original dataset has observations from nearby locations. Let’s see it on a scatter plot:
>>> analysis_res = est_perf.data
>>> plt.figure(figsize=(8,6))
>>> for idx in analysis_res.index[:10]:
... start_date, end_date = analysis_res.loc[idx, ('chunk', 'start_date')], analysis_res.loc[idx, ('chunk', 'end_date')]
... sub = df_all[df_all['timestamp'].between(start_date, end_date)]
... plt.scatter(sub['Latitude'], sub['Longitude'], s=5, label="Chunk {}".format(str(idx)))
>>> plt.legend()
>>> plt.xlabel('Latitude')
>>> plt.ylabel('Longitude')
In this example, NannyML estimated the performance (ROC AUC) of a model without accessing the target data. We can see from our comparison with the targets that the estimate is quite accurate. Next, the potential root causes of the drop in performance were indicated by detecting data drift. This was achieved using univariate methods that identified the features which drifted the most.