Domain Classifier
The second multivariate drift detection method of NannyML is Domain Classifier. It provides a measure of how easy it is to discriminate the reference data from the examined chunk data. You can read more about on the How it works: Domain Classifier section. When there is no data drift the datasets can’t discerned and we get a value of 0.5. The more drift there is, the higher the returned measure will be, up to a value of 1.
Just The Code
>>> import nannyml as nml
>>> from IPython.display import display
>>> # Load synthetic data
>>> reference_df, analysis_df, _ = nml.load_synthetic_car_loan_dataset()
>>> display(reference_df.head())
>>> # Define feature columns
>>> feature_column_names = [
... 'car_value',
... 'salary_range',
... 'debt_to_income_ratio',
... 'loan_length',
... 'repaid_loan_on_prev_car',
... 'size_of_downpayment',
... 'driver_tenure'
>>> ]
>>> calc = nml.DomainClassifierCalculator(
... feature_column_names=feature_column_names,
... timestamp_column_name='timestamp',
... chunk_size=5000
>>> )
>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='analysis').to_df())
>>> display(results.filter(period='reference').to_df())
>>> figure = results.plot()
>>> figure.show()
Advanced configuration
To learn how
Chunk
works and to set up custom chunkings check out the chunking tutorial.To learn how
ConstantThreshold
works and to set up custom threshold check out the thresholds tutorial.
Walkthrough
The method returns a single number, measuring the discrimination capability of the discriminator. Any increase in the discrimination value above 0.5 reflects a change in the structure of the model inputs.
NannyML calculates the discrimination value for the monitored model’s inputs, and raises an alert if the
values get outside the pre-defined range of [0.45, 0.65]
. If needed this range can be adjusted by specifying
a threshold strategy more appropriate for the user’s data.
In order to monitor a model, NannyML needs to learn about it from a reference dataset. Then it can monitor the data subject to actual analysis, provided as the analysis dataset. You can read more about this in our section on data periods.
Let’s start by loading some synthetic data provided by the NannyML package set it up as our reference and analysis dataframes. This synthetic data is for a binary classification model, but multi-class classification can be handled in the same way.
>>> import nannyml as nml
>>> from IPython.display import display
>>> # Load synthetic data
>>> reference_df, analysis_df, _ = nml.load_synthetic_car_loan_dataset()
>>> display(reference_df.head())
id |
car_value |
salary_range |
debt_to_income_ratio |
loan_length |
repaid_loan_on_prev_car |
size_of_downpayment |
driver_tenure |
repaid |
timestamp |
y_pred_proba |
y_pred |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
0 |
39811 |
40K - 60K € |
0.63295 |
19 |
False |
40% |
0.212653 |
1 |
2018-01-01 00:00:00.000 |
0.99 |
1 |
1 |
1 |
12679 |
40K - 60K € |
0.718627 |
7 |
True |
10% |
4.92755 |
0 |
2018-01-01 00:08:43.152 |
0.07 |
0 |
2 |
2 |
19847 |
40K - 60K € |
0.721724 |
17 |
False |
0% |
0.520817 |
1 |
2018-01-01 00:17:26.304 |
1 |
1 |
3 |
3 |
22652 |
20K - 40K € |
0.705992 |
16 |
False |
10% |
0.453649 |
1 |
2018-01-01 00:26:09.456 |
0.98 |
1 |
4 |
4 |
21268 |
60K+ € |
0.671888 |
21 |
True |
30% |
5.69526 |
1 |
2018-01-01 00:34:52.608 |
0.99 |
1 |
The DomainClassifierCalculator
module implements this functionality. We need to instantiate it with appropriate parameters:
feature_column_names: A list with the column names of the features we want to run drift detection on.
treat_as_categorical (Optional): A list containing the names of features in the provided data set that should be treated as categorical. Needs not be exhaustive.
timestamp_column_name (Optional): The name of the column in the reference data that contains timestamps.
chunk_size (Optional): The number of observations in each chunk of data used. Only one chunking argument needs to be provided. For more information about chunking configurations check out the chunking tutorial.
chunk_number (Optional): The number of chunks to be created out of data provided for each period.
chunk_period (Optional): The time period based on which we aggregate the provided data in order to create chunks.
chunker (Optional): A NannyML
Chunker
object that will handle the aggregation provided data in order to create chunks.cv_folds_num (Optional): Number of cross-validation folds to use when calculating DC discrimination value.
hyperparameters (Optional): A dictionary used to provide your own custom hyperparameters when training the discrimination model. Check out the available hyperparameter options in the LightGBM docs.
tune_hyperparameters (Optional): A boolean controlling whether hypertuning should be performed on the internal regressor models whilst fitting on reference data.
hyperparameter_tuning_config (Optional): A dictionary that allows you to provide a custom hyperparameter tuning configuration when tune_hyperparameters has been set to True. Available options are available in the AutoML FLAML documentation.
threshold (Optional): The threshold strategy used to calculate the alert threshold limits. For more information about thresholds, check out the thresholds tutorial.
Next, the fit()
method needs to be called on the reference data,
which the results will be based on. Then the
calculate()
method will
calculate the multivariate drift results on the provided data.
>>> # Define feature columns
>>> feature_column_names = [
... 'car_value',
... 'salary_range',
... 'debt_to_income_ratio',
... 'loan_length',
... 'repaid_loan_on_prev_car',
... 'size_of_downpayment',
... 'driver_tenure'
>>> ]
>>> calc = nml.DomainClassifierCalculator(
... feature_column_names=feature_column_names,
... timestamp_column_name='timestamp',
... chunk_size=5000
>>> )
>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
We can see these results of the data provided to the
calculate()
method as a dataframe.
>>> display(results.filter(period='analysis').to_df())
chunk
key
|
chunk_index
|
start_index
|
end_index
|
start_date
|
end_date
|
period
|
domain_classifier_auroc
value
|
upper_threshold
|
lower_threshold
|
alert
|
|
---|---|---|---|---|---|---|---|---|---|---|---|
0 |
[0:4999] |
0 |
0 |
4999 |
2018-10-30 18:00:00 |
2018-11-30 00:27:16.848000 |
analysis |
0.502704 |
0.65 |
0.45 |
False |
1 |
[5000:9999] |
1 |
5000 |
9999 |
2018-11-30 00:36:00 |
2018-12-30 07:03:16.848000 |
analysis |
0.49639 |
0.65 |
0.45 |
False |
2 |
[10000:14999] |
2 |
10000 |
14999 |
2018-12-30 07:12:00 |
2019-01-29 13:39:16.848000 |
analysis |
0.490815 |
0.65 |
0.45 |
False |
3 |
[15000:19999] |
3 |
15000 |
19999 |
2019-01-29 13:48:00 |
2019-02-28 20:15:16.848000 |
analysis |
0.493005 |
0.65 |
0.45 |
False |
4 |
[20000:24999] |
4 |
20000 |
24999 |
2019-02-28 20:24:00 |
2019-03-31 02:51:16.848000 |
analysis |
0.503402 |
0.65 |
0.45 |
False |
5 |
[25000:29999] |
5 |
25000 |
29999 |
2019-03-31 03:00:00 |
2019-04-30 09:27:16.848000 |
analysis |
0.913519 |
0.65 |
0.45 |
True |
6 |
[30000:34999] |
6 |
30000 |
34999 |
2019-04-30 09:36:00 |
2019-05-30 16:03:16.848000 |
analysis |
0.913364 |
0.65 |
0.45 |
True |
7 |
[35000:39999] |
7 |
35000 |
39999 |
2019-05-30 16:12:00 |
2019-06-29 22:39:16.848000 |
analysis |
0.916356 |
0.65 |
0.45 |
True |
8 |
[40000:44999] |
8 |
40000 |
44999 |
2019-06-29 22:48:00 |
2019-07-30 05:15:16.848000 |
analysis |
0.913297 |
0.65 |
0.45 |
True |
9 |
[45000:49999] |
9 |
45000 |
49999 |
2019-07-30 05:24:00 |
2019-08-29 11:51:16.848000 |
analysis |
0.916694 |
0.65 |
0.45 |
True |
The drift results from the reference data are accessible from the properties of the results object:
>>> display(results.filter(period='reference').to_df())
chunk
key
|
chunk_index
|
start_index
|
end_index
|
start_date
|
end_date
|
period
|
domain_classifier_auroc
value
|
upper_threshold
|
lower_threshold
|
alert
|
|
---|---|---|---|---|---|---|---|---|---|---|---|
0 |
[0:4999] |
0 |
0 |
4999 |
2018-01-01 00:00:00 |
2018-01-31 06:27:16.848000 |
reference |
0.508085 |
0.65 |
0.45 |
False |
1 |
[5000:9999] |
1 |
5000 |
9999 |
2018-01-31 06:36:00 |
2018-03-02 13:03:16.848000 |
reference |
0.505428 |
0.65 |
0.45 |
False |
2 |
[10000:14999] |
2 |
10000 |
14999 |
2018-03-02 13:12:00 |
2018-04-01 19:39:16.848000 |
reference |
0.506587 |
0.65 |
0.45 |
False |
3 |
[15000:19999] |
3 |
15000 |
19999 |
2018-04-01 19:48:00 |
2018-05-02 02:15:16.848000 |
reference |
0.499824 |
0.65 |
0.45 |
False |
4 |
[20000:24999] |
4 |
20000 |
24999 |
2018-05-02 02:24:00 |
2018-06-01 08:51:16.848000 |
reference |
0.507135 |
0.65 |
0.45 |
False |
5 |
[25000:29999] |
5 |
25000 |
29999 |
2018-06-01 09:00:00 |
2018-07-01 15:27:16.848000 |
reference |
0.498486 |
0.65 |
0.45 |
False |
6 |
[30000:34999] |
6 |
30000 |
34999 |
2018-07-01 15:36:00 |
2018-07-31 22:03:16.848000 |
reference |
0.501805 |
0.65 |
0.45 |
False |
7 |
[35000:39999] |
7 |
35000 |
39999 |
2018-07-31 22:12:00 |
2018-08-31 04:39:16.848000 |
reference |
0.494281 |
0.65 |
0.45 |
False |
8 |
[40000:44999] |
8 |
40000 |
44999 |
2018-08-31 04:48:00 |
2018-09-30 11:15:16.848000 |
reference |
0.505302 |
0.65 |
0.45 |
False |
9 |
[45000:49999] |
9 |
45000 |
49999 |
2018-09-30 11:24:00 |
2018-10-30 17:51:16.848000 |
reference |
0.502734 |
0.65 |
0.45 |
False |
NannyML can also visualize the multivariate drift results in a plot. Our plot contains several key elements.
The purple step plot shows the reconstruction error in each chunk of the analysis period. Thick squared point markers indicate the middle of these chunks.
The red horizontal dashed lines show upper and lower thresholds for alerting purposes.
If discrimination value crosses the upper or lower threshold an alert is raised. A red, diamond-shaped point marker additionally indicates this in the middle of the chunk.
>>> figure = results.plot()
>>> figure.show()
The multivariate drift results provide a concise summary of where data drift is happening in our input data.
Insights
Using this method of detecting drift, we can identify changes that we may not have seen using solely univariate methods.
What Next
After reviewing the results, we want to look at the drift results of individual features to see what changed in the model’s features individually.
The Performance Estimation functionality can be used to estimate the impact of the observed changes.