nannyml.drift.ranker module

Module containing ways to rank features according to drift.

This model allows you to rank the columns within a UnivariateDriftCalculator result according to their degree of drift.

The following rankers are currently available:

AlertCountRanker: ranks the features according to the number of drift detection alerts they cause.
CorrelationRanker: ranks the features according to their correlation with changes in realized or estimated performance.

class nannyml.drift.ranker.AlertCountRanker[source]

Bases: object

Ranks the features according to the number of drift detection alerts they cause.

rank(drift_calculation_result: nannyml.drift.univariate.result.Result, only_drifting: bool = False) → pandas.core.frame.DataFrame[source]

Ranks the features according to the number of drift detection alerts they cause.

Parameters

drift_calculation_result (nannyml.driQft.univariate.Result) – The result of a univariate drift calculation.
only_drifting (bool, default=False) – Omits features without alerts from the ranking results.

Returns

ranking – A DataFrame containing the feature names and their ranks (the highest rank starts at 1, second-highest rank is 2, etc.). Features with the same number of alerts are ranked alphanumerically on the feature name.

Return type

pd.DataFrame

Examples

>>> import nannyml as nml
>>> from IPython.display import display
>>>
>>> reference_df, analysis_df, target_df = nml.load_synthetic_binary_classification_dataset()
>>>
>>> display(reference_df.head())
>>>
>>> column_names = [
>>>     col for col in reference_df.columns if col not in ['timestamp', 'y_pred_proba', 'period',
>>>                                                        'y_pred', 'work_home_actual', 'identifier']]
>>>
>>> calc = nml.UnivariateDriftCalculator(column_names=column_names,
>>>     timestamp_column_name='timestamp')
>>>
>>> calc.fit(reference_df)
>>>
>>> results = calc.calculate(analysis_df.merge(target_df, on='identifier'))
>>>
>>> ranker = nml.AlertCountRanker()
>>> ranked_features = ranker.rank(drift_calculation_result=results, only_drifting=False)
>>> display(ranked_features)
        number_of_alerts                 column_name  rank
0                      5            wfh_prev_workday     1
1                      5                salary_range     2
2                      5  public_transportation_cost     3
3                      5        distance_from_office     4
4                      0                     workday     5
5                      0            work_home_actual     6
6                      0                      tenure     7
7                      0         gas_price_per_litre     8

class nannyml.drift.ranker.CorrelationRanker[source]

Bases: object

Ranks the features according to their correlation with changes in realized or estimated performance.

Examples

>>> import nannyml as nml
>>> from IPython.display import display
>>>
>>> reference_df, analysis_df, target_df = nml.load_synthetic_binary_classification_dataset()
>>>
>>> column_names = [col for col in reference_df.columns
>>>                 if col not in ['timestamp', 'y_pred_proba', 'period',
>>>                                'y_pred', 'work_home_actual', 'identifier']]
>>>
>>> univ_calc = nml.UnivariateDriftCalculator(column_names=column_names,
>>>                                           timestamp_column_name='timestamp')
>>>
>>> calc = nml.UnivariateDriftCalculator(column_names=column_names,
>>>                                      timestamp_column_name='timestamp')
>>>
>>> univ_calc.fit(reference_df)
>>> univariate_results = calc.calculate(analysis_df.merge(target_df, on='identifier'))
>>>
>>> realized_calc = nml.PerformanceCalculator(
>>>     y_pred_proba='y_pred_proba',
>>>     y_pred='y_pred',
>>>     y_true='work_home_actual',
>>>     timestamp_column_name='timestamp',
>>>     problem_type='classification_binary',
>>>     metrics=['roc_auc'])
>>> realized_calc.fit(reference_df)
>>> realized_perf_results = realized_calc.calculate(analysis_df.merge(target_df, on='identifier'))
>>>
>>> ranker = nml.CorrelationRanker()
>>> # ranker fits on one metric and reference period data only
>>> ranker.fit(realized_perf_results.filter(period='reference'))
>>> # ranker ranks on one drift method and one performance metric
>>> correlation_ranked_features = ranker.rank(
>>>     univariate_results,
>>>     realized_perf_results,
>>>     only_drifting = False)
>>> display(correlation_ranked_features)
                  column_name  pearsonr_correlation  pearsonr_pvalue  has_drifted  rank
0            wfh_prev_workday              0.929710     3.076474e-09         True     1
1  public_transportation_cost              0.925910     4.872173e-09         True     2
2                salary_range              0.921556     8.014868e-09         True     3
3        distance_from_office              0.920749     8.762147e-09         True     4
4         gas_price_per_litre              0.340076     1.423541e-01        False     5
5                     workday              0.154622     5.151128e-01        False     6
6            work_home_actual             -0.030899     8.971071e-01        False     7
7                      tenure             -0.177018     4.553046e-01        False     8

Creates a new CorrelationRanker instance.

fit(reference_performance_calculation_result: Optional[Union[nannyml.performance_estimation.confidence_based.results.Result, nannyml.performance_estimation.direct_loss_estimation.result.Result, nannyml.performance_calculation.result.Result]] = None) → nannyml.drift.ranker.CorrelationRanker[source]

Calculates the average performance during the reference period. This value is saved at the mean_reference_performance property of the ranker.

Parameters: reference_performance_calculation_result (Union[CBPEResults, DLEResults, PerformanceCalculationResults]) – Results from any performance calculator or estimator, e.g. PerformanceCalculator CBPE DLE
Returns: ranking
Return type: CorrelationRanker

rank(drift_calculation_result: nannyml.drift.univariate.result.Result, performance_calculation_result: Optional[Union[nannyml.performance_estimation.confidence_based.results.Result, nannyml.performance_estimation.direct_loss_estimation.result.Result, nannyml.performance_calculation.result.Result]] = None, only_drifting: bool = False)[source]

Compares the number of alerts for each feature and ranks them accordingly.

Parameters

drift_calculation_result (UnivariateResults) – The univariate drift results containing the features we want to rank.
performance_calculation_result (Union[CBPEResults, DLEResults, PerformanceCalculationResults]) – Results from any performance calculator or estimator, e.g. PerformanceCalculator CBPE DLE
only_drifting (bool, default=False) – Omits features without alerts from the ranking results.

Returns

ranking – A DataFrame containing the feature names and their ranks (the highest rank starts at 1, second-highest rank is 2, etc.). Features with the same number of alerts are ranked alphanumerically on the feature name.

Return type

pd.DataFrame