nannyml.drift.ranker module
Module containing ways to rank features according to drift.
This model allows you to rank the columns within a
UnivariateDriftCalculator
result according to their degree of drift.
The following rankers are currently available:
AlertCountRanker
: ranks the features according to the number of drift detection alerts they cause.CorrelationRanker
: ranks the features according to their correlation with changes in realized or estimated performance.
- class nannyml.drift.ranker.AlertCountRanker[source]
Bases:
object
Ranks the features according to the number of drift detection alerts they cause.
- rank(rankable_result: Union[nannyml.drift.univariate.result.Result, nannyml.data_quality.missing.result.Result, nannyml.data_quality.unseen.result.Result], only_drifting: bool = False) pandas.core.frame.DataFrame [source]
Ranks the features according to the number of drift detection alerts they cause.
- Parameters
rankable_result (RankableResult) – The result of a univariate drift calculation.
only_drifting (bool, default=False) – Omits features without alerts from the ranking results.
- Returns
ranking – A DataFrame containing the feature names and their ranks (the highest rank starts at 1, second-highest rank is 2, etc.). Features with the same number of alerts are ranked alphanumerically on the feature name.
- Return type
pd.DataFrame
Examples
>>> import nannyml as nml >>> from IPython.display import display >>> >>> reference_df, analysis_df, target_df = nml.load_synthetic_binary_classification_dataset() >>> >>> display(reference_df.head()) >>> >>> column_names = [ >>> col for col in reference_df.columns if col not in ['timestamp', 'y_pred_proba', 'period', >>> 'y_pred', 'work_home_actual', 'identifier']] >>> >>> calc = nml.UnivariateDriftCalculator(column_names=column_names, >>> timestamp_column_name='timestamp') >>> >>> calc.fit(reference_df) >>> >>> results = calc.calculate(analysis_df.merge(target_df, on='identifier')) >>> >>> ranker = nml.AlertCountRanker() >>> ranked_features = ranker.rank(drift_calculation_result=results, only_drifting=False) >>> display(ranked_features) number_of_alerts column_name rank 0 5 wfh_prev_workday 1 1 5 salary_range 2 2 5 public_transportation_cost 3 3 5 distance_from_office 4 4 0 workday 5 5 0 work_home_actual 6 6 0 tenure 7 7 0 gas_price_per_litre 8
- class nannyml.drift.ranker.CorrelationRanker[source]
Bases:
object
Ranks the features according to their correlation with changes in realized or estimated performance.
- Examples
>>> import nannyml as nml >>> from IPython.display import display >>> >>> reference_df, analysis_df, target_df = nml.load_synthetic_binary_classification_dataset() >>> >>> column_names = [col for col in reference_df.columns >>> if col not in ['timestamp', 'y_pred_proba', 'period', >>> 'y_pred', 'work_home_actual', 'identifier']] >>> >>> univ_calc = nml.UnivariateDriftCalculator(column_names=column_names, >>> timestamp_column_name='timestamp') >>> >>> calc = nml.UnivariateDriftCalculator(column_names=column_names, >>> timestamp_column_name='timestamp') >>> >>> univ_calc.fit(reference_df) >>> univariate_results = calc.calculate(analysis_df.merge(target_df, on='identifier')) >>> >>> realized_calc = nml.PerformanceCalculator( >>> y_pred_proba='y_pred_proba', >>> y_pred='y_pred', >>> y_true='work_home_actual', >>> timestamp_column_name='timestamp', >>> problem_type='classification_binary', >>> metrics=['roc_auc']) >>> realized_calc.fit(reference_df) >>> realized_perf_results = realized_calc.calculate(analysis_df.merge(target_df, on='identifier')) >>> >>> ranker = nml.CorrelationRanker() >>> # ranker fits on one metric and reference period data only >>> ranker.fit(realized_perf_results.filter(period='reference')) >>> # ranker ranks on one drift method and one performance metric >>> correlation_ranked_features = ranker.rank( >>> univariate_results, >>> realized_perf_results, >>> only_drifting = False) >>> display(correlation_ranked_features) column_name pearsonr_correlation pearsonr_pvalue has_drifted rank 0 wfh_prev_workday 0.929710 3.076474e-09 True 1 1 public_transportation_cost 0.925910 4.872173e-09 True 2 2 salary_range 0.921556 8.014868e-09 True 3 3 distance_from_office 0.920749 8.762147e-09 True 4 4 gas_price_per_litre 0.340076 1.423541e-01 False 5 5 workday 0.154622 5.151128e-01 False 6 6 work_home_actual -0.030899 8.971071e-01 False 7 7 tenure -0.177018 4.553046e-01 False 8
Creates a new CorrelationRanker instance.
- fit(reference_performance_calculation_result: Optional[Union[nannyml.performance_estimation.confidence_based.results.Result, nannyml.performance_estimation.direct_loss_estimation.result.Result, nannyml.performance_calculation.result.Result]] = None) nannyml.drift.ranker.CorrelationRanker [source]
Calculates the average performance during the reference period. This value is saved at the mean_reference_performance property of the ranker.
- Parameters
reference_performance_calculation_result (Union[CBPEResults, DLEResults, PerformanceCalculationResults]) – Results from any performance calculator or estimator, e.g.
PerformanceCalculator
CBPE
DLE
- Returns
ranking
- Return type
- rank(rankable_result: Union[nannyml.drift.univariate.result.Result, nannyml.data_quality.missing.result.Result, nannyml.data_quality.unseen.result.Result], performance_result: Optional[Union[nannyml.performance_estimation.confidence_based.results.Result, nannyml.performance_estimation.direct_loss_estimation.result.Result, nannyml.performance_calculation.result.Result]] = None, only_drifting: bool = False)[source]
Compares the number of alerts for each feature and ranks them accordingly.
- Parameters
rankable_result (RankableResult) – The univariate, data quality or simple statistic drift results containing the features we want to rank.
performance_result (PerformanceResult) – Results from any performance calculator or estimator, e.g.
PerformanceCalculator
CBPE
DLE
only_drifting (bool, default=False) – Omits features without alerts from the ranking results.
- Returns
ranking – A DataFrame containing the feature names and their ranks (the highest rank starts at 1, second-highest rank is 2, etc.). Features with the same number of alerts are ranked alphanumerically on the feature name.
- Return type
pd.DataFrame