nannyml.drift.ranker module
Module containing ways to rank features according to drift.
This model allows you to rank the columns within a
UnivariateDriftCalculator
result according to their degree of drift.
The following rankers are currently available:
AlertCountRanker
: ranks the features according to the number of drift detection alerts they cause.CorrelationRanker
: ranks the features according to their correlation with changes in realized or estimated performance.
- class nannyml.drift.ranker.AlertCountRanker[source]
Bases:
object
Ranks the features according to the number of drift detection alerts they cause.
- rank(rankable_result: Union[Result, Result, Result, Result, Result, Result, Result, Result], only_drifting: bool = False) DataFrame [source]
Ranks the features according to the number of drift detection alerts they cause.
- Parameters:
rankable_result (RankableResult) – The result of a univariate drift calculation.
only_drifting (bool, default=False) – Omits features without alerts from the ranking results.
- Returns:
ranking – A DataFrame containing the feature names and their ranks (the highest rank starts at 1, second-highest rank is 2, etc.). Features with the same number of alerts are ranked alphanumerically on the feature name.
- Return type:
pd.DataFrame
Examples
>>> import nannyml as nml >>> from IPython.display import display >>> reference_df, analysis_df, analysis_targets_df = nml.load_synthetic_car_loan_dataset() >>> analysis_full_df = analysis_df.merge(analysis_targets_df, left_index=True, right_index=True) >>> feature_column_names = [ ... 'car_value', 'salary_range', 'debt_to_income_ratio', 'loan_length', 'repaid_loan_on_prev_car', ... 'size_of_downpayment', 'driver_tenure', 'y_pred_proba', 'y_pred', 'repaid' >>> ] >>> univ_calc = nml.UnivariateDriftCalculator( ... column_names=feature_column_names, ... treat_as_categorical=['y_pred', 'repaid'], ... timestamp_column_name='timestamp', ... continuous_methods=['kolmogorov_smirnov', 'jensen_shannon'], ... categorical_methods=['chi2', 'jensen_shannon'], ... chunk_size=5000 >>> ) >>> univ_calc.fit(reference_df) >>> univariate_results = univ_calc.calculate(analysis_full_df) >>> alert_count_ranker = nml.AlertCountRanker() >>> alert_count_ranked_features = alert_count_ranker.rank( ... univariate_results.filter(methods=['jensen_shannon']), ... only_drifting = False) >>> display(alert_count_ranked_features) number_of_alerts column_name rank 0 5 y_pred_proba 1 1 5 salary_range 2 2 5 repaid_loan_on_prev_car 3 3 5 loan_length 4 4 0 car_value 5 5 0 y_pred 6 6 0 size_of_downpayment 7 7 0 repaid 8 8 0 driver_tenure 9 9 0 debt_to_income_ratio 10
- class nannyml.drift.ranker.CorrelationRanker[source]
Bases:
object
Ranks the features according to their correlation with changes in realized or estimated performance.
- Examples
>>> import nannyml as nml >>> from IPython.display import display >>> reference_df, analysis_df, analysis_targets_df = nml.load_synthetic_car_loan_dataset() >>> analysis_full_df = analysis_df.merge(analysis_targets_df, left_index=True, right_index=True) >>> feature_column_names = [ ... 'car_value', 'salary_range', 'debt_to_income_ratio', 'loan_length', 'repaid_loan_on_prev_car', ... 'size_of_downpayment', 'driver_tenure', 'y_pred_proba', 'y_pred', 'repaid' >>> ] >>> univ_calc = nml.UnivariateDriftCalculator( ... column_names=feature_column_names, ... treat_as_categorical=['y_pred', 'repaid'], ... timestamp_column_name='timestamp', ... continuous_methods=['kolmogorov_smirnov', 'jensen_shannon'], ... categorical_methods=['chi2', 'jensen_shannon'], ... chunk_size=5000 >>> ) >>> univ_calc.fit(reference_df) >>> univariate_results = univ_calc.calculate(analysis_full_df) >>> realized_calc = nml.PerformanceCalculator( ... y_pred_proba='y_pred_proba', ... y_pred='y_pred', ... y_true='repaid', ... timestamp_column_name='timestamp', ... problem_type='classification_binary', ... metrics=['roc_auc', 'recall',], ... chunk_size=5000) >>> realized_calc.fit(reference_df) >>> realized_perf_results = realized_calc.calculate(analysis_full_df) >>> ranker2 = nml.CorrelationRanker() >>> # ranker fits on one metric and reference period data only >>> ranker2.fit( ... realized_perf_results.filter(period='reference', metrics=['recall'])) >>> # ranker ranks on one drift method and one performance metric >>> correlation_ranked_features2 = ranker2.rank( ... univariate_results.filter(period='analysis', methods=['jensen_shannon']), ... realized_perf_results.filter(period='analysis', metrics=['recall']), ... only_drifting = False) >>> display(correlation_ranked_features2) column_name pearsonr_correlation pearsonr_pvalue has_drifted rank 0 repaid_loan_on_prev_car 0.96897 3.90719e-06 True 1 1 y_pred_proba 0.966157 5.50918e-06 True 2 2 loan_length 0.965298 6.08385e-06 True 3 3 car_value 0.963623 7.33185e-06 True 4 4 salary_range 0.963456 7.46561e-06 True 5 5 size_of_downpayment 0.308948 0.385072 False 6 6 debt_to_income_ratio 0.307373 0.387627 False 7 7 y_pred -0.357571 0.310383 False 8 8 repaid -0.395842 0.257495 False 9 9 driver_tenure -0.575807 0.0815202 False 10
Creates a new CorrelationRanker instance.
- fit(reference_performance_calculation_result: Optional[Union[Result, Result, Result]] = None) CorrelationRanker [source]
Calculates the average performance during the reference period. This value is saved at the mean_reference_performance property of the ranker.
- Parameters:
reference_performance_calculation_result (Union[CBPEResults, DLEResults, PerformanceCalculationResults]) – Results from any performance calculator or estimator, e.g.
PerformanceCalculator
CBPE
DLE
- Returns:
ranking
- Return type:
- rank(rankable_result: Union[Result, Result, Result, Result, Result, Result, Result, Result], performance_result: Optional[Union[Result, Result, Result]] = None, only_drifting: bool = False)[source]
Compares the number of alerts for each feature and ranks them accordingly.
- Parameters:
rankable_result (RankableResult) – The univariate, data quality or simple statistic drift results containing the features we want to rank.
performance_result (PerformanceResults) – Results from any performance calculator or estimator, e.g.
PerformanceCalculator
CBPE
DLE
only_drifting (bool, default=False) – Omits features without alerts from the ranking results.
- Returns:
ranking – A DataFrame containing the feature names and their ranks (the highest rank starts at 1, second-highest rank is 2, etc.). Features with the same number of alerts are ranked alphanumerically on the feature name.
- Return type:
pd.DataFrame