nannyml.drift.ranker module

Module containing ways to rank features according to drift.

This model allows you to rank the columns within a UnivariateDriftCalculator result according to their degree of drift.

The following rankers are currently available:

AlertCountRanker: ranks the features according to the number of drift detection alerts they cause.
CorrelationRanker: ranks the features according to their correlation with changes in realized or estimated performance.

class nannyml.drift.ranker.AlertCountRanker[source]

Bases: object

Ranks the features according to the number of drift detection alerts they cause.

rank(rankable_result: Union[Result, Result, Result, Result, Result, Result, Result, Result], only_drifting: bool = False) → DataFrame[source]

Ranks the features according to the number of drift detection alerts they cause.

Parameters:

rankable_result (RankableResult) – The result of a univariate drift calculation.
only_drifting (bool, default=False) – Omits features without alerts from the ranking results.

Returns:

ranking – A DataFrame containing the feature names and their ranks (the highest rank starts at 1, second-highest rank is 2, etc.). Features with the same number of alerts are ranked alphanumerically on the feature name.

Return type:

pd.DataFrame

Examples

>>> import nannyml as nml
>>> from IPython.display import display
>>> reference_df, analysis_df, analysis_targets_df = nml.load_synthetic_car_loan_dataset()
>>> analysis_full_df = analysis_df.merge(analysis_targets_df, left_index=True, right_index=True)
>>> feature_column_names = [
...     'car_value', 'salary_range', 'debt_to_income_ratio', 'loan_length', 'repaid_loan_on_prev_car',
...     'size_of_downpayment', 'driver_tenure', 'y_pred_proba', 'y_pred', 'repaid'
>>> ]
>>> univ_calc = nml.UnivariateDriftCalculator(
...     column_names=feature_column_names,
...     treat_as_categorical=['y_pred', 'repaid'],
...     timestamp_column_name='timestamp',
...     continuous_methods=['kolmogorov_smirnov', 'jensen_shannon'],
...     categorical_methods=['chi2', 'jensen_shannon'],
...     chunk_size=5000
>>> )
>>> univ_calc.fit(reference_df)
>>> univariate_results = univ_calc.calculate(analysis_full_df)
>>> alert_count_ranker = nml.AlertCountRanker()
>>> alert_count_ranked_features = alert_count_ranker.rank(
...     univariate_results.filter(methods=['jensen_shannon']),
...     only_drifting = False)
>>> display(alert_count_ranked_features)
        number_of_alerts                 column_name  rank
0                      5                y_pred_proba     1
1                      5                salary_range     2
2                      5     repaid_loan_on_prev_car     3
3                      5                 loan_length     4
4                      0                   car_value     5
5                      0                      y_pred     6
6                      0         size_of_downpayment     7
7                      0                      repaid     8
8                      0               driver_tenure     9
9                      0        debt_to_income_ratio     10

class nannyml.drift.ranker.CorrelationRanker[source]

Bases: object

Ranks the features according to their correlation with changes in realized or estimated performance.

Examples

>>> import nannyml as nml
>>> from IPython.display import display
>>> reference_df, analysis_df, analysis_targets_df = nml.load_synthetic_car_loan_dataset()
>>> analysis_full_df = analysis_df.merge(analysis_targets_df, left_index=True, right_index=True)
>>> feature_column_names = [
...     'car_value', 'salary_range', 'debt_to_income_ratio', 'loan_length', 'repaid_loan_on_prev_car',
...     'size_of_downpayment', 'driver_tenure', 'y_pred_proba', 'y_pred', 'repaid'
>>> ]
>>> univ_calc = nml.UnivariateDriftCalculator(
...     column_names=feature_column_names,
...     treat_as_categorical=['y_pred', 'repaid'],
...     timestamp_column_name='timestamp',
...     continuous_methods=['kolmogorov_smirnov', 'jensen_shannon'],
...     categorical_methods=['chi2', 'jensen_shannon'],
...     chunk_size=5000
>>> )
>>> univ_calc.fit(reference_df)
>>> univariate_results = univ_calc.calculate(analysis_full_df)
>>> realized_calc = nml.PerformanceCalculator(
...     y_pred_proba='y_pred_proba',
...     y_pred='y_pred',
...     y_true='repaid',
...     timestamp_column_name='timestamp',
...     problem_type='classification_binary',
...     metrics=['roc_auc', 'recall',],
...     chunk_size=5000)
>>> realized_calc.fit(reference_df)
>>> realized_perf_results = realized_calc.calculate(analysis_full_df)
>>> ranker2 = nml.CorrelationRanker()
>>> # ranker fits on one metric and reference period data only
>>> ranker2.fit(
...     realized_perf_results.filter(period='reference', metrics=['recall']))
>>> # ranker ranks on one drift method and one performance metric
>>> correlation_ranked_features2 = ranker2.rank(
...     univariate_results.filter(period='analysis', methods=['jensen_shannon']),
...     realized_perf_results.filter(period='analysis', metrics=['recall']),
...     only_drifting = False)
>>> display(correlation_ranked_features2)
                  column_name  pearsonr_correlation  pearsonr_pvalue  has_drifted  rank
0     repaid_loan_on_prev_car               0.96897      3.90719e-06         True     1
1                y_pred_proba              0.966157      5.50918e-06         True     2
2                 loan_length              0.965298      6.08385e-06         True     3
3                   car_value              0.963623      7.33185e-06         True     4
4                salary_range              0.963456      7.46561e-06         True     5
5         size_of_downpayment              0.308948         0.385072        False     6
6        debt_to_income_ratio              0.307373         0.387627        False     7
7                      y_pred             -0.357571         0.310383        False     8
8                      repaid             -0.395842         0.257495        False     9
9               driver_tenure             -0.575807        0.0815202        False     10

Creates a new CorrelationRanker instance.

fit(reference_performance_calculation_result: Optional[Union[Result, Result, Result]] = None) → CorrelationRanker[source]

Calculates the average performance during the reference period. This value is saved at the mean_reference_performance property of the ranker.

Parameters:: reference_performance_calculation_result (Union[CBPEResults, DLEResults, PerformanceCalculationResults]) – Results from any performance calculator or estimator, e.g. PerformanceCalculator CBPE DLE
Returns:: ranking
Return type:: CorrelationRanker

rank(rankable_result: Union[Result, Result, Result, Result, Result, Result, Result, Result], performance_result: Optional[Union[Result, Result, Result]] = None, only_drifting: bool = False)[source]

Compares the number of alerts for each feature and ranks them accordingly.

Parameters:

rankable_result (RankableResult) – The univariate, data quality or simple statistic drift results containing the features we want to rank.
performance_result (PerformanceResults) – Results from any performance calculator or estimator, e.g. PerformanceCalculator CBPE DLE
only_drifting (bool, default=False) – Omits features without alerts from the ranking results.

Returns:

ranking – A DataFrame containing the feature names and their ranks (the highest rank starts at 1, second-highest rank is 2, etc.). Features with the same number of alerts are ranked alphanumerically on the feature name.

Return type:

pd.DataFrame