nannyml.drift.ranking module

Module containing ways to rank drifting features.

class nannyml.drift.ranking.AlertCountRanking[source]

Bases: Ranking

Ranks features by the number of drift ‘alerts’ they’ve caused.

ALERT_COLUMN_SUFFIX = '_alert'

rank(drift_calculation_result: UnivariateStatisticalDriftCalculatorResult, only_drifting: bool = False) → DataFrame[source]

Compares the number of alerts for each feature and ranks them accordingly.

Parameters:

drift_calculation_result (pd.DataFrame) – The drift calculation results. Requires alert columns to be present. These are recognized and parsed using the ALERT_COLUMN_SUFFIX pattern, currently equal to '_alert'.
only_drifting (bool, default=False) – Omits features without alerts from the ranking results.

Returns:

feature_ranking – A DataFrame containing the feature names and their ranks (the highest rank starts at 1, second-highest rank is 2, etc.)

Return type:

pd.DataFrame

Examples

>>> import nannyml as nml
>>> from IPython.display import display
>>>
>>> reference_df = nml.load_synthetic_binary_classification_dataset()[0]
>>> analysis_df = nml.load_synthetic_binary_classification_dataset()[1]
>>> target_df = nml.load_synthetic_binary_classification_dataset()[2]
>>>
>>> display(reference_df.head())
>>>
>>> feature_column_names = [
>>>     col for col in reference_df.columns if col not in ['timestamp', 'y_pred_proba', 'period',
>>>                                                        'y_pred', 'repaid']]
>>>
>>> calc = nml.UnivariateStatisticalDriftCalculator(feature_column_names=feature_column_names,
>>>                                                 timestamp_column_name='timestamp')
>>>
>>> calc.fit(reference_df)
>>>
>>> results = calc.calculate(analysis_df.merge(target_df, on='identifier'))
>>>
>>> ranker = nml.Ranker.by('alert_count')
>>> ranked_features = ranker.rank(results, only_drifting=False)
>>> display(ranked_features)
                      feature  number_of_alerts  rank
0                  identifier                10     1
1        distance_from_office                 5     2
2                salary_range                 5     3
3  public_transportation_cost                 5     4
4            wfh_prev_workday                 5     5
5                      tenure                 2     6
6         gas_price_per_litre                 0     7
7                     workday                 0     8
8            work_home_actual                 0     9

class nannyml.drift.ranking.Ranker[source]

Bases: object

Factory class to easily access Ranking implementations.

classmethod by(key: str = 'alert_count', ranking_args: Optional[Dict[str, Any]] = None) → Ranking[source]

Returns a Ranking subclass instance given a key value.

If the provided key equals None, then a new instance of the default Ranking (AlertCountRanking) will be returned.

If a non-existent key is provided an InvalidArgumentsException is raised.

Parameters:

key (str, default='alert_count') – The key used to retrieve a Ranking. When providing a key that is already in the index, the value will be overwritten.
ranking_args (Dict[str, Any], default=None) – A dictionary of arguments that will be passed to the Ranking during creation.

Returns:

ranking – A new instance of a specific Ranking subclass.

Return type:

Ranking

Examples

>>> ranking = Ranker.by('alert_count')

classmethod register(key: str) → Callable[source]

Adds a Ranking to the registry using the provided key.

Just use the decorator above any Ranking subclass to have it automatically registered.

Examples

>>> @Ranker.register('alert_count')
>>> class AlertCountRanking(Ranking):
>>>     pass
>>>
>>> # Use the Ranking
>>> ranker = nml.Ranker.by('alert_count')
>>> ranked_features = ranker.rank(results, only_drifting=False)

registry: Dict[str, Ranking] = {'alert_count': <class 'nannyml.drift.ranking.AlertCountRanking'>}

class nannyml.drift.ranking.Ranking[source]

Bases: ABC

Class that abstracts ranking features by impact on model performance.

rank(drift_calculation_result: UnivariateStatisticalDriftCalculatorResult, only_drifting: bool = False) → DataFrame[source]

Ranks the features within a drift calculation according to impact on model performance.

Parameters:

drift_calculation_result (UnivariateStatisticalDriftCalculatorResult) – The drift calculation results.
only_drifting (bool) – Omits non-drifting features from the ranking if True.

Returns:

feature_ranking – A DataFrame containing at least a feature name and a rank per row.

Return type:

pd.DataFrame