nannyml.drift.univariate.calculator module

Calculates drift for individual columns.

Supported drift detection methods are:

Kolmogorov-Smirnov statistic (continuous)
Wasserstein distance (continuous)
Chi-squared statistic (categorical)
L-infinity distance (categorical)
Jensen-Shannon distance
Hellinger distance

For more information, check out the tutorial or the deep dive.

For help selecting the correct univariate drift detection method for your use case, check the method selection guide.

class nannyml.drift.univariate.calculator.UnivariateDriftCalculator(column_names: Union[str, List[str]], treat_as_numerical: Optional[Union[str, List[str]]] = None, treat_as_categorical: Optional[Union[str, List[str]]] = None, timestamp_column_name: Optional[str] = None, categorical_methods: Optional[Union[str, List[str]]] = None, continuous_methods: Optional[Union[str, List[str]]] = None, chunk_size: Optional[int] = None, chunk_number: Optional[int] = None, chunk_period: Optional[str] = None, chunker: Optional[Chunker] = None, thresholds: Optional[Dict[str, Threshold]] = None, computation_params: Optional[dict[str, Any]] = None)[source]

Bases: AbstractCalculator

Calculates drift for individual features.

Creates a new UnivariateDriftCalculator instance.

Parameters:

column_names (Union[str, List[str]]) – A string or list containing the names of features in the provided data set. A drift score will be calculated for each entry in this list.
treat_as_numerical (Union[str, List[str]]) – A single column name or list of column names to be treated as numerical by the calculator.
treat_as_categorical (Union[str, List[str]]) – A single column name or list of column names to be treated as categorical by the calculator.
timestamp_column_name (str) – The name of the column containing the timestamp of the model prediction.
categorical_methods (Union[str, List[str]], default=['jensen_shannon']) –
A method name or list of method names that will be performed on categorical columns. Supported methods for categorical variables:
- jensen_shannon
- chi2
- hellinger
- l_infinity
continuous_methods (Union[str, List[str]], default=['jensen_shannon']) –
A method name list of method names that will be performed on continuous columns. Supported methods for continuous variables:
- jensen_shannon
- kolmogorov_smirnov
- hellinger
- wasserstein
chunk_size (int) – Splits the data into chunks containing chunks_size observations. Only one of chunk_size, chunk_number or chunk_period should be given.
chunk_number (int) – Splits the data into chunk_number pieces. Only one of chunk_size, chunk_number or chunk_period should be given.
chunk_period (str) – Splits the data according to the given period. Only one of chunk_size, chunk_number or chunk_period should be given.
chunker (Chunker) – The Chunker used to split the data sets into a lists of chunks.
thresholds (dict) –
Defaults to:
```
{
    'kolmogorov_smirnov': StandardDeviationThreshold(std_lower_multiplier=None),
    'jensen_shannon': StandardDeviationThreshold(std_lower_multiplier=None),
    'wasserstein': StandardDeviationThreshold(std_lower_multiplier=None),
    'hellinger': StandardDeviationThreshold(std_lower_multiplier=None),
    'l_infinity': StandardDeviationThreshold(std_lower_multiplier=None),
}
```
A dictionary allowing users to set a custom threshold for each method. It links a Threshold subclass to a method name. This dictionary is optional. When a dictionary is given its values will override the default values. If no dictionary is given a default will be applied. The default method thresholds are as follows:
- kolmogorov_smirnov: StandardDeviationThreshold(std_lower_multiplier=None)
- jensen_shannon: StandardDeviationThreshold(std_lower_multiplier=None)
- wasserstein: StandardDeviationThreshold(std_lower_multiplier=None)
- hellinger: StandardDeviationThreshold(std_lower_multiplier=None)
- l_infinity: StandardDeviationThreshold(std_lower_multiplier=None)
The chi2 method does not support custom thresholds for now. Additional research is required to determine how to transition from its current p-value based implementation.
computation_params (dict) –
Defaults to:
```
{
    'kolmogorov_smirnov': {
        'calculation_method': 'auto',
        'n_bins':10 000
    },
    'wasserstein': {
        'calculation_method': 'auto',
        'n_bins':10 000
    }
}
```
A dictionary which allows users to specify whether they want drift calculated on the exact reference data or an estimated distribution of the reference data obtained using binning techniques. Applicable only to Kolmogorov-Smirnov and Wasserstein.

calculation_method: Specify whether the entire or the binned reference data will be stored.
The default value is auto.
- auto : Use exact for reference data smaller than 10 000 rows, estimated for larger.
- exact : Store the whole reference data.
  
  When calculating on chunk scipy.stats.ks_2samp(reference, chunk, method = `exact )` is called and whole reference and chunk vectors are passed.
- estimated : Store reference data binned into n_bins (default=10 000).
  
  The D-statistic will be calculated based on binned eCDF. Bins are quantile-based for Kolmogorov-Smirnov and equal-width based for Wasserstein. Notice that for the reference data of 10 000 rows the resulting D-statistic for exact and estimated methods should be the same. The pvalue in that method is calculated using asymptotic distribution of test statistic (as it is in the scipy.stats.ks_2samp with method = asymp ).
n_bins : Number of bins used to bin data when calculation_method = estimated.

The default value is 10 000. The larger the value the more precise the calculation (closer to calculation_method = exact ) but more data will be stored in the fitted calculator.

Examples

>>> import nannyml as nml
>>> reference, analysis, _ = nml.load_synthetic_car_price_dataset()
>>> column_names = [col for col in reference.columns if col not in ['timestamp', 'y_pred', 'y_true']]
>>> calc = nml.UnivariateDriftCalculator(
...   column_names=column_names,
...   timestamp_column_name='timestamp',
...   continuous_methods=['kolmogorov_smirnov', 'jensen_shannon', 'wasserstein'],
...   categorical_methods=['chi2', 'jensen_shannon', 'l_infinity'],
... ).fit(reference)
>>> res = calc.calculate(analysis)
>>> res = res.filter(period='analysis')
>>> for column_name in res.continuous_column_names:
...  for method in res.continuous_method_names:
...    res.plot(kind='drift', column_name=column_name, method=method).show()