nannyml.drift.univariate.calculator module

Calculates drift for individual features using the Kolmogorov-Smirnov and chi2-contingency statistical tests.

class nannyml.drift.univariate.calculator.UnivariateDriftCalculator(column_names: Union[str, List[str]], treat_as_categorical: Optional[Union[str, List[str]]] = None, timestamp_column_name: Optional[str] = None, categorical_methods: Optional[Union[str, List[str]]] = None, continuous_methods: Optional[Union[str, List[str]]] = None, chunk_size: Optional[int] = None, chunk_number: Optional[int] = None, chunk_period: Optional[str] = None, chunker: Optional[nannyml.chunk.Chunker] = None, thresholds: Optional[Dict[str, nannyml.thresholds.Threshold]] = None, computation_params: Optional[dict[str, Any]] = None)[source]

Bases: nannyml.base.AbstractCalculator

Calculates drift for individual features.

Creates a new UnivariateDriftCalculator instance.

Parameters

column_names (Union[str, List[str]]) – A string or list containing the names of features in the provided data set. A drift score will be calculated for each entry in this list.
treat_as_categorical (Union[str, List[str]]) – A single column name or list of column names to be treated as categorical by the calculator.
timestamp_column_name (str) – The name of the column containing the timestamp of the model prediction.
categorical_methods (Union[str, List[str]], default=['jensen_shannon']) – A method name or list of method names that will be performed on categorical columns.
continuous_methods (Union[str, List[str]], default=['jensen_shannon']) – A method name list of method names that will be performed on continuous columns.
chunk_size (int) – Splits the data into chunks containing chunks_size observations. Only one of chunk_size, chunk_number or chunk_period should be given.
chunk_number (int) – Splits the data into chunk_number pieces. Only one of chunk_size, chunk_number or chunk_period should be given.
chunk_period (str) – Splits the data according to the given period. Only one of chunk_size, chunk_number or chunk_period should be given.
chunker (Chunker) – The Chunker used to split the data sets into a lists of chunks.
thresholds (dict, default={ 'kolmogorov_smirnov': StandardDeviationThreshold(std_lower_multiplier=None), 'jensen_shannon': ConstantThreshold(upper=0.1), 'wasserstein': StandardDeviationThreshold(std_lower_multiplier=None), 'hellinger': ConstantThreshold(upper=0.1), 'l_infinity': ConstantThreshold(upper=0.1) }) –
A dictionary allowing users to set a custom threshold for each method. It links a Threshold subclass to a method name. This dictionary is optional. When a dictionary is given its values will override the default values. If no dictionary is given a default will be applied. The default method thresholds are as follows:
- kolmogorov_smirnov: StandardDeviationThreshold(std_lower_multiplier=None)
- jensen_shannon: ConstantThreshold(upper=0.1)
- wasserstein: StandardDeviationThreshold(std_lower_multiplier=None)
- hellinger: ConstantThreshold(upper=0.1)
- l_infinity: ConstantThreshold(upper=0.1)
The chi2 method does not support custom thresholds for now. Additional research is required to determine how to transition from its current p-value based implementation.
computation_params (dict, default={'kolmogorov_smirnov':{'calculation_method':{'auto', 'exact', 'estimated},) –
‘n_bins’:10 000}}, ‘wasserstein’:{‘calculation_method’:{‘auto’, ‘exact’, ‘estimated}, ‘n_bins’:10 000}}

A dictionary which allows users to specify whether they want drift calculated on the exact reference data or an estimated distribution of the reference data obtained using binning techniques. Applicable only to Kolmogorov-Smirnov and Wasserstein.
calculation_methodSpecify whether the entire or the binned reference data will be stored.
The default value is auto.
- auto : Use exact for reference data smaller than 10 000 rows, estimated for larger.
- exactStore the whole reference data.
  When calculating on chunk scipy.stats.ks_2samp(reference, chunk, method = `exact )` is called and whole reference and chunk vectors are passed.
- estimatedStore reference data binned into n_bins (default=10 000).
  The D-statistic will be calculated based on binned eCDF. Bins are quantile-based for Kolmogorov-Smirnov and equal-width based for Wasserstein. Notice that for the reference data of 10 000 rows the resulting D-statistic for exact and estimated methods should be the same. The pvalue in that method is calculated using asymptotic distribution of test statistic (as it is in the scipy.stats.ks_2samp with method = asymp ).
n_binsNumber of bins used to bin data when calculation_method = estimated.
The default value is 10 000. The larger the value the more precise the calculation (closer to calculation_method = exact ) but more data will be stored in the fitted calculator.

Examples

>>> import nannyml as nml
>>> reference, analysis, _ = nml.load_synthetic_car_price_dataset()
>>> column_names = [col for col in reference.columns if col not in ['timestamp', 'y_pred', 'y_true']]
>>> calc = nml.UnivariateDriftCalculator(
...   column_names=column_names,
...   timestamp_column_name='timestamp',
...   continuous_methods=['kolmogorov_smirnov', 'jensen_shannon', 'wasserstein'],
...   categorical_methods=['chi2', 'jensen_shannon', 'l_infinity'],
... ).fit(reference)
>>> res = calc.calculate(analysis)
>>> res = res.filter(period='analysis')
>>> for column_name in res.continuous_column_names:
...  for method in res.continuous_method_names:
...    res.plot(kind='drift', column_name=column_name, method=method).show()