nannyml.drift.univariate.calculator module

Calculates drift for individual features using the Kolmogorov-Smirnov and chi2-contingency statistical tests.

class nannyml.drift.univariate.calculator.UnivariateDriftCalculator(column_names: Union[str, List[str]], timestamp_column_name: Optional[str] = None, categorical_methods: Optional[Union[str, List[str]]] = None, continuous_methods: Optional[Union[str, List[str]]] = None, chunk_size: Optional[int] = None, chunk_number: Optional[int] = None, chunk_period: Optional[str] = None, chunker: Optional[nannyml.chunk.Chunker] = None)[source]

Bases: nannyml.base.AbstractCalculator

Calculates drift for individual features.

Creates a new UnivariateDriftCalculator instance.

Parameters
  • column_names (Union[str, List[str]]) – A string or list containing the names of features in the provided data set. A drift score will be calculated for each entry in this list.

  • timestamp_column_name (str) – The name of the column containing the timestamp of the model prediction.

  • categorical_methods (Union[str, List[str]], default=['jensen_shannon']) – A method name or list of method names that will be performed on categorical columns.

  • continuous_methods (Union[str, List[str]], default=['jensen_shannon']) – A a method name list of method names that will be performed on continuous columns.

  • chunk_size (int) – Splits the data into chunks containing chunks_size observations. Only one of chunk_size, chunk_number or chunk_period should be given.

  • chunk_number (int) – Splits the data into chunk_number pieces. Only one of chunk_size, chunk_number or chunk_period should be given.

  • chunk_period (str) – Splits the data according to the given period. Only one of chunk_size, chunk_number or chunk_period should be given.

  • chunker (Chunker) – The Chunker used to split the data sets into a lists of chunks.

Examples

>>> import nannyml as nml
>>> reference, analysis, _ = nml.load_synthetic_car_price_dataset()
>>> column_names = [col for col in reference.columns if col not in ['timestamp', 'y_pred', 'y_true']]
>>> calc = nml.UnivariateDriftCalculator(
...   column_names=column_names,
...   timestamp_column_name='timestamp',
...   continuous_methods=['kolmogorov_smirnov', 'jensen_shannon', 'wasserstein'],
...   categorical_methods=['chi2', 'jensen_shannon', 'l_infinity'],
... ).fit(reference)
>>> res = calc.calculate(analysis)
>>> res = res.filter(period='analysis')
>>> for column_name in res.continuous_column_names:
...  for method in res.continuous_method_names:
...    res.plot(kind='drift', column_name=column_name, method=method).show()