nannyml.drift.model_inputs.univariate.statistical.calculator module

Calculates drift for individual features using the Kolmogorov-Smirnov and chi2-contingency statistical tests.

class nannyml.drift.model_inputs.univariate.statistical.calculator.UnivariateStatisticalDriftCalculator(feature_column_names: List[str], timestamp_column_name: Optional[str] = None, chunk_size: Optional[int] = None, chunk_number: Optional[int] = None, chunk_period: Optional[str] = None, chunker: Optional[Chunker] = None)[source]

Bases: AbstractCalculator

Calculates drift for individual features using statistical tests.

Creates a new UnivariateStatisticalDriftCalculator instance.

Parameters:
  • feature_column_names (List[str]) – A list containing the names of features in the provided data set. A drift score will be calculated for each entry in this list.

  • timestamp_column_name (str, default=None) – The name of the column containing the timestamp of the model prediction.

  • chunk_size (int) – Splits the data into chunks containing chunks_size observations. Only one of chunk_size, chunk_number or chunk_period should be given.

  • chunk_number (int) – Splits the data into chunk_number pieces. Only one of chunk_size, chunk_number or chunk_period should be given.

  • chunk_period (str) – Splits the data according to the given period. Only one of chunk_size, chunk_number or chunk_period should be given.

  • chunker (Chunker) – The Chunker used to split the data sets into a lists of chunks.

Examples

>>> import nannyml as nml
>>> from IPython.display import display
>>> reference_df = nml.load_synthetic_binary_classification_dataset()[0]
>>> analysis_df = nml.load_synthetic_binary_classification_dataset()[1]
>>> display(reference_df.head())
>>> feature_column_names = [
...     col for col in reference_df.columns if col not in [
...     'timestamp', 'y_pred_proba', 'period', 'y_pred', 'work_home_actual', 'identifier'
>>> ]]
>>> calc = nml.UnivariateStatisticalDriftCalculator(
...     feature_column_names=feature_column_names,
...     timestamp_column_name='timestamp'
>>> )
>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.data.iloc[:, :9])
>>> display(calc.previous_reference_results.iloc[:, :9])
>>> for feature in calc.feature_column_names:
...     drift_fig = results.plot(
...         kind='feature_drift',
...         feature_column_name=feature,
...         plot_reference=True
...     )
...     drift_fig.show()
>>> for cont_feat in calc.continuous_column_names:
...     figure = results.plot(
...         kind='feature_distribution',
...         feature_column_name=cont_feat,
...         plot_reference=True
...     )
...     figure.show()
>>> for cat_feat in calc.categorical_column_names:
...     figure = results.plot(
...         kind='feature_distribution',
...         feature_column_name=cat_feat,
...         plot_reference=True)
...     figure.show()
>>> ranker = nml.Ranker.by('alert_count')
>>> ranked_features = ranker.rank(results, only_drifting = False)
>>> display(ranked_features)