nannyml.drift.multivariate.data_reconstruction.calculator module

Drift calculator using Reconstruction Error as a measure of drift.

class nannyml.drift.multivariate.data_reconstruction.calculator.DataReconstructionDriftCalculator(column_names: List[str], timestamp_column_name: Optional[str] = None, n_components: Union[int, float, str] = 0.65, chunk_size: Optional[int] = None, chunk_number: Optional[int] = None, chunk_period: Optional[str] = None, chunker: Optional[nannyml.chunk.Chunker] = None, imputer_categorical: Optional[sklearn.impute._base.SimpleImputer] = None, imputer_continuous: Optional[sklearn.impute._base.SimpleImputer] = None)[source]

Bases: nannyml.base.AbstractCalculator

BaseDriftCalculator implementation using Reconstruction Error as a measure of drift.

Creates a new DataReconstructionDriftCalculator instance.

Parameters
  • column_names (List[str]) – A list containing the names of features in the provided data set. All of these features will be used by the multivariate data reconstruction drift calculator to calculate an aggregate drift score.

  • timestamp_column_name (str, default=None) – The name of the column containing the timestamp of the model prediction.

  • n_components (Union[int, float, str], default=0.65) – The n_components parameter as passed to the sklearn.decomposition.PCA constructor. See https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

  • chunk_size (int, default=None) – Splits the data into chunks containing chunks_size observations. Only one of chunk_size, chunk_number or chunk_period should be given.

  • chunk_number (int, default=None) – Splits the data into chunk_number pieces. Only one of chunk_size, chunk_number or chunk_period should be given.

  • chunk_period (str, default=None) – Splits the data according to the given period. Only one of chunk_size, chunk_number or chunk_period should be given.

  • chunker (Chunker, default=None) – The Chunker used to split the data sets into a lists of chunks.

  • imputer_categorical (SimpleImputer, default=None) – The SimpleImputer used to impute categorical features in the data. Defaults to using most_frequent value.

  • imputer_continuous (SimpleImputer, default=None) – The SimpleImputer used to impute continuous features in the data. Defaults to using mean value.

Examples

>>> import nannyml as nml
>>> from IPython.display import display
>>> # Load synthetic data
>>> reference = nml.load_synthetic_binary_classification_dataset()[0]
>>> analysis = nml.load_synthetic_binary_classification_dataset()[1]
>>> display(reference.head())
>>> # Define feature columns
>>> column_names = [
...     col for col in reference.columns if col not in [
...         'timestamp', 'y_pred_proba', 'period', 'y_pred', 'work_home_actual', 'identifier'
...     ]]
>>> calc = nml.DataReconstructionDriftCalculator(
...     column_names=column_names,
...     timestamp_column_name='timestamp',
...     chunk_size=5000
>>> )
>>> calc.fit(reference)
>>> results = calc.calculate(analysis)
>>> display(results.data)
>>> display(results.calculator.previous_reference_results)
>>> figure = results.plot(plot_reference=True)
>>> figure.show()
nannyml.drift.multivariate.data_reconstruction.calculator.sampling_error(components: Tuple, data: pandas.core.frame.DataFrame) float[source]