nannyml.drift.multivariate.data_reconstruction.calculator module

Calculates the data reconstruction error on unseen analysis data after fitting on reference data.

This calculator wraps a PCA transformation. It will be fitted on reference data when the fit method is called. On calling the calculate method it will perform the inverse transformation on the analysis data and calculate the euclidian distance between the analysis data and the reconstructed version of it.

This is the data reconstruction error, and it can be used as a measure of drift between the reference and analysis data sets.

class nannyml.drift.multivariate.data_reconstruction.calculator.DataReconstructionDriftCalculator(column_names: ~typing.List[str], timestamp_column_name: ~typing.Optional[str] = None, n_components: ~typing.Union[int, float, str] = 0.65, chunk_size: ~typing.Optional[int] = None, chunk_number: ~typing.Optional[int] = None, chunk_period: ~typing.Optional[str] = None, chunker: ~typing.Optional[~nannyml.chunk.Chunker] = None, imputer_categorical: ~typing.Optional[~sklearn.impute._base.SimpleImputer] = None, imputer_continuous: ~typing.Optional[~sklearn.impute._base.SimpleImputer] = None, threshold: ~nannyml.thresholds.Threshold = StandardDeviationThreshold{'std_lower_multiplier': 3, 'std_upper_multiplier': 3, 'offset_from': <function nanmean>})[source]

Bases: AbstractCalculator

Multivariate Drift Calculator using PCA Reconstruction Error as a measure of drift.

Creates a new DataReconstructionDriftCalculator instance.

Parameters:
  • column_names – List[str] A list containing the names of features in the provided data set. All of these features will be used by the multivariate data reconstruction drift calculator to calculate an aggregate drift score.

  • timestamp_column_name – str, default=None The name of the column containing the timestamp of the model prediction.

  • n_components – Union[int, float, str], default=0.65 The n_components parameter as passed to the sklearn.decomposition.PCA constructor. See https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

  • chunk_size – int, default=None Splits the data into chunks containing chunks_size observations. Only one of chunk_size, chunk_number or chunk_period should be given.

  • chunk_number – int, default=None Splits the data into chunk_number pieces. Only one of chunk_size, chunk_number or chunk_period should be given.

  • chunk_period – str, default=None Splits the data according to the given period. Only one of chunk_size, chunk_number or chunk_period should be given.

  • chunker – Chunker, default=None The Chunker used to split the data sets into a lists of chunks.

  • imputer_categorical – SimpleImputer, default=None The SimpleImputer used to impute categorical features in the data. Defaults to using most_frequent value.

  • imputer_continuous – SimpleImputer, default=None The SimpleImputer used to impute continuous features in the data. Defaults to using mean value.

  • threshold – Threshold, default=StandardDeviationThreshold The threshold you wish to evaluate values on. Defaults to a StandardDeviationThreshold with default options. The other allowed value is ConstantThreshold.

Examples: >>> import nannyml as nml >>> # Load synthetic data >>> reference, analysis, _ = nml.load_synthetic_car_loan_dataset() >>> feature_column_names = [ … ‘car_value’, … ‘salary_range’, … ‘debt_to_income_ratio’, … ‘loan_length’, … ‘repaid_loan_on_prev_car’, … ‘size_of_downpayment’, … ‘driver_tenure’, >>> ] >>> calc = nml.DataReconstructionDriftCalculator( … column_names=feature_column_names, … timestamp_column_name=’timestamp’, … chunk_size=5000 >>> ) >>> calc.fit(reference) >>> results = calc.calculate(analysis) >>> figure = results.plot() >>> figure.show()