nannyml.drift.multivariate.domain_classifier.calculator module

Calculates the data reconstruction error on unseen analysis data after fitting on reference data.

This calculator wraps a PCA transformation. It will be fitted on reference data when the fit method is called. On calling the calculate method it will perform the inverse transformation on the analysis data and calculate the euclidian distance between the analysis data and the reconstructed version of it.

This is the data reconstruction error, and it can be used as a measure of drift between the reference and analysis data sets.

class nannyml.drift.multivariate.domain_classifier.calculator.DomainClassifierCalculator(feature_column_names: ~typing.Union[str, ~typing.List[str]], treat_as_categorical: ~typing.Optional[~typing.Union[str, ~typing.List[str]]] = None, timestamp_column_name: ~typing.Optional[str] = None, chunk_size: ~typing.Optional[int] = None, chunk_number: ~typing.Optional[int] = None, chunk_period: ~typing.Optional[str] = None, chunker: ~typing.Optional[~nannyml.chunk.Chunker] = None, cv_folds_num: ~typing.Optional[int] = 5, hyperparameters: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': 13, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': 'warn', 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0}, tune_hyperparameters: bool = False, hyperparameter_tuning_config: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = {'estimator_list': ['lgbm'], 'eval_method': 'cv', 'hpo_method': 'cfo', 'metric': 'roc_auc', 'n_splits': 5, 'seed': 1, 'task': 'binary', 'time_budget': 120, 'verbose': 0}, threshold: ~nannyml.thresholds.Threshold = ConstantThreshold{'lower': 0.45, 'upper': 0.65})[source]

Bases: AbstractCalculator

DomainClassifierCalculator implementation.

Uses Drift Detection Classifier’s cross validated performance as a measure of drift.

Create a new DomainClassifierCalculator instance.

feature_column_names: List[str]

A list containing the names of features in the provided data set. All of these features will be used by the multivariate classifier for drift detection to calculate an aggregate drift metric.

treat_as_categorical: Optional[Union[str, List[str]]], default=None

A list containing the names of features in the provided data set that should be treated as categorical. Needs not be exhaustive.

timestamp_column_name: Optional[str], default=None

The name of the column containing the timestamp of the model prediction.

chunk_size: int, default=None

Splits the data into chunks containing chunks_size observations. Only one of chunk_size, chunk_number or chunk_period should be given.

chunk_number: int, default=None

Splits the data into chunk_number pieces. Only one of chunk_size, chunk_number or chunk_period should be given.

chunk_period: str, default=None

Splits the data according to the given period. Only one of chunk_size, chunk_number or chunk_period should be given.

chunkerChunker, default=None

The Chunker used to split the data sets into a lists of chunks.

cv_folds_num: Optional[int]

Number of cross-validation folds to use when calculating DC discrimination value.

hyperparametersDict[str, Any], default = None

A dictionary used to provide your own custom hyperparameters when training the discrimination model. Check out the available hyperparameter options in the LightGBM docs.

tune_hyperparametersbool, default = False

A boolean controlling whether hypertuning should be performed on the internal regressor models whilst fitting on reference data. Tuning hyperparameters takes some time and does not guarantee better results, hence it defaults to False.

threshold: Threshold, default=ConstantThreshold

The threshold you wish to evaluate values on. Defaults to a ConstantThreshold with lower value of 0.45 and uppper value of 0.65.

hyperparameter_tuning_configDict[str, Any], default = None

A dictionary that allows you to provide a custom hyperparameter tuning configuration when tune_hyperparameters has been set to True. The following dictionary is the default tuning configuration. It can be used as a template to modify:

{
    "time_budget": 15,
    "metric": "mse",
    "estimator_list": ['lgbm'],
    "eval_method": "cv",
    "hpo_method": "cfo",
    "n_splits": 5,
    "task": 'regression',
    "seed": 1,
    "verbose": 0,
}

For an overview of possible parameters for the tuning process check out the FLAML documentation.

>>> import nannyml as nml
>>> # Load synthetic data
>>> reference_df, analysis_df, _ = nml.load_synthetic_car_loan_dataset()
>>> # Define feature columns
>>> feature_column_names = [
...     col for col in reference_df.columns
...     if col not in non_feature_columns
>>> ]
>>> calc = nml.DomainClassifierCalculator(
...     feature_column_names=feature_column_names,
...     timestamp_column_name='timestamp',
...     chunk_size=5000
>>> )
>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> figure = results.plot()
>>> figure.show()

tune_hyperparams(X: DataFrame, y: ndarray)[source]

nannyml.drift.multivariate.domain_classifier.calculator.drop_matching_duplicate_rows(X: DataFrame, y: ndarray, subset: List[str]) → Tuple[DataFrame, ndarray][source]

nannyml.drift.multivariate.domain_classifier.calculator.preprocess_categorical_features(X: DataFrame, continuous_column_names: List[str], categorical_column_names: List[str]) → DataFrame[source]