.. _multivariate_drift_detection_dc: ================= Domain Classifier ================= The second multivariate drift detection method of NannyML is Domain Classifier. It provides a measure of how easy it is to discriminate the reference data from the examined chunk data. You can read more about on the :ref:`How it works: Domain Classifier` section. When there is no data drift the datasets can't discerned and we get a value of 0.5. The more drift there is, the higher the returned measure will be, up to a value of 1. Just The Code ------------- .. nbimport:: :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb :cells: 1 3 4 6 8 .. admonition:: **Advanced configuration** :class: hint - To learn how :class:`~nannyml.chunk.Chunk` works and to set up custom chunkings check out the :ref:`chunking tutorial `. - To learn how :class:`~nannyml.thresholds.ConstantThreshold` works and to set up custom threshold check out the :ref:`thresholds tutorial `. Walkthrough ----------- The method returns a single number, measuring the discrimination capability of the discriminator. Any increase in the discrimination value above 0.5 reflects a change in the structure of the model inputs. NannyML calculates the discrimination value for the monitored model's inputs, and raises an alert if the values get outside the pre-defined range of ``[0.45, 0.65]``. If needed this range can be adjusted by specifying a threshold strategy more appropriate for the user's data. In order to monitor a model, NannyML needs to learn about it from a reference dataset. Then it can monitor the data subject to actual analysis, provided as the analysis dataset. You can read more about this in our section on :ref:`data periods`. Let's start by loading some synthetic data provided by the NannyML package set it up as our reference and analysis dataframes. This synthetic data is for a binary classification model, but multi-class classification can be handled in the same way. .. nbimport:: :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb :cells: 1 .. nbtable:: :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb :cell: 2 The :class:`~nannyml.drift.multivariate.domain_classifier.calculator.DomainClassifierCalculator` module implements this functionality. We need to instantiate it with appropriate parameters: - **feature_column_names:** A list with the column names of the features we want to run drift detection on. - **treat_as_categorical (Optional):** A list containing the names of features in the provided data set that should be treated as categorical. Needs not be exhaustive. - **timestamp_column_name (Optional):** The name of the column in the reference data that contains timestamps. - **chunk_size (Optional):** The number of observations in each chunk of data used. Only one chunking argument needs to be provided. For more information about :term:`chunking` configurations check out the :ref:`chunking tutorial`. - **chunk_number (Optional):** The number of chunks to be created out of data provided for each :ref:`period`. - **chunk_period (Optional):** The time period based on which we aggregate the provided data in order to create chunks. - **chunker (Optional):** A NannyML :class:`~nannyml.chunk.Chunker` object that will handle the aggregation provided data in order to create chunks. - **cv_folds_num (Optional):** Number of cross-validation folds to use when calculating DC discrimination value. - **hyperparameters (Optional):** A dictionary used to provide your own custom hyperparameters when training the discrimination model. Check out the available hyperparameter options in the `LightGBM docs`_. - **tune_hyperparameters (Optional):** A boolean controlling whether hypertuning should be performed on the internal regressor models whilst fitting on reference data. - **hyperparameter_tuning_config (Optional):** A dictionary that allows you to provide a custom hyperparameter tuning configuration when `tune_hyperparameters` has been set to `True`. Available options are available in the `AutoML FLAML documentation`_. - **threshold (Optional):** The threshold strategy used to calculate the alert threshold limits. For more information about thresholds, check out the :ref:`thresholds tutorial`. Next, the :meth:`~nannyml.base.AbstractCalculator.fit` method needs to be called on the reference data, which the results will be based on. Then the :meth:`~nannyml.base.AbstractCalculator.calculate` method will calculate the multivariate drift results on the provided data. .. nbimport:: :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb :cells: 3 We can see these results of the data provided to the :meth:`~nannyml.base.AbstractCalculator.calculate` method as a dataframe. .. nbimport:: :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb :cells: 4 .. nbtable:: :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb :cell: 5 The drift results from the reference data are accessible from the properties of the results object: .. nbimport:: :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb :cells: 6 .. nbtable:: :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb :cell: 7 NannyML can also visualize the multivariate drift results in a plot. Our plot contains several key elements. * The purple step plot shows the reconstruction error in each chunk of the analysis period. Thick squared point markers indicate the middle of these chunks. * The red horizontal dashed lines show upper and lower thresholds for alerting purposes. * If discrimination value crosses the upper or lower threshold an alert is raised. A red, diamond-shaped point marker additionally indicates this in the middle of the chunk. .. nbimport:: :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb :cells: 8 .. image:: /_static/tutorials/detecting_data_drift/multivariate_drift_detection/classifier-for-drift-detection.svg The multivariate drift results provide a concise summary of where data drift is happening in our input data. Insights -------- Using this method of detecting drift, we can identify changes that we may not have seen using solely univariate methods. What Next --------- After reviewing the results, we want to look at the :ref:`drift results of individual features` to see what changed in the model's features individually. The :ref:`Performance Estimation` functionality can be used to estimate the impact of the observed changes. .. _`AutoML FLAML documentation`: https://microsoft.github.io/FLAML/docs/reference/automl/automl .. _`LightGBM docs`: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html