nannyml.drift.univariate.methods module

This module contains the different drift detection method implementations.

The MethodFactory will convert the drift detection method names into an instance of the base Method class.

The UnivariateDriftCalculator class will perform the required data transformations before looping over all Method instances it holds and fit each on reference data or calculate the drift value on analysis data.

class nannyml.drift.univariate.methods.Chi2Statistic(**kwargs)[source]

Bases: Method

Calculates the Chi2-contingency statistic.

An alert will be raised for a Chunk if p_value < 0.05.

Creates a new Method instance.

Parameters:
  • display_name (str) – The name of the metric. Used to display in plots. If not given this name will be derived from the calculation_function.

  • column_name (str) – The name used to indicate the metric in columns of a DataFrame.

  • chunker (Chunker) – The Chunker used to split the data sets into a lists of chunks.

  • computation_params (dict, default=None) – A dictionary specifying parameter names and values to be used in the computation of the drift method.

  • upper_threshold (float, default=None) – An optional upper threshold for the data quality metric.

  • lower_threshold (float, default=None) – An optional lower threshold for the data quality metric.

  • upper_threshold_limit (float, default=None) – An optional upper threshold limit for the data quality metric.

  • lower_threshold_limit (float, default=0) – An optional lower threshold limit for the data quality metric.

alert(value: float)[source]

Evaluates if an alert has occurred for this method on the current chunk data.

Parameters:

value (float) – The method value for a given chunk

fit(reference_data: Series, timestamps: Optional[Series] = None) Self[source]

Fits a Method on reference data.

Parameters:
  • reference_data (pd.DataFrame) – The reference data used for fitting a Method. Must have target data available.

  • timestamps (Optional[pd.Series], default=None) – A series containing the reference data Timestamps

class nannyml.drift.univariate.methods.FeatureType(value)[source]

Bases: str, Enum

An enumeration indicating if a Method is applicable to continuous data, categorical data or both.

CATEGORICAL = 'categorical'
CONTINUOUS = 'continuous'
class nannyml.drift.univariate.methods.HellingerDistance(**kwargs)[source]

Bases: Method

Calculates the Hellinger Distance between two distributions.

Creates a new Method instance.

Parameters:
  • display_name (str) – The name of the metric. Used to display in plots. If not given this name will be derived from the calculation_function.

  • column_name (str) – The name used to indicate the metric in columns of a DataFrame.

  • chunker (Chunker) – The Chunker used to split the data sets into a lists of chunks.

  • computation_params (dict, default=None) – A dictionary specifying parameter names and values to be used in the computation of the drift method.

  • upper_threshold (float, default=None) – An optional upper threshold for the data quality metric.

  • lower_threshold (float, default=None) – An optional lower threshold for the data quality metric.

  • upper_threshold_limit (float, default=None) – An optional upper threshold limit for the data quality metric.

  • lower_threshold_limit (float, default=0) – An optional lower threshold limit for the data quality metric.

class nannyml.drift.univariate.methods.JensenShannonDistance(**kwargs)[source]

Bases: Method

Calculates Jensen-Shannon distance.

By default an alert will be raised if distance > 0.1.

Creates a new Method instance.

Parameters:
  • display_name (str) – The name of the metric. Used to display in plots. If not given this name will be derived from the calculation_function.

  • column_name (str) – The name used to indicate the metric in columns of a DataFrame.

  • chunker (Chunker) – The Chunker used to split the data sets into a lists of chunks.

  • computation_params (dict, default=None) – A dictionary specifying parameter names and values to be used in the computation of the drift method.

  • upper_threshold (float, default=None) – An optional upper threshold for the data quality metric.

  • lower_threshold (float, default=None) – An optional lower threshold for the data quality metric.

  • upper_threshold_limit (float, default=None) – An optional upper threshold limit for the data quality metric.

  • lower_threshold_limit (float, default=0) – An optional lower threshold limit for the data quality metric.

class nannyml.drift.univariate.methods.KolmogorovSmirnovStatistic(**kwargs)[source]

Bases: Method

Calculates the Kolmogorov-Smirnov d-stat.

An alert will be raised for a Chunk if p_value < 0.05.

Creates a new Method instance.

Parameters:
  • display_name (str) – The name of the metric. Used to display in plots. If not given this name will be derived from the calculation_function.

  • column_name (str) – The name used to indicate the metric in columns of a DataFrame.

  • chunker (Chunker) – The Chunker used to split the data sets into a lists of chunks.

  • computation_params (dict, default=None) – A dictionary specifying parameter names and values to be used in the computation of the drift method.

  • upper_threshold (float, default=None) – An optional upper threshold for the data quality metric.

  • lower_threshold (float, default=None) – An optional lower threshold for the data quality metric.

  • upper_threshold_limit (float, default=None) – An optional upper threshold limit for the data quality metric.

  • lower_threshold_limit (float, default=0) – An optional lower threshold limit for the data quality metric.

class nannyml.drift.univariate.methods.LInfinityDistance(**kwargs)[source]

Bases: Method

Calculates the L-Infinity Distance.

An alert will be raised if distance > 0.1.

Creates a new Method instance.

Parameters:
  • display_name (str) – The name of the metric. Used to display in plots. If not given this name will be derived from the calculation_function.

  • column_name (str) – The name used to indicate the metric in columns of a DataFrame.

  • chunker (Chunker) – The Chunker used to split the data sets into a lists of chunks.

  • computation_params (dict, default=None) – A dictionary specifying parameter names and values to be used in the computation of the drift method.

  • upper_threshold (float, default=None) – An optional upper threshold for the data quality metric.

  • lower_threshold (float, default=None) – An optional lower threshold for the data quality metric.

  • upper_threshold_limit (float, default=None) – An optional upper threshold limit for the data quality metric.

  • lower_threshold_limit (float, default=0) – An optional lower threshold limit for the data quality metric.

class nannyml.drift.univariate.methods.Method(display_name: str, column_name: str, chunker: Chunker, threshold: Threshold, computation_params: Optional[Dict[str, Any]] = None, upper_threshold_limit: Optional[float] = None, lower_threshold_limit: Optional[float] = None)[source]

Bases: ABC

A method base class to express the amount of drift between two distributions.

Creates a new Method instance.

Parameters:
  • display_name (str) – The name of the metric. Used to display in plots. If not given this name will be derived from the calculation_function.

  • column_name (str) – The name used to indicate the metric in columns of a DataFrame.

  • chunker (Chunker) – The Chunker used to split the data sets into a lists of chunks.

  • computation_params (dict, default=None) – A dictionary specifying parameter names and values to be used in the computation of the drift method.

  • upper_threshold (float, default=None) – An optional upper threshold for the data quality metric.

  • lower_threshold (float, default=None) – An optional lower threshold for the data quality metric.

  • upper_threshold_limit (float, default=None) – An optional upper threshold limit for the data quality metric.

  • lower_threshold_limit (float, default=0) – An optional lower threshold limit for the data quality metric.

__eq__(other)[source]

Establishes equality by comparing all properties.

alert(value: float)[source]

Evaluates if an alert has occurred for this method on the current chunk data.

Parameters:

value (float) – The method value for a given chunk

calculate(data: Series)[source]

Calculates drift within data with respect to the reference data.

Parameters:

data (pd.DataFrame) – The data to compare to the reference data.

fit(reference_data: Series, timestamps: Optional[Series] = None) Self[source]

Fits a Method on reference data.

Parameters:
  • reference_data (pd.DataFrame) – The reference data used for fitting a Method. Must have target data available.

  • timestamps (Optional[pd.Series], default=None) – A series containing the reference data Timestamps

class nannyml.drift.univariate.methods.MethodFactory[source]

Bases: object

A factory class that produces Method instances given a ‘key’ string and a ‘feature_type’ it supports.

classmethod create(key: str, feature_type: FeatureType, **kwargs) Method[source]

Returns a Method instance for a given key and FeatureType.

The value for the key is passed explicitly by the end user (provided within the UnivariateDriftCalculator initializer). The value for the FeatureType is provided implicitly by deducing it from the reference data upon fitting the UnivariateDriftCalculator.

Any additional keyword arguments are passed along to the initializer of the Method.

classmethod register(key: str, feature_type: FeatureType) Callable[source]

A decorator used to register a specific Method implementation to the factory.

Registering a Method requires a key string and a FeatureType.

The key sets the string value to select a Method by, e.g. chi2 to select the Chi2-contingency implementation when creating a UnivariateDriftCalculator.

Some Methods will only be applicable to one FeatureType, e.g. Kolmogorov-Smirnov can only be used with continuous data, Chi2-contingency only with categorical data. Some support multiple types however, such as the Jensen-Shannon distance. These can be registered multiple times, once for each FeatureType they support. The value for key can be identical, the factory will use both the FeatureType and the key value to determine which class to instantiate.

Examples

>>> @MethodFactory.register(key='jensen_shannon', feature_type=FeatureType.CONTINUOUS)
>>> @MethodFactory.register(key='jensen_shannon', feature_type=FeatureType.CATEGORICAL)
>>> class JensenShannonDistance(Method):
...   pass
registry: Dict[str, Dict[FeatureType, Type[Method]]] = {'chi2': {FeatureType.CATEGORICAL: <class 'nannyml.drift.univariate.methods.Chi2Statistic'>}, 'hellinger': {FeatureType.CATEGORICAL: <class 'nannyml.drift.univariate.methods.HellingerDistance'>, FeatureType.CONTINUOUS: <class 'nannyml.drift.univariate.methods.HellingerDistance'>}, 'jensen_shannon': {FeatureType.CATEGORICAL: <class 'nannyml.drift.univariate.methods.JensenShannonDistance'>, FeatureType.CONTINUOUS: <class 'nannyml.drift.univariate.methods.JensenShannonDistance'>}, 'kolmogorov_smirnov': {FeatureType.CONTINUOUS: <class 'nannyml.drift.univariate.methods.KolmogorovSmirnovStatistic'>}, 'l_infinity': {FeatureType.CATEGORICAL: <class 'nannyml.drift.univariate.methods.LInfinityDistance'>}, 'wasserstein': {FeatureType.CONTINUOUS: <class 'nannyml.drift.univariate.methods.WassersteinDistance'>}}
class nannyml.drift.univariate.methods.WassersteinDistance(**kwargs)[source]

Bases: Method

Calculates the Wasserstein Distance between two distributions.

An alert will be raised for a Chunk if .

Creates a new Method instance.

Parameters:
  • display_name (str) – The name of the metric. Used to display in plots. If not given this name will be derived from the calculation_function.

  • column_name (str) – The name used to indicate the metric in columns of a DataFrame.

  • chunker (Chunker) – The Chunker used to split the data sets into a lists of chunks.

  • computation_params (dict, default=None) – A dictionary specifying parameter names and values to be used in the computation of the drift method.

  • upper_threshold (float, default=None) – An optional upper threshold for the data quality metric.

  • lower_threshold (float, default=None) – An optional lower threshold for the data quality metric.

  • upper_threshold_limit (float, default=None) – An optional upper threshold limit for the data quality metric.

  • lower_threshold_limit (float, default=0) – An optional lower threshold limit for the data quality metric.