nannyml.drift.univariate.methods module

This module contains the different drift detection method implementations.

The MethodFactory will convert the drift detection method names into an instance of the base Method class.

The UnivariateDriftCalculator class will perform the required data transformations before looping over all Method instances it holds and fit each on reference data or calculate the drift value on analysis data.

class nannyml.drift.univariate.methods.CategoricalHellingerDistance(**kwargs)[source]

Bases: Method

Calculates the Hellinger Distance between two distributions.

Initialize Hellinger Distance method.

class nannyml.drift.univariate.methods.CategoricalJensenShannonDistance(**kwargs)[source]

Bases: Method

Calculates Jensen-Shannon distance.

By default an alert will be raised if distance > 0.1.

Initialize Jensen-Shannon method.

class nannyml.drift.univariate.methods.Chi2Statistic(**kwargs)[source]

Bases: Method

Calculates the Chi2-contingency statistic.

An alert will be raised for a Chunk if p_value < 0.05.

Initialize Chi2-contingency method.

alert(value: float)[source]

Evaluates if an alert has occurred for Chi2 on the current chunk data.

For Chi2 alerts are based on p-values rather than the actual method values like in all other Univariate drift methods.

Parameters:: value (float) – The method value for a given chunk

fit(reference_data: Series, timestamps: Optional[Series] = None) → Self[source]

Fits Chi2 Method on reference data.

Parameters:

reference_data (pd.DataFrame) – The reference data used for fitting a Method. Must have target data available.
timestamps (Optional[pd.Series], default=None) – A series containing the reference data Timestamps

class nannyml.drift.univariate.methods.ContinuousHellingerDistance(**kwargs)[source]

Bases: Method

Calculates the Hellinger Distance between two distributions.

Initialize Hellinger Distance method.

class nannyml.drift.univariate.methods.ContinuousJensenShannonDistance(**kwargs)[source]

Bases: Method

Calculates Jensen-Shannon distance.

By default an alert will be raised if distance > 0.1.

Initialize Jensen-Shannon method.

class nannyml.drift.univariate.methods.FeatureType(value)[source]

Bases: str, Enum

An enumeration indicating if a Method is applicable to continuous data, categorical data or both.

CATEGORICAL = 'categorical'

CONTINUOUS = 'continuous'

class nannyml.drift.univariate.methods.KolmogorovSmirnovStatistic(**kwargs)[source]

Bases: Method

Calculates the Kolmogorov-Smirnov d-stat.

An alert will be raised for a Chunk if p_value < 0.05.

Initialize Kolmogorov-Smirnov method.

class nannyml.drift.univariate.methods.LInfinityDistance(**kwargs)[source]

Bases: Method

Calculates the L-Infinity Distance.

An alert will be raised if distance > 0.1.

Initialize L-Infinity Distance method.

class nannyml.drift.univariate.methods.Method(display_name: str, column_name: str, chunker: Chunker, threshold: Threshold, computation_params: Optional[Dict[str, Any]] = None, upper_threshold_limit: Optional[float] = None, lower_threshold_limit: Optional[float] = None)[source]

Bases: ABC

A method base class to express the amount of drift between two distributions.

Creates a new Method instance.

Parameters:

display_name (str) – The name of the metric. Used to display in plots. If not given this name will be derived from the calculation_function.
column_name (str) – The name used to indicate the metric in columns of a DataFrame.
chunker (Chunker) – The Chunker used to split the data sets into a lists of chunks.
computation_params (dict, default=None) – A dictionary specifying parameter names and values to be used in the computation of the drift method.
threshold (Threshold) – Threshold class defining threshold strategy.
upper_threshold_limit (float, default=None) – An optional upper threshold limit for the data quality metric.
lower_threshold_limit (float, default=0) – An optional lower threshold limit for the data quality metric.

__eq__(other)[source]: Establishes equality by comparing all properties.

alert(value: float)[source]

Evaluates if an alert has occurred for this method on the current chunk data.

Parameters:: value (float) – The method value for a given chunk

calculate(data: Series)[source]

Calculates drift within data with respect to the reference data.

Parameters:: data (pd.DataFrame) – The data to compare to the reference data.

fit(reference_data: Series, timestamps: Optional[Series] = None) → Self[source]

Fits a Method on reference data.

Parameters:

reference_data (pd.DataFrame) – The reference data used for fitting a Method. Must have target data available.
timestamps (Optional[pd.Series], default=None) – A series containing the reference data Timestamps

class nannyml.drift.univariate.methods.MethodFactory[source]

Bases: object

A factory class that produces Method instances given a ‘key’ string and a ‘feature_type’ it supports.

classmethod create(key: str, feature_type: FeatureType, **kwargs) → Method[source]

Returns a Method instance for a given key and FeatureType.

The value for the key is passed explicitly by the end user (provided within the UnivariateDriftCalculator initializer). The value for the FeatureType is provided implicitly by deducing it from the reference data upon fitting the UnivariateDriftCalculator.

Any additional keyword arguments are passed along to the initializer of the Method.

classmethod register(key: str, feature_type: FeatureType) → Callable[source]

A decorator used to register a specific Method implementation to the factory.

Registering a Method requires a key string and a FeatureType.

The key sets the string value to select a Method by, e.g. chi2 to select the Chi2-contingency implementation when creating a UnivariateDriftCalculator.

Some Methods will only be applicable to one FeatureType, e.g. Kolmogorov-Smirnov can only be used with continuous data, Chi2-contingency only with categorical data. Some support multiple types however, such as the Jensen-Shannon distance. These can be registered multiple times, once for each FeatureType they support. The value for key can be identical, the factory will use both the FeatureType and the key value to determine which class to instantiate.

Examples

>>> @MethodFactory.register(key='jensen_shannon', feature_type=FeatureType.CONTINUOUS)
>>> @MethodFactory.register(key='jensen_shannon', feature_type=FeatureType.CATEGORICAL)
>>> class JensenShannonDistance(Method):
...   pass

registry: Dict[str, Dict[FeatureType, Type[Method]]] = {'chi2': {FeatureType.CATEGORICAL: <class 'nannyml.drift.univariate.methods.Chi2Statistic'>}, 'hellinger': {FeatureType.CATEGORICAL: <class 'nannyml.drift.univariate.methods.CategoricalHellingerDistance'>, FeatureType.CONTINUOUS: <class 'nannyml.drift.univariate.methods.ContinuousHellingerDistance'>}, 'jensen_shannon': {FeatureType.CATEGORICAL: <class 'nannyml.drift.univariate.methods.CategoricalJensenShannonDistance'>, FeatureType.CONTINUOUS: <class 'nannyml.drift.univariate.methods.ContinuousJensenShannonDistance'>}, 'kolmogorov_smirnov': {FeatureType.CONTINUOUS: <class 'nannyml.drift.univariate.methods.KolmogorovSmirnovStatistic'>}, 'l_infinity': {FeatureType.CATEGORICAL: <class 'nannyml.drift.univariate.methods.LInfinityDistance'>}, 'wasserstein': {FeatureType.CONTINUOUS: <class 'nannyml.drift.univariate.methods.WassersteinDistance'>}}

class nannyml.drift.univariate.methods.WassersteinDistance(**kwargs)[source]

Bases: Method

Calculates the Wasserstein Distance between two distributions.

An alert will be raised for a Chunk if .

Initialize Wasserstein Distance method.