Thresholds
Threshold basics
NannyML performance metrics and drift methods have thresholds associated to them in order to generate
alerts when necessary. The Threshold
class is responsible for calculating
those thresholds.
Its thresholds()
method returns two values: a lower and an upper threshold value.
It takes a numpy.ndarray
of values as an input. These are typically the metric or method values
calculated on reference data.
The process of calculating the threshold values is as follows.
The calculator or estimator runs and uses the reference data to compute the values
for the related method or metric for each chunk. Those values are used by the
thresholds()
method to calculate the associated lower and upper
threshold values.
When the calculator or estimator runs on an analysis chunk
the lower and upper threshold values will be compared with the method or metric values for each
chunk to see if they are breaching either the lower or upper threshold values.
If so, the alert flag will be set to True
for that chunk.
All NannyML calculators and estimators have a threshold
property that allows you to set a custom threshold for
their metrics or inspect them.
Some metrics have mathematical boundaries. For example, the F1
score, is limited to \([0, 1]\).
To enforce these boundaries, some metrics and drift methods within NannyML have lower and upper limits.
When calculating the threshold values during fitting, NannyML will check if the calculated threshold values fall within
these limits. If they don’t, the breaching threshold value(s) will be overridden by the theoretical limit.
NannyML also supports disabling the lower, upper or both thresholds. We’ll illustrate this in the following examples.
Constant thresholds
The ConstantThreshold
class is a very basic threshold. It is given a lower and upper value
when initialized and these will be returned as the lower and upper threshold values, independent of what reference data
is passed to it.
The ConstantThreshold
can be configured using the following parameters:
lower
: an optional float that sets the constant lower value. Defaults toNone
.Setting this to
None
disables the lower threshold.
upper
: an optional float that sets the constant upper threshold value. Defaults toNone
.Setting this to
None
disables the upper threshold.
>>> ct = nml.thresholds.ConstantThreshold(lower=0.5, upper=0.9)
>>> ct.thresholds(np.asarray(range(3)))
(0.5, 0.9)
The lower
and upper
parameters have a default value of None
. For example
NannyML interprets providing no lower
threshold value as no lower threshold should be applied.
>>> js = nml.thresholds.ConstantThreshold(upper=0.1)
>>> js.thresholds(np.asarray(range(3)))
(None, 0.1)
Standard deviation thresholds
The StandardDeviationThreshold
class will use the mean of the data it is given as
a baseline. It will then add the standard deviation of the given data, scaled by a multiplier, to that baseline to
calculate the upper threshold value. By subtracting the standard deviation, scaled by a multiplier, from the baseline
it calculates the lower threshold value.
This is easier to illustrate in code:
data = np.asarray(range(10))
baseline = np.mean(data)
offset = np.std(data)
upper_offset = offset * 3
lower_offset = offset * 3
lower_threshold, upper_threshold = baseline - lower_offset, baseline + upper_offset
The StandardDeviationThreshold
can be configured using the following parameters:
std_lower_multiplier
: an optional float that scales the offset for the upper threshold value. Defaults to3
.std_upper_multiplier
: an optional float that scales the offset for the lower threshold value. Defaults to3
.offset_from
: a function used to aggregate the given data.
These examples show how to create a StandardDeviationThreshold
.
This first example demonstrates the default usage.
>>> stdt = nml.thresholds.StandardDeviationThreshold()
>>> stdt.thresholds(np.asarray(range(3)))
(-1.4494897427831779, 3.449489742783178)
This next example shows how to configure the StandardDeviationThreshold
.
Multipliers can make the offset smaller or larger, alternatives to the mean may be provided as well.
>>> stdt = nml.thresholds.StandardDeviationThreshold(std_lower_multiplier=0.1, std_upper_multiplier=5, offset_from=np.max)
>>> stdt.thresholds(np.asarray(range(3)))
(1.9183503419072274, 6.08248290463863)
By providing a None
value you can disable one or more thresholds. The following example shows how to disable the
lower threshold by setting the appropriate multiplier to None
.
>>> stdt = nml.thresholds.StandardDeviationThreshold(std_lower_multiplier=None)
>>> stdt.thresholds(np.asarray(range(3)))
(None, 3.449489742783178)
Warning
The Chi-squared, \(\chi^2\), drift detection method for categorical data does not support custom thresholds yet. It is currently using p-values for thresholding and replacing them by or incorporating them in the custom thresholding system requires further research.
For now it will continue to function as it did before.
When specifying a custom threshold for Chi-squared in the
UnivariateDriftCalculator
,
NannyML will log a warning message to clarify the custom threshold will be ignored.