nannyml.data_quality.range.calculator module

Continuous numerical variable range monitor to ensure range supplied is within training bounds.

class nannyml.data_quality.range.calculator.NumericalRangeCalculator(column_names: ~typing.Union[str, ~typing.List[str]], normalize: bool = True, timestamp_column_name: ~typing.Optional[str] = None, chunk_size: ~typing.Optional[int] = None, chunk_number: ~typing.Optional[int] = None, chunk_period: ~typing.Optional[str] = None, chunker: ~typing.Optional[~nannyml.chunk.Chunker] = None, threshold: ~nannyml.thresholds.Threshold = ConstantThreshold{'lower': None, 'upper': 0})[source]

Bases: AbstractCalculator

NumericalRangeCalculator ensures the monitoring data set numerical ranges match the reference data set ones.

Creates a new NumericalRangeCalculator instance.

Parameters:
  • column_names (Union[str, List[str]]) – A string or list containing the names of features in the provided data set. Missing Values will be calculated for each entry in this list.

  • normalize (bool, default=True) – Whether to provide the missing value ratio (True) or the absolute number of missing values (False).

  • timestamp_column_name (str) – The name of the column containing the timestamp of the model prediction.

  • chunk_size (int) – Splits the data into chunks containing chunks_size observations. Only one of chunk_size, chunk_number or chunk_period should be given.

  • chunk_number (int) – Splits the data into chunk_number pieces. Only one of chunk_size, chunk_number or chunk_period should be given.

  • chunk_period (str) – Splits the data according to the given period. Only one of chunk_size, chunk_number or chunk_period should be given.

  • chunker (Chunker) – The Chunker used to split the data sets into a lists of chunks.

  • threshold (Threshold, default=StandardDeviationThreshold) – The threshold you wish to evaluate values on. Defaults to a StandardDeviationThreshold with default options. The other available value is ConstantThreshold.

Examples

>>> import nannyml as nml
>>> reference_df, analysis_df, _ = nml.load_synthetic_car_price_dataset()
>>> feature_column_names = [col for col in reference_df.columns if col not in [
...     'fuel','transmission','timestamp', 'y_pred', 'y_true']]
>>> calc = nml.NumericalRangeCalculator(
...     column_names=feature_column_names,
...     timestamp_column_name='timestamp',
... ).fit(reference_df)
>>> res = calc.calculate(analysis_df)
>>> res.filter(period='analysis').plot().show()