nannyml.metadata.base module

NannyML module providing classes and utilities for dealing with model metadata.

class nannyml.metadata.base.ModelMetadata(model_type: nannyml.metadata.base.ModelType, model_name: Optional[str] = None, features: Optional[List[nannyml.metadata.feature.Feature]] = None, target_column_name: str = 'target', partition_column_name: str = 'partition', timestamp_column_name: str = 'date')[source]

Bases: abc.ABC

The ModelMetadata class contains all the information nannyML requires.

To understand the model inputs and outputs you wish it to process, nannyML needs to understand what your model (and hence also your model inputs/outputs) look like. The ModelMetadata class combines all the information about your model it might need. We call this the model metadata, since it does not concern the actual model (e.g .weights or coefficients) but generic information about your model.

These properties are: - model_name : a human-readable name for the model - model_purpose : an optional description of the use for your model - model_problem : the kind of problem your model is trying to solve. We currently only support binary_classification problems but are planning to support more very soon! - features : the list of Features for the model - identifier_column_name : name of the column that contains a value that acts as an identifier for the observation, i.e. it is unique over all observations. - prediction_column_name : name of the column that contains the models’ predictions - target_column_name : name of the column that contains the ground truth / target / actual. - partition_column_name : name of the column that contains the partition the observation belongs to. Allowed partition values are ‘reference’ and ‘analysis’. - timestamp_column_name : name of the column that contains the timestamp indicating when the observation occurred.

Creates a new ModelMetadata instance.

Parameters

model_type (ModelType) – The kind of problem your model is trying to solve. Used to determine which metadata properties should be known by NannyML.
model_name (string, default=None) – A human-readable name for the model.
features (List[Feature]) – The list of Features for the model. Optional, defaults to None.
target_column_name (string) – The name of the column that contains the ground truth / target / actual. Optional, defaults to target
partition_column_name (string) – The name of the column that contains the partition the observation belongs to. Allowed partition values are ‘reference’ and ‘analysis’. Optional, defaults to partition
timestamp_column_name (string) – The name of the column that contains the timestamp indicating when the observation occurred. Optional, defaults to date.

Returns

metadata

Return type

ModelMetadata

Examples

>>> from nannyml.metadata import ModelMetadata, Feature, FeatureType
>>> metadata = ModelMetadata(model_type='classification_binary', target_column_name='work_home_actual')
>>> metadata.features = [Feature(column_name='dist_from_office', label='office_distance',
description='Distance from home to the office', feature_type=FeatureType.CONTINUOUS),
>>> Feature(column_name='salary_range', label='salary_range', feature_type=FeatureType.CATEGORICAL)]
>>> metadata.to_dict()
{'timestamp_column_name': 'date',
 'partition_column_name': 'partition',
 'target_column_name': 'work_home_actual',
 'prediction_column_name': None,
 'predicted_probability_column_name': None,
 'features': [{'label': 'office_distance',
   'column_name': 'dist_from_office',
   'type': 'continuous',
   'description': 'Distance from home to the office'},
  {'label': 'salary_range',
   'column_name': 'salary_range',
   'type': 'categorical',
   'description': None}]}

__repr__()[source]: Converts the ModelMetadata instance to a string representation.

__str__()[source]: Converts the ModelMetadata instance to a string representation.

property categorical_features: List[nannyml.metadata.feature.Feature]

Retrieves all categorical features.

Returns: features – A list of all categorical features
Return type: List[Feature]

Examples

>>> from nannyml.metadata import ModelMetadata, Feature, FeatureType
>>> metadata = ModelMetadata(model_type='classification_binary', target_column_name='work_home_actual')
>>> metadata.features = [
>>>     Feature('cat1', 'cat1', FeatureType.CATEGORICAL), Feature('cat2', 'cat2', FeatureType.CATEGORICAL),
>>>     Feature('cont1', 'cont1', FeatureType.CONTINUOUS), Feature('cont2', 'cont2', FeatureType.CONTINUOUS)]
>>> metadata.categorical_features
[Feature({'label': 'cat1', 'column_name': 'cat1', 'type': 'categorical', 'description': None}),
Feature({'label': 'cat2', 'column_name': 'cat2', 'type': 'categorical', 'description': None})]

property continuous_features: List[nannyml.metadata.feature.Feature]

Retrieves all continuous features.

Returns: features – A list of all continuous features
Return type: List[Feature]

Examples

>>> from nannyml.metadata import ModelMetadata, Feature, FeatureType
>>> metadata = ModelMetadata('work_from_home', target_column_name='work_home_actual')
>>> metadata.features = [
>>>     Feature('cat1', 'cat1', FeatureType.CATEGORICAL), Feature('cat2', 'cat2', FeatureType.CATEGORICAL),
>>>     Feature('cont1', 'cont1', FeatureType.CONTINUOUS), Feature('cont2', 'cont2', FeatureType.CONTINUOUS)]
>>> metadata.continuous_features
[Feature({'label': 'cont1', 'column_name': 'cont1', 'type': 'continuous', 'description': None}),
Feature({'label': 'cont2', 'column_name': 'cont2', 'type': 'continuous', 'description': None})]

abstract enrich(data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]

Creates copies of all metadata columns with fixed names.

Parameters: data (DataFrame) – The data to enrich
Returns: enriched_data – A DataFrame that has all metadata present in NannyML-specific columns.
Return type: DataFrame

abstract extract(data: pandas.core.frame.DataFrame, model_name: Optional[str] = None, exclude_columns: Optional[List[str]] = None)[source]

Tries to extract model metadata from a given data set.

Manually constructing model metadata can be cumbersome, especially if you have hundreds of features. NannyML includes this helper function that tries to do the boring stuff for you using some simple rules.

By default, all columns in the given dataset are considered to be either model features or metadata. Use the exclude_columns parameter to prevent columns from being interpreted as metadata or features.

Parameters

data (DataFrame) – The dataset containing model inputs and outputs, enriched with the required metadata.
model_name (str) – A human-readable name for the model.
exclude_columns (List[str], default=None) – A list of column names that are to be skipped during metadata extraction, preventing them from being interpreted as either model metadata or model features.

Returns

metadata – A fully initialized ModelMetadata subclass instance.

Return type

ModelMetadata

Notes

This method is most often not used directly, but by calling the nannyml.metadata.extraction.extract_metadata() function that will delegate to this method.

This particular abstract method provides common functionality for its subclasses and is always called there using a super().extract() call.

feature(index: Optional[int] = None, feature: Optional[str] = None, column: Optional[str] = None) → Optional[nannyml.metadata.feature.Feature][source]

A function used to access a specific model feature.

Because a model might contain hundreds of features NannyML provides this utility method to filter through them and find the exact feature you need.

Parameters

index (int) – Retrieve a Feature using its index in the features list.
feature (str) – Retrieve a feature using its label.
column (str) – Retrieve a feature using the name of the column it has in the model inputs/outputs.

Returns

feature – A single Feature matching the search criteria. Returns None if none were found matching the criteria or no criteria were provided.

Return type

Feature

Examples

>>> from nannyml.metadata import ModelMetadata, Feature, FeatureType
>>> metadata = ModelMetadata(model_type='classification_binary', target_column_name='work_home_actual')
>>> metadata.features = [Feature(column_name='dist_from_office', label='office_distance',
>>> description='Distance from home to the office', feature_type=FeatureType.CONTINUOUS),
>>> Feature(column_name='salary_range', label='salary_range', feature_type=FeatureType.CATEGORICAL)]
>>> metadata.feature(index=1)
Feature({'label': 'salary_range', 'column_name': 'salary_range', 'type': 'categorical', 'description': None})
>>> metadata.feature(feature='office_distance')
Feature({'label': 'office_distance', 'column_name': 'dist_from_office', 'type': 'continuous',
'description': 'Distance from home to the office'})
>>> metadata.feature(column='dist_from_office')
Feature({'label': 'office_distance', 'column_name': 'dist_from_office', 'type': 'continuous',
'description': 'Distance from home to the office'})

abstract is_complete() → Tuple[bool, List[str]][source]

Flags if the ModelMetadata is considered complete or still missing values.

Returns

complete (bool) – True when all required fields are present, False otherwise
missing (List[str]) – A list of all missing properties. Empty when metadata is complete.

Examples

>>> from nannyml.metadata import ModelMetadata, Feature, FeatureType
>>> metadata = ModelMetadata('work_from_home', target_column_name='work_home_actual')
>>> metadata.features = [
>>>     Feature('cat1', 'cat1', FeatureType.CATEGORICAL), Feature('cat2', 'cat2', FeatureType.CATEGORICAL),
>>>     Feature('cont1', 'cont1', FeatureType.CONTINUOUS), Feature('cont2', 'cont2', FeatureType.UNKNOWN)]
>>> # missing either predicted labels or predicted probabilities, 'cont2' has an unknown feature type
>>> metadata.is_complete()
(False, ['predicted_probability_column_name', 'prediction_column_name'])
>>> metadata.predicted_probability_column_name = 'y_pred_proba'  # fix the missing value
>>> metadata.feature(feature='cont2').feature_type = FeatureType.CONTINUOUS
>>> metadata.is_complete()
(True, [])

abstract property metadata_columns: Returns all metadata columns that are added to the data by the enrich method.

property partition_column_name

print()[source]

Returns a string representation of a ModelMetadata instance.

Examples

>>> from nannyml.metadata import ModelMetadata, Feature, FeatureType
>>> metadata = ModelMetadata(model_type='classification_binary', target_column_name='work_home_actual')
>>> metadata.features = [Feature(column_name='dist_from_office', label='office_distance',
>>> description='Distance to the office', feature_type=FeatureType.CONTINUOUS),
>>> Feature(column_name='salary_range', label='salary_range', feature_type=FeatureType.CATEGORICAL)]
>>> metadata.print()
Metadata for model work_from_home
--
# Warning - unable to identify all essential data
# Please identify column names for all '~ UNKNOWN ~' values
--
Model problem                       binary_classification
Timestamp column                    date
Partition column                    partition
Prediction column                   ~ UNKNOWN ~
Predicted probability column        ~ UNKNOWN ~
Target column                       work_home_actual
--
Features
--
Name                                Column                              Type            Description
office_distance                     dist_from_office                    continuous      Distance to the office
salary_range                        salary_range                        categorical     None

property target_column_name

property timestamp_column_name

abstract to_df() → pandas.core.frame.DataFrame[source]

Converts a ModelMetadata instance into a read-only DataFrame.

Examples

>>> from nannyml.metadata import ModelMetadata, Feature, FeatureType
>>> metadata = ModelMetadata(model_type='classification_binary', target_column_name='work_home_actual')
>>> metadata.features = [Feature(column_name='dist_from_office', label='office_distance',
description='Distance from home to the office', feature_type=FeatureType.CONTINUOUS),
>>> Feature(column_name='salary_range', label='salary_range', feature_type=FeatureType.CATEGORICAL)]
>>> metadata.to_df()

abstract to_dict() → Dict[str, Any][source]: Converts a ModelMetadata instance into a Dictionary.

class nannyml.metadata.base.ModelType(value)[source]

Bases: str, enum.Enum

A listing of all possible model types.

The model type will determine which specific metadata properties are required by NannyML. Each ModelMetadata subclass will be associated with a specific ModelType.

CLASSIFICATION_BINARY = 'classification_binary'

CLASSIFICATION_MULTICLASS = 'classification_multiclass'

static parse(model_type_str: str)[source]: Returns a ModelType instance from a string representation.