Configuration file

Locations

The nml CLI will look for configuration files called either nannyml.yaml or nann.yml in a number of preset locations. These presets can be overridden by telling NannyML where to look for your configuration. You can do this by using an environment variable or a command line argument.

The nml CLI will go over the possible options in the following order:

Evaluate the -c or --configuration-path command line argument
When providing an explicit location to the nml CLI, the configuration file living at that location will always be prioritised above any preset location.
nml -c /path/to/nann.yml run
Evaluate the NML_CONFIG_PATH environment variable
If the NML_CONFIG_PATH environment variable was found, its value will be interpreted as a path pointing to your config file.
export NML_CONFIG_PATH /path/to/nann.yml
Look for nannyml.yaml or nann.yml in the /config directory
This directory is unlikely to exist on your local system, but is easy to use when mounting your configuration files into the NannyML docker container.
docker run -v /path/to/config/dir/:/config/ nannyml/nannyml nml run
Look for nannyml.yaml or nann.yml in the current working $PWD directory

When working on your local system you can just run the nml CLI in the same location as your nannyml.yaml or nann.yml file. Make sure you’ve activated your virtual environment when using one!

Format

The configuration file format is broken down into multiple sections.

Input section

This section describes the input data for NannyML, i.e. the reference and analysis datasets.

The following snippet shows the basic form of pointing towards two local CSV files as reference and analysis data.

input:
  reference_data:
    path: /data/synthetic_sample_reference.csv

  analysis_data:
    path: /data/synthetic_sample_analysis.csv

You can also work with data living in cloud storage. We currently support reading data from S3 buckets (Amazon Web Services), GCS buckets (Google Cloud Platform) and ADLS or Azure Blob Storage (Microsoft Azure). We use the awesome fsspec project for this.

You can provide credentials to access these locations by using cloud-vendor specific way (e.g. setting some environment variables or providing config files like .aws/credentials) or provide them in the configuration.

input:
  reference_data:
    path: s3://nml-data/synthetic_sample_reference.pq
    credentials:  # providing example AWS credentials
      client_kwargs:
        aws_access_key_id: 'ACCESS_KEY_ID'
        aws_secret_access_key: 'SECRET_ACCESS_KEY'

  analysis_data:
    path: gs://nml-data/synthetic_sample_analysis.pq
    credentials:  # providing example GCP credentials
        token: ml6-workshop-fa83b3d60b5d.json  # path to service account key file

Any pandas.read_csv or pandas.read_parquet options can be passed along by providing them in the configuration using the read_args parameter.

input:
  reference_data:
    path: /data/synthetic_sample_reference.csv
    read_args:
        delimiter: ;
        chunksize: 100000

When target values are delivered separately you can specify these as an input as well. You must also provide a column used to join your target values with your analysis data.

input:
  reference_data:
    path: /data/synthetic_sample_reference.csv

  analysis_data:
    path: /data/synthetic_sample_analysis.csv

  target_data:
    path: /data/synthetic_sample_analysis_gt.csv
    join_column: identifier

Output section

The output section allows you to instruct NannyML on how and where to write the outputs of the calculations. We currently support writing data and plots to a local or cloud filesystem or exporting data to a relational database.

Warning

This is a very early release and additional ways of outputting data are on their way. This configuration section will be prone to big changes in the future.

Writing to filesystem

You can specify the folder to write outputs to using the path parameter. The optional format parameter allows you to choose the format to export the results DataFrames in. Allowed values are csv and parquet, with parquet being the default.

output:
  raw_files:
    path: /data/out/
    format: parquet

The output section supports the use of credentials:

output:
  raw_files:
    path: s3://nml-data/synthetic_sample_reference.pq
    credentials:  # providing example AWS credentials
      client_kwargs:
        aws_access_key_id: 'ACCESS_KEY_ID'
        aws_secret_access_key: 'SECRET_ACCESS_KEY'

The output format supports passing along any pandas.to_csv or pandas.to_parquet using the write_args parameter.

output:
  raw_files:
    path: /data/out/
    format: csv
    write_args:
      headers: False

Writing to a pickle file

NannyML supports directly pickling the Result objects returned by calculators and estimators. Use the following configuration to enable this:

output:
  pickle:
    path: /data/out/  # a *.pkl file will be written here by each calculator/estimator

Writing to a relational database

NannyML can also export its data to a relational database. When provided with a connection string NannyML will create the required table structure and insert calculator and estimator results in there.

Warning

Your data must contain a timestamp column in order to use this functionality.

There is a separate table for each calculator and estimator. The following sample from the cbpe_performance_metrics table illustrates their overall structure:

id	model_id	run_id	timestamp	metric_name	value	alert
1	2	4	2014-05-09 12:00:00.000000	ROC AUC	0.9395984406102346	false
2	2	4	2014-05-10 12:00:00.000000	ROC AUC	0.9669333004887973	false
3	2	4	2014-05-11 12:00:00.000000	ROC AUC	0.9616566861394408	false
4	2	4	2014-05-12 12:00:00.000000	ROC AUC	0.9631921191605108	false
5	2	4	2014-05-13 12:00:00.000000	ROC AUC	0.9679918198658687	false
6	2	4	2014-05-14 12:00:00.000000	ROC AUC	0.9680751598579069	false
7	2	4	2014-05-15 12:00:00.000000	ROC AUC	0.9593668335222013	false
8	2	4	2014-05-16 12:00:00.000000	ROC AUC	0.964513389926401	false
9	2	4	2014-05-17 12:00:00.000000	ROC AUC	0.9674120045991212	false

id is the database primary (technical) key, uniquely identifying each row.
model_id is a foreign key to the model table. It currently only contains a name for a model but having this allows you to filter on a model when performing queries or visualizing in dashboards.
run_id is a foreign key to the run table. It contains information about how and when NannyML was run. It also serves to filter metrics that were inserted during a given run, allowing you to easily remove these in case of errors.
timestamp is a timestamp created by finding the middle point of the start and end timestamps for each chunk. E.g. for a chunk starting at midnight and ending just before midnight of that day, the generated timestamp will be at noon.
metric_name is a column specific to some calculators and estimators. It contains the name of the metric that’s being calculated or estimated.
value contains the actual value that was being calculated. This might be a realized or estimated performance metric or a drift metric.
alert contains a boolean value (true or false) indicating whether the metric crossed a threshold, thus raising an alert.
upper_threshold contains the value of the upper threshold for the metric. Exceeding this value results in an alert.
lower_threshold contains the value of the lower threshold for the metric. Diving under this value results in an alert.
feature_name is not listed here but is present in univariate calculator results. It contains the name of the feature the metric value belongs to.

We currently support all databases supported by SQLAlchemy. You can find more information on the required connection strings in their Engine Configuration. The following snippet illustrates how to configure the database export to a Postgres database running locally.

output:
  database:
    connection_string: postgresql://postgres:mysecretpassword@localhost:5432/postgres
    model_name: my regression model

Note the presence of the model_name value. It will ensure an entry for the given name is present in the model table (by either retrieving or creating it) and link it to the metrics using the model.id value as a foreign key. This configuration is optional but recommended. Dropping this parameter results in the metrics being written without a model_id value, which makes them harder to link to a single given model.

Column mapping section

This section is responsible for teaching NannyML about your specific model: what are its features, predictions, … You do this by providing a column mapping that associates a NannyML specific meaning to your input data. For more information on this, check out the Data requirements documentation.

The following snippet lists the column mapping for the Synthetic Binary Classification Car Loan Dataset.

column_mapping:
  features:
    - car_value
    - salary_range
    - debt_to_income_ratio
    - loan_length
    - repaid_loan_on_prev_car
    - size_of_downpayment
    - tenure
  timestamp: timestamp
  y_pred: y_pred
  y_pred_proba: y_pred_proba
  y_true: repaid

This snippet shows how to setup the column mapping for the Synthetic Multiclass Classification Dataset.

column_mapping:
  features:
    - acq_channel
    - app_behavioral_score
    - requested_credit_limit
    - app_channel
    - credit_bureau_score
    - stated_income
    - is_customer
  timestamp: timestamp
  y_pred: y_pred
  y_pred_proba:
    prepaid_card: y_pred_proba_prepaid_card
    highstreet_card: y_pred_proba_highstreet_card
    upmarket_card: y_pred_proba_upmarket_card
  y_true: y_true

Store section

This section lets you set up a FilesystemStore for caching purposes.

When a FilesystemStore is configured it will be used to store and load fitted calculators during the run. NannyML will use the store to try to load pre-fitted calculators. If none can be found a new calculator will be created, fitted and persisted using the store. The next time NannyML is run using the same configuration file it will find the stored calculator and use it subsequently.

Check out the tutorial on storing and loading calculators to learn more.

This snippet shows how to setup the store in configuration using the local filesystem:

store:
  file:
    path: /out/nml-cache/calculators

This snippet shows how use S3:

store:
  file:
    path: s3://my-bucket/nml/cache/
    credentials:
      client_kwargs:
        aws_access_key_id: '<ACCESS_KEY_ID>'
        aws_secret_access_key: '<SECRET_ACCESS_KEY>'

This snippet shows how to use Google Cloud Storage:

store:
  file:
    path: gs://my-bucket/nml/cache/
    credentials:
        token: service-account-access-key.json

This snippet shows how to use Azure Blob Storage:

store:
  file:
    path: abfs://my-bucket/nml/cache/
    credentials:
        account_name: '<ACCOUNT_NAME>'
        account_key: '<ACCOUNT_KEY>'

Chunker section

The chunker section allows you to set the chunking behavior for all of the calculators and estimators that will be run. Check the Chunking documentation for more information on the practice of chunking and the available Chunkers.

This section is optional and when it is absent NannyML will use a DefaultChunker instead.

chunker:
  chunk_size: 5000  # chunks of fixed size

chunker:
  chunk_period: W  # chunks grouping observations by week

Scheduling section

The scheduling section allows you to configure the schedule NannyML is to run on. This section is optional and if none is found NannyML will just run a single time, unscheduled.

There are currently two ways of scheduling in NannyML.

Interval scheduling allows you to set the interval between NannyML runs, such as every 6 hours or every 3 days. The available time increments are weeks, days, hours and minutes.
Cron scheduling allows you to leverage the widely known crontab expressions to control scheduling.

Interval based scheduling configuration

scheduling:
  interval:
    days: 1  # wait one day from the timestamp at which the command is run

cron based scheduling configuration

scheduling:
  cron:
    crontab: "*/5 * * * *" # every 5 minutes, so on 00:05, 00:10, 00:15, ...

Standalone parameters section

This section contains some standalone parameters that mostly serve as an alternative to CLI arguments.

The required problem_type variable allows you to pass along a ProblemType value. NannyML uses this information to better understand the provided model inputs and outputs.

problem_type: regression  # pass the problem type (one of 'classification_binary', 'classification_multiclass' or 'regression')

ignore_errors: True  # continue execution if a calculator/estimator fails

Templating paths

To use NannyML as a scheduled job we provide some support for path templating. This allows you to read data from and write data to locations that are based on timestamps.

The following example illustrates writing outputs to a 3-tiered directory structure for years, months and days. When NannyML is run as a daily scheduled job the results will be written to a different folder each day, preserving the outputs of previous runs.

output:
  path: /data/out/{{year}}/{{month}}/{{day}}

The following placeholders are currently supported:

minute
hour
day
weeknumber
month
year

Examples

The following example contains the configuration required to run the nml CLI for the Synthetic Binary Classification Car Loan Dataset.

All data is read and written to the local filesystem.

input:
  reference_data:
    path: data/synthetic_sample_reference.csv

  analysis_data:
    path: data/synthetic_sample_analysis.csv

output:
  raw_files:
    path: out/
    format: parquet

column_mapping:
  features:
    - car_value
    - salary_range
    - debt_to_income_ratio
    - loan_length
    - repaid_loan_on_prev_car
    - size_of_downpayment
    - tenure
  timestamp: timestamp
  y_pred: y_pred
  y_pred_proba: y_pred_proba
  y_true: work_home_actual

problem_type: classification_binary

ignore_errors: True

The following example contains the configuration used to run the nml CLI on the Synthetic Multiclass Classification Dataset.

Input data is read from one S3 bucket using templated paths. Targets have been provided separately - they are not present in the analysis data. The results are written to another S3 bucket, also using a templated path.

input:
  reference_data:
    path: s3://nml-data/{{year}}/{{month}}/{{day}}/mc_reference.csv
    credentials:
      client_kwargs:
        aws_access_key_id: 'DATA_ACCESS_KEY_ID'
        aws_secret_access_key: 'DATA_SECRET_ACCESS_KEY'

  analysis_data:
    path: s3://nml-data/{{year}}/{{month}}/{{day}}/mc_analysis.csv
    credentials:
      client_kwargs:
        aws_access_key_id: 'DATA_ACCESS_KEY_ID'
        aws_secret_access_key: 'DATA_SECRET_ACCESS_KEY'

  target_data:
    path: s3://nml-data/{{year}}/{{month}}/{{day}}/mc_analysis.csv
    join_column: identifier
    credentials:
      client_kwargs:
        aws_access_key_id: 'DATA_ACCESS_KEY_ID'
        aws_secret_access_key: 'DATA_SECRET_ACCESS_KEY'

output:
  raw_files:
    path: s3://nml-results/{{year}}/{{month}}/{{day}}
    format: parquet
    credentials:  # different credentials
        client_kwargs:
          aws_access_key_id: 'RESULTS_ACCESS_KEY_ID'
          aws_secret_access_key: 'RESULTS_SECRET_ACCESS_KEY'

chunker:
  chunk_size: 5000

column_mapping:
  features:
    - acq_channel
    - app_behavioral_score
    - requested_credit_limit
    - app_channel
    - credit_bureau_score
    - stated_income
    - is_customer
  timestamp: timestamp
  y_pred: y_pred
  y_pred_proba:
    prepaid_card: y_pred_proba_prepaid_card
    highstreet_card: y_pred_proba_highstreet_card
    upmarket_card: y_pred_proba_upmarket_card
  y_true: y_true

problem_type: classification_multiclass

ignore_errors: False

The following example contains the configuration required to run the nml CLI for the Synthetic Regression Dataset.

The data is read from the local filesystem but written to an external database.

input:
  reference_data:
    path: data/regression_synthetic_reference.csv

  analysis_data:
    path: data/regression_synthetic_analysis.csv

  target_data:
    path: data/regression_synthetic_analysis_targets.csv

output:
  database:
    connection_string: postgresql://postgres:mysecretpassword@localhost:5432/postgres
    model_name: regression_car_price

problem_type: regression

chunker:
  chunk_period: D

column_mapping:
  features:
    - car_age
    - km_driven
    - price_new
    - accident_count
    - door_count
    - transmission
    - fuel
  timestamp: timestamp
  y_pred: y_pred
  y_true: y_true