Configuration file
Locations
The nml
CLI will look for configuration files called either nannyml.yaml
or nann.yml
in a number of
preset locations. These presets can be overridden by telling NannyML where to look for your configuration.
You can do this by using an environment variable or a command line argument.
The nml
CLI will go over the possible options in the following order:
Evaluate the
-c
or--configuration-path
command line argumentWhen providing an explicit location to the
nml
CLI, the configuration file living at that location will always be prioritised above any preset location.nml -c /path/to/nann.yml run
Evaluate the
NML_CONFIG_PATH
environment variableIf the
NML_CONFIG_PATH
environment variable was found, its value will be interpreted as a path pointing to your config file.export NML_CONFIG_PATH /path/to/nann.yml
Look for
nannyml.yaml
ornann.yml
in the/config
directoryThis directory is unlikely to exist on your local system, but is easy to use when mounting your configuration files into the NannyML docker container.
docker run -v /path/to/config/dir/:/config/ nannyml/nannyml nml run
Look for
nannyml.yaml
ornann.yml
in the current working$PWD
directoryWhen working on your local system you can just run the
nml
CLI in the same location as yournannyml.yaml
ornann.yml
file. Make sure you’ve activated your virtual environment when using one!
Format
The configuration file format is broken down into multiple sections.
Input section
This section describes the input data for NannyML, i.e. the reference
and analysis
datasets.
The following snippet shows the basic form of pointing towards two local CSV files as reference and analysis data.
input:
reference_data:
path: /data/synthetic_sample_reference.csv
analysis_data:
path: /data/synthetic_sample_analysis.csv
You can also work with data living in cloud storage. We currently support reading data from S3 buckets (Amazon Web Services), GCS buckets (Google Cloud Platform) and ADLS or Azure Blob Storage (Microsoft Azure). We use the awesome fsspec project for this.
You can provide credentials to access these locations by using cloud-vendor specific way (e.g. setting some environment
variables or providing config files like .aws/credentials
) or provide them in the configuration.
input:
reference_data:
path: s3://nml-data/synthetic_sample_reference.pq
credentials: # providing example AWS credentials
client_kwargs:
aws_access_key_id: 'ACCESS_KEY_ID'
aws_secret_access_key: 'SECRET_ACCESS_KEY'
analysis_data:
path: gs://nml-data/synthetic_sample_analysis.pq
credentials: # providing example GCP credentials
token: ml6-workshop-fa83b3d60b5d.json # path to service account key file
Any pandas.read_csv
or pandas.read_parquet
options can be passed along by providing them in the configuration
using the read_args
parameter.
input:
reference_data:
path: /data/synthetic_sample_reference.csv
read_args:
delimiter: ;
chunksize: 100000
When target values are delivered separately you can specify these as an input as well. You must also provide a column used to join your target values with your analysis data.
input:
reference_data:
path: /data/synthetic_sample_reference.csv
analysis_data:
path: /data/synthetic_sample_analysis.csv
target_data:
path: /data/synthetic_sample_analysis_gt.csv
join_column: identifier
Output section
The output section allows you to instruct NannyML on how and where to write the outputs of the calculations. We currently support writing data and plots to a local or cloud filesystem or exporting data to a relational database.
Warning
This is a very early release and additional ways of outputting data are on their way. This configuration section will be prone to big changes in the future.
Writing to filesystem
You can specify the folder to write outputs to using the path
parameter.
The optional format
parameter allows you to choose the format to export the results DataFrames in.
Allowed values are csv
and parquet
, with parquet
being the default.
output:
raw_files:
path: /data/out/
format: parquet
The output section supports the use of credentials:
output:
raw_files:
path: s3://nml-data/synthetic_sample_reference.pq
credentials: # providing example AWS credentials
client_kwargs:
aws_access_key_id: 'ACCESS_KEY_ID'
aws_secret_access_key: 'SECRET_ACCESS_KEY'
The output format supports passing along any pandas.to_csv
or pandas.to_parquet
using the write_args
parameter.
output:
raw_files:
path: /data/out/
format: csv
write_args:
headers: False
Writing to a pickle file
NannyML supports directly pickling the Result
objects returned by calculators and estimators.
Use the following configuration to enable this:
output:
pickle:
path: /data/out/ # a *.pkl file will be written here by each calculator/estimator
Writing to a relational database
NannyML can also export its data to a relational database. When provided with a connection string NannyML will create the required table structure and insert calculator and estimator results in there.
Warning
Your data must contain a timestamp column in order to use this functionality.
There is a separate table for each calculator and estimator. The following sample from the cbpe_performance_metrics table illustrates their overall structure:
id |
model_id |
run_id |
timestamp |
metric_name |
value |
alert |
---|---|---|---|---|---|---|
1 |
2 |
4 |
2014-05-09 12:00:00.000000 |
ROC AUC |
0.9395984406102346 |
false |
2 |
2 |
4 |
2014-05-10 12:00:00.000000 |
ROC AUC |
0.9669333004887973 |
false |
3 |
2 |
4 |
2014-05-11 12:00:00.000000 |
ROC AUC |
0.9616566861394408 |
false |
4 |
2 |
4 |
2014-05-12 12:00:00.000000 |
ROC AUC |
0.9631921191605108 |
false |
5 |
2 |
4 |
2014-05-13 12:00:00.000000 |
ROC AUC |
0.9679918198658687 |
false |
6 |
2 |
4 |
2014-05-14 12:00:00.000000 |
ROC AUC |
0.9680751598579069 |
false |
7 |
2 |
4 |
2014-05-15 12:00:00.000000 |
ROC AUC |
0.9593668335222013 |
false |
8 |
2 |
4 |
2014-05-16 12:00:00.000000 |
ROC AUC |
0.964513389926401 |
false |
9 |
2 |
4 |
2014-05-17 12:00:00.000000 |
ROC AUC |
0.9674120045991212 |
false |
id is the database primary (technical) key, uniquely identifying each row.
model_id is a foreign key to the model table. It currently only contains a name for a model but having this allows you to filter on a model when performing queries or visualizing in dashboards.
run_id is a foreign key to the run table. It contains information about how and when NannyML was run. It also serves to filter metrics that were inserted during a given run, allowing you to easily remove these in case of errors.
timestamp is a timestamp created by finding the middle point of the start and end timestamps for each chunk. E.g. for a chunk starting at midnight and ending just before midnight of that day, the generated timestamp will be at noon.
metric_name is a column specific to some calculators and estimators. It contains the name of the metric that’s being calculated or estimated.
value contains the actual value that was being calculated. This might be a realized or estimated performance metric or a drift metric.
alert contains a boolean value (
true
orfalse
) indicating whether the metric crossed a threshold, thus raising an alert.upper_threshold contains the value of the upper threshold for the metric. Exceeding this value results in an alert.
lower_threshold contains the value of the lower threshold for the metric. Diving under this value results in an alert.
feature_name is not listed here but is present in univariate calculator results. It contains the name of the feature the metric value belongs to.
We currently support all databases supported by SQLAlchemy. You can find more information on the required connection strings in their Engine Configuration. The following snippet illustrates how to configure the database export to a Postgres database running locally.
output:
database:
connection_string: postgresql://postgres:mysecretpassword@localhost:5432/postgres
model_name: my regression model
Note the presence of the model_name
value. It will ensure an entry for the given name is present in the model
table (by either retrieving or creating it) and link it to the metrics using the model.id
value as a foreign key.
This configuration is optional but recommended. Dropping this parameter results in the metrics being written without
a model_id
value, which makes them harder to link to a single given model.
Column mapping section
This section is responsible for teaching NannyML about your specific model: what are its features, predictions, … You do this by providing a column mapping that associates a NannyML specific meaning to your input data. For more information on this, check out the Data requirements documentation.
The following snippet lists the column mapping for the Synthetic Binary Classification Car Loan Dataset.
column_mapping:
features:
- car_value
- salary_range
- debt_to_income_ratio
- loan_length
- repaid_loan_on_prev_car
- size_of_downpayment
- tenure
timestamp: timestamp
y_pred: y_pred
y_pred_proba: y_pred_proba
y_true: repaid
This snippet shows how to setup the column mapping for the Synthetic Multiclass Classification Dataset.
column_mapping:
features:
- acq_channel
- app_behavioral_score
- requested_credit_limit
- app_channel
- credit_bureau_score
- stated_income
- is_customer
timestamp: timestamp
y_pred: y_pred
y_pred_proba:
prepaid_card: y_pred_proba_prepaid_card
highstreet_card: y_pred_proba_highstreet_card
upmarket_card: y_pred_proba_upmarket_card
y_true: y_true
Store section
This section lets you set up a FilesystemStore
for caching purposes.
When a FilesystemStore
is configured it will be used to store and load fitted
calculators during the run. NannyML will use the store to try to load pre-fitted calculators. If none can be found
a new calculator will be created, fitted and persisted using the store.
The next time NannyML is run using the same configuration file it will find the stored calculator and use it subsequently.
Check out the tutorial on storing and loading calculators to learn more.
This snippet shows how to setup the store in configuration using the local filesystem:
store:
file:
path: /out/nml-cache/calculators
This snippet shows how use S3:
store:
file:
path: s3://my-bucket/nml/cache/
credentials:
client_kwargs:
aws_access_key_id: '<ACCESS_KEY_ID>'
aws_secret_access_key: '<SECRET_ACCESS_KEY>'
This snippet shows how to use Google Cloud Storage:
store:
file:
path: gs://my-bucket/nml/cache/
credentials:
token: service-account-access-key.json
This snippet shows how to use Azure Blob Storage:
store:
file:
path: abfs://my-bucket/nml/cache/
credentials:
account_name: '<ACCOUNT_NAME>'
account_key: '<ACCOUNT_KEY>'
Chunker section
The chunker section allows you to set the chunking behavior for all of the calculators and estimators that will be run.
Check the Chunking documentation for more information on the practice of chunking and the available Chunkers
.
This section is optional and when it is absent NannyML will use a DefaultChunker
instead.
chunker:
chunk_size: 5000 # chunks of fixed size
chunker:
chunk_period: W # chunks grouping observations by week
Scheduling section
The scheduling section allows you to configure the schedule NannyML is to run on. This section is optional and if none is found NannyML will just run a single time, unscheduled.
There are currently two ways of scheduling in NannyML.
Interval scheduling allows you to set the interval between NannyML runs, such as every 6 hours or every 3 days. The available time increments are
weeks
,days
,hours
andminutes
.Cron scheduling allows you to leverage the widely known
crontab
expressions to control scheduling.
scheduling:
interval:
days: 1 # wait one day from the timestamp at which the command is run
scheduling:
cron:
crontab: "*/5 * * * *" # every 5 minutes, so on 00:05, 00:10, 00:15, ...
Standalone parameters section
This section contains some standalone parameters that mostly serve as an alternative to CLI arguments.
The required problem_type variable allows you to pass along a ProblemType
value.
NannyML uses this information to better understand the provided model inputs and outputs.
problem_type: regression # pass the problem type (one of 'classification_binary', 'classification_multiclass' or 'regression')
ignore_errors: True # continue execution if a calculator/estimator fails
Templating paths
To use NannyML as a scheduled job we provide some support for path templating. This allows you to read data from and write data to locations that are based on timestamps.
The following example illustrates writing outputs to a 3-tiered directory structure for years, months and days. When NannyML is run as a daily scheduled job the results will be written to a different folder each day, preserving the outputs of previous runs.
output:
path: /data/out/{{year}}/{{month}}/{{day}}
The following placeholders are currently supported:
minute
hour
day
weeknumber
month
year
Examples
The following example contains the configuration required to run the nml
CLI
for the Synthetic Binary Classification Car Loan Dataset.
All data is read and written to the local filesystem.
input:
reference_data:
path: data/synthetic_sample_reference.csv
analysis_data:
path: data/synthetic_sample_analysis.csv
output:
raw_files:
path: out/
format: parquet
column_mapping:
features:
- car_value
- salary_range
- debt_to_income_ratio
- loan_length
- repaid_loan_on_prev_car
- size_of_downpayment
- tenure
timestamp: timestamp
y_pred: y_pred
y_pred_proba: y_pred_proba
y_true: work_home_actual
problem_type: classification_binary
ignore_errors: True
The following example contains the configuration used to run the nml
CLI on the Synthetic Multiclass Classification Dataset.
Input data is read from one S3 bucket using templated paths. Targets have been provided separately - they are not present in the analysis data. The results are written to another S3 bucket, also using a templated path.
input:
reference_data:
path: s3://nml-data/{{year}}/{{month}}/{{day}}/mc_reference.csv
credentials:
client_kwargs:
aws_access_key_id: 'DATA_ACCESS_KEY_ID'
aws_secret_access_key: 'DATA_SECRET_ACCESS_KEY'
analysis_data:
path: s3://nml-data/{{year}}/{{month}}/{{day}}/mc_analysis.csv
credentials:
client_kwargs:
aws_access_key_id: 'DATA_ACCESS_KEY_ID'
aws_secret_access_key: 'DATA_SECRET_ACCESS_KEY'
target_data:
path: s3://nml-data/{{year}}/{{month}}/{{day}}/mc_analysis.csv
join_column: identifier
credentials:
client_kwargs:
aws_access_key_id: 'DATA_ACCESS_KEY_ID'
aws_secret_access_key: 'DATA_SECRET_ACCESS_KEY'
output:
raw_files:
path: s3://nml-results/{{year}}/{{month}}/{{day}}
format: parquet
credentials: # different credentials
client_kwargs:
aws_access_key_id: 'RESULTS_ACCESS_KEY_ID'
aws_secret_access_key: 'RESULTS_SECRET_ACCESS_KEY'
chunker:
chunk_size: 5000
column_mapping:
features:
- acq_channel
- app_behavioral_score
- requested_credit_limit
- app_channel
- credit_bureau_score
- stated_income
- is_customer
timestamp: timestamp
y_pred: y_pred
y_pred_proba:
prepaid_card: y_pred_proba_prepaid_card
highstreet_card: y_pred_proba_highstreet_card
upmarket_card: y_pred_proba_upmarket_card
y_true: y_true
problem_type: classification_multiclass
ignore_errors: False
The following example contains the configuration required to run the nml
CLI
for the Synthetic Regression Dataset.
The data is read from the local filesystem but written to an external database.
input:
reference_data:
path: data/regression_synthetic_reference.csv
analysis_data:
path: data/regression_synthetic_analysis.csv
target_data:
path: data/regression_synthetic_analysis_targets.csv
output:
database:
connection_string: postgresql://postgres:mysecretpassword@localhost:5432/postgres
model_name: regression_car_price
problem_type: regression
chunker:
chunk_period: D
column_mapping:
features:
- car_age
- km_driven
- price_new
- accident_count
- door_count
- transmission
- fuel
timestamp: timestamp
y_pred: y_pred
y_true: y_true