.. _cli_configuration: ======================================================= Configuration file ======================================================= .. _cli_configuration_location: Locations --------------------------------------------- The ``nml`` CLI will look for configuration files called either ``nannyml.yaml`` or ``nann.yml`` in a number of preset locations. These presets can be overridden by telling NannyML where to look for your configuration. You can do this by using an environment variable or a command line argument. The ``nml`` CLI will go over the possible options in the following order: #. Evaluate the ``-c`` or ``--configuration-path`` command line argument When providing an explicit location to the ``nml`` CLI, the configuration file living at that location will always be prioritised above any preset location. .. code-block:: bash nml -c /path/to/nann.yml run #. Evaluate the ``NML_CONFIG_PATH`` environment variable If the ``NML_CONFIG_PATH`` environment variable was found, its value will be interpreted as a path pointing to your config file. .. code-block:: bash export NML_CONFIG_PATH /path/to/nann.yml #. Look for ``nannyml.yaml`` or ``nann.yml`` in the ``/config`` directory This directory is unlikely to exist on your local system, but is easy to use when mounting your configuration files into the `NannyML docker container `_. .. code-block:: bash docker run -v /path/to/config/dir/:/config/ nannyml/nannyml nml run #. Look for ``nannyml.yaml`` or ``nann.yml`` in the current working ``$PWD`` directory When working on your local system you can just run the ``nml`` CLI in the same location as your ``nannyml.yaml`` or ``nann.yml`` file. Make sure you've activated your virtual environment when using one! .. _cli_configuration_format: Format --------------------------------------------- The configuration file format is broken down into multiple sections. Input section ************* This section describes the input data for NannyML, i.e. the ``reference`` and ``analysis`` datasets. The following snippet shows the basic form of pointing towards two local CSV files as reference and analysis data. .. code-block:: yaml input: reference_data: path: /data/synthetic_sample_reference.csv analysis_data: path: /data/synthetic_sample_analysis.csv You can also work with data living in cloud storage. We currently support reading data from S3 buckets (Amazon Web Services), GCS buckets (Google Cloud Platform) and ADLS or Azure Blob Storage (Microsoft Azure). We use the awesome `fsspec `_ project for this. You can provide credentials to access these locations by using cloud-vendor specific way (e.g. setting some environment variables or providing config files like ``.aws/credentials``) or provide them in the configuration. .. code-block:: yaml input: reference_data: path: s3://nml-data/synthetic_sample_reference.pq credentials: # providing example AWS credentials client_kwargs: aws_access_key_id: 'ACCESS_KEY_ID' aws_secret_access_key: 'SECRET_ACCESS_KEY' analysis_data: path: gs://nml-data/synthetic_sample_analysis.pq credentials: # providing example GCP credentials token: ml6-workshop-fa83b3d60b5d.json # path to service account key file Any ``pandas.read_csv`` or ``pandas.read_parquet`` options can be passed along by providing them in the configuration using the ``read_args`` parameter. .. code-block:: yaml input: reference_data: path: /data/synthetic_sample_reference.csv read_args: delimiter: ; chunksize: 100000 When target values are delivered separately you can specify these as an input as well. You must also provide a column used to join your target values with your analysis data. .. code-block:: yaml input: reference_data: path: /data/synthetic_sample_reference.csv analysis_data: path: /data/synthetic_sample_analysis.csv target_data: path: /data/synthetic_sample_analysis_gt.csv join_column: identifier Output section ************** The output section allows you to instruct NannyML on how and where to write the outputs of the calculations. We currently support writing data and plots to a local or cloud filesystem or exporting data to a relational database. .. warning:: This is a very early release and additional ways of outputting data are on their way. This configuration section will be prone to big changes in the future. Writing to filesystem """""""""""""""""""""" You can specify the folder to write outputs to using the ``path`` parameter. The optional ``format`` parameter allows you to choose the format to export the results DataFrames in. Allowed values are ``csv`` and ``parquet``, with ``parquet`` being the default. .. code-block:: yaml output: raw_files: path: /data/out/ format: parquet The output section supports the use of credentials: .. code-block:: yaml output: raw_files: path: s3://nml-data/synthetic_sample_reference.pq credentials: # providing example AWS credentials client_kwargs: aws_access_key_id: 'ACCESS_KEY_ID' aws_secret_access_key: 'SECRET_ACCESS_KEY' The output format supports passing along any ``pandas.to_csv`` or ``pandas.to_parquet`` using the ``write_args`` parameter. .. code-block:: yaml output: raw_files: path: /data/out/ format: csv write_args: headers: False Writing to a pickle file """"""""""""""""""""""""" NannyML supports directly pickling the ``Result`` objects returned by calculators and estimators. Use the following configuration to enable this: .. code-block:: yaml output: pickle: path: /data/out/ # a *.pkl file will be written here by each calculator/estimator Writing to a relational database """""""""""""""""""""""""""""""" NannyML can also export its data to a relational database. When provided with a connection string NannyML will create the required table structure and insert calculator and estimator results in there. .. warning:: Your data must contain a :term:`timestamp` column in order to use this functionality. There is a separate table for each calculator and estimator. The following sample from the `cbpe_performance_metrics` table illustrates their overall structure: +-----+-----------+---------+-----------------------------+--------------+---------------------+--------+ | id | model_id | run_id | timestamp | metric_name | value | alert | +=====+===========+=========+=============================+==============+=====================+========+ | 1 | 2 | 4 | 2014-05-09 12:00:00.000000 | ROC AUC | 0.9395984406102346 | false | +-----+-----------+---------+-----------------------------+--------------+---------------------+--------+ | 2 | 2 | 4 | 2014-05-10 12:00:00.000000 | ROC AUC | 0.9669333004887973 | false | +-----+-----------+---------+-----------------------------+--------------+---------------------+--------+ | 3 | 2 | 4 | 2014-05-11 12:00:00.000000 | ROC AUC | 0.9616566861394408 | false | +-----+-----------+---------+-----------------------------+--------------+---------------------+--------+ | 4 | 2 | 4 | 2014-05-12 12:00:00.000000 | ROC AUC | 0.9631921191605108 | false | +-----+-----------+---------+-----------------------------+--------------+---------------------+--------+ | 5 | 2 | 4 | 2014-05-13 12:00:00.000000 | ROC AUC | 0.9679918198658687 | false | +-----+-----------+---------+-----------------------------+--------------+---------------------+--------+ | 6 | 2 | 4 | 2014-05-14 12:00:00.000000 | ROC AUC | 0.9680751598579069 | false | +-----+-----------+---------+-----------------------------+--------------+---------------------+--------+ | 7 | 2 | 4 | 2014-05-15 12:00:00.000000 | ROC AUC | 0.9593668335222013 | false | +-----+-----------+---------+-----------------------------+--------------+---------------------+--------+ | 8 | 2 | 4 | 2014-05-16 12:00:00.000000 | ROC AUC | 0.964513389926401 | false | +-----+-----------+---------+-----------------------------+--------------+---------------------+--------+ | 9 | 2 | 4 | 2014-05-17 12:00:00.000000 | ROC AUC | 0.9674120045991212 | false | +-----+-----------+---------+-----------------------------+--------------+---------------------+--------+ - **id** is the database primary (technical) key, uniquely identifying each row. - **model_id** is a foreign key to the `model` table. It currently only contains a name for a model but having this allows you to filter on a model when performing queries or visualizing in dashboards. - **run_id** is a foreign key to the `run` table. It contains information about how and when NannyML was run. It also serves to filter metrics that were inserted during a given run, allowing you to easily remove these in case of errors. - **timestamp** is a :term:`timestamp` created by finding the middle point of the start and end timestamps for each :term:`chunk`. E.g. for a chunk starting at midnight and ending just before midnight of that day, the generated timestamp will be at noon. - **metric_name** is a column specific to some calculators and estimators. It contains the name of the metric that's being calculated or estimated. - **value** contains the actual value that was being calculated. This might be a realized or estimated performance metric or a drift metric. - **alert** contains a boolean value (``true`` or ``false``) indicating whether the metric crossed a threshold, thus raising an alert. - **upper_threshold** contains the value of the upper threshold for the metric. Exceeding this value results in an alert. - **lower_threshold** contains the value of the lower threshold for the metric. Diving under this value results in an alert. - **feature_name** is not listed here but is present in univariate calculator results. It contains the name of the feature the metric value belongs to. We currently support all databases supported by SQLAlchemy. You can find more information on the required connection strings in their `Engine Configuration `_. The following snippet illustrates how to configure the database export to a Postgres database running locally. .. code-block:: yaml output: database: connection_string: postgresql://postgres:mysecretpassword@localhost:5432/postgres model_name: my regression model Note the presence of the ``model_name`` value. It will ensure an entry for the given name is present in the `model` table (by either retrieving or creating it) and link it to the metrics using the ``model.id`` value as a foreign key. This configuration is optional but recommended. Dropping this parameter results in the metrics being written without a ``model_id`` value, which makes them harder to link to a single given model. Column mapping section *********************** This section is responsible for teaching NannyML about your specific model: what are its features, predictions, ... You do this by providing a column mapping that associates a NannyML specific meaning to your input data. For more information on this, check out the :ref:`data_requirements` documentation. The following snippet lists the column mapping for the :ref:`dataset-synthetic-binary-car-loan`. .. code-block:: yaml column_mapping: features: - car_value - salary_range - debt_to_income_ratio - loan_length - repaid_loan_on_prev_car - size_of_downpayment - tenure timestamp: timestamp y_pred: y_pred y_pred_proba: y_pred_proba y_true: repaid This snippet shows how to setup the column mapping for the :ref:`dataset-synthetic-multiclass`. .. code-block:: yaml column_mapping: features: - acq_channel - app_behavioral_score - requested_credit_limit - app_channel - credit_bureau_score - stated_income - is_customer timestamp: timestamp y_pred: y_pred y_pred_proba: prepaid_card: y_pred_proba_prepaid_card highstreet_card: y_pred_proba_highstreet_card upmarket_card: y_pred_proba_upmarket_card y_true: y_true .. _cli_configuration_store: Store section ***************** This section lets you set up a :class:`~nannyml.io.store.file_store.FilesystemStore` for caching purposes. When a :class:`~nannyml.io.store.file_store.FilesystemStore` is configured it will be used to store and load fitted calculators during the run. NannyML will use the store to try to load pre-fitted calculators. If none can be found a new calculator will be created, fitted and persisted using the store. The next time NannyML is run using the same configuration file it will find the stored calculator and use it subsequently. Check out the :ref:`tutorial on storing and loading calculators` to learn more. This snippet shows how to setup the store in configuration using the local filesystem: .. code-block:: yaml store: file: path: /out/nml-cache/calculators This snippet shows how use S3: .. code-block:: yaml store: file: path: s3://my-bucket/nml/cache/ credentials: client_kwargs: aws_access_key_id: '' aws_secret_access_key: '' This snippet shows how to use Google Cloud Storage: .. code-block:: yaml store: file: path: gs://my-bucket/nml/cache/ credentials: token: service-account-access-key.json This snippet shows how to use Azure Blob Storage: .. code-block:: yaml store: file: path: abfs://my-bucket/nml/cache/ credentials: account_name: '' account_key: '' Chunker section ***************** The chunker section allows you to set the chunking behavior for all of the calculators and estimators that will be run. Check the :ref:`chunking` documentation for more information on the practice of chunking and the available ``Chunkers``. This section is optional and when it is absent NannyML will use a :class:`~nannyml.chunking.DefaultChunker` instead. .. code-block:: yaml chunker: chunk_size: 5000 # chunks of fixed size .. code-block:: yaml chunker: chunk_period: W # chunks grouping observations by week Scheduling section ******************* The scheduling section allows you to configure the schedule NannyML is to run on. This section is optional and if none is found NannyML will just run a single time, unscheduled. There are currently two ways of scheduling in NannyML. - **Interval** scheduling allows you to set the interval between NannyML runs, such as *every 6 hours* or *every 3 days*. The available time increments are ``weeks``, ``days``, ``hours`` and ``minutes``. - **Cron** scheduling allows you to leverage the widely known ``crontab`` expressions to control scheduling. .. code-block:: yaml :caption: Interval based scheduling configuration scheduling: interval: days: 1 # wait one day from the timestamp at which the command is run .. code-block:: yaml :caption: ``cron`` based scheduling configuration scheduling: cron: crontab: "*/5 * * * *" # every 5 minutes, so on 00:05, 00:10, 00:15, ... Standalone parameters section ***************************** This section contains some standalone parameters that mostly serve as an alternative to CLI arguments. The required `problem_type` variable allows you to pass along a :class:`~nannyml._typing.ProblemType` value. NannyML uses this information to better understand the provided model inputs and outputs. .. code-block:: yaml problem_type: regression # pass the problem type (one of 'classification_binary', 'classification_multiclass' or 'regression') .. code-block:: yaml ignore_errors: True # continue execution if a calculator/estimator fails Templating paths -------------------------------------------- To use NannyML as a scheduled job we provide some support for path templating. This allows you to read data from and write data to locations that are based on timestamps. The following example illustrates writing outputs to a 3-tiered directory structure for years, months and days. When NannyML is run as a daily scheduled job the results will be written to a different folder each day, preserving the outputs of previous runs. .. code-block:: yaml output: path: /data/out/{{year}}/{{month}}/{{day}} The following placeholders are currently supported: - ``minute`` - ``hour`` - ``day`` - ``weeknumber`` - ``month`` - ``year`` Examples -------------------------------------------- The following example contains the configuration required to run the ``nml`` CLI for the :ref:`dataset-synthetic-binary-car-loan`. All data is read and written to the local filesystem. .. code-block:: yaml input: reference_data: path: data/synthetic_sample_reference.csv analysis_data: path: data/synthetic_sample_analysis.csv output: raw_files: path: out/ format: parquet column_mapping: features: - car_value - salary_range - debt_to_income_ratio - loan_length - repaid_loan_on_prev_car - size_of_downpayment - tenure timestamp: timestamp y_pred: y_pred y_pred_proba: y_pred_proba y_true: work_home_actual problem_type: classification_binary ignore_errors: True The following example contains the configuration used to run the ``nml`` CLI on the :ref:`dataset-synthetic-multiclass`. Input data is read from one S3 bucket using templated paths. Targets have been provided separately - they are not present in the analysis data. The results are written to another S3 bucket, also using a templated path. .. code-block:: yaml input: reference_data: path: s3://nml-data/{{year}}/{{month}}/{{day}}/mc_reference.csv credentials: client_kwargs: aws_access_key_id: 'DATA_ACCESS_KEY_ID' aws_secret_access_key: 'DATA_SECRET_ACCESS_KEY' analysis_data: path: s3://nml-data/{{year}}/{{month}}/{{day}}/mc_analysis.csv credentials: client_kwargs: aws_access_key_id: 'DATA_ACCESS_KEY_ID' aws_secret_access_key: 'DATA_SECRET_ACCESS_KEY' target_data: path: s3://nml-data/{{year}}/{{month}}/{{day}}/mc_analysis.csv join_column: identifier credentials: client_kwargs: aws_access_key_id: 'DATA_ACCESS_KEY_ID' aws_secret_access_key: 'DATA_SECRET_ACCESS_KEY' output: raw_files: path: s3://nml-results/{{year}}/{{month}}/{{day}} format: parquet credentials: # different credentials client_kwargs: aws_access_key_id: 'RESULTS_ACCESS_KEY_ID' aws_secret_access_key: 'RESULTS_SECRET_ACCESS_KEY' chunker: chunk_size: 5000 column_mapping: features: - acq_channel - app_behavioral_score - requested_credit_limit - app_channel - credit_bureau_score - stated_income - is_customer timestamp: timestamp y_pred: y_pred y_pred_proba: prepaid_card: y_pred_proba_prepaid_card highstreet_card: y_pred_proba_highstreet_card upmarket_card: y_pred_proba_upmarket_card y_true: y_true problem_type: classification_multiclass ignore_errors: False The following example contains the configuration required to run the ``nml`` CLI for the :ref:`dataset-synthetic-regression`. The data is read from the local filesystem but written to an external database. .. code-block:: yaml input: reference_data: path: data/regression_synthetic_reference.csv analysis_data: path: data/regression_synthetic_analysis.csv target_data: path: data/regression_synthetic_analysis_targets.csv output: database: connection_string: postgresql://postgres:mysecretpassword@localhost:5432/postgres model_name: regression_car_price problem_type: regression chunker: chunk_period: D column_mapping: features: - car_age - km_driven - price_new - accident_count - door_count - transmission - fuel timestamp: timestamp y_pred: y_pred y_true: y_true