Glossary

Alert

An alert refers to a variable at a particular chunk that gets flagged for possible data drift. The alert is raised after the drift functionality of NannyML finds the drift characteristics for this variable and chunk to be suspect.

Butterfly dataset

A dataset used in Data Reconstruction with PCA to give an example where univariate drift statistics are insufficient in detecting complex data drifts in multidimensional data.

CBPE (Confidence-Based Performance Estimation)

A family of methods to estimate model performance in the absence of ground truth that take advantage of the confidence which is expressed in the monitored model output probability/score prediction.

Chi Squared test

The Chi Squared test, or chi2 test as is sometimes called, is a non parametric statistical test regarding discrete distributions. It is be used to examine whether there is statistically significant difference between expected and observed frequencies for one or more categories of a contingency table. In NannyML we use the Chi Squared test to answer whether the two samples of a categorical variable come from a different distribution.

The Chi Squared test results include the chi squared statistic and a p-value. The bigger the chi squared statistic the more different the results between the two samples we are comparing are. The p value represents the chance that we would get the results we have provided they came from the same distribution.

You can find more information on the wikipedia Chi-squared test page. At NannyML we use the scipy implementation of the Chi-square test of independence of variables in a contingency table.

Concept Drift

A change in the underlying pattern (or mapping) between the Model Inputs and the Target (P(y|X)).

Data Drift

A change in joint distribution of Model Inputs (P(X)).

Data Chunk

Data chunk is simply a sample of data. All the results generated by NannyML are calculated and presented on the level of chunk i.e. a chunk is a single data point on the monitoring results. Chunks are usually created based on time periods - they contain all the observations and predictions from single hour, day, month etc. depending on the selected interval. They can be also size-based so that each chunk contains n observations, or number-based so the whole data is splt into k chunks. In each case chronology of data between chunks is maintained.

Data Period

A data period is a subset of the data used to monitor a model. NannyML expects the provided data to be in one of two data periods.

The first data period is called the reference period. It contains all the observations for a period with an accepted level of performance. It most likely also includes target data.

The second subset of the data is the analysis period. It contains the observations you want NannyML to analyse. In the absence of targets performance in the analysis period can be estimated.

You can read more about Data Periods in the relevant data requirements section.

Estimated Performance

The performance the monitored model is expected to have as a result of the Performance Estimation process. Estimated performance can be available immediately after predictions are made.

Feature

A variable used by our machine learning model. The model inputs consist of features.

Latent space

A space of reduced dimensionality, compared to the model input space, that can represent our input data. This space is the result of a representation learning algorithm. Data points that are close together in the model input space are also close together in the latent space.

Ground truth

A synonym for Target.

Identifier

Usually a single column, but can be multiple columns where necessary. It is used uniquely identify an observation. When providing Target data at a later point in time, this value can help refer back to the original prediction.

Being able to uniquely identify each row of data can help reference any particular issues NannyML might identify and make resolving issues easier for you. As we add functionality to provide target data afterwards your data will already be in the correct shape to support it!

Note

Format: No specific format. Any str or int value is possible.
Candidates: An existing identifier from your business case. A technical identifier such as a globally unique identifier (GUID). A hash of some (or all) of your column values, using a hashing function with appropriate collision properties, e.g. the SHA-2 and SHA-3 families. A concatenation of your dataset name and a row number.

Imputation

The process of substituting missing values with actual values on a dataset.

Kolmogorov-Smirnov test

The Kolmogorov-Smirnov test, or KS test as is more commonly called, is a non parametric statistical test regarding the equality of continuous one dimensional probability distributions. It can be used to compare a sample with a reference probability distribution, called one-sample KS test, or to compare two samples. In NannyML we use the two sample KS test looking to answer whether the two samples in question come from a different distribution.

The KS test results include the KS statistic, or d-statistic as is more commonly called, and a p-value. The d-statistic takes values between 0 and 1. The bigger the d-statistic the more different the results between the two samples we are comparing are. The p value represents the chance that we would get the results we have provided they come from the same distribution.

You can find more information on the wikipedia KS test page. At NannyML we use the scipy implementation of the two sample KS test.

Model inputs

Every Feature used by the model.

Model outputs

The scores or probabilities that your model predicts for its target outcome.

Model predictions

A synonym for Model outputs.

Multivariate Drift Detection

Drift Detection steps that involve all model features in order to create appropriate drift measures.

Partition Column

A column that tells us what Data Period the data is in. A partition column is necessary for NannyML in order to produce model monitoring results.

PCA

Principal Component Analysis is a method used for dimensionality reduction. The method produces a linear transformation of the input data that results in a space with orthogonal components that maximise the available variance of the input data.

More information are available on the PCA Wikipedia page.

Performance Estimation

Estimating performance of a deployed ML model without having access to Target.

Predictions

A synonym for Model outputs.

Predicted labels

The outcome a machine learning model predicts for the event it was called to predict. Predicted labels are a two value categorical variable. They can be represented by integers, usually 0 and 1, booleans, meaning True or False, or strings. For NannyML, in a binary classification problem, it is ideal if predicted labels are presented as integers with 1 representing the positive outcome.

Predicted probabilities

The probabilities assigned by a machine learning model regarding the chance that a positive event materializes for the binary outcome it was called to predict.

Predicted scores

Sometimes the prediction of a machine learning model is transformed into a continuous range of real numbers. Those scores take values outside the [0,1] range that is allowed for probabilities. The higher the score the more likely the positive outcome should be.

Realized Performance

The actual performance of the monitored model once Targets become available. The term is used to differentiate between Estimated Performance and actual results.

Reconstruction Error

The average euclidean distance between the original and the reconstructed data points in a dataset. The reconstructed dataset is created by transforming our model inputs to a Latent space and then transforming them back to the model input space. Given that this process cannot be lossless there will always be a difference between the original and the reconstructed data. This difference is captured by the reconstruction error.

Standard Error

The Standard Error of a statistic is the standard deviation of the probability distribution we are sampling it from. It can also be an estimate of that standard deviation. If the statistic is the sample mean, then it is called Standard Error of the Mean and abbreviated as SEM.

The exact value of standard error from an independent sample of \(n\) observations taken from a statistical population with standard deviation \(\sigma\) is:

\[{\sigma }_{\bar {x}}\ ={\frac {\sigma }{\sqrt {n}}}\]

More information can be read at the Wikipedia Standard Error page.

Target

The actual outcome of the event the machine learning model is trying to predict. Also referred to as Ground truth.

Timestamp

Usually a single column, but can be multiple columns where necessary. This provides NannyML with the date and time that the prediction was made.

NannyML need to understand when predictions were made, and how you record this, so it can bucket observations in time periods.

Note

Format

Any format supported by Pandas, most likely:

ISO 8601, e.g. 2021-10-13T08:47:23Z
Unix-epoch in units of seconds, e.g. 1513393355

Univariate Drift Detection

Drift Detection methods that use each model feature individually in order to detect change in Model Inputs.