Detecting Data Drift

Take a machine learning model that uses some multidimensional input data \(\mathbf{X}\) and makes predictions \(y\).

The model has been trained on some data distribution \(P(\mathbf{X})\). There is data drift when the production data comes from a different distribution \(P(\mathbf{X'}) \neq P(\mathbf{X})\).

A machine learning model operating on an input distribution different from the one it has been trained on may underperform. It is crucial to detect data drift, in a timely manner, when a model is in production. Moreover further investigating the characteristics of an observed drift, the causes of any performance change can be identified.

There is also a special case of data drift called label shift. In this case, the outcome distributions between the training and production data are different, meaning \(P(y') \neq P(y)\). However, the relationship between the population characteristics and a specific outcome does not change, namely \(P(\mathbf{X'}|y') = P(\mathbf{X}|y)\).

Data drift is one of the two main reason for silent model failure. The second one is concept drift, where the relationship between the model inputs and the target changes. In this case we have: \(P(y'|\mathbf{X'}) \neq P(y|\mathbf{X})\). In production it can happen that a model experiences data drift and concept drift simultaneously.

Below we see the various ways in which NannyML detects data drift.