Chunking
Why do we need chunks?
NannyML monitors ML models in production by doing data drift detection and performance estimation or monitoring. This functionality relies on aggregate metrics that are evaluated on samples of production data. These samples are called chunks. All the results generated are calculated and presented per chunk i.e. a chunk is a single data point on the monitoring results. You can refer to Data Drift guide or Performance Estimation guide to review example results.
Walkthrough on creating chunks
To allow for flexibility there are many ways to create chunks. The examples below will show how different kinds of chunks can be created. The examples will be run based on the performance estimation flow on the synthetic binary classification dataset provided by NannyML. First we set up this dataset.
>>> import pandas as pd
>>> import nannyml as nml
>>> from IPython.display import display
>>> reference = nml.load_synthetic_binary_classification_dataset()[0]
>>> analysis = nml.load_synthetic_binary_classification_dataset()[1]
Time-based chunking
Time-based chunking creates chunks based on time intervals. One chunk can contain all the observations
from a single hour, day, week, month etc. In most cases, such chunks will vary in the number of observations they
contain. Specify the chunk_period
argument to get appropriate split. The example below chunks data quarterly.
>>> cbpe = nml.CBPE(
>>> y_pred_proba='y_pred_proba',
>>> y_pred='y_pred',
>>> y_true='work_home_actual',
>>> timestamp_column_name='timestamp',
>>> metrics=['roc_auc'],
>>> chunk_period="Q")
>>> cbpe.fit(reference)
>>> est_perf = cbpe.estimate(analysis)
>>> est_perf.data.iloc[:3,:5]
key |
start_index |
end_index |
start_date |
end_date |
|
---|---|---|---|---|---|
0 |
2017Q3 |
0 |
1261 |
2017-08-31 00:00:00 |
2017-09-30 23:59:59 |
1 |
2017Q4 |
1262 |
4951 |
2017-10-01 00:00:00 |
2017-12-31 23:59:59 |
2 |
2018Q1 |
4952 |
8702 |
2018-01-01 00:00:00 |
2018-03-31 23:59:59 |
Note
Notice that each calendar quarter was taken into account, even if it was not fully covered with records. This means some chunks contain fewer observations (usually the last and the first). See the first row above - Q3 is July-September, but the first record in the data is from the last day of August. The first chunk has ~1200 of observations while the 2nd and 3rd contain above 3000. This can cause some chunks to be less relibaly estimated or calculated.
Possible time offsets are listed in the table below:
Alias |
Description |
---|---|
S |
second |
T, min |
minute |
H |
hour |
D |
day |
W |
week |
M |
month |
Q |
quarter |
A, y |
year |
Size-based chunking
Chunks can be of fixed size, i.e. each chunk contains the same number of observations. Set this up by specifying the
chunk_size
parameter.
>>> cbpe = nml.CBPE(
>>> y_pred_proba='y_pred_proba',
>>> y_pred='y_pred',
>>> y_true='work_home_actual',
>>> timestamp_column_name='timestamp',
>>> metrics=['roc_auc'],
>>> chunk_size=3500)
>>> cbpe.fit(reference_data=reference)
>>> est_perf = cbpe.estimate(analysis)
>>> est_perf.data.iloc[:3,:5]
key |
start_index |
end_index |
start_date |
end_date |
|
---|---|---|---|---|---|
0 |
[0:3499] |
0 |
3499 |
2017-08-31 00:00:00 |
2017-11-26 23:59:59 |
1 |
[3500:6999] |
3500 |
6999 |
2017-11-26 00:00:00 |
2018-02-18 23:59:59 |
2 |
[7000:10499] |
7000 |
10499 |
2018-02-18 00:00:00 |
2018-05-14 23:59:59 |
Note
If the number of observations is not divisible by the chunk size required, the number of rows equal to the
remainder of a division will be dropped. This ensures that each chunk has the same size, but in worst case
scenario it results in dropping chunk_size-1
rows. Notice that the last index in the last chunk is 48999
while the last index in the raw data is 49999:
>>> est_perf.data.iloc[-2:,:5]
key |
start_index |
end_index |
start_date |
end_date |
|
---|---|---|---|---|---|
12 |
[42000:45499] |
42000 |
45499 |
2020-06-18 00:00:00 |
2020-09-13 23:59:59 |
13 |
[45500:48999] |
45500 |
48999 |
2020-09-13 00:00:00 |
2020-12-08 23:59:59 |
>>> analysis.index.max()
49999
Number-based chunking
The total number of chunks can be set by the chunk_number
parameter:
>>> cbpe = nml.CBPE(
>>> y_pred_proba='y_pred_proba',
>>> y_pred='y_pred',
>>> y_true='work_home_actual',
>>> timestamp_column_name='timestamp',
>>> metrics=['roc_auc'],
>>> chunk_number=9)
>>> cbpe.fit(reference_data=reference)
>>> est_perf = cbpe.estimate(analysis)
>>> len(est_perf.data)
9
Note
Chunks created this way will be equal in size. If the number of observations is not divisible by the
chunk_number
then the number of observations equal to the residual of the division will be dropped.
>>> est_perf.data.iloc[-2:,:5]
key |
start_index |
end_index |
start_date |
end_date |
|
---|---|---|---|---|---|
7 |
[38885:44439] |
38885 |
44439 |
2020-04-03 00:00:00 |
2020-08-18 23:59:59 |
8 |
[44440:49994] |
44440 |
49994 |
2020-08-18 00:00:00 |
2021-01-01 23:59:59 |
>>> analysis.index.max()
49999
Note
The same splitting rule is always applied to the dataset used for fitting (reference
) and the dataset of
interest (in the presented case - analysis
). Unless these two datasets are of the same size, the chunk sizes
can be considerably different. E.g. if the reference
dataset has 10 000 observations and the analysis
dataset has 80 000, and chunking is number-based, chunks in reference
will be much smaller than in
analysis
. Additionally, if the data drift or performance estimation is calculated on
combined reference
and analysis
the results presented for reference
will be calculated on different
chunks than they were fitted.
Automatic chunking
The default chunking method is count-based, with the desired count set to 10. This is used if a chunking method isn’t specified.
>>> cbpe = nml.CBPE(
>>> y_pred_proba='y_pred_proba',
>>> y_pred='y_pred',
>>> y_true='work_home_actual',
>>> timestamp_column_name='timestamp',
>>> metrics=['roc_auc'])
>>> cbpe.fit(reference_data=reference)
>>> est_perf = cbpe.estimate(pd.concat([reference, analysis]))
>>> len(est_perf.data)
10
Chunks on plots with results
Finally, once the chunking method is selected, the full performance estimation can be run.
Each point on the plot represents a single chunk, with the y-axis showing the performance. They are aligned on the x axis with the date at the end of the chunk, not the date in the middle of the chunk. Plots are interactive - hovering over the point will display the precise information about the period, to help prevent any confusion.
>>> cbpe = nml.CBPE(model_metadata=metadata, chunk_size=5_000).fit(reference_data=reference) >>> est_perf = cbpe.estimate(analysis) >>> est_perf.plot(kind='performance').show()