Chunking

Why do we need chunks?

NannyML monitors ML models in production by doing data drift detection and performance estimation or monitoring. This functionality relies on aggregate metrics evaluated on samples of production data. These samples are called chunks.

All the results generated are calculated and presented per chunk i.e., a chunk is a single data point on the monitoring results. You can refer to the Data Drift guide or Performance Estimation guide to review example results.

Walkthrough on creating chunks

To allow for flexibility, there are many ways to create chunks. The examples below will show how different kinds of chunks can be created. The examples will be run based on the performance estimation flow on the synthetic binary classification dataset provided by NannyML. First, we set up this dataset.

>>> import nannyml as nml
>>> reference_df, analysis_df, _ = nml.load_synthetic_car_loan_dataset()

Time-based chunking

Time-based chunking creates chunks based on time intervals. One chunk can contain all the observations from a one hour, to a day, month or year. In most cases, such chunks will vary in the number of observations they contain. Specify the chunk_period argument to get the appropriate split. The example below chunks data quarterly.

>>> cbpe = nml.CBPE(
...     y_pred_proba='y_pred_proba',
...     y_pred='y_pred',
...     y_true='repaid',
...     timestamp_column_name='timestamp',
...     metrics=['roc_auc'],
...     problem_type='classification_binary',
...     chunk_period="Q", # here we define the chunk period.
>>> )
>>> cbpe.fit(reference_df)
>>> est_perf = cbpe.estimate(analysis_df)

>>> est_perf.data.iloc[:3, :6]

	chunk key	chunk_index	start_index	end_index	start_date	end_date
0	2018Q1	0	0	14863	2018-01-01 00:00:00	2018-03-31 23:59:59.999999999
1	2018Q2	1	14864	29892	2018-04-01 00:00:00	2018-06-30 23:59:59.999999999
2	2018Q3	2	29893	45086	2018-07-01 00:00:00	2018-09-30 23:59:59.999999999

Note

Notice that each calendar quarter was considered, even if it was not fully covered with records. This means some chunks contain fewer observations (usually the last and the first). For example, see the first row above - Q3 is July-September, but the first record in the data is from the last day of August. The first chunk has ~1200 observations, while the second and third contain above 3000. This can cause some chunks to be less reliably estimated or calculated.

Possible time offsets are listed in the table below:

Alias	Description
S	second
T, min	minute
H	hour
D	day
W	week
M	month
Q	quarter
A, y	year

Size-based chunking

Chunks can be of fixed size, i.e., each chunk contains the same number of observations. Set this up by specifying the chunk_size parameter.

>>> cbpe = nml.CBPE(
...     y_pred_proba='y_pred_proba',
...     y_pred='y_pred',
...     y_true='repaid',
...     timestamp_column_name='timestamp',
...     metrics=['roc_auc'],
...     problem_type='classification_binary',
...     chunk_size=3500, # here we define the chunk size.
>>> )
>>> cbpe.fit(reference_df)
>>> est_perf = cbpe.estimate(analysis_df)

>>> est_perf.data.iloc[:3, :6]

	chunk key	chunk_index	start_index	end_index	start_date	end_date
0	[0:3499]	0	0	3499	2018-01-01 00:00:00	2018-01-22 04:28:28.848000
1	[3500:6999]	1	3500	6999	2018-01-22 04:37:12	2018-02-12 09:05:40.848000
2	[7000:10499]	2	7000	10499	2018-02-12 09:14:24	2018-03-05 13:42:52.848000

Note

If the number of observations is not divisible by the chunk_number required, by default, the leftover observations will form their own, incomplete chunk. Notice that on the last chunk the difference between the start_index and end_index is smaller than the chunk_size defined.

Incomplete chunk behavior can be configured using the incomplete parameter.

Check the custom chunks section if you want to change the default behaviour.

>>> est_perf.data.iloc[-2:,:6]

	chunk key	chunk_index	start_index	end_index	start_date	end_date
26	[42000:45499]	12	42000	45499	2019-07-12 01:26:24	2019-08-02 05:54:52.848000
27	[45500:49999]	13	45500	49999	2019-08-02 06:03:36	2019-08-29 11:51:16.848000

>>> last = est_perf.data.iloc[-1].loc['chunk']
>>> print(last.loc['end_index'] - last.loc['start_index'])
4499

Number-based chunking

The total number of chunks can be set by the chunk_number parameter:

>>> cbpe = nml.CBPE(
...     y_pred_proba='y_pred_proba',
...     y_pred='y_pred',
...     y_true='repaid',
...     timestamp_column_name='timestamp',
...     metrics=['roc_auc'],
...     problem_type='classification_binary',
...     chunk_number=9, # here we define the chunk number

>>> )

>>> cbpe.fit(reference_df)
>>> est_perf = cbpe.estimate(analysis_df)

>>> len(est_perf.filter(period='reference'))
/home/niels/Code/nml/nannyml/nannyml/performance_estimation/confidence_based/metrics.py:401: UserWarning: No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.
  warnings.warn("No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.")
/home/niels/Code/nml/nannyml/nannyml/performance_estimation/confidence_based/metrics.py:401: UserWarning: No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.
  warnings.warn("No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.")
/home/niels/Code/nml/nannyml/nannyml/performance_estimation/confidence_based/metrics.py:401: UserWarning: No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.
  warnings.warn("No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.")
/home/niels/Code/nml/nannyml/nannyml/performance_estimation/confidence_based/metrics.py:401: UserWarning: No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.
  warnings.warn("No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.")
/home/niels/Code/nml/nannyml/nannyml/performance_estimation/confidence_based/metrics.py:401: UserWarning: No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.
  warnings.warn("No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.")
/home/niels/Code/nml/nannyml/nannyml/performance_estimation/confidence_based/metrics.py:401: UserWarning: No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.
  warnings.warn("No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.")
/home/niels/Code/nml/nannyml/nannyml/performance_estimation/confidence_based/metrics.py:401: UserWarning: No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.
  warnings.warn("No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.")
/home/niels/Code/nml/nannyml/nannyml/performance_estimation/confidence_based/metrics.py:401: UserWarning: No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.
  warnings.warn("No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.")
/home/niels/Code/nml/nannyml/nannyml/performance_estimation/confidence_based/metrics.py:401: UserWarning: No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.
  warnings.warn("No 'y_true' values given for chunk, returning NaN as realized ROC-AUC.")

Note

Chunks created this way will be equal in size.

If the number of observations is not divisible by the chunk_number required, by default, the leftover observations will form their own, incomplete chunk. Notice that on the last chunk the difference between the start_index and end_index is smaller than the chunk_size defined.

Incomplete chunk behavior can be configured using the incomplete parameter.

Check the custom chunks section if you want to change the default behavior.

>>> est_perf.filter(period='reference').data.iloc[-2:, :6]

	chunk key	chunk_index	start_index	end_index	start_date	end_date
7	[38885:44439]	7	38885	44439	2018-08-24 10:46:05.520000	2018-09-27 01:52:31.728000
8	[44440:49999]	8	44440	49999	2018-09-27 02:01:14.880000	2018-10-30 17:51:16.848000

>>> first = est_perf.data.iloc[1].loc['chunk']
>>> last = est_perf.data.iloc[-1].loc['chunk']

>>> print('first chunk len:', first.loc['end_index'] - first.loc['start_index'])
>>> print('last chunk len:', last.loc['end_index'] - last.loc['start_index'])
first chunk len: 5554
last chunk len: 5559

Warning

The same splitting rule is always applied to the dataset used for fitting (reference) and the dataset of interest (in the presented case - analysis).

Unless these two datasets are the same size, the chunk sizes can be considerably different. For example, if the reference dataset has 10 000 observations and the analysis dataset has 80 000, and chunking is number-based, chunks in reference will be much smaller than in the analysis.

Additionally, if the data drift or performance estimation is calculated on combined reference and analysis, the results presented for reference will be calculated on different chunks than they were fitted.

Automatic chunking

The default chunking method is count-based, with the desired count set to 10. This is used if a chunking method is not specified.

>>> cbpe = nml.CBPE(
...     y_pred_proba='y_pred_proba',
...     y_pred='y_pred',
...     y_true='repaid',
...     timestamp_column_name='timestamp',
...     metrics=['roc_auc'],
...     problem_type='classification_binary',
>>> )

>>> cbpe.fit(reference_df)
>>> est_perf = cbpe.estimate(analysis_df)

>>> print(len(est_perf.filter(period='reference')))
10

Customize chunk behavior

A custom Chunker() instance can be provided to change the default way of handling incomplete chunks or to handle a custom way of chunking the dataset.

For example, SizeBasedChunker() can be used to drop the leftover observations to have fixed-sized chunks.

>>> from nannyml.chunk import SizeBasedChunker, CountBasedChunker

>>> # The reference dataset contains 50000 records
>>> print(f"Size of reference data: {reference_df.shape[0]}")

>>> # We can use the 'drop' strategy to handle incomplete chunks
>>> chunker = SizeBasedChunker(chunk_size=3500 , incomplete='drop')

>>> last = chunker.split(reference_df)[-1]
>>> print(f"The last index: {last.end_index}")
>>> print(f"Last chunk size: {len(last)}")
Size of reference data: 50000
The last index: 48999
Last chunk size: 3500

You could also chunk your data into a fixed number of chunks, choosing to append any leftover observations to the last chunk.

>>> # The reference dataset contains 50000 records
>>> print(f"Size of reference data: {reference_df.shape[0]}")

>>> # We can use a different chunker with another 'incomplete' strategy
>>> chunker_count_drop = CountBasedChunker(chunk_number=9, incomplete='append')

>>> last = chunker_count_drop.split(reference_df)[-1]
>>> print(f"The last index: {last.end_index}")
>>> print(f"Last chunk size: {len(last)}")
Size of reference data: 50000
The last index: 49999
Last chunk size: 5560

You can then provide your custom chunker to the appropriate calculator or estimator.

>>> cbpe = nml.CBPE(
...     y_pred_proba='y_pred_proba',
...     y_pred='y_pred',
...     y_true='repaid',
...     timestamp_column_name='timestamp',
...     metrics=['roc_auc'],
...     problem_type='classification_binary',
...     chunker=chunker_count_drop
>>> ).fit(reference_data=reference_df)

Chunks on plots with results

Finally, once the chunking method is selected, the full performance estimation can be run.

Each point on the plot represents a single chunk, with the y-axis showing the performance. They are aligned on the x-axis with the date at the end of the chunk, not the date in the middle. Plots are interactive - hovering over the point will display precise information about the period to help prevent any confusion.

>>> cbpe = nml.CBPE(
...     y_pred_proba='y_pred_proba',
...     y_pred='y_pred',
...     y_true='repaid',
...     timestamp_column_name='timestamp',
...     metrics=['roc_auc'],
...     problem_type='classification_binary',
...     chunk_size=5_000
>>> ).fit(reference_data=reference_df)

>>> est_perf = cbpe.estimate(analysis_df)
>>> figure = est_perf.plot(kind='performance')
>>> figure.show()