Rows Count

Just The Code

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, analysis_targets_df = nml.load_synthetic_car_loan_dataset()
>>> display(reference_df.head())

>>> calc = nml.SummaryStatsRowCountCalculator(timestamp_column_name='timestamp', chunk_period='M')

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='all').to_df())

>>> results.filter(period='analysis').plot().show()

Walkthrough

The Row Count calculation is straightforward. For each chunk NannyML calculates the row count for the selected dataframe. The resulting values from the reference data chunks are used to calculate the alert thresholds. The row count results from the analysis chunks are compared against those thresholds and generate alerts if applicable.

We begin by loading the synthetic car loan dataset provided by the NannyML package.

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, analysis_targets_df = nml.load_synthetic_car_loan_dataset()
>>> display(reference_df.head())

	id	car_value	salary_range	debt_to_income_ratio	loan_length	repaid_loan_on_prev_car	size_of_downpayment	driver_tenure	repaid	timestamp	y_pred_proba	y_pred
0	0	39811	40K - 60K €	0.63295	19	False	40%	0.212653	1	2018-01-01 00:00:00.000	0.99	1
1	1	12679	40K - 60K €	0.718627	7	True	10%	4.92755	0	2018-01-01 00:08:43.152	0.07	0
2	2	19847	40K - 60K €	0.721724	17	False	0%	0.520817	1	2018-01-01 00:17:26.304	1	1
3	3	22652	20K - 20K €	0.705992	16	False	10%	0.453649	1	2018-01-01 00:26:09.456	0.98	1
4	4	21268	60K+ €	0.671888	21	True	30%	5.69526	1	2018-01-01 00:34:52.608	0.99	1

The SummaryStatsRowCountCalculator class implements the functionality needed for row count calculations. We need to instantiate it with appropriate optional parameters:

timestamp_column_name (Optional): The name of the column in the reference data that contains timestamps.
chunk_size (Optional): The number of observations in each chunk of data used. Only one chunking argument needs to be provided. For more information about chunking configurations check out the chunking tutorial.
chunk_number (Optional): The number of chunks to be created out of data provided for each period.
chunk_period (Optional): The time period based on which we aggregate the provided data in order to create chunks.
chunker (Optional): A NannyML Chunker object that will handle the aggregation provided data in order to create chunks.
threshold (Optional): The threshold strategy used to calculate the alert threshold limits. For more information about thresholds, check out the thresholds tutorial.

>>> calc = nml.SummaryStatsRowCountCalculator(timestamp_column_name='timestamp', chunk_period='M')

Next, the fit() method needs to be called on the reference data, which provides the baseline that the analysis data will be compared with for alert generation. Then the calculate() method will calculate the data quality results on the data provided to it.

The results can be filtered to only include a certain data period, method or column by using the filter method. You can evaluate the result data by converting the results into a DataFrame, by calling the to_df() method. By default this will return a DataFrame with a multi-level index. The first level represents the column, the second level represents resulting information such as the data quality metric values or the alert thresholds.

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='all').to_df())

	chunk key	chunk_index	start_index	end_index	start_date	end_date	period	rows_count value	upper_threshold	lower_threshold	alert
0	2018-01	0	0	5119	2018-01-01 00:00:00	2018-01-31 23:59:59.999999999	reference	5120	5451.21	4548.79	False
1	2018-02	1	5120	9744	2018-02-01 00:00:00	2018-02-28 23:59:59.999999999	reference	4625	5451.21	4548.79	False
2	2018-03	2	9745	14863	2018-03-01 00:00:00	2018-03-31 23:59:59.999999999	reference	5119	5451.21	4548.79	False
3	2018-04	3	14864	19818	2018-04-01 00:00:00	2018-04-30 23:59:59.999999999	reference	4955	5451.21	4548.79	False
4	2018-05	4	19819	24938	2018-05-01 00:00:00	2018-05-31 23:59:59.999999999	reference	5120	5451.21	4548.79	False
5	2018-06	5	24939	29892	2018-06-01 00:00:00	2018-06-30 23:59:59.999999999	reference	4954	5451.21	4548.79	False
6	2018-07	6	29893	35012	2018-07-01 00:00:00	2018-07-31 23:59:59.999999999	reference	5120	5451.21	4548.79	False
7	2018-08	7	35013	40132	2018-08-01 00:00:00	2018-08-31 23:59:59.999999999	reference	5120	5451.21	4548.79	False
8	2018-09	8	40133	45086	2018-09-01 00:00:00	2018-09-30 23:59:59.999999999	reference	4954	5451.21	4548.79	False
9	2018-10	9	45087	49999	2018-10-01 00:00:00	2018-10-31 23:59:59.999999999	reference	4913	5451.21	4548.79	False
10	2018-10	0	0	206	2018-10-01 00:00:00	2018-10-31 23:59:59.999999999	analysis	207	5451.21	4548.79	True
11	2018-11	1	207	5161	2018-11-01 00:00:00	2018-11-30 23:59:59.999999999	analysis	4955	5451.21	4548.79	False
12	2018-12	2	5162	10280	2018-12-01 00:00:00	2018-12-31 23:59:59.999999999	analysis	5119	5451.21	4548.79	False
13	2019-01	3	10281	15400	2019-01-01 00:00:00	2019-01-31 23:59:59.999999999	analysis	5120	5451.21	4548.79	False
14	2019-02	4	15401	20024	2019-02-01 00:00:00	2019-02-28 23:59:59.999999999	analysis	4624	5451.21	4548.79	False
15	2019-03	5	20025	25144	2019-03-01 00:00:00	2019-03-31 23:59:59.999999999	analysis	5120	5451.21	4548.79	False
16	2019-04	6	25145	30099	2019-04-01 00:00:00	2019-04-30 23:59:59.999999999	analysis	4955	5451.21	4548.79	False
17	2019-05	7	30100	35218	2019-05-01 00:00:00	2019-05-31 23:59:59.999999999	analysis	5119	5451.21	4548.79	False
18	2019-06	8	35219	40173	2019-06-01 00:00:00	2019-06-30 23:59:59.999999999	analysis	4955	5451.21	4548.79	False
19	2019-07	9	40174	45293	2019-07-01 00:00:00	2019-07-31 23:59:59.999999999	analysis	5120	5451.21	4548.79	False
20	2019-08	10	45294	49999	2019-08-01 00:00:00	2019-08-31 23:59:59.999999999	analysis	4706	5451.21	4548.79	False

More information on accessing the information contained in the Result can be found on the Working with results page.

The next step is visualizing the results, which is done using the plot() method. It is recommended to filter results for each column and plot separately.

>>> results.filter(period='analysis').plot().show()

Insights

We see that when we use a monthly chunking strategy we have too few data points for October 2018.

What Next

We can also inspect the dataset for other Summary Statistics such as Average. We can also inspect the dataset using Data Quality functionality provided by NannyML. Last but not least, we can look for any Data Drift present in the dataset using Detecting Data Drift functionality of NannyML.