Summation

Just The Code

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, analysis_targets_df = nml.load_synthetic_car_loan_dataset()
>>> display(reference_df.head())

>>> feature_column_names = [
...     'car_value', 'debt_to_income_ratio', 'driver_tenure'
>>> ]
>>> calc = nml.SummaryStatsSumCalculator(
...     column_names=feature_column_names,
>>> )

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='all').to_df())

>>> for column_name in results.column_names:
...     results.filter(column_names=column_name).plot().show()

Walkthrough

The Summation value calculation is straightforward. For each chunk NannyML calculates the sum for all selected numerical columns. The resulting values from the reference data chunks are used to calculate the alert thresholds. The sum value results from the analysis chunks are compared against those thresholds and generate alerts if applicable.

We begin by loading the synthetic car loan dataset provided by the NannyML package.

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, analysis_targets_df = nml.load_synthetic_car_loan_dataset()
>>> display(reference_df.head())

	id	car_value	salary_range	debt_to_income_ratio	loan_length	repaid_loan_on_prev_car	size_of_downpayment	driver_tenure	repaid	timestamp	y_pred_proba	y_pred
0	0	39811	40K - 60K €	0.63295	19	False	40%	0.212653	1	2018-01-01 00:00:00.000	0.99	1
1	1	12679	40K - 60K €	0.718627	7	True	10%	4.92755	0	2018-01-01 00:08:43.152	0.07	0
2	2	19847	40K - 60K €	0.721724	17	False	0%	0.520817	1	2018-01-01 00:17:26.304	1	1
3	3	22652	20K - 20K €	0.705992	16	False	10%	0.453649	1	2018-01-01 00:26:09.456	0.98	1
4	4	21268	60K+ €	0.671888	21	True	30%	5.69526	1	2018-01-01 00:34:52.608	0.99	1

The SummaryStatsSumCalculator class implements the functionality needed for sum values calculations. We need to instantiate it with appropriate parameters:

column_names: A list with the names of columns to be evaluated.
timestamp_column_name (Optional): The name of the column in the reference data that contains timestamps.
chunk_size (Optional): The number of observations in each chunk of data used. Only one chunking argument needs to be provided. For more information about chunking configurations check out the chunking tutorial.
chunk_number (Optional): The number of chunks to be created out of data provided for each period.
chunk_period (Optional): The time period based on which we aggregate the provided data in order to create chunks.
chunker (Optional): A NannyML Chunker object that will handle the aggregation provided data in order to create chunks.
threshold (Optional): The threshold strategy used to calculate the alert threshold limits. For more information about thresholds, check out the thresholds tutorial.

>>> feature_column_names = [
...     'car_value', 'debt_to_income_ratio', 'driver_tenure'
>>> ]
>>> calc = nml.SummaryStatsSumCalculator(
...     column_names=feature_column_names,
>>> )

Next, the fit() method needs to be called on the reference data, which provides the baseline that the analysis data will be compared with for alert generation. Then the calculate() method will calculate the data quality results on the data provided to it.

The results can be filtered to only include a certain data period, method or column by using the filter method. You can evaluate the result data by converting the results into a DataFrame, by calling the to_df() method. By default this will return a DataFrame with a multi-level index. The first level represents the column, the second level represents resulting information such as the data quality metric values, the alert thresholds or the associated sampling error.

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='all').to_df())

	chunk key	chunk_index	start_index	end_index	period	car_value value	sampling_error	upper_confidence_boundary	lower_confidence_boundary	upper_threshold	lower_threshold	alert	debt_to_income_ratio value	sampling_error	upper_confidence_boundary	lower_confidence_boundary	upper_threshold	lower_threshold	alert	driver_tenure value	sampling_error	upper_confidence_boundary	lower_confidence_boundary	upper_threshold	lower_threshold	alert
0	[0:4999]	0	0	4999	reference	1.48302e+08	1.44233e+06	1.52629e+08	1.43975e+08	1.50966e+08	1.4548e+08	False	2925.39	10.9802	2958.33	2892.45	2954.47	2901.89	False	23080.7	162.772	23569	22592.4	23327.3	22673.1	False
1	[5000:9999]	1	5000	9999	reference	1.48088e+08	1.44233e+06	1.52415e+08	1.43761e+08	1.50966e+08	1.4548e+08	False	2913.64	10.9802	2946.58	2880.7	2954.47	2901.89	False	23084.6	162.772	23573	22596.3	23327.3	22673.1	False
2	[10000:14999]	2	10000	14999	reference	1.47888e+08	1.44233e+06	1.52215e+08	1.43561e+08	1.50966e+08	1.4548e+08	False	2931.72	10.9802	2964.66	2898.78	2954.47	2901.89	False	22858.1	162.772	23346.4	22369.8	23327.3	22673.1	False
3	[15000:19999]	3	15000	19999	reference	1.4729e+08	1.44233e+06	1.51617e+08	1.42963e+08	1.50966e+08	1.4548e+08	False	2920.13	10.9802	2953.07	2887.19	2954.47	2901.89	False	23145.5	162.772	23633.8	22657.2	23327.3	22673.1	False
4	[20000:24999]	4	20000	24999	reference	1.47183e+08	1.44233e+06	1.5151e+08	1.42856e+08	1.50966e+08	1.4548e+08	False	2928.74	10.9802	2961.68	2895.8	2954.47	2901.89	False	22973.5	162.772	23461.8	22485.2	23327.3	22673.1	False
5	[25000:29999]	5	25000	29999	reference	1.49716e+08	1.44233e+06	1.54043e+08	1.45389e+08	1.50966e+08	1.4548e+08	False	2920.2	10.9802	2953.14	2887.26	2954.47	2901.89	False	23065.2	162.772	23553.5	22576.9	23327.3	22673.1	False
6	[30000:34999]	6	30000	34999	reference	1.49592e+08	1.44233e+06	1.53919e+08	1.45265e+08	1.50966e+08	1.4548e+08	False	2939.18	10.9802	2972.12	2906.24	2954.47	2901.89	False	22850	162.772	23338.4	22361.7	23327.3	22673.1	False
7	[35000:39999]	7	35000	39999	reference	1.48628e+08	1.44233e+06	1.52955e+08	1.44301e+08	1.50966e+08	1.4548e+08	False	2943.21	10.9802	2976.15	2910.27	2954.47	2901.89	False	22903.2	162.772	23391.5	22414.9	23327.3	22673.1	False
8	[40000:44999]	8	40000	44999	reference	1.48666e+08	1.44233e+06	1.52993e+08	1.44339e+08	1.50966e+08	1.4548e+08	False	2924.53	10.9802	2957.47	2891.59	2954.47	2901.89	False	23135.2	162.772	23623.5	22646.8	23327.3	22673.1	False
9	[45000:49999]	9	45000	49999	reference	1.46879e+08	1.44233e+06	1.51206e+08	1.42552e+08	1.50966e+08	1.4548e+08	False	2935.06	10.9802	2968	2902.12	2954.47	2901.89	False	22905.9	162.772	23394.3	22417.6	23327.3	22673.1	False
10	[0:4999]	0	0	4999	analysis	1.49806e+08	1.44233e+06	1.54133e+08	1.45479e+08	1.50966e+08	1.4548e+08	False	2947.7	10.9802	2980.64	2914.75	2954.47	2901.89	False	22560.9	162.772	23049.2	22072.6	23327.3	22673.1	True
11	[5000:9999]	1	5000	9999	analysis	1.4938e+08	1.44233e+06	1.53707e+08	1.45053e+08	1.50966e+08	1.4548e+08	False	2922.52	10.9802	2955.46	2889.58	2954.47	2901.89	False	22931	162.772	23419.3	22442.7	23327.3	22673.1	False
12	[10000:14999]	2	10000	14999	analysis	1.49387e+08	1.44233e+06	1.53713e+08	1.4506e+08	1.50966e+08	1.4548e+08	False	2917.37	10.9802	2950.31	2884.43	2954.47	2901.89	False	23253.3	162.772	23741.6	22765	23327.3	22673.1	False
13	[15000:19999]	3	15000	19999	analysis	1.49919e+08	1.44233e+06	1.54246e+08	1.45592e+08	1.50966e+08	1.4548e+08	False	2928.47	10.9802	2961.41	2895.53	2954.47	2901.89	False	23117.3	162.772	23605.6	22629	23327.3	22673.1	False
14	[20000:24999]	4	20000	24999	analysis	1.46159e+08	1.44233e+06	1.50486e+08	1.41832e+08	1.50966e+08	1.4548e+08	False	2949.23	10.9802	2982.18	2916.29	2954.47	2901.89	False	22938.5	162.772	23426.8	22450.2	23327.3	22673.1	False
15	[25000:29999]	5	25000	29999	analysis	2.41891e+08	1.44233e+06	2.46218e+08	2.37564e+08	1.50966e+08	1.4548e+08	True	2930.12	10.9802	2963.06	2897.17	2954.47	2901.89	False	22915.5	162.772	23403.8	22427.2	23327.3	22673.1	False
16	[30000:34999]	6	30000	34999	analysis	2.45306e+08	1.44233e+06	2.49633e+08	2.40979e+08	1.50966e+08	1.4548e+08	True	2933.18	10.9802	2966.12	2900.24	2954.47	2901.89	False	22969.5	162.772	23457.8	22481.2	23327.3	22673.1	False
17	[35000:39999]	7	35000	39999	analysis	2.44072e+08	1.44233e+06	2.48399e+08	2.39745e+08	1.50966e+08	1.4548e+08	True	2931.72	10.9802	2964.67	2898.78	2954.47	2901.89	False	22753.5	162.772	23241.8	22265.2	23327.3	22673.1	False
18	[40000:44999]	8	40000	44999	analysis	2.4523e+08	1.44233e+06	2.49557e+08	2.40903e+08	1.50966e+08	1.4548e+08	True	2924.66	10.9802	2957.6	2891.72	2954.47	2901.89	False	22984.3	162.772	23472.6	22496	23327.3	22673.1	False
19	[45000:49999]	9	45000	49999	analysis	2.43532e+08	1.44233e+06	2.47859e+08	2.39205e+08	1.50966e+08	1.4548e+08	True	2925.04	10.9802	2957.98	2892.1	2954.47	2901.89	False	23014.1	162.772	23502.4	22525.8	23327.3	22673.1	False

More information on accessing the information contained in the Result can be found on the Working with results page.

The next step is visualizing the results, which is done using the plot() method. It is recommended to filter results for each column and plot separately.

>>> for column_name in results.column_names:
...     results.filter(column_names=column_name).plot().show()

Insights

We see that only the car_value column exhibits a change in sum value.

What Next

We can also inspect the dataset for other Summary Statistics such as Standard Deviation. We can also look for any Data Drift present in the dataset using Detecting Data Drift functionality of NannyML.