Summation

Just The Code

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, analysis_targets_df = nml.load_synthetic_car_loan_dataset()
>>> display(reference_df.head())

>>> feature_column_names = [
...     'car_value', 'debt_to_income_ratio', 'driver_tenure'
>>> ]
>>> calc = nml.SummaryStatsSumCalculator(
...     column_names=feature_column_names,
>>> )

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='all').to_df())

>>> for column_name in results.column_names:
...     results.filter(column_names=column_name).plot().show()

Walkthrough

The Summation value calculation is straightforward. For each chunk NannyML calculates the sum for all selected numerical columns. The resulting values from the reference data chunks are used to calculate the alert thresholds. The sum value results from the analysis chunks are compared against those thresholds and generate alerts if applicable.

We begin by loading the synthetic car loan dataset provided by the NannyML package.

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, analysis_targets_df = nml.load_synthetic_car_loan_dataset()
>>> display(reference_df.head())

id

car_value

salary_range

debt_to_income_ratio

loan_length

repaid_loan_on_prev_car

size_of_downpayment

driver_tenure

repaid

timestamp

y_pred_proba

y_pred

0

0

39811

40K - 60K €

0.63295

19

False

40%

0.212653

1

2018-01-01 00:00:00.000

0.99

1

1

1

12679

40K - 60K €

0.718627

7

True

10%

4.92755

0

2018-01-01 00:08:43.152

0.07

0

2

2

19847

40K - 60K €

0.721724

17

False

0%

0.520817

1

2018-01-01 00:17:26.304

1

1

3

3

22652

20K - 40K €

0.705992

16

False

10%

0.453649

1

2018-01-01 00:26:09.456

0.98

1

4

4

21268

60K+ €

0.671888

21

True

30%

5.69526

1

2018-01-01 00:34:52.608

0.99

1

The SummaryStatsSumCalculator class implements the functionality needed for sum values calculations. We need to instantiate it with appropriate parameters:

  • column_names: A list with the names of columns to be evaluated.

  • timestamp_column_name (Optional): The name of the column in the reference data that contains timestamps.

  • chunk_size (Optional): The number of observations in each chunk of data used. Only one chunking argument needs to be provided. For more information about chunking configurations check out the chunking tutorial.

  • chunk_number (Optional): The number of chunks to be created out of data provided for each period.

  • chunk_period (Optional): The time period based on which we aggregate the provided data in order to create chunks.

  • chunker (Optional): A NannyML Chunker object that will handle the aggregation provided data in order to create chunks.

  • threshold (Optional): The threshold strategy used to calculate the alert threshold limits. For more information about thresholds, check out the thresholds tutorial.

>>> feature_column_names = [
...     'car_value', 'debt_to_income_ratio', 'driver_tenure'
>>> ]
>>> calc = nml.SummaryStatsSumCalculator(
...     column_names=feature_column_names,
>>> )

Next, the fit() method needs to be called on the reference data, which provides the baseline that the analysis data will be compared with for alert generation. Then the calculate() method will calculate the data quality results on the data provided to it.

The results can be filtered to only include a certain data period, method or column by using the filter method. You can evaluate the result data by converting the results into a DataFrame, by calling the to_df() method. By default this will return a DataFrame with a multi-level index. The first level represents the column, the second level represents resulting information such as the data quality metric values, the alert thresholds or the associated sampling error.

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='all').to_df())

chunk
key
chunk_index
start_index
end_index
start_date
end_date
period
car_value
value
sampling_error
upper_confidence_boundary
lower_confidence_boundary
upper_threshold
lower_threshold
alert
debt_to_income_ratio
value
sampling_error
upper_confidence_boundary
lower_confidence_boundary
upper_threshold
lower_threshold
alert
driver_tenure
value
sampling_error
upper_confidence_boundary
lower_confidence_boundary
upper_threshold
lower_threshold
alert

0

[0:4999]

0

0

4999

reference

1.48302e+08

1.44233e+06

1.52629e+08

1.43975e+08

1.50966e+08

1.4548e+08

False

2925.39

10.9802

2958.33

2892.45

2954.47

2901.89

False

23080.7

162.772

23569

22592.4

23327.3

22673.1

False

1

[5000:9999]

1

5000

9999

reference

1.48088e+08

1.44233e+06

1.52415e+08

1.43761e+08

1.50966e+08

1.4548e+08

False

2913.64

10.9802

2946.58

2880.7

2954.47

2901.89

False

23084.6

162.772

23573

22596.3

23327.3

22673.1

False

2

[10000:14999]

2

10000

14999

reference

1.47888e+08

1.44233e+06

1.52215e+08

1.43561e+08

1.50966e+08

1.4548e+08

False

2931.72

10.9802

2964.66

2898.78

2954.47

2901.89

False

22858.1

162.772

23346.4

22369.8

23327.3

22673.1

False

3

[15000:19999]

3

15000

19999

reference

1.4729e+08

1.44233e+06

1.51617e+08

1.42963e+08

1.50966e+08

1.4548e+08

False

2920.13

10.9802

2953.07

2887.19

2954.47

2901.89

False

23145.5

162.772

23633.8

22657.2

23327.3

22673.1

False

4

[20000:24999]

4

20000

24999

reference

1.47183e+08

1.44233e+06

1.5151e+08

1.42856e+08

1.50966e+08

1.4548e+08

False

2928.74

10.9802

2961.68

2895.8

2954.47

2901.89

False

22973.5

162.772

23461.8

22485.2

23327.3

22673.1

False

5

[25000:29999]

5

25000

29999

reference

1.49716e+08

1.44233e+06

1.54043e+08

1.45389e+08

1.50966e+08

1.4548e+08

False

2920.2

10.9802

2953.14

2887.26

2954.47

2901.89

False

23065.2

162.772

23553.5

22576.9

23327.3

22673.1

False

6

[30000:34999]

6

30000

34999

reference

1.49592e+08

1.44233e+06

1.53919e+08

1.45265e+08

1.50966e+08

1.4548e+08

False

2939.18

10.9802

2972.12

2906.24

2954.47

2901.89

False

22850

162.772

23338.4

22361.7

23327.3

22673.1

False

7

[35000:39999]

7

35000

39999

reference

1.48628e+08

1.44233e+06

1.52955e+08

1.44301e+08

1.50966e+08

1.4548e+08

False

2943.21

10.9802

2976.15

2910.27

2954.47

2901.89

False

22903.2

162.772

23391.5

22414.9

23327.3

22673.1

False

8

[40000:44999]

8

40000

44999

reference

1.48666e+08

1.44233e+06

1.52993e+08

1.44339e+08

1.50966e+08

1.4548e+08

False

2924.53

10.9802

2957.47

2891.59

2954.47

2901.89

False

23135.2

162.772

23623.5

22646.8

23327.3

22673.1

False

9

[45000:49999]

9

45000

49999

reference

1.46879e+08

1.44233e+06

1.51206e+08

1.42552e+08

1.50966e+08

1.4548e+08

False

2935.06

10.9802

2968

2902.12

2954.47

2901.89

False

22905.9

162.772

23394.3

22417.6

23327.3

22673.1

False

10

[0:4999]

0

0

4999

analysis

1.49806e+08

1.44233e+06

1.54133e+08

1.45479e+08

1.50966e+08

1.4548e+08

False

2947.7

10.9802

2980.64

2914.75

2954.47

2901.89

False

22560.9

162.772

23049.2

22072.6

23327.3

22673.1

True

11

[5000:9999]

1

5000

9999

analysis

1.4938e+08

1.44233e+06

1.53707e+08

1.45053e+08

1.50966e+08

1.4548e+08

False

2922.52

10.9802

2955.46

2889.58

2954.47

2901.89

False

22931

162.772

23419.3

22442.7

23327.3

22673.1

False

12

[10000:14999]

2

10000

14999

analysis

1.49387e+08

1.44233e+06

1.53713e+08

1.4506e+08

1.50966e+08

1.4548e+08

False

2917.37

10.9802

2950.31

2884.43

2954.47

2901.89

False

23253.3

162.772

23741.6

22765

23327.3

22673.1

False

13

[15000:19999]

3

15000

19999

analysis

1.49919e+08

1.44233e+06

1.54246e+08

1.45592e+08

1.50966e+08

1.4548e+08

False

2928.47

10.9802

2961.41

2895.53

2954.47

2901.89

False

23117.3

162.772

23605.6

22629

23327.3

22673.1

False

14

[20000:24999]

4

20000

24999

analysis

1.46159e+08

1.44233e+06

1.50486e+08

1.41832e+08

1.50966e+08

1.4548e+08

False

2949.23

10.9802

2982.18

2916.29

2954.47

2901.89

False

22938.5

162.772

23426.8

22450.2

23327.3

22673.1

False

15

[25000:29999]

5

25000

29999

analysis

2.41891e+08

1.44233e+06

2.46218e+08

2.37564e+08

1.50966e+08

1.4548e+08

True

2930.12

10.9802

2963.06

2897.17

2954.47

2901.89

False

22915.5

162.772

23403.8

22427.2

23327.3

22673.1

False

16

[30000:34999]

6

30000

34999

analysis

2.45306e+08

1.44233e+06

2.49633e+08

2.40979e+08

1.50966e+08

1.4548e+08

True

2933.18

10.9802

2966.12

2900.24

2954.47

2901.89

False

22969.5

162.772

23457.8

22481.2

23327.3

22673.1

False

17

[35000:39999]

7

35000

39999

analysis

2.44072e+08

1.44233e+06

2.48399e+08

2.39745e+08

1.50966e+08

1.4548e+08

True

2931.72

10.9802

2964.67

2898.78

2954.47

2901.89

False

22753.5

162.772

23241.8

22265.2

23327.3

22673.1

False

18

[40000:44999]

8

40000

44999

analysis

2.4523e+08

1.44233e+06

2.49557e+08

2.40903e+08

1.50966e+08

1.4548e+08

True

2924.66

10.9802

2957.6

2891.72

2954.47

2901.89

False

22984.3

162.772

23472.6

22496

23327.3

22673.1

False

19

[45000:49999]

9

45000

49999

analysis

2.43532e+08

1.44233e+06

2.47859e+08

2.39205e+08

1.50966e+08

1.4548e+08

True

2925.04

10.9802

2957.98

2892.1

2954.47

2901.89

False

23014.1

162.772

23502.4

22525.8

23327.3

22673.1

False

More information on accessing the information contained in the Result can be found on the Working with results page.

The next step is visualizing the results, which is done using the plot() method. It is recommended to filter results for each column and plot separately.

>>> for column_name in results.column_names:
...     results.filter(column_names=column_name).plot().show()
../../_images/sum-car_value.svg../../_images/sum-debt_to_income_ratio.svg../../_images/sum-driver_tenure.svg

Insights

We see that only the car_value column exhibits a change in sum value.

What Next

We can also inspect the dataset for other Summary Statistics such as Standard Deviation. We can also look for any Data Drift present in the dataset using Detecting Data Drift functionality of NannyML.