Average
Just The Code
>>> import nannyml as nml
>>> from IPython.display import display
>>> reference, analysis, analysis_targets = nml.load_synthetic_car_loan_dataset()
>>> display(reference.head())
>>> selected_columns = [
... 'car_value', 'debt_to_income_ratio', 'driver_tenure'
>>> ]
>>> calc = nml.SummaryStatsAvgCalculator(
... column_names=selected_columns,
>>> )
>>> calc.fit(reference)
>>> results = calc.calculate(analysis)
>>> display(results.filter(period='all').to_df())
>>> for column_name in results.column_names:
... results.filter(column_names=column_name).plot().show()
Walkthrough
The Mean value calculation is straightforward. For each chunk NannyML calculates the mean for all selected numerical columns. The resulting values from the reference data chunks are used to calculate the alert thresholds. The mean value results from the analysis chunks are compared against those thresholds and generate alerts if applicable.
We begin by loading the synthetic car loan dataset provided by the NannyML package.
>>> import nannyml as nml
>>> from IPython.display import display
>>> reference, analysis, analysis_targets = nml.load_synthetic_car_loan_dataset()
>>> display(reference.head())
car_value |
salary_range |
debt_to_income_ratio |
loan_length |
repaid_loan_on_prev_car |
size_of_downpayment |
driver_tenure |
repaid |
timestamp |
y_pred_proba |
y_pred |
|
---|---|---|---|---|---|---|---|---|---|---|---|
0 |
39811 |
40K - 60K € |
0.63295 |
19 |
False |
40% |
0.212653 |
1 |
2018-01-01 00:00:00.000 |
0.99 |
1 |
1 |
12679 |
40K - 60K € |
0.718627 |
7 |
True |
10% |
4.92755 |
0 |
2018-01-01 00:08:43.152 |
0.07 |
0 |
2 |
19847 |
40K - 60K € |
0.721724 |
17 |
False |
0% |
0.520817 |
1 |
2018-01-01 00:17:26.304 |
1 |
1 |
3 |
22652 |
20K - 20K € |
0.705992 |
16 |
False |
10% |
0.453649 |
1 |
2018-01-01 00:26:09.456 |
0.98 |
1 |
4 |
21268 |
60K+ € |
0.671888 |
21 |
True |
30% |
5.69526 |
1 |
2018-01-01 00:34:52.608 |
0.99 |
1 |
The SummaryStatsAvgCalculator
class implements
the functionality needed for mean values calculations.
We need to instantiate it with appropriate parameters:
The names of the columns to be evaluated.
Optionally, the name of the column containing the observation timestamps.
Optionally, a chunking approach or a predefined chunker. If neither is provided, the default chunker creating 10 chunks will be used.
Optionally, a threshold strategy to modify the default one. See available threshold options here.
>>> selected_columns = [
... 'car_value', 'debt_to_income_ratio', 'driver_tenure'
>>> ]
>>> calc = nml.SummaryStatsAvgCalculator(
... column_names=selected_columns,
>>> )
Next, the fit()
method needs
to be called on the reference data, which provides the baseline that the analysis data will be
compared with for alert generation. Then the
calculate()
method will
calculate the data quality results on the data provided to it.
The results can be filtered to only include a certain data period, method or column by using the filter
method.
You can evaluate the result data by converting the results into a DataFrame,
by calling the to_df()
method.
By default this will return a DataFrame with a multi-level index. The first level represents the column, the second level
represents resulting information such as the data quality metric values, the alert thresholds or the associated sampling error.
>>> calc.fit(reference)
>>> results = calc.calculate(analysis)
>>> display(results.filter(period='all').to_df())
chunk
key
|
chunk_index
|
start_index
|
end_index
|
start_date
|
end_date
|
period
|
car_value
value
|
sampling_error
|
upper_confidence_boundary
|
lower_confidence_boundary
|
upper_threshold
|
lower_threshold
|
alert
|
debt_to_income_ratio
value
|
sampling_error
|
upper_confidence_boundary
|
lower_confidence_boundary
|
upper_threshold
|
lower_threshold
|
alert
|
driver_tenure
value
|
sampling_error
|
upper_confidence_boundary
|
lower_confidence_boundary
|
upper_threshold
|
lower_threshold
|
alert
|
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
[0:4999] |
0 |
0 |
4999 |
reference |
29660.5 |
288.466 |
30525.9 |
28795.1 |
30193.2 |
29096.1 |
False |
0.585079 |
0.00219605 |
0.591667 |
0.578491 |
0.590895 |
0.580378 |
False |
4.61614 |
0.0325543 |
4.71381 |
4.51848 |
4.66547 |
4.53461 |
False |
||
1 |
[5000:9999] |
1 |
5000 |
9999 |
reference |
29617.7 |
288.466 |
30483.1 |
28752.3 |
30193.2 |
29096.1 |
False |
0.582728 |
0.00219605 |
0.589316 |
0.57614 |
0.590895 |
0.580378 |
False |
4.61693 |
0.0325543 |
4.71459 |
4.51927 |
4.66547 |
4.53461 |
False |
||
2 |
[10000:14999] |
2 |
10000 |
14999 |
reference |
29577.6 |
288.466 |
30443 |
28712.2 |
30193.2 |
29096.1 |
False |
0.586344 |
0.00219605 |
0.592932 |
0.579756 |
0.590895 |
0.580378 |
False |
4.57162 |
0.0325543 |
4.66928 |
4.47396 |
4.66547 |
4.53461 |
False |
||
3 |
[15000:19999] |
3 |
15000 |
19999 |
reference |
29458 |
288.466 |
30323.4 |
28592.6 |
30193.2 |
29096.1 |
False |
0.584026 |
0.00219605 |
0.590614 |
0.577438 |
0.590895 |
0.580378 |
False |
4.6291 |
0.0325543 |
4.72676 |
4.53144 |
4.66547 |
4.53461 |
False |
||
4 |
[20000:24999] |
4 |
20000 |
24999 |
reference |
29436.6 |
288.466 |
30302 |
28571.2 |
30193.2 |
29096.1 |
False |
0.585748 |
0.00219605 |
0.592337 |
0.57916 |
0.590895 |
0.580378 |
False |
4.5947 |
0.0325543 |
4.69237 |
4.49704 |
4.66547 |
4.53461 |
False |
||
5 |
[25000:29999] |
5 |
25000 |
29999 |
reference |
29943.3 |
288.466 |
30808.7 |
29077.9 |
30193.2 |
29096.1 |
False |
0.584041 |
0.00219605 |
0.590629 |
0.577452 |
0.590895 |
0.580378 |
False |
4.61304 |
0.0325543 |
4.7107 |
4.51537 |
4.66547 |
4.53461 |
False |
||
6 |
[30000:34999] |
6 |
30000 |
34999 |
reference |
29918.4 |
288.466 |
30783.8 |
29053 |
30193.2 |
29096.1 |
False |
0.587836 |
0.00219605 |
0.594424 |
0.581248 |
0.590895 |
0.580378 |
False |
4.57001 |
0.0325543 |
4.66767 |
4.47235 |
4.66547 |
4.53461 |
False |
||
7 |
[35000:39999] |
7 |
35000 |
39999 |
reference |
29725.6 |
288.466 |
30591 |
28860.2 |
30193.2 |
29096.1 |
False |
0.588643 |
0.00219605 |
0.595231 |
0.582055 |
0.590895 |
0.580378 |
False |
4.58064 |
0.0325543 |
4.6783 |
4.48297 |
4.66547 |
4.53461 |
False |
||
8 |
[40000:44999] |
8 |
40000 |
44999 |
reference |
29733.2 |
288.466 |
30598.6 |
28867.8 |
30193.2 |
29096.1 |
False |
0.584906 |
0.00219605 |
0.591494 |
0.578318 |
0.590895 |
0.580378 |
False |
4.62703 |
0.0325543 |
4.72469 |
4.52937 |
4.66547 |
4.53461 |
False |
||
9 |
[45000:49999] |
9 |
45000 |
49999 |
reference |
29375.8 |
288.466 |
30241.2 |
28510.4 |
30193.2 |
29096.1 |
False |
0.587012 |
0.00219605 |
0.5936 |
0.580424 |
0.590895 |
0.580378 |
False |
4.58119 |
0.0325543 |
4.67885 |
4.48353 |
4.66547 |
4.53461 |
False |
||
10 |
[0:4999] |
0 |
0 |
4999 |
analysis |
29961.2 |
288.466 |
30826.6 |
29095.8 |
30193.2 |
29096.1 |
False |
0.589539 |
0.00219605 |
0.596127 |
0.582951 |
0.590895 |
0.580378 |
False |
4.51218 |
0.0325543 |
4.60984 |
4.41452 |
4.66547 |
4.53461 |
True |
||
11 |
[5000:9999] |
1 |
5000 |
9999 |
analysis |
29876.1 |
288.466 |
30741.5 |
29010.7 |
30193.2 |
29096.1 |
False |
0.584504 |
0.00219605 |
0.591092 |
0.577916 |
0.590895 |
0.580378 |
False |
4.58621 |
0.0325543 |
4.68387 |
4.48854 |
4.66547 |
4.53461 |
False |
||
12 |
[10000:14999] |
2 |
10000 |
14999 |
analysis |
29877.3 |
288.466 |
30742.7 |
29011.9 |
30193.2 |
29096.1 |
False |
0.583473 |
0.00219605 |
0.590061 |
0.576885 |
0.590895 |
0.580378 |
False |
4.65065 |
0.0325543 |
4.74832 |
4.55299 |
4.66547 |
4.53461 |
False |
||
13 |
[15000:19999] |
3 |
15000 |
19999 |
analysis |
29983.8 |
288.466 |
30849.2 |
29118.4 |
30193.2 |
29096.1 |
False |
0.585695 |
0.00219605 |
0.592283 |
0.579106 |
0.590895 |
0.580378 |
False |
4.62346 |
0.0325543 |
4.72112 |
4.52579 |
4.66547 |
4.53461 |
False |
||
14 |
[20000:24999] |
4 |
20000 |
24999 |
analysis |
29231.8 |
288.466 |
30097.2 |
28366.4 |
30193.2 |
29096.1 |
False |
0.589847 |
0.00219605 |
0.596435 |
0.583259 |
0.590895 |
0.580378 |
False |
4.5877 |
0.0325543 |
4.68537 |
4.49004 |
4.66547 |
4.53461 |
False |
||
15 |
[25000:29999] |
5 |
25000 |
29999 |
analysis |
48378.3 |
288.466 |
49243.7 |
47512.9 |
30193.2 |
29096.1 |
True |
0.586023 |
0.00219605 |
0.592611 |
0.579435 |
0.590895 |
0.580378 |
False |
4.5831 |
0.0325543 |
4.68076 |
4.48543 |
4.66547 |
4.53461 |
False |
||
16 |
[30000:34999] |
6 |
30000 |
34999 |
analysis |
49061.2 |
288.466 |
49926.6 |
48195.8 |
30193.2 |
29096.1 |
True |
0.586636 |
0.00219605 |
0.593224 |
0.580048 |
0.590895 |
0.580378 |
False |
4.5939 |
0.0325543 |
4.69157 |
4.49624 |
4.66547 |
4.53461 |
False |
||
17 |
[35000:39999] |
7 |
35000 |
39999 |
analysis |
48814.5 |
288.466 |
49679.9 |
47949.1 |
30193.2 |
29096.1 |
True |
0.586345 |
0.00219605 |
0.592933 |
0.579757 |
0.590895 |
0.580378 |
False |
4.5507 |
0.0325543 |
4.64836 |
4.45303 |
4.66547 |
4.53461 |
False |
||
18 |
[40000:44999] |
8 |
40000 |
44999 |
analysis |
49046.1 |
288.466 |
49911.5 |
48180.7 |
30193.2 |
29096.1 |
True |
0.584932 |
0.00219605 |
0.59152 |
0.578344 |
0.590895 |
0.580378 |
False |
4.59686 |
0.0325543 |
4.69453 |
4.4992 |
4.66547 |
4.53461 |
False |
||
19 |
[45000:49999] |
9 |
45000 |
49999 |
analysis |
48706.3 |
288.466 |
49571.7 |
47840.9 |
30193.2 |
29096.1 |
True |
0.585008 |
0.00219605 |
0.591596 |
0.57842 |
0.590895 |
0.580378 |
False |
4.60281 |
0.0325543 |
4.70048 |
4.50515 |
4.66547 |
4.53461 |
False |
More information on accessing the information contained in the
Result
can be found on the Working with results page.
The next step is visualizing the results, which is done using the
plot()
method.
It is recommended to filter results for each column and plot separately.
>>> for column_name in results.column_names:
... results.filter(column_names=column_name).plot().show()
Insights
We see that only the car_value column exhibits a change in mean value.
What Next
We can also inspect the dataset for other Summary Statistics such as Standard Deviation. We can also look for any Data Drift present in the dataset using Detecting Data Drift functionality of NannyML.