Rows Count

Just The Code

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, analysis_targets_df = nml.load_synthetic_car_loan_dataset()
>>> display(reference_df.head())

>>> calc = nml.SummaryStatsRowCountCalculator(timestamp_column_name='timestamp', chunk_period='M')

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='all').to_df())

>>> results.filter(period='analysis').plot().show()

Walkthrough

The Row Count calculation is straightforward. For each chunk NannyML calculates the row count for the selected dataframe. The resulting values from the reference data chunks are used to calculate the alert thresholds. The row count results from the analysis chunks are compared against those thresholds and generate alerts if applicable.

We begin by loading the synthetic car loan dataset provided by the NannyML package.

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, analysis_targets_df = nml.load_synthetic_car_loan_dataset()
>>> display(reference_df.head())

id

car_value

salary_range

debt_to_income_ratio

loan_length

repaid_loan_on_prev_car

size_of_downpayment

driver_tenure

repaid

timestamp

y_pred_proba

y_pred

0

0

39811

40K - 60K €

0.63295

19

False

40%

0.212653

1

2018-01-01 00:00:00.000

0.99

1

1

1

12679

40K - 60K €

0.718627

7

True

10%

4.92755

0

2018-01-01 00:08:43.152

0.07

0

2

2

19847

40K - 60K €

0.721724

17

False

0%

0.520817

1

2018-01-01 00:17:26.304

1

1

3

3

22652

20K - 20K €

0.705992

16

False

10%

0.453649

1

2018-01-01 00:26:09.456

0.98

1

4

4

21268

60K+ €

0.671888

21

True

30%

5.69526

1

2018-01-01 00:34:52.608

0.99

1

The SummaryStatsRowCountCalculator class implements the functionality needed for row count calculations. We need to instantiate it with appropriate optional parameters:

  • timestamp_column_name (Optional): The name of the column in the reference data that contains timestamps.

  • chunk_size (Optional): The number of observations in each chunk of data used. Only one chunking argument needs to be provided. For more information about chunking configurations check out the chunking tutorial.

  • chunk_number (Optional): The number of chunks to be created out of data provided for each period.

  • chunk_period (Optional): The time period based on which we aggregate the provided data in order to create chunks.

  • chunker (Optional): A NannyML Chunker object that will handle the aggregation provided data in order to create chunks.

  • threshold (Optional): The threshold strategy used to calculate the alert threshold limits. For more information about thresholds, check out the thresholds tutorial.

>>> calc = nml.SummaryStatsRowCountCalculator(timestamp_column_name='timestamp', chunk_period='M')

Next, the fit() method needs to be called on the reference data, which provides the baseline that the analysis data will be compared with for alert generation. Then the calculate() method will calculate the data quality results on the data provided to it.

The results can be filtered to only include a certain data period, method or column by using the filter method. You can evaluate the result data by converting the results into a DataFrame, by calling the to_df() method. By default this will return a DataFrame with a multi-level index. The first level represents the column, the second level represents resulting information such as the data quality metric values or the alert thresholds.

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='all').to_df())

chunk
key
chunk_index
start_index
end_index
start_date
end_date
period
rows_count
value
upper_threshold
lower_threshold
alert

0

2018-01

0

0

5119

2018-01-01 00:00:00

2018-01-31 23:59:59.999999999

reference

5120

5451.21

4548.79

False

1

2018-02

1

5120

9744

2018-02-01 00:00:00

2018-02-28 23:59:59.999999999

reference

4625

5451.21

4548.79

False

2

2018-03

2

9745

14863

2018-03-01 00:00:00

2018-03-31 23:59:59.999999999

reference

5119

5451.21

4548.79

False

3

2018-04

3

14864

19818

2018-04-01 00:00:00

2018-04-30 23:59:59.999999999

reference

4955

5451.21

4548.79

False

4

2018-05

4

19819

24938

2018-05-01 00:00:00

2018-05-31 23:59:59.999999999

reference

5120

5451.21

4548.79

False

5

2018-06

5

24939

29892

2018-06-01 00:00:00

2018-06-30 23:59:59.999999999

reference

4954

5451.21

4548.79

False

6

2018-07

6

29893

35012

2018-07-01 00:00:00

2018-07-31 23:59:59.999999999

reference

5120

5451.21

4548.79

False

7

2018-08

7

35013

40132

2018-08-01 00:00:00

2018-08-31 23:59:59.999999999

reference

5120

5451.21

4548.79

False

8

2018-09

8

40133

45086

2018-09-01 00:00:00

2018-09-30 23:59:59.999999999

reference

4954

5451.21

4548.79

False

9

2018-10

9

45087

49999

2018-10-01 00:00:00

2018-10-31 23:59:59.999999999

reference

4913

5451.21

4548.79

False

10

2018-10

0

0

206

2018-10-01 00:00:00

2018-10-31 23:59:59.999999999

analysis

207

5451.21

4548.79

True

11

2018-11

1

207

5161

2018-11-01 00:00:00

2018-11-30 23:59:59.999999999

analysis

4955

5451.21

4548.79

False

12

2018-12

2

5162

10280

2018-12-01 00:00:00

2018-12-31 23:59:59.999999999

analysis

5119

5451.21

4548.79

False

13

2019-01

3

10281

15400

2019-01-01 00:00:00

2019-01-31 23:59:59.999999999

analysis

5120

5451.21

4548.79

False

14

2019-02

4

15401

20024

2019-02-01 00:00:00

2019-02-28 23:59:59.999999999

analysis

4624

5451.21

4548.79

False

15

2019-03

5

20025

25144

2019-03-01 00:00:00

2019-03-31 23:59:59.999999999

analysis

5120

5451.21

4548.79

False

16

2019-04

6

25145

30099

2019-04-01 00:00:00

2019-04-30 23:59:59.999999999

analysis

4955

5451.21

4548.79

False

17

2019-05

7

30100

35218

2019-05-01 00:00:00

2019-05-31 23:59:59.999999999

analysis

5119

5451.21

4548.79

False

18

2019-06

8

35219

40173

2019-06-01 00:00:00

2019-06-30 23:59:59.999999999

analysis

4955

5451.21

4548.79

False

19

2019-07

9

40174

45293

2019-07-01 00:00:00

2019-07-31 23:59:59.999999999

analysis

5120

5451.21

4548.79

False

20

2019-08

10

45294

49999

2019-08-01 00:00:00

2019-08-31 23:59:59.999999999

analysis

4706

5451.21

4548.79

False

More information on accessing the information contained in the Result can be found on the Working with results page.

The next step is visualizing the results, which is done using the plot() method. It is recommended to filter results for each column and plot separately.

>>> results.filter(period='analysis').plot().show()
../../_images/count.svg

Insights

We see that when we use a monthly chunking strategy we have too few data points for October 2018.

What Next

We can also inspect the dataset for other Summary Statistics such as Average. We can also inspect the dataset using Data Quality functionality provided by NannyML. Last but not least, we can look for any Data Drift present in the dataset using Detecting Data Drift functionality of NannyML.