Missing Values Detection

Just The Code

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, analysis_targets_df = nml.load_titanic_dataset()
>>> display(reference_df.head())

>>> feature_column_names = [
...     'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked',
>>> ]
>>> calc = nml.MissingValuesCalculator(
...     column_names=feature_column_names,
>>> )

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='all').to_df())

>>> for column_name in results.column_names:
...     results.filter(column_names=column_name).plot().show()

Walkthrough

NannyML’s approach to missing values detection is quite straightforward. For each chunk NannyML calculates the number of missing values. There is an option, called normalize, to convert the count of values to a relative ratio if needed. The resulting values from the reference data chunks are used to calculate the alert thresholds. The missing values results from the analysis chunks are compared against those thresholds and generate alerts if applicable.

We begin by loading the titanic dataset provided by the NannyML package.

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, analysis_targets_df = nml.load_titanic_dataset()
>>> display(reference_df.head())

PassengerId

Pclass

Name

Sex

Age

SibSp

Parch

Ticket

Fare

Cabin

Embarked

boat

body

home.dest

Survived

0

1

3

Braund, Mr. Owen Harris

male

22

1

0

A/5 21171

7.25

nan

S

nan

nan

Bridgerule, Devon

0

1

2

1

Cumings, Mrs. John Bradley (Florence Briggs Thayer)

female

38

1

0

PC 17599

71.2833

C85

C

4

nan

New York, NY

1

2

3

3

Heikkinen, Miss. Laina

female

26

0

0

STON/O2. 3101282

7.925

nan

S

nan

nan

nan

1

3

4

1

Futrelle, Mrs. Jacques Heath (Lily May Peel)

female

35

1

0

113803

53.1

C123

S

D

nan

Scituate, MA

1

4

5

3

Allen, Mr. William Henry

male

35

0

0

373450

8.05

nan

S

nan

nan

Lower Clapton, Middlesex or Erdington, Birmingham

0

The MissingValuesCalculator class implements the functionality needed for missing values calculations. We need to instantiate it with appropriate parameters:

  • column_names: A list with the names of columns to be evaluated.

  • normalize (Optional): Optionally, a boolean option indicating whether we want the absolute count of the missing value instances or their relative ratio. By default it is set to true.

  • timestamp_column_name (Optional): The name of the column in the reference data that contains timestamps.

  • chunk_size (Optional): The number of observations in each chunk of data used. Only one chunking argument needs to be provided. For more information about chunking configurations check out the chunking tutorial.

  • chunk_number (Optional): The number of chunks to be created out of data provided for each period.

  • chunk_period (Optional): The time period based on which we aggregate the provided data in order to create chunks.

  • chunker (Optional): A NannyML Chunker object that will handle the aggregation provided data in order to create chunks.

  • thresholds (Optional): The threshold strategy used to calculate the alert threshold limits. For more information about thresholds, check out the thresholds tutorial.

>>> feature_column_names = [
...     'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked',
>>> ]
>>> calc = nml.MissingValuesCalculator(
...     column_names=feature_column_names,
>>> )

Next, the fit() method needs to be called on the reference data, which provides the baseline that the analysis data will be compared with for alert generation. Then the calculate() method will calculate the data quality results on the data provided to it.

The results can be filtered to only include a certain data period, method or column by using the filter method. You can evaluate the result data by converting the results into a DataFrame, by calling the to_df() method. By default this will return a DataFrame with a multi-level index. The first level represents the column, the second level represents resulting information such as the data quality metric values, the alert thresholds or the associated sampling error.

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='all').to_df())

chunk
key
chunk_index
start_index
end_index
start_date
end_date
period
Pclass
value
sampling_error
upper_confidence_boundary
lower_confidence_boundary
upper_threshold
lower_threshold
alert
Name
value
sampling_error
upper_confidence_boundary
lower_confidence_boundary
upper_threshold
lower_threshold
alert
Sex
value
sampling_error
upper_confidence_boundary
lower_confidence_boundary
upper_threshold
lower_threshold
alert
Age
value
sampling_error
upper_confidence_boundary
lower_confidence_boundary
upper_threshold
lower_threshold
alert
SibSp
value
sampling_error
upper_confidence_boundary
lower_confidence_boundary
upper_threshold
lower_threshold
alert
Parch
value
sampling_error
upper_confidence_boundary
lower_confidence_boundary
upper_threshold
lower_threshold
alert
Ticket
value
sampling_error
upper_confidence_boundary
lower_confidence_boundary
upper_threshold
lower_threshold
alert
Fare
value
sampling_error
upper_confidence_boundary
lower_confidence_boundary
upper_threshold
lower_threshold
alert
Cabin
value
sampling_error
upper_confidence_boundary
lower_confidence_boundary
upper_threshold
lower_threshold
alert
Embarked
value
sampling_error
upper_confidence_boundary
lower_confidence_boundary
upper_threshold
lower_threshold
alert

0

[0:88]

0

0

88

reference

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.235955

0.0422925

0.362832

0.109078

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.808989

0.044537

0.9426

0.675378

0.915233

0.626814

False

0.011236

0.00501641

0.0262852

0

0.0156432

False

1

[89:177]

1

89

177

reference

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.157303

0.0422925

0.284181

0.030426

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.797753

0.044537

0.931364

0.664142

0.915233

0.626814

False

0

0.00501641

0.0150492

0

0.0156432

False

2

[178:266]

2

178

266

reference

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.191011

0.0422925

0.317889

0.0641338

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.797753

0.044537

0.931364

0.664142

0.915233

0.626814

False

0

0.00501641

0.0150492

0

0.0156432

False

3

[267:355]

3

267

355

reference

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.202247

0.0422925

0.329125

0.0753698

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.662921

0.044537

0.796532

0.52931

0.915233

0.626814

False

0

0.00501641

0.0150492

0

0.0156432

False

4

[356:444]

4

356

444

reference

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.202247

0.0422925

0.329125

0.0753698

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.842697

0.044537

0.976308

0.709086

0.915233

0.626814

False

0

0.00501641

0.0150492

0

0.0156432

False

5

[445:533]

5

445

533

reference

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.258427

0.0422925

0.385304

0.13155

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.730337

0.044537

0.863948

0.596726

0.915233

0.626814

False

0

0.00501641

0.0150492

0

0.0156432

False

6

[534:622]

6

534

622

reference

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.224719

0.0422925

0.351597

0.0978417

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.786517

0.044537

0.920128

0.652906

0.915233

0.626814

False

0

0.00501641

0.0150492

0

0.0156432

False

7

[623:711]

7

623

711

reference

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.179775

0.0422925

0.306653

0.0528979

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.752809

0.044537

0.88642

0.619198

0.915233

0.626814

False

0

0.00501641

0.0150492

0

0.0156432

False

8

[712:800]

8

712

800

reference

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.179775

0.0422925

0.306653

0.0528979

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.741573

0.044537

0.875184

0.607962

0.915233

0.626814

False

0

0.00501641

0.0150492

0

0.0156432

False

9

[801:890]

9

801

890

reference

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.155556

0.0420569

0.281726

0.029385

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.788889

0.0442889

0.921755

0.656022

0.915233

0.626814

False

0.0111111

0.00498846

0.0260765

0

0.0156432

False

10

[0:40]

0

0

40

analysis

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.146341

0.0623112

0.333275

0

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.853659

0.0656181

1

0.656804

0.915233

0.626814

False

0

0.00739088

0.0221726

0

0.0156432

False

11

[41:81]

1

41

81

analysis

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.146341

0.0623112

0.333275

0

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.609756

0.0656181

0.80661

0.412902

0.915233

0.626814

True

0

0.00739088

0.0221726

0

0.0156432

False

12

[82:122]

2

82

122

analysis

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.292683

0.0623112

0.479617

0.105749

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.780488

0.0656181

0.977342

0.583633

0.915233

0.626814

False

0

0.00739088

0.0221726

0

0.0156432

False

13

[123:163]

3

123

163

analysis

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.219512

0.0623112

0.406446

0.0325786

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.0243902

0

0.0243902

0.0243902

0

True

0.853659

0.0656181

1

0.656804

0.915233

0.626814

False

0

0.00739088

0.0221726

0

0.0156432

False

14

[164:204]

4

164

204

analysis

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.195122

0.0623112

0.382056

0.00818835

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.780488

0.0656181

0.977342

0.583633

0.915233

0.626814

False

0

0.00739088

0.0221726

0

0.0156432

False

15

[205:245]

5

205

245

analysis

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.219512

0.0623112

0.406446

0.0325786

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.780488

0.0656181

0.977342

0.583633

0.915233

0.626814

False

0

0.00739088

0.0221726

0

0.0156432

False

16

[246:286]

6

246

286

analysis

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.292683

0.0623112

0.479617

0.105749

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.926829

0.0656181

1

0.729975

0.915233

0.626814

True

0

0.00739088

0.0221726

0

0.0156432

False

17

[287:327]

7

287

327

analysis

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.195122

0.0623112

0.382056

0.00818835

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.707317

0.0656181

0.904171

0.510463

0.915233

0.626814

False

0

0.00739088

0.0221726

0

0.0156432

False

18

[328:368]

8

328

368

analysis

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.195122

0.0623112

0.382056

0.00818835

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.829268

0.0656181

1

0.632414

0.915233

0.626814

False

0

0.00739088

0.0221726

0

0.0156432

False

19

[369:417]

9

369

417

analysis

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.163265

0.056998

0.334259

0

0.293608

0.103796

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0

0

0

0

0

False

0.714286

0.060023

0.894355

0.534217

0.915233

0.626814

False

0

0.00676068

0.020282

0

0.0156432

False

More information on accessing the information contained in the Result can be found on the Working with results page.

The next step is visualizing the results, which is done using the plot() method. It is recommended to filter results for each column and plot separately.

>>> for column_name in results.column_names:
...     results.filter(column_names=column_name).plot().show()
../../_images/missing-titanic-Age.svg../../_images/missing-titanic-Cabin.svg../../_images/missing-titanic-Embarked.svg../../_images/missing-titanic-Fare.svg../../_images/missing-titanic-Name.svg../../_images/missing-titanic-Parch.svg../../_images/missing-titanic-Pclass.svg../../_images/missing-titanic-Sex.svg../../_images/missing-titanic-SibSp.svg../../_images/missing-titanic-Ticket.svg

Insights

We see that most of the dataset columns don’t have missing values. The Age and Cabin columns are the most interesting with regards to missing values.

What Next

We can also inspect the dataset for Unseen Values in the Unseen Values Tutorial. Then we can look for any Data Drift present in the dataset using Detecting Data Drift functionality of NannyML.