Missing Values Detection

Just The Code

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, analysis_targets_df = nml.load_titanic_dataset()
>>> display(reference_df.head())

>>> feature_column_names = [
...     'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked',
>>> ]
>>> calc = nml.MissingValuesCalculator(
...     column_names=feature_column_names,
>>> )

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='all').to_df())

>>> for column_name in results.column_names:
...     results.filter(column_names=column_name).plot().show()

Walkthrough

NannyML’s approach to missing values detection is quite straightforward. For each chunk NannyML calculates the number of missing values. There is an option, called normalize, to convert the count of values to a relative ratio if needed. The resulting values from the reference data chunks are used to calculate the alert thresholds. The missing values results from the analysis chunks are compared against those thresholds and generate alerts if applicable.

We begin by loading the titanic dataset provided by the NannyML package.

>>> import nannyml as nml
>>> from IPython.display import display

>>> reference_df, analysis_df, analysis_targets_df = nml.load_titanic_dataset()
>>> display(reference_df.head())

	PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	boat	body	home.dest	Survived
0	1	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.25	nan	S	nan	nan	Bridgerule, Devon	0
1	2	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.2833	C85	C	4	nan	New York, NY	1
2	3	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.925	nan	S	nan	nan	nan	1
3	4	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1	C123	S	D	nan	Scituate, MA	1
4	5	3	Allen, Mr. William Henry	male	35	0	373450	8.05	nan	S	nan	nan	Lower Clapton, Middlesex or Erdington, Birmingham	0

The MissingValuesCalculator class implements the functionality needed for missing values calculations. We need to instantiate it with appropriate parameters:

column_names: A list with the names of columns to be evaluated.
normalize (Optional): Optionally, a boolean option indicating whether we want the absolute count of the missing value instances or their relative ratio. By default it is set to true.
timestamp_column_name (Optional): The name of the column in the reference data that contains timestamps.
chunk_size (Optional): The number of observations in each chunk of data used. Only one chunking argument needs to be provided. For more information about chunking configurations check out the chunking tutorial.
chunk_number (Optional): The number of chunks to be created out of data provided for each period.
chunk_period (Optional): The time period based on which we aggregate the provided data in order to create chunks.
chunker (Optional): A NannyML Chunker object that will handle the aggregation provided data in order to create chunks.
thresholds (Optional): The threshold strategy used to calculate the alert threshold limits. For more information about thresholds, check out the thresholds tutorial.

>>> feature_column_names = [
...     'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked',
>>> ]
>>> calc = nml.MissingValuesCalculator(
...     column_names=feature_column_names,
>>> )

Next, the fit() method needs to be called on the reference data, which provides the baseline that the analysis data will be compared with for alert generation. Then the calculate() method will calculate the data quality results on the data provided to it.

The results can be filtered to only include a certain data period, method or column by using the filter method. You can evaluate the result data by converting the results into a DataFrame, by calling the to_df() method. By default this will return a DataFrame with a multi-level index. The first level represents the column, the second level represents resulting information such as the data quality metric values, the alert thresholds or the associated sampling error.

>>> calc.fit(reference_df)
>>> results = calc.calculate(analysis_df)
>>> display(results.filter(period='all').to_df())

	chunk key	chunk_index	start_index	end_index	period	alert	alert	alert	Age value	sampling_error	upper_confidence_boundary	lower_confidence_boundary	upper_threshold	alert	alert	alert	alert	Fare value	upper_confidence_boundary	lower_confidence_boundary	alert	Cabin value	sampling_error	upper_confidence_boundary	lower_confidence_boundary	upper_threshold	lower_threshold	alert	Embarked value	sampling_error	upper_confidence_boundary	upper_threshold	alert
0	[0:88]	0	0	88	reference	False	False	False	0.235955	0.0422925	0.362832	0.109078	0.374409	False	False	False	False	0	0	0	False	0.808989	0.044537	0.9426	0.675378	1	0.55096	False	0.011236	0.00501641	0.0262852	0.0150438	False
1	[89:177]	1	89	177	reference	False	False	False	0.157303	0.0422925	0.284181	0.030426	0.374409	False	False	False	False	0	0	0	False	0.797753	0.044537	0.931364	0.664142	1	0.55096	False	0	0.00501641	0.0150492	0.0150438	False
2	[178:266]	2	178	266	reference	False	False	False	0.191011	0.0422925	0.317889	0.0641338	0.374409	False	False	False	False	0	0	0	False	0.797753	0.044537	0.931364	0.664142	1	0.55096	False	0	0.00501641	0.0150492	0.0150438	False
3	[267:355]	3	267	355	reference	False	False	False	0.202247	0.0422925	0.329125	0.0753698	0.374409	False	False	False	False	0	0	0	False	0.662921	0.044537	0.796532	0.52931	1	0.55096	False	0	0.00501641	0.0150492	0.0150438	False
4	[356:444]	4	356	444	reference	False	False	False	0.202247	0.0422925	0.329125	0.0753698	0.374409	False	False	False	False	0	0	0	False	0.842697	0.044537	0.976308	0.709086	1	0.55096	False	0	0.00501641	0.0150492	0.0150438	False
5	[445:533]	5	445	533	reference	False	False	False	0.258427	0.0422925	0.385304	0.13155	0.374409	False	False	False	False	0	0	0	False	0.730337	0.044537	0.863948	0.596726	1	0.55096	False	0	0.00501641	0.0150492	0.0150438	False
6	[534:622]	6	534	622	reference	False	False	False	0.224719	0.0422925	0.351597	0.0978417	0.374409	False	False	False	False	0	0	0	False	0.786517	0.044537	0.920128	0.652906	1	0.55096	False	0	0.00501641	0.0150492	0.0150438	False
7	[623:711]	7	623	711	reference	False	False	False	0.179775	0.0422925	0.306653	0.0528979	0.374409	False	False	False	False	0	0	0	False	0.752809	0.044537	0.88642	0.619198	1	0.55096	False	0	0.00501641	0.0150492	0.0150438	False
8	[712:800]	8	712	800	reference	False	False	False	0.179775	0.0422925	0.306653	0.0528979	0.374409	False	False	False	False	0	0	0	False	0.741573	0.044537	0.875184	0.607962	1	0.55096	False	0	0.00501641	0.0150492	0.0150438	False
9	[801:889]	9	801	889	reference	False	False	False	0.157303	0.0422925	0.284181	0.030426	0.374409	False	False	False	False	0	0	0	False	0.786517	0.044537	0.920128	0.652906	1	0.55096	False	0.011236	0.00501641	0.0262852	0.0150438	False
10	[890:890]	10	890	890	reference	False	False	False	0	0.398986	1	0	0.374409	False	False	False	False	0	0	0	False	1	0.420161	1	0	1	0.55096	False	0	0.0473247	0.141974	0.0150438	False
11	[0:40]	0	0	40	analysis	False	False	False	0.146341	0.0623112	0.333275	0	0.374409	False	False	False	False	0	0	0	False	0.853659	0.0656181	1	0.656804	1	0.55096	False	0	0.00739088	0.0221726	0.0150438	False
12	[41:81]	1	41	81	analysis	False	False	False	0.146341	0.0623112	0.333275	0	0.374409	False	False	False	False	0	0	0	False	0.609756	0.0656181	0.80661	0.412902	1	0.55096	False	0	0.00739088	0.0221726	0.0150438	False
13	[82:122]	2	82	122	analysis	False	False	False	0.292683	0.0623112	0.479617	0.105749	0.374409	False	False	False	False	0	0	0	False	0.780488	0.0656181	0.977342	0.583633	1	0.55096	False	0	0.00739088	0.0221726	0.0150438	False
14	[123:163]	3	123	163	analysis	False	False	False	0.219512	0.0623112	0.406446	0.0325786	0.374409	False	False	False	False	0.0243902	0.0243902	0.0243902	True	0.853659	0.0656181	1	0.656804	1	0.55096	False	0	0.00739088	0.0221726	0.0150438	False
15	[164:204]	4	164	204	analysis	False	False	False	0.195122	0.0623112	0.382056	0.00818835	0.374409	False	False	False	False	0	0	0	False	0.780488	0.0656181	0.977342	0.583633	1	0.55096	False	0	0.00739088	0.0221726	0.0150438	False
16	[205:245]	5	205	245	analysis	False	False	False	0.219512	0.0623112	0.406446	0.0325786	0.374409	False	False	False	False	0	0	0	False	0.780488	0.0656181	0.977342	0.583633	1	0.55096	False	0	0.00739088	0.0221726	0.0150438	False
17	[246:286]	6	246	286	analysis	False	False	False	0.292683	0.0623112	0.479617	0.105749	0.374409	False	False	False	False	0	0	0	False	0.926829	0.0656181	1	0.729975	1	0.55096	False	0	0.00739088	0.0221726	0.0150438	False
18	[287:327]	7	287	327	analysis	False	False	False	0.195122	0.0623112	0.382056	0.00818835	0.374409	False	False	False	False	0	0	0	False	0.707317	0.0656181	0.904171	0.510463	1	0.55096	False	0	0.00739088	0.0221726	0.0150438	False
19	[328:368]	8	328	368	analysis	False	False	False	0.195122	0.0623112	0.382056	0.00818835	0.374409	False	False	False	False	0	0	0	False	0.829268	0.0656181	1	0.632414	1	0.55096	False	0	0.00739088	0.0221726	0.0150438	False
20	[369:409]	9	369	409	analysis	False	False	False	0.097561	0.0623112	0.284495	0	0.374409	False	False	False	False	0	0	0	False	0.707317	0.0656181	0.904171	0.510463	1	0.55096	False	0	0.00739088	0.0221726	0.0150438	False
21	[410:417]	10	410	417	analysis	False	False	False	0.5	0.141063	0.923189	0.0768111	0.374409	True	False	False	False	0	0	0	False	0.75	0.148549	1	0.304352	1	0.55096	False	0	0.0167318	0.0501955	0.0150438	False

More information on accessing the information contained in the Result can be found on the Working with results page.

The next step is visualizing the results, which is done using the plot() method. It is recommended to filter results for each column and plot separately.

>>> for column_name in results.column_names:
...     results.filter(column_names=column_name).plot().show()

Insights

We see that most of the dataset columns don’t have missing values. The Age and Cabin columns are the most interesting with regards to missing values.

What Next

We can also inspect the dataset for Unseen Values in the Unseen Values Tutorial. Then we can look for any Data Drift present in the dataset using Detecting Data Drift functionality of NannyML.