.. _unseen_values: ======================= Unseen Values Detection ======================= Just The Code ------------- .. nbimport:: :path: ./example_notebooks/Tutorial - Unseen Values.ipynb :cells: 1 3 4 6 .. _unseen_values_walkthrough: Walkthrough ----------- NannyML defines :term:`unseen values` as categorical feature values that are not present in the :term:`reference period`. NannyML's approach to unseen values detection is simple. The reference :term:`period ` is used to create a set of expected values for each categorical feature. For each :term:`chunk` in the analysis :term:`period ` NannyML calculates the number of unseen values. There is an option, called ``normalize``, to convert the count of values to a relative ratio if needed. If unseen values are detected in a chunk, an alert is raised for the relevant feature. We begin by loading the :ref:`titanic dataset` provided by the NannyML package. .. nbimport:: :path: ./example_notebooks/Tutorial - Unseen Values.ipynb :cells: 1 .. nbtable:: :path: ./example_notebooks/Tutorial - Unseen Values.ipynb :cell: 2 The :class:`~nannyml.data_quality.unseen.calculator.UnseenValuesCalculator` class implements the functionality needed for unseen values calculations. We need to instantiate it with appropriate parameters: - **column_names:** A list with the names of columns to be evaluated. They need to be categorical columns. - **normalize (Optional):** Optionally, a boolean option indicating whether we want the absolute count of the missing value instances or their relative ratio. By default it is set to true. - **timestamp_column_name (Optional):** The name of the column in the reference data that contains timestamps. - **chunk_size (Optional):** The number of observations in each chunk of data used. Only one chunking argument needs to be provided. For more information about :term:`chunking` configurations check out the :ref:`chunking tutorial`. - **chunk_number (Optional):** The number of chunks to be created out of data provided for each :ref:`period`. - **chunk_period (Optional):** The time period based on which we aggregate the provided data in order to create chunks. - **chunker (Optional):** A NannyML :class:`~nannyml.chunk.Chunker` object that will handle the aggregation provided data in order to create chunks. - **thresholds (Optional):** The threshold strategy used to calculate the alert threshold limits. For more information about thresholds, check out the :ref:`thresholds tutorial`. .. warning:: Note that because of how unseen values are defined they will be 0 by definition for the :term:`reference period`. Hence the :ref:`StandardDeviationThreshold` threshold option is not really applicable for this calculator. .. nbimport:: :path: ./example_notebooks/Tutorial - Unseen Values.ipynb :cells: 3 Next, the :meth:`~nannyml.base.AbstractCalculator.fit` method needs to be called on the reference data, which provides the baseline that the analysis data will be compared with for :term:`alert` generation. Then the :meth:`~nannyml.base.AbstractCalculator.calculate` method will calculate the data quality results on the data provided to it. The results can be filtered to only include a certain data period, method or column by using the ``filter`` method. You can evaluate the result data by converting the results into a `DataFrame`, by calling the :meth:`~nannyml.base.AbstractResult.to_df` method. By default this will return a `DataFrame` with a multi-level index. The first level represents the column, the second level represents resulting information such as the data quality metric values and the alert thresholds. .. nbimport:: :path: ./example_notebooks/Tutorial - Unseen Values.ipynb :cells: 4 .. nbtable:: :path: ./example_notebooks/Tutorial - Unseen Values.ipynb :cell: 5 More information on accessing the information contained in the :class:`~nannyml.data_quality.unseen.result.Result` can be found on the :ref:`working_with_results` page. The next step is visualizing the results, which is done using the :meth:`~nannyml.data_quality.unseen.result.Result.plot` method. It is recommended to filter results for each column and plot separately. .. nbimport:: :path: ./example_notebooks/Tutorial - Unseen Values.ipynb :cells: 6 .. image:: /_static/tutorials/data_quality/unseen-titanic-Cabin.svg .. image:: /_static/tutorials/data_quality/unseen-titanic-Embarked.svg .. image:: /_static/tutorials/data_quality/unseen-titanic-Sex.svg .. image:: /_static/tutorials/data_quality/unseen-titanic-Ticket.svg Insights -------- We see that most of the dataset columns don't have unseen values. The **Ticket** and **Cabin** columns are the most interesting with regards to unseen values. What Next --------- We can also inspect the dataset for missing values in the :ref:`Missing Values Tutorial`. Then we can look for any :term:`Data Drift` present in the dataset using :ref:`data-drift` functionality of NannyML.