nannyml.datasets.datasets module

Utility module offering curated datasets for quick experimentation.

nannyml.datasets.datasets.load_csv_file_to_df(local_file: str) DataFrame[source]

Loads a data file from within the NannyML package.

Parameters:

local_file (str, required) – string with the name of the data file to be loaded.

Returns:

df – A DataFrame containing the requested data

Return type:

pd.DataFrame

nannyml.datasets.datasets.load_modified_california_housing_dataset()[source]

Loads the modified california housing dataset provided for testing the NannyML package.

This dataset has been altered to represent a binary classification problem over time. More information about the dataset can be found at: California Housing Dataset

Returns:

  • reference (pd.DataFrame) – A DataFrame containing reference period of modified california housing dataset

  • analysis (pd.DataFrame) – A DataFrame containing analysis period of modified california housing dataset

  • analysis_tgt (pd.DataFrame) – A DataFrame containing target values for the analysis period of modified california housing dataset

Examples

>>> from nannyml.datasets import load_modified_california_housing_dataset
>>> reference_df, analysis_df, analysis_targets_df = load_modified_california_housing_dataset()
nannyml.datasets.datasets.load_pq_file_to_df(local_file: str) DataFrame[source]

Loads a data file from within the NannyML package.

Parameters:

local_file (str, required) – string with the name of the data file to be loaded.

Returns:

df – A DataFrame containing the requested data

Return type:

pd.DataFrame

nannyml.datasets.datasets.load_synthetic_binary_classification_dataset()[source]

Loads the synthetic binary classification dataset provided for testing the NannyML package.

Returns:

  • reference (pd.DataFrame) – A DataFrame containing reference period of synthetic binary classification dataset

  • analysis (pd.DataFrame) – A DataFrame containing analysis period of synthetic binary classification dataset

  • analysis_tgt (pd.DataFrame) – A DataFrame containing target values for the analysis period of synthetic binary classification dataset

Examples

>>> from nannyml.datasets import load_synthetic_binary_classification_dataset
>>> reference_df, analysis_df, analysis_targets_df = load_synthetic_binary_classification_dataset()
nannyml.datasets.datasets.load_synthetic_car_loan_data_quality_dataset()[source]

Loads the synthetic car loan binary classification dataset that contains missing values provided for testing the NannyML package.

Returns:

  • reference (pd.DataFrame) – A DataFrame containing reference period of synthetic car loan binary classification dataset that contains missing values

  • analysis (pd.DataFrame) – A DataFrame containing analysis period of synthetic car loan binary classification dataset that contains missing values

  • analysis_tgt (pd.DataFrame) – A DataFrame containing target values for the analysis period of synthetic car loan binary classification dataset that contains missing values

Examples

>>> from nannyml.datasets import load_synthetic_car_loan_w_missing_dataset
>>> reference_df, analysis_df, analysis_targets_df = load_synthetic_car_loan_w_missing_dataset()
nannyml.datasets.datasets.load_synthetic_car_loan_dataset()[source]

Loads the synthetic car loan binary classification dataset provided for testing the NannyML package.

Returns:

  • reference (pd.DataFrame) – A DataFrame containing reference period of synthetic binary classification dataset

  • analysis (pd.DataFrame) – A DataFrame containing analysis period of synthetic binary classification dataset

  • analysis_tgt (pd.DataFrame) – A DataFrame containing target values for the analysis period of synthetic binary classification dataset

Examples

>>> from nannyml.datasets import load_synthetic_car_loan_dataset
>>> reference_df, analysis_df, analysis_targets_df = load_synthetic_car_loan_dataset()
nannyml.datasets.datasets.load_synthetic_car_price_dataset()[source]

Loads the synthetic car price dataset provided for testing the NannyML package on regression problems.

Returns:

  • reference (pd.DataFrame) – A DataFrame containing reference period of synthetic car price dataset

  • analysis (pd.DataFrame) – A DataFrame containing analysis period of synthetic car price dataset

  • analysis_tgt (pd.DataFrame) – A DataFrame containing target values for the analysis period of synthetic car price dataset

Examples

>>> from nannyml.datasets import load_synthetic_car_price_dataset
>>> reference, analysis, analysis_tgt = load_synthetic_car_price_dataset()
nannyml.datasets.datasets.load_synthetic_multiclass_classification_dataset()[source]

Loads the synthetic multiclass classification dataset provided for testing the NannyML package.

Returns:

  • reference (pd.DataFrame) – A DataFrame containing reference period of synthetic multiclass classification dataset

  • analysis (pd.DataFrame) – A DataFrame containing analysis period of synthetic multiclass classification dataset

  • analysis_tgt (pd.DataFrame) – A DataFrame containing target values for the analysis period of synthetic multiclass classification dataset

Examples

>>> from nannyml.datasets import load_synthetic_multiclass_classification_dataset
>>> reference_df, analysis_df, analysis_targets_df = load_synthetic_multiclass_classification_dataset()
nannyml.datasets.datasets.load_titanic_dataset()[source]

Loads the titanic the NannyML package.

The dataset has been created by combining two sources, the kaggle dataset[1] and the data world dataset[2]. Note that we have made the reference period align with the kaggle train set and the analysis period align with the kaggle test set.

[1]: https://www.kaggle.com/competitions/titanic/data [2]: https://data.world/nrippner/titanic-disaster-dataset

Returns:

  • reference (pd.DataFrame) – A DataFrame containing reference period of the titanic dataset

  • analysis (pd.DataFrame) – A DataFrame containing analysis period of the titanic dataset

  • analysis_tgt (pd.DataFrame) – A DataFrame containing target values for the analysis period of the titanic dataset

Examples

>>> from nannyml.datasets import load_titanic_dataset
>>> reference_df, analysis_df, analysis_targets_df = load_titanic_dataset()
nannyml.datasets.datasets.load_us_census_ma_employment_data()[source]

Loads the real-world binary classification dataset - predicting whether an individual is employed.

Returns:

  • reference (pd.DataFrame) – A DataFrame containing reference period of synthetic car price dataset

  • analysis (pd.DataFrame) – A DataFrame containing analysis period of synthetic car price dataset

  • analysis_tgt (pd.DataFrame) – A DataFrame containing target values for the analysis period of synthetic car price dataset

Examples

>>> from nannyml.datasets import load_us_census_ma_employment_reference_and_analysis_data
>>> reference, analysis, analysis_tgt = load_us_census_ma_employment_reference_and_analysis_data()