nannyml.sampling_error.binary_classification module

Module containing functions to estimate sampling error for binary classification metrics.

The implementation of the sampling error estimation is split into two functions.

The first function is called during fitting and will calculate the sampling error components based the reference data. Most of the time these will be the standard deviation of the distribution of differences between y_true and y_pred and the fraction of positive labels in y_true.

The second function will be called during calculation or estimation. It takes the predetermined error components and combines them with the size of the (analysis) data to give an estimate for the sampling error.

nannyml.sampling_error.binary_classification.accuracy_sampling_error(sampling_error_components: Tuple, data) → float[source]

Calculate the accuracy sampling error for a chunk of data.

Parameters:

sampling_error_components (a set of parameters that were derived from reference data.) –
data (the (analysis) data you want to calculate or estimate a metric for.) –

Returns:

sampling_error

Return type:

float

nannyml.sampling_error.binary_classification.accuracy_sampling_error_components(y_true_reference: Series, y_pred_reference: Series) → Tuple[source]

Calculate sampling error components for accuracy using reference data. Calculation is based on modified standard error of mean formula.

Parameters:

y_true_reference (pd.Series) – Target values for the reference dataset.
y_pred_reference (pd.Series) – Predictions for the reference dataset.

Returns:

(std,)

Return type:

Tuple[np.ndarray]

nannyml.sampling_error.binary_classification.ap_sampling_error(sampling_error_components, data)[source]

Calculate the AUROC sampling error for a chunk of data.

if first component is NaN (due to data quality) result will be nan

Parameters:

sampling_error_components (a set of parameters that were derived from reference data.) –
data (the (analysis) data you want to calculate or estimate a metric for.) –

Returns:

sampling_error

Return type:

float

nannyml.sampling_error.binary_classification.ap_sampling_error_components(y_true_reference: Series, y_pred_proba_reference: Series) → Tuple[ndarray, int][source]

Calculate sampling error components for AP using reference data. Calculation is done by calculating the sampling error on reference data and extrapolating for different sizes using 1/sqrt(n) approximation.

Parameters:

y_true_reference (pd.Series) – Target values for the reference dataset.
y_pred_proba_reference (pd.Series) – Prediction values for the reference dataset.

Returns:

(std, sample_size) – Note that the sampling error component are different than usual!

Return type:

Tuple[np.ndarray, int]

nannyml.sampling_error.binary_classification.auroc_sampling_error(sampling_error_components, data)[source]

Calculate the AUROC sampling error for a chunk of data.

Parameters:

sampling_error_components (a set of parameters that were derived from reference data.) –
data (the (analysis) data you want to calculate or estimate a metric for.) –

Returns:

sampling_error

Return type:

float

nannyml.sampling_error.binary_classification.auroc_sampling_error_components(y_true_reference: Series, y_pred_proba_reference: Series) → Tuple[source]

Calculate sampling error components for AUROC using reference data. Calculation is based on the Variance Sum Law and expressing AUROC as a Mann-Whitney U statistic.

Parameters:

y_true_reference (pd.Series) – Target values for the reference dataset.
y_pred_proba_reference (pd.Series) – Prediction values for the reference dataset.

Returns:

(std, fraction)

Return type:

Tuple[np.ndarray, float]

nannyml.sampling_error.binary_classification.business_value_sampling_error(sampling_error_components: Tuple, data) → float[source]

Calculate the false positive rate sampling error for a chunk of data. :param sampling_error_components: :type sampling_error_components: a set of parameters that were derived from reference data. :param data: :type data: the (analysis) data you want to calculate or estimate a metric for.

Returns:: sampling_error
Return type:: float

nannyml.sampling_error.binary_classification.business_value_sampling_error_components(y_true_reference: Series, y_pred_reference: Series, business_value_matrix: ndarray, normalize_business_value: Optional[str]) → Tuple[float, Optional[str]][source]

Estimate sampling error for the false negative rate. :param y_true_reference: Target values for the reference dataset. :type y_true_reference: pd.Series :param y_pred_reference: Predictions for the reference dataset. :type y_pred_reference: pd.Series :param business_value_matrix: A 2x2 matrix of values for the business problem. :type business_value_matrix: np.ndarray :param normalize_business_value: Determines how the business value will be normalized. Allowed values are None and ‘per_prediction’. :type normalize_business_value: Optional[str], default=None

Returns:: components
Return type:: tuple

nannyml.sampling_error.binary_classification.f1_sampling_error(sampling_error_components, data)[source]

Calculate the F1 sampling error for a chunk of data.

Parameters:

sampling_error_components (a set of parameters that were derived from reference data.) –
data (the (analysis) data you want to calculate or estimate a metric for.) –

Returns:

sampling_error

Return type:

float

nannyml.sampling_error.binary_classification.f1_sampling_error_components(y_true_reference: Series, y_pred_reference: Series) → Tuple[source]

Calculate sampling error components for F1 using reference data. Calculation is based on modified standard error of mean formula.

Parameters:

y_true_reference (pd.Series) – Target values for the reference dataset.
y_pred_reference (pd.Series) – Predictions for the reference dataset.

Returns:

(std, fraction)

Return type:

Tuple[np.ndarray, float]

nannyml.sampling_error.binary_classification.false_negative_sampling_error(sampling_error_components: Tuple, data) → float[source]

Calculate the false positive rate sampling error for a chunk of data.

Parameters:

sampling_error_components (a set of parameters that were derived from reference data.) –
data (the (analysis) data you want to calculate or estimate a metric for.) –

Returns:

sampling_error

Return type:

float

nannyml.sampling_error.binary_classification.false_negative_sampling_error_components(y_true_reference: Series, y_pred_reference: Series, normalize_confusion_matrix: Optional[str]) → Tuple[float, float, Optional[str]][source]

Estimate sampling error components for false negative rate using reference data. Calculation is based on modified standard error of mean formula.

Parameters:

y_true_reference (pd.Series) – Target values for the reference dataset.
y_pred_reference (pd.Series) – Predictions for the reference dataset.
normalize_confusion_matrix (str) – The type of normalization to apply to the confusion matrix.

Returns:

(std, relevant_proportion, norm_type)

Return type:

Tuple[float, float, str]

nannyml.sampling_error.binary_classification.false_positive_sampling_error(sampling_error_components: Tuple, data) → float[source]

Calculate the false positive rate sampling error for a chunk of data.

Parameters:

sampling_error_components (a set of parameters that were derived from reference data.) –
data (the (analysis) data you want to calculate or estimate a metric for.) –

Returns:

sampling_error

Return type:

float

nannyml.sampling_error.binary_classification.false_positive_sampling_error_components(y_true_reference: Series, y_pred_reference: Series, normalize_confusion_matrix: Optional[str]) → Tuple[float, float, Optional[str]][source]

Estimate sampling error components for false positive rate using reference data. Calculation is based on modified standard error of mean formula.

Parameters:

y_true_reference (pd.Series) – Target values for the reference dataset.
y_pred_reference (pd.Series) – Predictions for the reference dataset.
normalize_confusion_matrix (str) – The type of normalization to apply to the confusion matrix.

Returns:

(std, relevant_proportion, norm_type)

Return type:

Tuple[float, float, str]

nannyml.sampling_error.binary_classification.precision_sampling_error(sampling_error_components, data)[source]

Calculate the precision sampling error for a chunk of data.

Parameters:

sampling_error_components (a set of parameters that were derived from reference data.) –
data (the (analysis) data you want to calculate or estimate a metric for.) –

Returns:

sampling_error

Return type:

float

nannyml.sampling_error.binary_classification.precision_sampling_error_components(y_true_reference: Series, y_pred_reference: Series) → Tuple[source]

Calculate sampling error components for precision using reference data. Calculation is based on modified standard error of mean formula.

Parameters:

y_true_reference (pd.Series) – Target values for the reference dataset.
y_pred_reference (pd.Series) – Predictions for the reference dataset.

Returns:

(std, fraction)

Return type:

Tuple[np.ndarray, float]

nannyml.sampling_error.binary_classification.recall_sampling_error(sampling_error_components, data)[source]

Calculate the recall sampling error for a chunk of data.

Parameters:

sampling_error_components (a set of parameters that were derived from reference data.) –
data (the (analysis) data you want to calculate or estimate a metric for.) –

Returns:

sampling_error

Return type:

float

nannyml.sampling_error.binary_classification.recall_sampling_error_components(y_true_reference: Series, y_pred_reference: Series) → Tuple[source]

Calculate sampling error components for recall using reference data. Calculation is based on modified standard error of mean formula.

Parameters:

y_true_reference (pd.Series) – Target values for the reference dataset.
y_pred_reference (pd.Series) – Predictions for the reference dataset.

Returns:

(std, fraction)

Return type:

Tuple[np.ndarray, float]

nannyml.sampling_error.binary_classification.specificity_sampling_error(sampling_error_components, data)[source]

Calculate the specificity sampling error for a chunk of data.

Parameters:

sampling_error_components (a set of parameters that were derived from reference data.) –
data (the (analysis) data you want to calculate or estimate a metric for.) –

Returns:

sampling_error

Return type:

float

nannyml.sampling_error.binary_classification.specificity_sampling_error_components(y_true_reference: Series, y_pred_reference: Series) → Tuple[source]

Calculate sampling error components for specificity using reference data. Calculation is based on modified standard error of mean formula.

Parameters:

y_true_reference (pd.Series) – Target values for the reference dataset.
y_pred_reference (pd.Series) – Predictions for the reference dataset.

Returns:

(std, fraction)

Return type:

Tuple[np.ndarray, float]

nannyml.sampling_error.binary_classification.true_negative_sampling_error(sampling_error_components: Tuple, data) → float[source]

Calculate the true negative rate sampling error for a chunk of data.

Parameters:

sampling_error_components (a set of parameters that were derived from reference data.) –
data (the (analysis) data you want to calculate or estimate a metric for.) –

Returns:

sampling_error

Return type:

float

nannyml.sampling_error.binary_classification.true_negative_sampling_error_components(y_true_reference: Series, y_pred_reference: Series, normalize_confusion_matrix: Optional[str]) → Tuple[float, float, Optional[str]][source]

Estimate sampling error components for true negative rate using reference data. Calculation is based on modified standard error of mean formula.

Parameters:

y_true_reference (pd.Series) – Target values for the reference dataset.
y_pred_reference (pd.Series) – Predictions for the reference dataset.
normalize_confusion_matrix (str) – The type of normalization to apply to the confusion matrix.

Returns:

(std, relevant_proportion, norm_type)

Return type:

Tuple[float, float, str]

nannyml.sampling_error.binary_classification.true_positive_sampling_error(sampling_error_components: Tuple, data) → float[source]

Calculate the true positive rate sampling error for a chunk of data.

Parameters:

sampling_error_components (a set of parameters that were derived from reference data.) –
data (the (analysis) data you want to calculate or estimate a metric for.) –

Returns:

sampling_error

Return type:

float

nannyml.sampling_error.binary_classification.true_positive_sampling_error_components(y_true_reference: Series, y_pred_reference: Series, normalize_confusion_matrix: Optional[str]) → Tuple[float, float, Optional[str]][source]

Estimate sampling error components for true positive rate using reference data. Calculation is based on modified standard error of mean formula.

Parameters:

y_true_reference (pd.Series) – Target values for the reference dataset.
y_pred_reference (pd.Series) – Predictions for the reference dataset.
normalize_confusion_matrix (str) – The type of normalization to apply to the confusion matrix.

Returns:

(std, relevant_proportion, norm_type)

Return type:

Tuple[float, float, str]