US Census Employment dataset

This page shows how US Census MA dataset was obtained and prepared to serve as an example in quickstart. Full notebook can be found in our github repository.

To find out what requirements NannyML has for datasets, check out Data Requirements.

Data Source

The dataset comes from US Census and it was obtained using folktables. Feature descriptions are given in PUMS documentation.

Dataset Description

The task is to predict whether an individual is employed based on features like age, education etc. The data analyzed comes from the state of Massachusetts and it covers the range from 2014 to 2018.

Preparing Data for NannyML

Fetching the Data

First we import required libraries and fetch the data with folktables:

>>> import numpy as np
>>> import pandas as pd
>>> from lightgbm import LGBMClassifier
>>> from folktables import ACSDataSource, ACSEmployment

>>> years = list(range(2014, 2019))
>>> dfs = []
>>> for year in years:
...     data_source = ACSDataSource(survey_year=year, horizon='1-Year', survey='person')
...     data = data_source.get_data(states=["MA"], download=True)
...     features, labels, _ = ACSEmployment.df_to_numpy(data)
...     df = pd.DataFrame(features)
...     df.columns = ACSEmployment.features
...     df[ACSEmployment.target] = labels
...     df['year'] = year
...     dfs.append(df)
...     

>>> df = pd.concat(dfs).reset_index(drop=True)
>>> df.head()

AGEP

SCHL

MAR

RELP

DIS

ESP

CIT

MIG

MIL

ANC

NATIVITY

DEAR

DEYE

DREM

SEX

RAC1P

ESR

year

0

30

19

1

0

2

0

1

3

4

1

1

2

2

2

1

2

True

2014

1

24

19

1

1

2

0

1

3

4

2

1

2

2

2

2

2

False

2014

2

5

2

5

4

2

2

1

3

0

2

1

2

2

2

2

2

False

2014

3

5

2

5

4

2

2

1

3

0

2

1

2

2

2

1

2

False

2014

4

83

22

1

0

1

0

1

1

2

2

1

2

2

2

1

1

False

2014

Data is fetched for each year separately and column year is created.

Descriptions of all the variables can be found in the appendix.

Defining Partitions and Preprocessing

We split the data into three partitions simulating model lifecycle: train, test and production (after deployment) data. We will use 2014 data for training, 2015 for evaluation and 2016-2018 will simulate production data.

>>> df['partition'] = None
>>> df['partition'] = np.where(df['year']==2014, 'train', df['partition'])
>>> df['partition'] = np.where(df['year']==2015, 'test', df['partition'])
>>> df['partition'] = np.where(df['year']>2015, 'prod', df['partition'])

We now define categorical and numeric features:

>>> categorical_features = ['SCHL','MAR', 'RELP', 'DIS', 'ESP', 'CIT', 'MIG', 'MIL', 'ANC', 'NATIVITY',
...                         'DEAR', 'DEYE', 'DREM', 'SEX', 'RAC1P']

>>> numeric_features = ['AGEP']

Since categorical features are already encoded correctly for LGBM model (non-negative, integers-like), we don’t need any preprocessing. We will just turn them into proper integers. We will also rename the target column and convert target to int:

>>> features = numeric_features + categorical_features

>>> target_col = 'employed'

>>> df = df.rename(columns={'ESR':target_col})

>>> df[categorical_features] = df[categorical_features].astype(int)
>>> df[target_col] = df[target_col].astype(int)

Developing ML Model and Making Predictions

We will now fit a model that will be subject to monitoring (e.g. in quickstart):

>>> client_model = LGBMClassifier(random_state=1)

>>> client_model.fit(df[df['partition']=='train'][features], df[df['partition']=='train'][target_col],
...                  categorical_feature=categorical_features)

>>> df['prediction'] = client_model.predict(df[features])
>>> df['predicted_probability'] = client_model.predict_proba(df[features])[:,1]

Let’s turn categorical features into category dtype so that NannyML correctly recognizes them:

>>> for feat in categorical_features:
...     df[feat] = df[feat].astype(str).astype('category')

Splitting and Storing the Data

Now we will just split the data based on partitions, drop selected columns and store it in the relevant location in NannyML repository so the data can be accessed from within the library:

>>> full_reference_data = df[df['partition']=='test'].reset_index(drop=True).drop(columns='partition')
>>> analysis_wo_targets = df[df['partition']=='prod'].reset_index(drop=True).drop(columns=[target_col, 'partition'])
>>> analysis_targets = df[df['partition']=='prod'][[target_col]].reset_index(drop=True)

>>> data_dir = '../../../nannyml/nannyml/datasets/data/'

>>> full_reference_data.to_parquet(data_dir + "employment_MA_reference.pq")
>>> analysis_wo_targets.to_parquet(data_dir + "employment_MA_analysis.pq")
>>> analysis_targets.to_parquet(data_dir + "employment_MA_analysis_target.pq", )

Appendix: Feature description

This description comes from PUMS documentation:

AGEP - age person, numeric

SCHL - Educational attainment:

  • N/A - less than 3 years old

  • 1 - No schooling completed

  • 2 - Nursery school, preschool

  • 3 - Kindergarten

  • 4 - Grade 1

  • 5 - Grade 2

  • 6 - Grade 3

  • 7 - Grade 4

  • 8 - Grade 5

  • 9 - Grade 6

  • 10 - Grade 7

  • 11 - Grade 8

  • 12 - Grade 9

  • 13 - Grade 10

  • 14 - Grade 11

  • 15 - 12th grade - no diploma

  • 16 - Regular high school diploma

  • 17 - GED or alternative credential

  • 18 - Some college, but less than 1 year

  • 19 - 1 or more years of college credit, no degree

  • 20 - Associate’s degree

  • 21 - Bachelor’s degree

  • 22 - Master’s degree

  • 23 - Professional degree beyond a bachelor’s degree

  • 24 - Doctorate degree

MAR Character 1 - Marital status:

  • 1 - Married

  • 2 - Widowed

  • 3 - Divorced

  • 4 - Separated

  • 5 - Never married or under 15 years old

RELP Character 2 - Relationship:

  • 0 - Reference person

  • 1 - Husband/wife

  • 2 - Biological son or daughter

  • 3 - Adopted son or daughter

  • 4 - Stepson or stepdaughter

  • 5 - Brother or sister

  • 6 - Father or mother

  • 7 - Grandchild

  • 8 - Parent-in-law

  • 9 - Son-in-law or daughter-in-law

  • 10 - Other relative

  • 11 - Roomer or boarder

  • 12 - Housemate or roommate

  • 13 - Unmarried partner

  • 14 - Foster child

  • 15 - Other nonrelative

  • 16 - Institutionalized group quarters population

  • 17- Noninstitutionalized group quarters population

DIS - Disability recode:

  • 1 - With a disability

  • 2 - Without a disability

ESP - Employment status of parents:

  • b - N/A (not own child of householder, and not child in subfamily)

  • 1 - Living with two parents: both parents in labor force

  • 2 - Living with two parents: Father only in labor force

  • 3 - Living with two parents: Mother only in labor force

  • 4 - Living with two parents: Neither parent in labor force

  • 5 - Living with father: Father in the labor force

  • 6 - Living with father: Father not in labor force

  • 7 - Living with mother: Mother in the labor force

  • 8 - Living with mother: Mother not in labor force

CIT - Citizenship status:

  • 1 - Born in the U.S.

  • 2 - Born in Puerto Rico, Guam, the U.S. Virgin Islands, or the Northern Marianas

  • 3 - Born abroad of American parent(s)

  • 4 - U.S. citizen by naturalization

  • 5 - Not a citizen of the U.S.

MIG - Mobility status (lived here 1 year ago):

  • N/A - less than 1 year old

  • 1 - Yes, same house (nonmovers)

  • 2 - No, outside US and Puerto Rico

  • 3 - No, different house in US or Puerto Rico

MIL - Military service:

  • N/A - less than 17 years old

  • 1 - Now on active duty

  • 2 - On active duty in the past, but not now

  • 3 - Only on active duty for training in Reserves/National Guard

  • 4 - Never served in the military

ANC - Ancestry recode:

  • 1 - Single

  • 2 - Multiple

  • 3 - Unclassified

  • 4 - Not reported

  • 8 - Suppressed for data year 2018 for select PUMAs

NATIVITY - Nativity:

  • 1 - Native

  • 2 - Foreign born

DEAR - Hearing difficulty:

  • 1 - Yes

  • 2 - No

DEYE - Vision difficulty:

  • 1 - Yes

  • 2 - No

DREM - Cognitive difficulty:

  • N/A - Less than 5 years old

  • 1 - Yes

  • 2 - No

SEX - Sex:

  • 1 - Male

  • 2 - Female

RAC1P - Recoded detailed race code:

  • 1 - White alone

  • 2 - Black or African American alone

  • 3 - American Indian alone

  • 4 - Alaska Native alone

  • 5 - American Indian and Alaska Native tribes specified or American Indian or Alaska Native, not specified and no other races

  • 6 - Asian alone

  • 7 - Native Hawaiian and Other Pacific Islander alone

  • 8 - Some Other Race alone

  • 9 - Two or More Races

ESR - target:

  • True - employed

  • False - unemployed

References