US Census Employment dataset
This page shows how US Census MA dataset was obtained and prepared to serve as an example in quickstart. Full notebook can be found in our github repository.
To find out what requirements NannyML has for datasets, check out Data Requirements.
Data Source
The dataset comes from US Census and it was obtained using folktables. Feature descriptions are given in PUMS documentation.
Dataset Description
The task is to predict whether an individual is employed based on features like age, education etc. The data analyzed comes from the state of Massachusetts and it covers the range from 2014 to 2018.
Preparing Data for NannyML
Fetching the Data
First we import required libraries and fetch the data with folktables:
>>> import numpy as np
>>> import pandas as pd
>>> from lightgbm import LGBMClassifier
>>> from folktables import ACSDataSource, ACSEmployment
>>> years = list(range(2014, 2019))
>>> dfs = []
>>> for year in years:
... data_source = ACSDataSource(survey_year=year, horizon='1-Year', survey='person')
... data = data_source.get_data(states=["MA"], download=True)
... features, labels, _ = ACSEmployment.df_to_numpy(data)
... df = pd.DataFrame(features)
... df.columns = ACSEmployment.features
... df[ACSEmployment.target] = labels
... df['year'] = year
... dfs.append(df)
...
>>> df = pd.concat(dfs).reset_index(drop=True)
>>> df.head()
AGEP |
SCHL |
MAR |
RELP |
DIS |
ESP |
CIT |
MIG |
MIL |
ANC |
NATIVITY |
DEAR |
DEYE |
DREM |
SEX |
RAC1P |
ESR |
year |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
30 |
19 |
1 |
0 |
2 |
0 |
1 |
3 |
4 |
1 |
1 |
2 |
2 |
2 |
1 |
2 |
True |
2014 |
1 |
24 |
19 |
1 |
1 |
2 |
0 |
1 |
3 |
4 |
2 |
1 |
2 |
2 |
2 |
2 |
2 |
False |
2014 |
2 |
5 |
2 |
5 |
4 |
2 |
2 |
1 |
3 |
0 |
2 |
1 |
2 |
2 |
2 |
2 |
2 |
False |
2014 |
3 |
5 |
2 |
5 |
4 |
2 |
2 |
1 |
3 |
0 |
2 |
1 |
2 |
2 |
2 |
1 |
2 |
False |
2014 |
4 |
83 |
22 |
1 |
0 |
1 |
0 |
1 |
1 |
2 |
2 |
1 |
2 |
2 |
2 |
1 |
1 |
False |
2014 |
Data is fetched for each year separately and column year is created.
Descriptions of all the variables can be found in the appendix.
Defining Partitions and Preprocessing
We split the data into three partitions simulating model lifecycle: train, test and production (after deployment) data. We will use 2014 data for training, 2015 for evaluation and 2016-2018 will simulate production data.
>>> df['partition'] = None
>>> df['partition'] = np.where(df['year']==2014, 'train', df['partition'])
>>> df['partition'] = np.where(df['year']==2015, 'test', df['partition'])
>>> df['partition'] = np.where(df['year']>2015, 'prod', df['partition'])
We now define categorical and numeric features:
>>> categorical_features = ['SCHL','MAR', 'RELP', 'DIS', 'ESP', 'CIT', 'MIG', 'MIL', 'ANC', 'NATIVITY',
... 'DEAR', 'DEYE', 'DREM', 'SEX', 'RAC1P']
>>> numeric_features = ['AGEP']
Since categorical features are already encoded correctly for LGBM model (non-negative, integers-like), we don’t need
any preprocessing. We will just turn them into proper integers
. We will also rename the target column and convert
target to int
:
>>> features = numeric_features + categorical_features
>>> target_col = 'employed'
>>> df = df.rename(columns={'ESR':target_col})
>>> df[categorical_features] = df[categorical_features].astype(int)
>>> df[target_col] = df[target_col].astype(int)
Developing ML Model and Making Predictions
We will now fit a model that will be subject to monitoring (e.g. in quickstart):
>>> client_model = LGBMClassifier(random_state=1)
>>> client_model.fit(df[df['partition']=='train'][features], df[df['partition']=='train'][target_col],
... categorical_feature=categorical_features)
>>> df['prediction'] = client_model.predict(df[features])
>>> df['predicted_probability'] = client_model.predict_proba(df[features])[:,1]
Let’s turn categorical features into category
dtype
so that NannyML correctly recognizes them:
>>> for feat in categorical_features:
... df[feat] = df[feat].astype(str).astype('category')
Splitting and Storing the Data
Now we will just split the data based on partitions, drop selected columns and store it in the relevant location in NannyML repository so the data can be accessed from within the library:
>>> full_reference_data = df[df['partition']=='test'].reset_index(drop=True).drop(columns='partition')
>>> analysis_wo_targets = df[df['partition']=='prod'].reset_index(drop=True).drop(columns=[target_col, 'partition'])
>>> analysis_targets = df[df['partition']=='prod'][[target_col]].reset_index(drop=True)
>>> data_dir = '../../../nannyml/nannyml/datasets/data/'
>>> full_reference_data.to_parquet(data_dir + "employment_MA_reference.pq")
>>> analysis_wo_targets.to_parquet(data_dir + "employment_MA_analysis.pq")
>>> analysis_targets.to_parquet(data_dir + "employment_MA_analysis_target.pq", )
Appendix: Feature description
This description comes from PUMS documentation:
AGEP - age person, numeric
SCHL - Educational attainment:
N/A
- less than 3 years old1
- No schooling completed2
- Nursery school, preschool3
- Kindergarten4
- Grade 15
- Grade 26
- Grade 37
- Grade 48
- Grade 59
- Grade 610
- Grade 711
- Grade 812
- Grade 913
- Grade 1014
- Grade 1115
- 12th grade - no diploma16
- Regular high school diploma17
- GED or alternative credential18
- Some college, but less than 1 year19
- 1 or more years of college credit, no degree20
- Associate’s degree21
- Bachelor’s degree22
- Master’s degree23
- Professional degree beyond a bachelor’s degree24
- Doctorate degree
MAR Character 1 - Marital status:
1
- Married2
- Widowed3
- Divorced4
- Separated5
- Never married or under 15 years old
RELP Character 2 - Relationship:
0
- Reference person1
- Husband/wife2
- Biological son or daughter3
- Adopted son or daughter4
- Stepson or stepdaughter5
- Brother or sister6
- Father or mother7
- Grandchild8
- Parent-in-law9
- Son-in-law or daughter-in-law10
- Other relative11
- Roomer or boarder12
- Housemate or roommate13
- Unmarried partner14
- Foster child15
- Other nonrelative16
- Institutionalized group quarters population17
- Noninstitutionalized group quarters population
DIS - Disability recode:
1
- With a disability2
- Without a disability
ESP - Employment status of parents:
b
- N/A (not own child of householder, and not child in subfamily)1
- Living with two parents: both parents in labor force2
- Living with two parents: Father only in labor force3
- Living with two parents: Mother only in labor force4
- Living with two parents: Neither parent in labor force5
- Living with father: Father in the labor force6
- Living with father: Father not in labor force7
- Living with mother: Mother in the labor force8
- Living with mother: Mother not in labor force
CIT - Citizenship status:
1
- Born in the U.S.2
- Born in Puerto Rico, Guam, the U.S. Virgin Islands, or the Northern Marianas3
- Born abroad of American parent(s)4
- U.S. citizen by naturalization5
- Not a citizen of the U.S.
MIG - Mobility status (lived here 1 year ago):
N/A
- less than 1 year old1
- Yes, same house (nonmovers)2
- No, outside US and Puerto Rico3
- No, different house in US or Puerto Rico
MIL - Military service:
N/A
- less than 17 years old1
- Now on active duty2
- On active duty in the past, but not now3
- Only on active duty for training in Reserves/National Guard4
- Never served in the military
ANC - Ancestry recode:
1
- Single2
- Multiple3
- Unclassified4
- Not reported8
- Suppressed for data year 2018 for select PUMAs
NATIVITY - Nativity:
1
- Native2
- Foreign born
DEAR - Hearing difficulty:
1
- Yes2
- No
DEYE - Vision difficulty:
1
- Yes2
- No
DREM - Cognitive difficulty:
N/A
- Less than 5 years old1
- Yes2
- No
SEX - Sex:
1
- Male2
- Female
RAC1P - Recoded detailed race code:
1
- White alone2
- Black or African American alone3
- American Indian alone4
- Alaska Native alone5
- American Indian and Alaska Native tribes specified or American Indian or Alaska Native, not specified and no other races6
- Asian alone7
- Native Hawaiian and Other Pacific Islander alone8
- Some Other Race alone9
- Two or More Races
ESR - target:
True
- employedFalse
- unemployed