Titanic Dataset
NannyML provides the titanic dataset in order to help show case it’s data quality features. The titanic dataset provided here is compiled from two sources, kaggle and data.world.
One difference of the titanic dataset compared to other dataset provided by NannyML is that there is not in-built model making predictions. Hence the titanic dataset cannot be used for NannyML performance estimation and realized performance modules. To find out what requirements NannyML has for datasets, check out Data Requirements.
Problem Description
The titanic dataset covers the passengers of the RMS Titanic and tells us whether they survived its sinking.
Dataset Description
A sample of the dataset can be seen below.
>>> import nannyml as nml
>>> reference_df, analysis_df, analysis_targets_df = nml.load_titanic_dataset()
>>> reference_df.head()
PassengerId |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
boat |
body |
home.dest |
Survived |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
1 |
3 |
Braund, Mr. Owen Harris |
male |
22 |
1 |
0 |
A/5 21171 |
7.25 |
nan |
S |
nan |
nan |
Bridgerule, Devon |
0 |
1 |
2 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Thayer) |
female |
38 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
4 |
nan |
New York, NY |
1 |
2 |
3 |
3 |
Heikkinen, Miss. Laina |
female |
26 |
0 |
0 |
STON/O2. 3101282 |
7.925 |
nan |
S |
nan |
nan |
nan |
1 |
3 |
4 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35 |
1 |
0 |
113803 |
53.1 |
C123 |
S |
D |
nan |
Scituate, MA |
1 |
4 |
5 |
3 |
Allen, Mr. William Henry |
male |
35 |
0 |
0 |
373450 |
8.05 |
nan |
S |
nan |
nan |
Lower Clapton, Middlesex or Erdington, Birmingham |
0 |
The dataset has 13 features:
Pclass - a proxy for socio-economic status, 1 is Upper, 2 is Middle and 3 is Lower class.
Age - passenger’s age. Fractional if less than one. If it is estimated it is in the form of xx.5.
SibSp - number of Siblings (brother, sister, stepbrother or stepsister) or spouses (husband or wife - mistresses and fiances were ignored) aboard.
Parch - number of parent (mother, father) or children (daughter, son, stepdaughter, stepson) aboard. Children who travelled only with a nanny have Parch=0. the employee’s residence to the workplace.
Ticket - passenger’s Ticket Number.
Fare - passenger’s Fare.
Cabin - passenger’s cabin number.
Embarked - passenger’s port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
boat - lifeboar information if the passenger survived.
body - body number if the passenger did not survive and a body was recovered.
home.dest - passenger’s domicile and destination information, if available.
The Survived column tells us whether the passenger survived and is what we call the Target column.
There is also an auxiliary column that kaggle uses and we have kept for compatibility:
PassengerId - a unique number referencing each passenger. This is very useful for joining the target results on the analysis dataset.
The titanic dataset is used by NannyML at the Data Quality Tutorials showcasing our Missing values and Unseen Values detection functionality.