Titanic Dataset

NannyML provides the titanic dataset in order to help show case it’s data quality features. The titanic dataset provided here is compiled from two sources, kaggle and data.world.

One difference of the titanic dataset compared to other dataset provided by NannyML is that there is not in-built model making predictions. Hence the titanic dataset cannot be used for NannyML performance estimation and realized performance modules. To find out what requirements NannyML has for datasets, check out Data Requirements.

Problem Description

The titanic dataset covers the passengers of the RMS Titanic and tells us whether they survived its sinking.

Dataset Description

A sample of the dataset can be seen below.

>>> import nannyml as nml
>>> reference_df, analysis_df, analysis_targets_df = nml.load_titanic_dataset()
>>> reference_df.head()

PassengerId

Pclass

Name

Sex

Age

SibSp

Parch

Ticket

Fare

Cabin

Embarked

boat

body

home.dest

Survived

0

1

3

Braund, Mr. Owen Harris

male

22

1

0

A/5 21171

7.25

nan

S

nan

nan

Bridgerule, Devon

0

1

2

1

Cumings, Mrs. John Bradley (Florence Briggs Thayer)

female

38

1

0

PC 17599

71.2833

C85

C

4

nan

New York, NY

1

2

3

3

Heikkinen, Miss. Laina

female

26

0

0

STON/O2. 3101282

7.925

nan

S

nan

nan

nan

1

3

4

1

Futrelle, Mrs. Jacques Heath (Lily May Peel)

female

35

1

0

113803

53.1

C123

S

D

nan

Scituate, MA

1

4

5

3

Allen, Mr. William Henry

male

35

0

0

373450

8.05

nan

S

nan

nan

Lower Clapton, Middlesex or Erdington, Birmingham

0

The dataset has 13 features:

  • Pclass - a proxy for socio-economic status, 1 is Upper, 2 is Middle and 3 is Lower class.

  • Age - passenger’s age. Fractional if less than one. If it is estimated it is in the form of xx.5.

  • SibSp - number of Siblings (brother, sister, stepbrother or stepsister) or spouses (husband or wife - mistresses and fiances were ignored) aboard.

  • Parch - number of parent (mother, father) or children (daughter, son, stepdaughter, stepson) aboard. Children who travelled only with a nanny have Parch=0. the employee’s residence to the workplace.

  • Ticket - passenger’s Ticket Number.

  • Fare - passenger’s Fare.

  • Cabin - passenger’s cabin number.

  • Embarked - passenger’s port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

  • boat - lifeboar information if the passenger survived.

  • body - body number if the passenger did not survive and a body was recovered.

  • home.dest - passenger’s domicile and destination information, if available.

The Survived column tells us whether the passenger survived and is what we call the Target column.

There is also an auxiliary column that kaggle uses and we have kept for compatibility:

  • PassengerId - a unique number referencing each passenger. This is very useful for joining the target results on the analysis dataset.

The titanic dataset is used by NannyML at the Data Quality Tutorials showcasing our Missing values and Unseen Values detection functionality.