nannyml.chunk module

NannyML module providing intelligent splitting of data into chunks.

class nannyml.chunk.Chunk(key: str, data: DataFrame, start_datetime: datetime = datetime.datetime(9999, 12, 31, 23, 59, 59, 999999), end_datetime: datetime = datetime.datetime(9999, 12, 31, 23, 59, 59, 999999), partition: Optional[str] = None)[source]

Bases: object

A subset of data that acts as a logical unit during calculations.

Creates a new chunk.

Parameters
  • key (str, required.) – A value describing what data is wrapped in this chunk.

  • data (DataFrame, required) – The data to be contained within the chunk.

  • start_datetime (datetime) – The starting point in time for this chunk.

  • end_datetime (datetime) – The end point in time for this chunk.

  • partition (string, optional) – The ‘partition’ this chunk belongs to, for example ‘reference’ or ‘analysis’.

__len__()[source]

Returns the number of rows held within this chunk.

Returns

length – Number of rows in the data property of the chunk.

Return type

int

__repr__()[source]

Returns textual summary of a chunk.

Returns

chunk_str

Return type

str

class nannyml.chunk.Chunker[source]

Bases: ABC

Base class for Chunker implementations.

Inheriting classes will split a DataFrame into a list of Chunks. They will do this based on several constraints, e.g. observation timestamps, number of observations per Chunk or a preferred number of Chunks.

Creates a new Chunker. Not used directly.

split(data: DataFrame, columns=None, minimum_chunk_size: Optional[int] = None) List[Chunk][source]

Splits a given data frame into a list of chunks.

This method provides a uniform interface across Chunker implementations to keep them interchangeable.

After performing the implementation-specific _split method, there are some checks on the resulting chunk list.

If the total number of chunks is low a warning will be written out to the logs.

We dynamically determine the optimal minimum number of observations per chunk and then check if the resulting chunks contain at least as many. If there are any underpopulated chunks a warning will be written out in the logs.

Parameters
  • data (DataFrame) – The data to be split into chunks

  • columns (List[str], default=None) – A list of columns to be included in the resulting chunk data. Unlisted columns will be dropped.

  • minimum_chunk_size (int, default=None) – The recommended minimum number of observations a Chunk should hold. When specified a warning will appear if the split results in underpopulated chunks. When not specified there will be no checks for underpopulated chunks.

Returns

chunks – The list of chunks

Return type

List[Chunk]

class nannyml.chunk.CountBasedChunker(chunk_count: int)[source]

Bases: Chunker

A Chunker that will split data into chunks based on the preferred number of total chunks.

Examples

>>> from nannyml.chunk import CountBasedChunker
>>> df = pd.read_parquet('/path/to/my/data.pq')
>>> chunker = CountBasedChunker(chunk_count=100, minimum_chunk_size=50)
>>> chunks = chunker.split(data=df)

Creates a new CountBasedChunker.

It will calculate the amount of observations per chunk based on the given chunk count. It then continues to split the data into chunks just like a SizeBasedChunker does.

Parameters

chunk_count (int) – The amount of chunks to split the data in.

Returns

chunker

Return type

CountBasedChunker

class nannyml.chunk.DefaultChunker[source]

Bases: Chunker

Splits data into about 10 chunks.

Examples

>>> from nannyml.chunk import DefaultChunker
>>> df = pd.read_parquet('/path/to/my/data.pq')
>>> chunker = DefaultChunker(minimum_chunk_size=50)
>>> chunks = chunker.split(data=df)

Creates a new DefaultChunker.

DEFAULT_CHUNK_COUNT = 10
class nannyml.chunk.PeriodBasedChunker(date_column_name: str = 'nml_meta_timestamp', offset: str = 'W')[source]

Bases: Chunker

A Chunker that will split data into Chunks based on a date column in the data.

Examples

Chunk using monthly periods and providing a column name

>>> from nannyml.chunk import PeriodBasedChunker
>>> df = pd.read_parquet('/path/to/my/data.pq')
>>> chunker = PeriodBasedChunker(date_column_name='observation_date', offset='M')
>>> chunks = chunker.split(data=df)

Or chunk using weekly periods

>>> from nannyml.chunk import PeriodBasedChunker
>>> df = pd.read_parquet('/path/to/my/data.pq')
>>> chunker = PeriodBasedChunker(date_column=df['observation_date'], offset='W', minimum_chunk_size=50)
>>> chunks = chunker.split(data=df)

Creates a new PeriodBasedChunker.

Parameters
  • date_column_name (string) – The name of the column in the DataFrame that contains the date used for chunking. Defaults to the metadata timestamp column added by the ModelMetadata.extract_metadata function.

  • offset (a frequency string representing a pandas.tseries.offsets.DateOffset) – The offset determines how the time-based grouping will occur. A list of possible values is to be found at https://pandas.pydata.org/docs/user_guide/timeseries.html#offset-aliases.

Returns

chunker

Return type

a PeriodBasedChunker instance used to split data into time-based Chunks.

class nannyml.chunk.SizeBasedChunker(chunk_size: int, drop_incomplete: bool = False)[source]

Bases: Chunker

A Chunker that will split data into Chunks based on the preferred number of observations per Chunk.

Notes

  • Chunks are adjacent, not overlapping

  • There will be no “incomplete chunks”, so the leftover observations that cannot fill an entire chunk will be dropped by default.

Examples

Chunk using monthly periods and providing a column name

>>> from nannyml.chunk import SizeBasedChunker
>>> df = pd.read_parquet('/path/to/my/data.pq')
>>> chunker = SizeBasedChunker(chunk_size=2000, minimum_chunk_size=50)
>>> chunks = chunker.split(data=df)

Create a new SizeBasedChunker.

Parameters
  • chunk_size (int) – The preferred size of the resulting Chunks, i.e. the number of observations in each Chunk.

  • drop_incomplete (bool, default=False) – Indicates whether the final Chunk after splitting should be dropped if it doesn’t contain chunk_size observations. Defaults to False, i.e. the final chunk will always be kept.

Returns

chunker

Return type

a size-based instance used to split data into Chunks of a constant size.