nannyml.chunk module

NannyML module providing intelligent splitting of data into chunks.

class nannyml.chunk.Chunk(key: str, data: DataFrame, start_datetime: Optional[datetime] = None, end_datetime: Optional[datetime] = None, start_index: int = - 1, end_index: int = - 1, period: Optional[str] = None)[source]

Bases: object

A subset of data that acts as a logical unit during calculations.

Creates a new chunk.

Parameters:

key (str, required.) – A value describing what data is wrapped in this chunk.
data (DataFrame, required) – The data to be contained within the chunk.
start_datetime (datetime) – The starting point in time for this chunk.
end_datetime (datetime) – The end point in time for this chunk.
period (string, optional) – The ‘period’ this chunk belongs to, for example ‘reference’ or ‘analysis’.

__len__()[source]

Returns the number of rows held within this chunk.

Returns:: length – Number of rows in the data property of the chunk.
Return type:: int

__repr__()[source]

Returns textual summary of a chunk.

Returns:: chunk_str
Return type:: str

merge(other: Chunk)[source]: Merge two chunks together into a single one

class nannyml.chunk.Chunker(timestamp_column_name: Optional[str] = None)[source]

Bases: ABC

Base class for Chunker implementations.

Inheriting classes will split a DataFrame into a list of Chunks. They will do this based on several constraints, e.g. observation timestamps, number of observations per Chunk or a preferred number of Chunks.

Creates a new Chunker.

split(data: DataFrame, columns=None) → List[Chunk][source]

Splits a given data frame into a list of chunks.

This method provides a uniform interface across Chunker implementations to keep them interchangeable.

After performing the implementation-specific _split method, there are some checks on the resulting chunk list.

If the total number of chunks is low a warning will be written out to the logs.

We dynamically determine the optimal minimum number of observations per chunk and then check if the resulting chunks contain at least as many. If there are any underpopulated chunks a warning will be written out in the logs.

Parameters:

data (DataFrame) – The data to be split into chunks
columns (List[str], default=None) – A list of columns to be included in the resulting chunk data. Unlisted columns will be dropped.

Returns:

chunks – The list of chunks

Return type:

List[Chunk]

class nannyml.chunk.ChunkerFactory[source]

Bases: object

classmethod get_chunker(chunk_size: Optional[int] = None, chunk_number: Optional[int] = None, chunk_period: Optional[str] = None, chunker: Optional[Chunker] = None, timestamp_column_name: Optional[str] = None) → Chunker[source]

class nannyml.chunk.CountBasedChunker(chunk_number: int, incomplete: str = 'keep', timestamp_column_name: Optional[str] = None)[source]

Bases: Chunker

A Chunker that will split data into chunks based on the preferred number of total chunks.

Notes

Chunks are adjacent, not overlapping
There may be “incomplete” chunks, as the remainder of observations after dividing by chunk_size will form a chunk of their own.

Examples

>>> from nannyml.chunk import CountBasedChunker
>>> df = pd.read_parquet('/path/to/my/data.pq')
>>> chunker = CountBasedChunker(chunk_number=100)
>>> chunks = chunker.split(data=df)

Creates a new CountBasedChunker.

It will calculate the amount of observations per chunk based on the given chunk count. It then continues to split the data into chunks just like a SizeBasedChunker does.

Parameters:

chunk_number (int) – The amount of chunks to split the data in.
incomplete (str, default='keep') –
Choose how to handle any leftover observations that don’t make up a full Chunk. The following options are available:
- 'drop': drop the leftover observations
- 'keep': keep the incomplete Chunk (containing less than chunk_size observations)
- 'append': append leftover observations to the last complete Chunk (overfilling it)
Defaults to 'keep'.

Returns:

chunker

Return type:

CountBasedChunker

class nannyml.chunk.DefaultChunker(timestamp_column_name: Optional[str] = None)[source]

Bases: Chunker

Splits data into about 10 chunks.

Examples

>>> from nannyml.chunk import DefaultChunker
>>> df = pd.read_parquet('/path/to/my/data.pq')
>>> chunker = DefaultChunker()
>>> chunks = chunker.split(data=df)

Creates a new DefaultChunker.

DEFAULT_CHUNK_COUNT = 10

class nannyml.chunk.PeriodBasedChunker(timestamp_column_name: str, offset: str = 'W')[source]

Bases: Chunker

A Chunker that will split data into Chunks based on a date column in the data.

Examples

Chunk using monthly periods and providing a column name

>>> from nannyml.chunk import PeriodBasedChunker
>>> df = pd.read_parquet('/path/to/my/data.pq')
>>> chunker = PeriodBasedChunker(timestamp_column_name='observation_date', offset='M')
>>> chunks = chunker.split(data=df)

Or chunk using weekly periods

>>> from nannyml.chunk import PeriodBasedChunker
>>> df = pd.read_parquet('/path/to/my/data.pq')
>>> chunker = PeriodBasedChunker(timestamp_column_name=df['observation_date'], offset='W', minimum_chunk_size=50)
>>> chunks = chunker.split(data=df)

Creates a new PeriodBasedChunker.

Parameters:

offset (a frequency string representing a pandas.tseries.offsets.DateOffset) – The offset determines how the time-based grouping will occur. A list of possible values is to be found at https://pandas.pydata.org/docs/user_guide/timeseries.html#offset-aliases.
drop (bool, default=False) – Drops the timestamp column from the chunk data if True.

Returns:

chunker

Return type:

a PeriodBasedChunker instance used to split data into time-based Chunks.

class nannyml.chunk.SizeBasedChunker(chunk_size: int, incomplete: str = 'keep', timestamp_column_name: Optional[str] = None)[source]

Bases: Chunker

A Chunker that will split data into Chunks based on the preferred number of observations per Chunk.

Notes

Chunks are adjacent, not overlapping
There may be “incomplete” chunks, as the remainder of observations after dividing by chunk_size will form a chunk of their own.

Examples

Chunk using monthly periods and providing a column name

>>> from nannyml.chunk import SizeBasedChunker
>>> df = pd.read_parquet('/path/to/my/data.pq')
>>> chunker = SizeBasedChunker(chunk_size=2000, incomplete='drop')
>>> chunks = chunker.split(data=df)

Create a new SizeBasedChunker.

Parameters:

chunk_size (int) – The preferred size of the resulting Chunks, i.e. the number of observations in each Chunk.
incomplete (str, default='keep') –
Choose how to handle any leftover observations that don’t make up a full Chunk. The following options are available:
- 'drop': drop the leftover observations
- 'keep': keep the incomplete Chunk (containing less than chunk_size observations)
- 'append': append leftover observations to the last complete Chunk (overfilling it)
Defaults to 'keep'.

Returns:

chunker

Return type:

a size-based instance used to split data into Chunks of a constant size.