nannyml.chunk module
NannyML module providing intelligent splitting of data into chunks.
- class nannyml.chunk.Chunk(key: str, data: pandas.core.frame.DataFrame, start_datetime: Optional[datetime.datetime] = None, end_datetime: Optional[datetime.datetime] = None, start_index: int = - 1, end_index: int = - 1, period: Optional[str] = None)[source]
Bases:
object
A subset of data that acts as a logical unit during calculations.
Creates a new chunk.
- Parameters
key (str, required.) – A value describing what data is wrapped in this chunk.
data (DataFrame, required) – The data to be contained within the chunk.
start_datetime (datetime) – The starting point in time for this chunk.
end_datetime (datetime) – The end point in time for this chunk.
period (string, optional) – The ‘period’ this chunk belongs to, for example ‘reference’ or ‘analysis’.
- __len__()[source]
Returns the number of rows held within this chunk.
- Returns
length – Number of rows in the data property of the chunk.
- Return type
int
- merge(other: nannyml.chunk.Chunk)[source]
Merge two chunks together into a single one
- class nannyml.chunk.Chunker(timestamp_column_name: Optional[str] = None)[source]
Bases:
abc.ABC
Base class for Chunker implementations.
Inheriting classes will split a DataFrame into a list of Chunks. They will do this based on several constraints, e.g. observation timestamps, number of observations per Chunk or a preferred number of Chunks.
Creates a new Chunker.
- split(data: pandas.core.frame.DataFrame, columns=None) List[nannyml.chunk.Chunk] [source]
Splits a given data frame into a list of chunks.
This method provides a uniform interface across Chunker implementations to keep them interchangeable.
After performing the implementation-specific _split method, there are some checks on the resulting chunk list.
If the total number of chunks is low a warning will be written out to the logs.
We dynamically determine the optimal minimum number of observations per chunk and then check if the resulting chunks contain at least as many. If there are any underpopulated chunks a warning will be written out in the logs.
- Parameters
data (DataFrame) – The data to be split into chunks
columns (List[str], default=None) – A list of columns to be included in the resulting chunk data. Unlisted columns will be dropped.
- Returns
chunks – The list of chunks
- Return type
List[Chunk]
- class nannyml.chunk.ChunkerFactory[source]
Bases:
object
- classmethod get_chunker(chunk_size: Optional[int] = None, chunk_number: Optional[int] = None, chunk_period: Optional[str] = None, chunker: Optional[nannyml.chunk.Chunker] = None, timestamp_column_name: Optional[str] = None) nannyml.chunk.Chunker [source]
- class nannyml.chunk.CountBasedChunker(chunk_number: int, incomplete: str = 'append', timestamp_column_name: Optional[str] = None)[source]
Bases:
nannyml.chunk.Chunker
A Chunker that will split data into chunks based on the preferred number of total chunks.
Examples
>>> from nannyml.chunk import CountBasedChunker >>> df = pd.read_parquet('/path/to/my/data.pq') >>> chunker = CountBasedChunker(chunk_number=100) >>> chunks = chunker.split(data=df)
Creates a new CountBasedChunker.
It will calculate the amount of observations per chunk based on the given chunk count. It then continues to split the data into chunks just like a SizeBasedChunker does.
- Parameters
chunk_number (int) –
- The amount of chunks to split the data in.
incomplete: str, default=’append’
incomplete (str, default='append') –
Choose how to handle any leftover observations that don’t make up a full Chunk. The following options are available:
'drop'
: drop the leftover observations'keep'
: keep the incomplete Chunk (containing less thanchunk_size
observations)'append'
: append leftover observations to the last complete Chunk (overfilling it)
Defaults to
'append'
.
- Returns
chunker
- Return type
- class nannyml.chunk.DefaultChunker(timestamp_column_name: Optional[str] = None)[source]
Bases:
nannyml.chunk.Chunker
Splits data into about 10 chunks.
Examples
>>> from nannyml.chunk import DefaultChunker >>> df = pd.read_parquet('/path/to/my/data.pq') >>> chunker = DefaultChunker() >>> chunks = chunker.split(data=df)
Creates a new DefaultChunker.
- DEFAULT_CHUNK_COUNT = 10
- class nannyml.chunk.PeriodBasedChunker(timestamp_column_name: str, offset: str = 'W')[source]
Bases:
nannyml.chunk.Chunker
A Chunker that will split data into Chunks based on a date column in the data.
Examples
Chunk using monthly periods and providing a column name
>>> from nannyml.chunk import PeriodBasedChunker >>> df = pd.read_parquet('/path/to/my/data.pq') >>> chunker = PeriodBasedChunker(timestamp_column_name='observation_date', offset='M') >>> chunks = chunker.split(data=df)
Or chunk using weekly periods
>>> from nannyml.chunk import PeriodBasedChunker >>> df = pd.read_parquet('/path/to/my/data.pq') >>> chunker = PeriodBasedChunker(timestamp_column_name=df['observation_date'], offset='W', minimum_chunk_size=50) >>> chunks = chunker.split(data=df)
Creates a new PeriodBasedChunker.
- Parameters
offset (a frequency string representing a pandas.tseries.offsets.DateOffset) – The offset determines how the time-based grouping will occur. A list of possible values is to be found at https://pandas.pydata.org/docs/user_guide/timeseries.html#offset-aliases.
drop (bool, default=False) – Drops the timestamp column from the chunk data if True.
- Returns
chunker
- Return type
a PeriodBasedChunker instance used to split data into time-based Chunks.
- class nannyml.chunk.SizeBasedChunker(chunk_size: int, incomplete: str = 'append', timestamp_column_name: Optional[str] = None)[source]
Bases:
nannyml.chunk.Chunker
A Chunker that will split data into Chunks based on the preferred number of observations per Chunk.
Notes
Chunks are adjacent, not overlapping
There will be no “incomplete chunks”, so the leftover observations that cannot fill an entire chunk will be dropped by default.
Examples
Chunk using monthly periods and providing a column name
>>> from nannyml.chunk import SizeBasedChunker >>> df = pd.read_parquet('/path/to/my/data.pq') >>> chunker = SizeBasedChunker(chunk_size=2000, incomplete='drop') >>> chunks = chunker.split(data=df)
Create a new SizeBasedChunker.
- Parameters
chunk_size (int) – The preferred size of the resulting Chunks, i.e. the number of observations in each Chunk.
incomplete (str, default='append') –
Choose how to handle any leftover observations that don’t make up a full Chunk. The following options are available:
'drop'
: drop the leftover observations'keep'
: keep the incomplete Chunk (containing less thanchunk_size
observations)'append'
: append leftover observations to the last complete Chunk (overfilling it)
Defaults to
'append'
.
- Returns
chunker
- Return type
a size-based instance used to split data into Chunks of a constant size.