Cascade¶

About¶

The Cascade class is the primary interface for the glyphdeck library. It handles and processes data in a record-like structure, providing easy to use syntax for LLM data handling workflows.

It validates and enforces all data movements against a common id, ensuring that each record has a unique, immutable identifier that remains consistent, regardless of other changes.

Inherited Class Instances¶

The Cascade is integrated with instances of utility classes:

sanitiser - Identify and replace pieces of private information within DataDicts using regular expression patterns.
llm_handler - Handler for interacting with Large Language Models (LLMs) within the Cascade.

Example¶

import glyphdeck as gd

# Provide a dataframe or a path to a file (csv or xlsx)
data_source = r"tests\testdata.pizzashopreviews.xlsx"

# Intialise cascade instance and identify the unique id (required) and target data
cascade = gd.Cascade(data_source, "Review", "Review Text")

# Optionally remove private information
cascade.sanitiser.run()

# Prepare the llm
cascade.set_llm_handler(
    provider="OpenAI",
    model="gpt-4o-mini",
    system_message=(
        "You are an expert pizza shop customer feedback analyst system."
        "Analyse the feedback and return results in the correct format."
    ),
    validation_model=gd.validators.SubCatsSentiment,
    cache_identifier="pizzshop_sentiment",
)

# Run the llm_handler
cascade.llm_handler.run("llm_category_sentiment")

Methods & Properties¶

class glyphdeck.Cascade( data_source: str | DataFrame, id_column: str, data_columns: str | List[str], encoding: str = 'utf-8', sheet_name: int | str = 0, )¶

Bases: object

Handles and processes data in a record-like structure, providing easy to use syntax for data handling workflows with LLMs.

Automatically validates and enforces all data movements against a common id, ensuring that each record has a unique, immutable identifier that remains consistent, regardless of other changes.

This class is the primary interface for the glyphdeck library.

Inherits the functionalities of other modules across the library for seemless use, including the sanitiser & llm_handler.

records¶

A dictionary to hold all records.

Type:: Dict[int, RecordDict]

expected_len¶

The number of values expected in each list in the records data.

Type:: int

append( title: str, data: Dict[int | str, List], column_names: str | List[str] | None = None, update_expected_len: bool = False, )¶

Add a new record to the ‘records’ dictionary.

Parameters:

title – The title of the new record.
data – The data dictionary containing the new record’s data.
column_names – The list of column names. Defaults to None.
update_expected_len – Boolean flag to update the expected length of data lists. Defaults to False.

Returns:

None

column_names( record_identifier: int | str, ) → List[str]¶

Return the list of column names corresponding to the provided record_identifier.

Parameters:: record_identifier – The record identifier, which can be an integer or a string.
Returns:: The list of column names for the specified record.
Return type:: List[str]

data( record_identifier: int | str, ) → Dict[int | str, List]¶

Return the data dictionary corresponding to the provided record_identifier.

Parameters:: record_identifier – The record identifier, which can be an integer or a string.
Returns:: The data dictionary of the specified record.
Return type:: DataDict

property delta: timedelta¶

Returns the overall timedelta of the cascade.

Returns:: The overall timedelta from the initialisation to the latest record.
Return type:: timedelta

df( record_identifier: int | str, recreate=False, ) → DataFrame¶

Return the dataframe corresponding to the provided record_identifier.

Parameters:

record_identifier – The record identifier, which can be an integer or a string.
recreate – A boolean indicating whether to recreate the dataframe from the data in the record. Defaults to False.

Returns:

The dataframe of the specified record.

Return type:

pd.DataFrame

dt( record_identifier: int | str, ) → datetime¶

Return the datetime corresponding to the provided record_identifier.

Parameters:: record_identifier – The record identifier, which can be an integer or a string.
Returns:: The datetime of the specified record.
Return type:: datetime

Retrieve the specified records in the requested output format.

Parameters:

record_identifiers – Optional list of record identifiers (keys or titles). If None, the latest record is used. Defaults to None.
output_type – The type of output to be returned (‘dataframe’, ‘list’, ‘nested list’, or ‘dict’). Defaults to “dataframe”.
rebase – Boolean flag to join the records onto the base dataframe. Defaults to True.
combine – Boolean flag to combine the records before joining onto the base dataframe. Defaults to True.
recreate – Boolean flag to recreate dataframes from record data instead of using existing dataframes. Defaults to False.

Returns:

The output in the specified format.

Return type:

Union[pd.DataFrame, List[pd.DataFrame], Dict[Union[int, str], pd.DataFrame]]

property latest_column_names: List[str]¶

Returns the column names of the latest record.

Returns:: The list of column names of the latest record.
Return type:: List[str]

property latest_data: Dict[int | str, List]¶

Returns the data of the latest record.

Returns:: The data dictionary of the latest record.
Return type:: DataDict

property latest_df: DataFrame¶

Returns the DataFrame of the latest record.

Parameters:: recreate – Whether to recreate the DataFrame from the data. Defaults to False.
Returns:: The DataFrame of the latest record.
Return type:: pd.DataFrame

property latest_dt: datetime¶

Returns the datetime of the latest record.

Returns:: The datetime of the latest record.
Return type:: datetime

property latest_key: int¶

Returns the key of the latest record.

Returns:: The key of the latest record.
Return type:: int

Returns the latest record dictionary.

Returns:: The latest record data.
Return type:: RecordDict

property latest_record_delta: timedelta¶

Returns the timedelta of the latest record.

Returns:: The timedelta of the latest record.
Return type:: timedelta

property latest_title: str¶

Returns the title of the latest record.

Returns:: The title of the latest record.
Return type:: str

Return the record corresponding to the provided record number or record title.

Parameters:: record_identifier – The identifier for the record, which can be either an integer (record number) or a string (record title).
Returns:: The record dict corresponding to the provided identifier.
Return type:: RecordDict
Raises:: TypeError – If the provided record_identifier is not an integer or string.

record_delta( record_identifier: int | str, ) → timedelta¶

Return the timedelta corresponding to the provided record_identifier.

Parameters:: record_identifier – The record identifier, which can be an integer or a string.
Returns:: The timedelta of the specified record.
Return type:: timedelta

set_expected_len( value: int, )¶

Set the expected length of the data lists in records.

Parameters:: value – The expected length for each list in the records data.
Returns:: None

set_llm_handler( provider: str, model: str, system_message: str, validation_model, cache_identifier: str, use_cache: bool = True, temperature: float = 0.2, max_validation_retries: int = 2, max_preprepared_coroutines: int = 10, max_awaiting_coroutines: int = 100, )¶

Create the LLMHandler instance for the Cascade instance.

Rather than taking a input_data argument, it always uses self.latest_data. This can be changed. Also Passes a reference to the current cascade instance up through the kwargs.

Parameters:

provider – The name of the LLM provider.
model – The specific model to be used.
system_message – The system message to be used by the LLM.
validation_model – The model used for data validation.
cache_identifier – The identifier for cache storage.
use_cache – Whether to use caching. Defaults to True.
temperature – The sampling temperature for the LLM. Defaults to 0.2.
max_validation_retries – The maximum number of validation retries. Defaults to 2.
max_preprepared_coroutines – The maximum number of pre-prepared coroutines. Defaults to 10.
max_awaiting_coroutines – The maximum number of awaiting coroutines. Defaults to 100.

Returns:

None

title( record_identifier: int | str, ) → str¶

Return the title corresponding to the provided record_identifier number.

Parameters:: record_identifier – The record identifier, which can be an integer or a string.
Returns:: The title of the specified record.
Return type:: str

title_key( title: str, ) → int¶

Return the record number for a given title.

Parameters:: title – The title of the record to retrieve the key for.
Returns:: The key of the record associated with the given title.
Return type:: int
Raises:: TypeError – If the provided title does not exist in the records.

write_output( file_type: str, file_name_prefix: str, record_identifiers: List[int | str] | int | str | None = None, rebase: bool = True, combine: bool = True, xlsx_use_sheets: bool = True, recreate: bool = False, ) → Self¶

Write the output of the selected records to a file or files.

Parameters:

file_type – The type of file to write the output to. Can be ‘csv’ or ‘xlsx’.
file_name_prefix – The prefix to be used for the output file name.
record_identifiers – The identifiers for the records to be included in the output. Can be a single identifier or a list of identifiers. Defaults to None, which means the latest record is used.
rebase – If True, the output dataframes are joined onto the base dataframe. Defaults to True.
combine – If True, the records are combined before joining onto the base dataframe or returning. Defaults to True.
xlsx_use_sheets – If True and file_type is ‘xlsx’, writes each record to its own sheet in the same file. Defaults to True.
recreate – If True, the dataframes are recreated from the data in the records instead of using existing dataframes. Defaults to False.

Returns:

The Cascade object, allowing further cascadeed operations.

Return type:

Self