Cascade

About

The Cascade class is the primary interface for the glyphdeck library. It handles and processes data in a record-like structure, providing easy to use syntax for LLM data handling workflows.

It validates and enforces all data movements against a common id, ensuring that each record has a unique, immutable identifier that remains consistent, regardless of other changes.

Inherited Class Instances

The Cascade is integrated with instances of utility classes:

  • sanitiser - Identify and replace pieces of private information within DataDicts using regular expression patterns.

  • llm_handler - Handler for interacting with Large Language Models (LLMs) within the Cascade.

Example

import glyphdeck as gd

# Provide a dataframe or a path to a file (csv or xlsx)
data_source = r"tests\testdata.pizzashopreviews.xlsx"

# Intialise cascade instance and identify the unique id (required) and target data
cascade = gd.Cascade(data_source, "Review", "Review Text")

# Optionally remove private information
cascade.sanitiser.run()

# Prepare the llm
cascade.set_llm_handler(
    provider="OpenAI",
    model="gpt-4o-mini",
    system_message=(
        "You are an expert pizza shop customer feedback analyst system."
        "Analyse the feedback and return results in the correct format."
    ),
    validation_model=gd.validators.SubCatsSentiment,
    cache_identifier="pizzshop_sentiment",
)

# Run the llm_handler
cascade.llm_handler.run("llm_category_sentiment")

Methods & Properties

class glyphdeck.Cascade(
data_source: str | DataFrame,
id_column: str,
data_columns: str | List[str],
encoding: str = 'utf-8',
sheet_name: int | str = 0,
)

Bases: object

Handles and processes data in a record-like structure, providing easy to use syntax for data handling workflows with LLMs.

Automatically validates and enforces all data movements against a common id, ensuring that each record has a unique, immutable identifier that remains consistent, regardless of other changes.

This class is the primary interface for the glyphdeck library.

Inherits the functionalities of other modules across the library for seemless use, including the sanitiser & llm_handler.

records

A dictionary to hold all records.

Type:

Dict[int, RecordDict]

expected_len

The number of values expected in each list in the records data.

Type:

int

append(
title: str,
data: Dict[int | str, List],
column_names: str | List[str] | None = None,
update_expected_len: bool = False,
)

Add a new record to the ‘records’ dictionary.

Parameters:
  • title – The title of the new record.

  • data – The data dictionary containing the new record’s data.

  • column_names – The list of column names. Defaults to None.

  • update_expected_len – Boolean flag to update the expected length of data lists. Defaults to False.

Returns:

None

column_names(
record_identifier: int | str,
) List[str]

Return the list of column names corresponding to the provided record_identifier.

Parameters:

record_identifier – The record identifier, which can be an integer or a string.

Returns:

The list of column names for the specified record.

Return type:

List[str]

data(
record_identifier: int | str,
) Dict[int | str, List]

Return the data dictionary corresponding to the provided record_identifier.

Parameters:

record_identifier – The record identifier, which can be an integer or a string.

Returns:

The data dictionary of the specified record.

Return type:

DataDict

property delta: timedelta

Returns the overall timedelta of the cascade.

Returns:

The overall timedelta from the initialisation to the latest record.

Return type:

timedelta

df(
record_identifier: int | str,
recreate=False,
) DataFrame

Return the dataframe corresponding to the provided record_identifier.

Parameters:
  • record_identifier – The record identifier, which can be an integer or a string.

  • recreate – A boolean indicating whether to recreate the dataframe from the data in the record. Defaults to False.

Returns:

The dataframe of the specified record.

Return type:

pd.DataFrame

dt(
record_identifier: int | str,
) datetime

Return the datetime corresponding to the provided record_identifier.

Parameters:

record_identifier – The record identifier, which can be an integer or a string.

Returns:

The datetime of the specified record.

Return type:

datetime

get_output(
record_identifiers: List[int | str] | int | str | None = None,
output_type: str = 'dataframe',
rebase: bool = True,
combine: bool = True,
recreate: bool = False,
) DataFrame | List[DataFrame] | Dict[int | str, DataFrame]

Retrieve the specified records in the requested output format.

Parameters:
  • record_identifiers – Optional list of record identifiers (keys or titles). If None, the latest record is used. Defaults to None.

  • output_type – The type of output to be returned (‘dataframe’, ‘list’, ‘nested list’, or ‘dict’). Defaults to “dataframe”.

  • rebase – Boolean flag to join the records onto the base dataframe. Defaults to True.

  • combine – Boolean flag to combine the records before joining onto the base dataframe. Defaults to True.

  • recreate – Boolean flag to recreate dataframes from record data instead of using existing dataframes. Defaults to False.

Returns:

The output in the specified format.

Return type:

Union[pd.DataFrame, List[pd.DataFrame], Dict[Union[int, str], pd.DataFrame]]

property latest_column_names: List[str]

Returns the column names of the latest record.

Returns:

The list of column names of the latest record.

Return type:

List[str]

property latest_data: Dict[int | str, List]

Returns the data of the latest record.

Returns:

The data dictionary of the latest record.

Return type:

DataDict

property latest_df: DataFrame

Returns the DataFrame of the latest record.

Parameters:

recreate – Whether to recreate the DataFrame from the data. Defaults to False.

Returns:

The DataFrame of the latest record.

Return type:

pd.DataFrame

property latest_dt: datetime

Returns the datetime of the latest record.

Returns:

The datetime of the latest record.

Return type:

datetime

property latest_key: int

Returns the key of the latest record.

Returns:

The key of the latest record.

Return type:

int

property latest_record: Dict[str, str | None | List[str] | datetime | timedelta | Dict[int | str, List] | DataFrame]

Returns the latest record dictionary.

Returns:

The latest record data.

Return type:

RecordDict

property latest_record_delta: timedelta

Returns the timedelta of the latest record.

Returns:

The timedelta of the latest record.

Return type:

timedelta

property latest_title: str

Returns the title of the latest record.

Returns:

The title of the latest record.

Return type:

str

record(
record_identifier: int | str,
) Dict[str, str | None | List[str] | datetime | timedelta | Dict[int | str, List] | DataFrame]

Return the record corresponding to the provided record number or record title.

Parameters:

record_identifier – The identifier for the record, which can be either an integer (record number) or a string (record title).

Returns:

The record dict corresponding to the provided identifier.

Return type:

RecordDict

Raises:

TypeError – If the provided record_identifier is not an integer or string.

record_delta(
record_identifier: int | str,
) timedelta

Return the timedelta corresponding to the provided record_identifier.

Parameters:

record_identifier – The record identifier, which can be an integer or a string.

Returns:

The timedelta of the specified record.

Return type:

timedelta

set_expected_len(
value: int,
)

Set the expected length of the data lists in records.

Parameters:

value – The expected length for each list in the records data.

Returns:

None

set_llm_handler(
provider: str,
model: str,
system_message: str,
validation_model,
cache_identifier: str,
use_cache: bool = True,
temperature: float = 0.2,
max_validation_retries: int = 2,
max_preprepared_coroutines: int = 10,
max_awaiting_coroutines: int = 100,
)

Create the LLMHandler instance for the Cascade instance.

Rather than taking a input_data argument, it always uses self.latest_data. This can be changed. Also Passes a reference to the current cascade instance up through the kwargs.

Parameters:
  • provider – The name of the LLM provider.

  • model – The specific model to be used.

  • system_message – The system message to be used by the LLM.

  • validation_model – The model used for data validation.

  • cache_identifier – The identifier for cache storage.

  • use_cache – Whether to use caching. Defaults to True.

  • temperature – The sampling temperature for the LLM. Defaults to 0.2.

  • max_validation_retries – The maximum number of validation retries. Defaults to 2.

  • max_preprepared_coroutines – The maximum number of pre-prepared coroutines. Defaults to 10.

  • max_awaiting_coroutines – The maximum number of awaiting coroutines. Defaults to 100.

Returns:

None

title(
record_identifier: int | str,
) str

Return the title corresponding to the provided record_identifier number.

Parameters:

record_identifier – The record identifier, which can be an integer or a string.

Returns:

The title of the specified record.

Return type:

str

title_key(
title: str,
) int

Return the record number for a given title.

Parameters:

title – The title of the record to retrieve the key for.

Returns:

The key of the record associated with the given title.

Return type:

int

Raises:

TypeError – If the provided title does not exist in the records.

write_output(
file_type: str,
file_name_prefix: str,
record_identifiers: List[int | str] | int | str | None = None,
rebase: bool = True,
combine: bool = True,
xlsx_use_sheets: bool = True,
recreate: bool = False,
) Self

Write the output of the selected records to a file or files.

Parameters:
  • file_type – The type of file to write the output to. Can be ‘csv’ or ‘xlsx’.

  • file_name_prefix – The prefix to be used for the output file name.

  • record_identifiers – The identifiers for the records to be included in the output. Can be a single identifier or a list of identifiers. Defaults to None, which means the latest record is used.

  • rebase – If True, the output dataframes are joined onto the base dataframe. Defaults to True.

  • combine – If True, the records are combined before joining onto the base dataframe or returning. Defaults to True.

  • xlsx_use_sheets – If True and file_type is ‘xlsx’, writes each record to its own sheet in the same file. Defaults to True.

  • recreate – If True, the dataframes are recreated from the data in the records instead of using existing dataframes. Defaults to False.

Returns:

The Cascade object, allowing further cascadeed operations.

Return type:

Self