data types¶

DataDict¶

The DataDict type is the required format for operations within the Cascade and other parts of the glyphdeck library.

glyphdeck.DataDict¶: alias of Dict[int | str, List]

A basic dict with a nested list.

The key can be either an int or a str, as long as it is unique. This corresponds to the id_column argument in Cascade.

The list contains the data to be processed, with each item representing the data for that column. This corresponds to the data_columns argument in Cascade.

example: DataDict = {
    1: ["Delicious and fresh", "Rich culture"],
    2: ["Oversalted and soggy", "Warm but crowded"],
    3: ["Comforting and cheesy", "Historical beauty"],
}

Tip

Use glyphdeck.prepare return your dataframe as a tuple including itself as a DataDict

Note

Cascade handles conversion into a DataDict automatically when you create an instance

glyphdeck.prepare( data_source: str | DataFrame, id_column: str, data_columns: str | List[str], encoding: str, sheet: str | int, ) → Tuple[DataFrame, Dict[int | str, List]]¶

Conditionally prepares data from various formats into a common DataDict format.

Depending on the input format (dataframe, CSV file, or XLSX file), this function runs the appropriate preparation routine to convert the data into the common dictionary format.

Parameters:

data_source (Union[str, pd.DataFrame]) – The data source to be prepared. This can be a dataframe, a CSV file path, or an XLSX file path.
id_column (str) – The name of the column that contains unique IDs.
data_columns (Union[str, List[str]]) – A single column name or a list of column names that contain the data to be extracted.
encoding (str) – The encoding to use when reading text files.
sheet (Union[str, int]) – The name or number of the sheet to read from in an XLSX file.

Returns:

A tuple containing the prepared dataframe and a dictionary with IDs as keys and lists of column values as values.

Return type:

Tuple[pd.DataFrame, DataDict]

Raises:

AssertionError – If data_source is not a dataframe or a string path to a CSV/XLSX file.
FileNotFoundError – If the specified CSV/XLSX file does not exist.
ValueError – If the file cannot be read as a CSV/XLSX.
AssertionError – If there are issues validating id_column or data_columns.

Record types¶

The record types used to pass through the Cascade.

Tip

These are generated inside the Cascade. You can easily access & manipulate the records using its properties and methods.

glyphdeck.RecordDict¶: alias of Dict[str, str | None | List[str] | datetime | timedelta | Dict[int | str, List] | DataFrame]

Metadata for data entries in the Cascade. One is recorded each time new or transformed data is appended into a Cascade instance.

The data corresponds to the DataDict type.

{
    "title": "Reviews",
    "dt": datetime.datetime(2024, 10, 8, 17, 45, 2, 285588),
    "delta": datetime.timedelta(0),
    "data": {
        1: ["Delicious and fresh", "Rich culture"],
        2: ["Oversalted and soggy", "Warm but crowded"],
        3: ["Comforting and cheesy", "Historical beauty"],
    },
    "column_names": [
        "Food Review",
        "Country Review"
    ],
}

glyphdeck.RecordsDict¶: alias of Dict[int, Dict[str, str | None | List[str] | datetime | timedelta | Dict[int | str, List] | DataFrame]]

Stores multiple individual records in order of addition, making them easily available for access via the properties and methods of the Cascade class.

Each individual record only contains the current version of the data.

For example, this would be the records (per previous example) when a Sentiment validator was run on it:

{
    0: { ... },
    1: { ... },
    2: {
        "title": "LLM Sentiment",
        "dt": datetime.datetime(2024, 10, 8, 17, 59, 28, 207103),
        "delta": datetime.timedelta(microseconds=218445),
        "data": {
            1: [0.8, 0.8],
            2: [-0.75, 0.2],
            3: [0.75, 0.5]
        },
        "column_names": [
            "Food Review_sentiment_score",
            "Country Review_sentiment_score"
        ],
    },
}

Example records, dict render?