sanitiser

About

The sanitiser is used to identify and replace pieces of private information within DataDicts using regular expression patterns.

It supports sanitisation of emails, URLs, file paths, folder paths, dates, numbers and any other regex you want to add in.

Usage

Each Cascade is initialised with an instance of the BaseSanitiser class.

This can be accessed like so:

cascade = gd.Cascade(...)
cascade.sanitiser.[...]
Cascade.sanitiser

Returns the sanitiser object, updating it with provided or latest data.

Returns:

The updated sanitiser object.

Return type:

Sanitiser

Cascade Extension

The .sanitiser available in the Cascade instance has some extra functionality add onto that provided by the BaseSanitiser` class.

defaults

Uses the DataDict from the latest record by default.

selected_data

Data to use instead of the default. Only used when use_selected is True.

Type:

DataDict

use_selected

Whether to use selected data or not.

Type:

bool

run(
title: str = 'sanitised',
)

Run the sanitiser and append the result to the cascade.

Parameters:

title (str) – The title to be given to the sanitised record. Defaults to “sanitised”.

Returns:

The sanitiser instance, capable of being further used to cascade additional operations.

Return type:

sanitiser

Raises:

AssertionError – If the provided title argument is not a string.

BaseSanitiser

The .sanitiser also inherits the features of the BaseSanitiser.

class glyphdeck.processors.sanitiser.BaseSanitiser(
input_data: Dict[int | str, List],
pattern_groups: List = None,
)

Bases: object

Sanitises strings by replacing private information with placeholders.

It can be used separately in this module but can also be accessed in a more streamlined way as within the Cascade class.

email_regex

A regex pattern string for matching email addresses.

Type:

str

email_pattern

A compiled regex pattern for matching email addresses.

Type:

re.Pattern

folder_path_regex

A regex pattern string for matching folder paths.

Type:

str

folder_path_pattern

A compiled regex pattern for matching folder paths.

Type:

re.Pattern

file_path_regex

A regex pattern string for matching full file paths.

Type:

str

file_path_pattern

A compiled regex pattern for matching full file paths.

Type:

re.Pattern

url_regex

A regex pattern string for matching URLs.

Type:

str

url_pattern

A compiled regex pattern for matching URLs.

Type:

re.Pattern

date_regex1

A regex pattern string for matching dates in the form dd-mm-yyyy.

Type:

str

date_pattern1

A compiled regex pattern for matching dates in the form dd-mm-yyyy.

Type:

re.Pattern

date_regex2

A regex pattern string for matching dates like 1 Jan 22 and variations.

Type:

str

date_pattern2

A compiled regex pattern for matching dates like 1 Jan 22 and variations.

Type:

re.Pattern

date_regex3

A regex pattern string for matching dates like 1-mar-2022 and variations.

Type:

str

date_pattern3

A compiled regex pattern for matching dates like 1-mar-2022 and variations.

Type:

re.Pattern

number_regex

A regex pattern string for matching words that contain one or more digits.

Type:

str

number_pattern

A compiled regex pattern for matching words that contain one or more digits.

Type:

re.Pattern

overall_run_state

Indicates if any sanitisation has been run.

Type:

bool

active_groups

Active group names from the patterns dictionary.

Type:

List[str]

inactive_groups

Inactive group names from the patterns dictionary.

Type:

List[str]

PatternsDict

alias of Dict[str, Dict[str, str | float | Pattern[str]]]

add_pattern(
pattern_name: str,
group: str,
placeholder: str,
rank: float,
regex: str,
)

Add a new pattern to the BaseSanitiser.

Parameters:
  • pattern_name – The unique name for the pattern.

  • group – The group to which the new pattern belongs.

  • placeholder – The placeholder to substitute matches with.

  • rank – The rank indicating the order in which to process this pattern.

  • regex – The regex string to compile and use for matching.

Returns:

None

Raises:

TypeError – If the placeholder contains invalid characters.

select_groups(
pattern_groups: List[str],
) Self

Activates or deactivates pattern groups.

Parameters:

pattern_groups – A list of pattern groups to activate. All others are deactivated.

Returns:

The updated instance of the BaseSanitiser class.

Return type:

Self

Raises:

KeyError – If a provided group does not exist in the available patterns.

set_placeholders(
placeholder_dict: Dict[str, str],
) Self

Set custom placeholders for the patterns.

Parameters:

placeholder_dict – A dictionary with group names as keys and custom placeholders as values.

Returns:

The updated instance of the BaseSanitiser class.

Return type:

Self

Raises:

KeyError – If a provided key does not exist in the available patterns.

BaseSanitiser.patterns: Dict[str, Dict[str, str | float | Pattern[str]]]

Note

>>> # Stores patterns, placeholders & groupings used to sanitise data
>>> # Adding patterns with Sanitiser methods will insert them here
>>> {
>>>     {
>>>         "group": "date",
>>>         "placeholder": "<DATE>",
>>>         "rank": 1,
>>>         "pattern": _date_pattern1,
>>>     },
>>>     ...
>>> }