sanitiser¶
About¶
The sanitiser is used to identify and replace pieces of private information
within DataDicts using regular expression patterns.
It supports sanitisation of emails, URLs, file paths, folder paths, dates, numbers and any other regex you want to add in.
Usage¶
Each Cascade is initialised with an instance of the BaseSanitiser class.
This can be accessed like so:
cascade = gd.Cascade(...)
cascade.sanitiser.[...]
- Cascade.sanitiser¶
Returns the sanitiser object, updating it with provided or latest data.
- Returns:
The updated sanitiser object.
- Return type:
Sanitiser
Cascade Extension¶
The .sanitiser available in the Cascade instance has some extra
functionality add onto that provided by the BaseSanitiser` class.
- defaults¶
Uses the
DataDictfrom the latest record by default.
- selected_data¶
Data to use instead of the default. Only used when
use_selectedis True.- Type:
DataDict
- use_selected¶
Whether to use selected data or not.
- Type:
bool
- run(
- title: str = 'sanitised',
Run the sanitiser and append the result to the cascade.
- Parameters:
title (str) – The title to be given to the sanitised record. Defaults to “sanitised”.
- Returns:
The sanitiser instance, capable of being further used to cascade additional operations.
- Return type:
sanitiser
- Raises:
AssertionError – If the provided title argument is not a string.
BaseSanitiser¶
The .sanitiser also inherits the features of the BaseSanitiser.
- class glyphdeck.processors.sanitiser.BaseSanitiser(
- input_data: Dict[int | str, List],
- pattern_groups: List = None,
Bases:
objectSanitises strings by replacing private information with placeholders.
It can be used separately in this module but can also be accessed in a more streamlined way as within the Cascade class.
- email_regex¶
A regex pattern string for matching email addresses.
- Type:
str
- email_pattern¶
A compiled regex pattern for matching email addresses.
- Type:
re.Pattern
- folder_path_regex¶
A regex pattern string for matching folder paths.
- Type:
str
- folder_path_pattern¶
A compiled regex pattern for matching folder paths.
- Type:
re.Pattern
- file_path_regex¶
A regex pattern string for matching full file paths.
- Type:
str
- file_path_pattern¶
A compiled regex pattern for matching full file paths.
- Type:
re.Pattern
- url_regex¶
A regex pattern string for matching URLs.
- Type:
str
- url_pattern¶
A compiled regex pattern for matching URLs.
- Type:
re.Pattern
- date_regex1¶
A regex pattern string for matching dates in the form dd-mm-yyyy.
- Type:
str
- date_pattern1¶
A compiled regex pattern for matching dates in the form dd-mm-yyyy.
- Type:
re.Pattern
- date_regex2¶
A regex pattern string for matching dates like 1 Jan 22 and variations.
- Type:
str
- date_pattern2¶
A compiled regex pattern for matching dates like 1 Jan 22 and variations.
- Type:
re.Pattern
- date_regex3¶
A regex pattern string for matching dates like 1-mar-2022 and variations.
- Type:
str
- date_pattern3¶
A compiled regex pattern for matching dates like 1-mar-2022 and variations.
- Type:
re.Pattern
- number_regex¶
A regex pattern string for matching words that contain one or more digits.
- Type:
str
- number_pattern¶
A compiled regex pattern for matching words that contain one or more digits.
- Type:
re.Pattern
- overall_run_state¶
Indicates if any sanitisation has been run.
- Type:
bool
- active_groups¶
Active group names from the patterns dictionary.
- Type:
List[str]
- inactive_groups¶
Inactive group names from the patterns dictionary.
- Type:
List[str]
- PatternsDict¶
alias of
Dict[str,Dict[str,str|float|Pattern[str]]]
- add_pattern(
- pattern_name: str,
- group: str,
- placeholder: str,
- rank: float,
- regex: str,
Add a new pattern to the BaseSanitiser.
- Parameters:
pattern_name – The unique name for the pattern.
group – The group to which the new pattern belongs.
placeholder – The placeholder to substitute matches with.
rank – The rank indicating the order in which to process this pattern.
regex – The regex string to compile and use for matching.
- Returns:
None
- Raises:
TypeError – If the placeholder contains invalid characters.
- select_groups(
- pattern_groups: List[str],
Activates or deactivates pattern groups.
- Parameters:
pattern_groups – A list of pattern groups to activate. All others are deactivated.
- Returns:
The updated instance of the BaseSanitiser class.
- Return type:
Self
- Raises:
KeyError – If a provided group does not exist in the available patterns.
- set_placeholders(
- placeholder_dict: Dict[str, str],
Set custom placeholders for the patterns.
- Parameters:
placeholder_dict – A dictionary with group names as keys and custom placeholders as values.
- Returns:
The updated instance of the BaseSanitiser class.
- Return type:
Self
- Raises:
KeyError – If a provided key does not exist in the available patterns.
- BaseSanitiser.patterns: Dict[str, Dict[str, str | float | Pattern[str]]]¶
Note
>>> # Stores patterns, placeholders & groupings used to sanitise data >>> # Adding patterns with Sanitiser methods will insert them here >>> { >>> { >>> "group": "date", >>> "placeholder": "<DATE>", >>> "rank": 1, >>> "pattern": _date_pattern1, >>> }, >>> ... >>> }