transform_regex_extract
TransformRegexExtract
¶
Bases: PipelineAction
Extract values from a specified column in a DataFrame using regex patterns.
This action extracts values from a column based on a regex pattern and stores the result in a new column. Optionally, you can replace the matched pattern in the original column with a different string, remove the original column, or add a boolean column indicating which rows matched the pattern.
Example
This action also supports processing multiple columns simultaneously. To use this functionality, structure the configuration as a dictionary mapping each source column name to its extraction parameters.
Example
Source code in src/cloe_nessy/pipeline/actions/transform_regex_extract.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | |
run(context, source_column_name='', extract_column_name='', pattern='', keep_original_column=True, replace_by='', match_info_column_name='', extract_columns=None, **_)
¶
Performs a regex extract (and replace) on a specified column in a DataFrame.
This function performs a regex extract (and optionally a replace) on one or more columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
PipelineContext
|
The context in which this action is executed. |
required |
source_column_name
|
str
|
Column name to perform the regex replace on. |
''
|
pattern
|
str
|
Regex pattern to match. |
''
|
replace_by
|
str
|
String that should replace the extracted pattern in the source column. |
''
|
extract_column_name
|
str
|
Column name to store the extract, default: |
''
|
keep_original_column
|
bool
|
Whether to keep the original column, default: True |
True
|
match_info_column_name
|
str
|
Column name to store a boolean column whether a match was found, default: None |
''
|
extract_columns
|
dict | None
|
Dictionary of column names and their corresponding 1-column-case. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If any of the required arguments are not provided. |
ValueError
|
If the regex pattern is invalid. |
Returns:
| Name | Type | Description |
|---|---|---|
PipelineContext |
PipelineContext
|
Transformed context with the modified DataFrame. |