transform_join
TransformJoinAction
¶
Bases: PipelineAction
Joins the current DataFrame with another DataFrame defined in joined_data.
The join operation is performed based on specified columns and the type of
join indicated by the how parameter. Supported join types can be taken
from PySpark
documentation
Examples:
Join Tables:
action: TRANSFORM_JOIN
options:
joined_data: ((step:Load Conditions Table))
join_condition: |
left.material = right.material
AND right.sales_org = '10'
AND right.distr_chan = '10'
AND right.knart = 'ZUVP'
AND right.lovmkond <> 'X'
AND right.sales_unit = 'ST'
AND left.calday BETWEEN
to_date(right.date_from, 'yyyyMMdd') AND
to_date(right.date_to, 'yyyyMMdd')
how: left
Referencing a DataFrame from another step
The joined_data parameter is a reference to the DataFrame from another step.
The DataFrame is accessed using the result attribute of the PipelineStep. The syntax
for referencing the DataFrame is ((step:Step Name)), mind the double parentheses.
Dictionary Join Syntax
When using a dictionary for join_on, the keys represent columns
from the DataFrame in context and the values represent columns from
the DataFrame in joined_data. This is useful when joining tables
with different column names for the same logical entity.
Complex Join Conditions
Use join_condition instead of join_on for complex joins with literals,
expressions, and multiple conditions. Reference columns using left.column_name
for the main DataFrame and right.column_name for the joined DataFrame.
Supports all PySpark functions and operators.
Source code in src/cloe_nessy/pipeline/actions/transform_join.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | |
run(context, *, joined_data=None, join_on=None, join_condition=None, how='inner', **_)
¶
Joins the current DataFrame with another DataFrame defined in joined_data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
PipelineContext
|
Context in which this Action is executed. |
required |
joined_data
|
PipelineStep | None
|
The PipelineStep context defining the DataFrame to join with as the right side of the join. |
None
|
join_on
|
list[str] | str | dict[str, str] | None
|
A string for the join column name, a list of column names, or a dictionary mapping columns from the left DataFrame to the right DataFrame. This defines the condition for the join operation. Mutually exclusive with join_condition. |
None
|
join_condition
|
str | None
|
A string containing a complex join expression with literals, functions, and multiple conditions. Use 'left.' and 'right.' prefixes to reference columns from respective DataFrames. Mutually exclusive with join_on. |
None
|
how
|
str
|
The type of join to perform. Must be one of: inner, cross, outer, full, fullouter, left, leftouter, right, rightouter, semi, anti, etc. |
'inner'
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If no joined_data is provided. |
ValueError
|
If neither join_on nor join_condition is provided. |
ValueError
|
If both join_on and join_condition are provided. |
ValueError
|
If the data from context is None. |
ValueError
|
If the data from the joined_data is None. |
Returns:
| Type | Description |
|---|---|
PipelineContext
|
Context after the execution of this Action, containing the result of the join operation. |