autoJoinTables

Auto-join multiple tables

Events is a root step table, after performing multiple operations we want to join a step down the path (a descendant of root step), however there is no common fields to perform the join.

A = root step table with f1, f2, f3 fields
B = select f1, f2, size(f3) as u1 from A
C = select f1, UDF_X(f2) as u2, u1 from B
D = select UDF_Y(f1) as u3, u2, u1 from C
E = select u1, u2, SCORE(u1,u2,u3) as s from D

As a result we want to join A and E tables but as you can see they don’t have common fields.

So we want to create a new operator that will join neighboring tables and brings all new fields from descendants so resulting table will be

  • A(f1, f2, f3) + (u1, u2, u3, s) or
  • A(f1, f2, f3) + (s) // specified column

Solution
We will join A with B, and then resulting table with C, and so on.
For ex.

  • R1 = Left join A and B based on (f1,f2) columns since both of them have these columns so R1 will have (f1, f2, f3, u1) columns
  • R2 = Left join R1 with C based on (f1, u1) columns, R2 will have (f1, f2, f3, u1, u2) columns
  • R3 = Left join R2 with D based on (u1, u2) columns, R3 will have (f1, f2, f3, u1, u2, u3) columns
  • R4 = Left join R3 with E based on (u1, u2) columns, R4 will have (f1, f2, f3, u1, u2, u3, s) columns

Joinability

A (an event table) can be joined with B (a dimension table) if and only if:

  • B is a child step A
  • They have at least one field in common
  • B has at least one extra field

It’s better to think of such operations as a lookup operation - where the key fields are the common fields between the two tables, and the extra column in B is the new column that is added.

As such the resulting table will never have more rows than the event table. In fact there should be a one to one mapping between the rows of the event table and the output table. The output table will only have additional columns.

Difference between join and lookup
In order to do the look up, we need to turn the dimension table (B in this case) into a map. The key columns would be the columns that we have in common with A. And if there are multiple values for the same key, we will pick one at random.

NOTE this operator ignores lhub_scores in the joins, and if lhub_scores are present in the nodes except root step they will get dropped. Renaming lhub_scores would keep them in the playbook

Operator Usage in Easy Mode

  1. Click + on the parent node.
  2. Enter the Auto Join Tables operator in the search field and select the operator from the Results to open the operator form.
  3. In the Table drop-down, enter or select a table or list of tables to join.
  4. Optional. In the Result Tables drop-down, select a table or list of tables, whose additional columns will be added to the first table and the rest will be dropped.
  5. Click Run to view the result.
  6. Click Save to add the operator to the playbook.
  7. Click Cancel to discard the operator form.

Usage Details

autoJoinTables([list of steps (tables)],)s-- join all steps and collect all new columns
autoJoinTables([list of steps (tables)], list of desired steps in the output) -- join all steps and in the output show only the columns from the second list of step

Note
When second list is empty, it will join all in sequential order and collect all new columns so output will contain all columns from all tables, if second list is not empty then it will still perform sequential join but in the output it will show only the columns from the second list of step, for better understandin look at the examples below.

Example

Input
table1:

Node name columns
NodeAa, b, c
NodeBb, c, d
NodeCc, d, e
NodeDd, e, f
NodeEe, f, g
-- if you want to see all columns in the output:
autoJoinTables([nodeA, nodeB, nodeC, nodeD, nodeE])
-- will produce a table with (a, b, c, d, e, f, g) columns
 
-- if you want to see only columns from nodeA (a,b,c) and new columns from nodeC (e) and nodeD (g) in the output
autoJoinTables([nodeA, nodeB, nodeC, nodeD, nodeE], nodeA, nodeC, nodeD)
-- will produce a table with (a, b, c, e, g) columns

NOTE: This operator is not "smart" enough to know right ordering of the tables or know missing nodes.

For example,

in above example if you put only "nodeA,nodeE" this operator will not know that there is other nodes in between these two nodes
in above example if you put list of tables in reverse order, it will try to compute in reverse order where the output will be incorrect.