SQL joins are one of the critical parts of any ETL. For wrangling or massaging data from multiple tables, one way or other you need to combine the data meaningfully. Good and efficient data modelers will try to normalize data into multiple tables to avoid duplicate representation of data. This can be through BCNF (Boyce-Codd Normal Forms ) or Dimensional models. Its common practice to join such tables by relevant columns ( also called keys ) depending on the context. For example, finding all employees in a particular department, you may join an employee table with a department table based on department number. Instead if you are interested in finding all employees joined prior to a date, joining criteria may be based on date between multiple tables.
We are not here to discuss joins, but imagine the joins happening in a distributed computing cluster. Assume that you can employee data distributed across twenty different nodes, and your department data in ten different nodes. How does a distributed computing system like Spark joins the data efficiently ? This is the context of this article. Towards the end we will explain the latest feature since Spark 3.0 named Adaptive Query Execution (AQE) to make things better.