Mathematical Explanation of the Difference between SQL Joins: Inner, Outer, Left, Right

4.3k Views Asked by At

Question

This question calls for a mathematically sound & intuitive explanation of SQL joins that clearly shows the difference between the following:

  • Inner Join
  • Left Join
  • Right Join
  • Full Outer Join

The explanation of joins should not misuse Venn diagrams. This is key. It should also be as accessible as possible to a computer programmer or mathematical beginner. We don't want to scare programmers away from mathematical concepts by using too much jargon. Of course, a little bit of maths is always necessary.

Motivation

The internet is rife with usages of Venn diagrams to explain SQL joins. As pointed out in the following articles, this leads to a grave misunderstanding of either Venn diagrams, SQL joins or both:

As a website that many students of mathematics and computer science consult as a source of truth, it is our responsibility as a community to try everything in our power to propagate truth. Unfortunately, Venn diagram usage to explain a concept which is really Cartesian product at its core is all to rife.

Our own sister site, StackOverflow, is unfortunately part of this problem: https://stackoverflow.com/questions/38549/what-is-the-difference-between-inner-join-and-outer-join/38578#38578. While there are many amazing answers under that question, the prevailing belief on that site appears to be that joins are intersections/unions and Venn diagrams are appropriate to explain them. The top ranked and accepted answer uses Venn diagrams and intersection/union to explain joins.

While there may be some cases where join coincides with intersections and unions, it is not in general the case. I fear that people are simply seeing the special case and accepting the Venn diagram explanation. I fear they are then walking away with improper understanding of SQL joins and set theory.

I am hoping that by posting a question here, even a small percentage of people might be directed here instead of to another site that has SQL joins incorrectly explained using Venn diagrams. I am hoping that at least one of the Stack Exchange websites can have an accepted answer explaining SQL joins that is mathematically accurate, and potentially many other good alternative answers alongside it to provide different perspectives.

To be clear: I think I understand SQL joins myself. The purpose of this question is to create visibility and a source of truth for those new students of computer science and mathematics who might not understand them fully.

Related

Is Cartesian Product same as SQL Full Outer Join?

3

There are 3 best solutions below

21
On BEST ANSWER

Let $A, B$ be sets. We think of $A$ and $B$ as tables, and their elements as rows. Each element of $x\in A$ is a list of data entries, one for each column of $A$.

(Edit: WLOG assume $A$ and $B$ do not have duplicate entries. If they do, add a unique index column to each.)

Let $R$ be any relation, that is, a subset $R \subseteq A \times B$, where we write $a \sim \, b$ if $(a,b) \in R$. In SQL $R$ corresponds to the statement that appears after "ON", e.g., A.name = B.name corresponds to the relation $x \sim y$ if and only if the entry in the name column of for a row $x \in A$ is the same as the name column in a row of $y \in A$.

Then $$A \operatorname{ INNER JOIN } B \operatorname{ON} R = \{(a,b) \in A \times B \, |\, a \sim b\}\, (=R).$$

(Edit: Here $(a,b)$ represents the concatenation of the entries of rows $a$ and $b$, corresponding to SELECT * FROM A JOIN B ON R. Of course the actual output may differ depending on the implementation.)

But here, if $a \in A$ is such that there is no corresponding $b$ such that $a \sim b$, then $a$ will not appear in the join. If you take a left join, you want every $a$ to appear regardless. So you add a special element $\operatorname{NULL}$ and add it to your relation. $\operatorname{NULL}$ obeys the rules

$a \sim \operatorname{NULL}$ iff there is no $b \in B$ with $a \sim b$

$\operatorname{NULL} \sim b$ iff there is no $a \in A$ with $a \sim b$

Now let $$\hat{A} = A \cup \{\operatorname{NULL}\},$$ $$\hat{B} = B \cup \{\operatorname{NULL}\}.$$

Then we have

$$A \operatorname{ INNER JOIN } B \operatorname{ON} R = \{(a,b) \in A \times B \, | a \sim b\}$$ $$A \operatorname{ LEFT JOIN } B \operatorname{ON} R = \{(a,b) \in A \times \hat{B} \, | a \sim b\}$$ $$A \operatorname{ RIGHT JOIN } B \operatorname{ON} R = \{(a,b) \in \hat{A} \times B \, | a \sim b\}$$ $$A \operatorname{ OUTER JOIN } B \operatorname{ON} R = \{(a,b) \in \hat{A} \times \hat{B} \, | a \sim b\}.$$

Thus we'll have the pairs $(a, \operatorname{NULL})$ appear on the left join whenever $a$ doesn't match any $b$, and $(\operatorname{NULL}, b)$ whenever $b$ doesn't match any $a$ in the right join. (note that we don't have $\operatorname{NULL} \sim \operatorname{NULL}$, so we never have $(\operatorname{NULL}, \operatorname{NULL})$.)

The reason that Venn diagrams are used to depict joins is that usually joins are usually done on relations as simple as the one given above, $R$ corresponding to A.name = B.name. In that case, if $\text{names}(T)$ is the set of names that appear in a table $T$, that is, $\text{names}(T)$ = SELECT DISTINCT names FROM T, then

\begin{align*}\text{names}(A\operatorname{ INNER JOIN } B \operatorname{ON} R) &= \text{names}(A)\cap \text{names}(B) \\ \text{names}(A\operatorname{ LEFT JOIN } B \operatorname{ON} R) &= \text{names}(A)\\ \text{names}(A\operatorname{ RIGHT JOIN } B \operatorname{ON} R) &= \text{names}(B)\\ \text{names}(A\operatorname{ OUTER JOIN } B \operatorname{ON} R) &= \text{names}(A)\cup \text{names}(B).\end{align*}

However, this completely loses sight of the fact that joins may be one-to-one, many-to-one, or many-to-many, and personally I've found those Venn diagrams more confusing than helpful when learning about joins.

4
On

Jair Taylor has given us a precise mathematical formalism of the four type of joins in his answer, as called for. This answer supplements that one with a concrete example.

Suppose we have two tables, BuildingPrice and Buyers:

enter image description here

And suppose we want to know which buildings can be afforded by which buyers. We can do a SQL join. Here is the inner join SQL:

SELECT * FROM BuildingPrice JOIN Buyers ON AccountBalance >= Price

The ON condition characterises the relation Jair talks about in his answer. We can then visualise all four joins (with the same ON condition), in the following diagram:

enter image description here

In this diagram, we flip the Buyers table on its side so that its rows are now columns, i.e. we transpose it. We also add the special NULL element that Jair describes. This gives us the cross product, which is the rectangular area achieved by multiplying the columns in the transposed Buyers table, plus NULL, with the rows in the BuildingPrice table, plus NULL. All joins start with the inner join, the green area. The left, right and outer joins add extra elements as required.

Each element in the diagram that's included in the diagram is a pair of rows: one from BuildingPrice and one from Buyers. Of course, what's actually returned by a join is not a set of pairs of rows but a set of rows. So for any given pair, we convert it to a single row of the result table by simply taking the union of all the column to value mappings. For the NULL case, those mappings will all have a value of NULL. So for example, our LEFT join would result in this table:

enter image description here

A Note on NULL

It is important that we have the correct, precise interpretation of NULL here, and what it means for the resulting records in the joined table. WLOG we'll just consider the LEFT JOIN case. Suppose we have an element $x$ of the left table which has no right table elements associated to it. This will, in Jair's characterisation, give rise to the pair $(x, $NULL$)$ being included in the join.

For the actual joined table though, we have to go a step further and convert that pair to a record i.e. a row in the resultant table. For that to work, we need to convert NULL to a column-mapping in the right table, where the value of each mapped column is NULL. So in this case, NULL is actually the map:

As correctly pointed out in the comments, the two tables will not in general have the same set of columns or even the same number of columns, so the meaning of NULL in the LEFT and RIGHT cases is different. WLOG, we're just considering the left case, in which the NULL actually means this mapping representing a row of the right table:

$($Buyers.Name$ \rightarrow$NULL$,$ AccountBalance$ \rightarrow$NULL$)$

13
On

An alternative characterisation of joins starts with LEFT JOIN and defines everything from there. It is equivalent to Jair Taylor's formalism, just a different perspective. This definition is very formal so it should definitely be supplemented by other answers / concrete examples for a good intuition of JOIN.

Definition: Values

Let's define the set $V$ as the set of all possible values in any possible SQL cell. So $V$ would be the union of all possible SQL types. The reason for doing this is so that we don't get bogged down in type-system considerations.

No matter what our universe of values is, we always assume a null value, call it $NULL$.

Definition: Record(s)

Let's say we have a set of columns $C$. A record for $C$ is just a function from $C$ onto $V$. In computer science terms, imagine a dictionary or a map. Let's denote the set of all records for a column set $C$ as $R_C$:

$$R_C = C \rightarrow V$$

Definition: Null Record

Let's say we have a set of columns $C$. We can define the null record for $C$, $NULL_C : R_C$ as follows:

$$NULL_C = \lambda c \mapsto NULL$$

That is, it is the function which maps every column $c : C$ to the value $NULL$.

Definition: Table

Let's say we have a set of columns $C$. A table for $C$ is just a set of records for $C$. Let's denote the set of all such tables as $T_C$. Then:

$$T_C = \mathcal P(R_C)$$

Where $\mathcal P$ is just the symbol for the powerset, i.e. the set of all subsets, of a given set. So a table is just a subset of all possible records for a given set of columns.

Note: As Jair points out in his answer, although tables are in reality bags, not sets of records, we can always add an invisible column to the column set $C$ that must be unique, forcing a set representation. So WLOG, we'll continue with sets, which are easier to handle.

Definition: Left Set Selector

Suppose we have two sets of columns $C$ and $D$. WLOG let's assume these sets are disjoint (in SQL, we can force column names to be disjoint by prepending the table name to get a fully qualified name). And suppose we have two tables $t_C : T_C$ and $t_D : T_D$. And suppose we are given any binary relation $R : \mathcal P(t_C \times t_D)$.

Then we can define a precursor to the left join. Define $S : t_C \rightarrow \mathcal P(t_C \times t_D)$:

$$S(r_C) = \{r_D : t_D | r_C R r_D\}$$

And then define our set selector $LS : \mathcal P(t_c \times (t_d \cup NULL_D))$

$$ LS(r_C) = \begin{cases} S(r_C) & \text{if }S(r_C) \neq \emptyset \\ NULL_D & \text{if }S(r_C) = \emptyset \end{cases} $$

Definition: Left Join Precursor

Given column sets $C, D$, and a relation $R : \mathcal P(t_C \times t_D)$. The left join precursor $LJP : T_C \times T_D \rightarrow \mathcal P(T_C \times (T_D \cup \{NULL_D\}))$ can be defined as follows:

$$LJP(t_c, t_d) = \bigcup_{r_C : T_C} LS(r_C)$$

Record Join

Suppose we have two records $r_C$ and $r_D$ on column sets $C$ and $D$ respectively. Then we can define the joined record on the set $C \cup D$ as:

$$J(r_C, r_D) = \lambda x \mapsto \begin{cases} r_C(x) & x : C \\ r_D(x) & x : D \end{cases} $$

Definition: Left Join

Given column sets $C, D$, and a relation $R : \mathcal P(t_C \times t_D)$. The left join $L : T_C \times T_D \mapsto T_{C \cup D}$ can be defined as:

$$L(t_C, t_D) = \{J(r_C, r_D) : R_{C \cup D}| (r_C, r_D) : LJP(t_C, t_D)\}$$

Definition: Right Join

The right join $RJ$ can be defined using symmetry and the left join:

$$RJ(t_C, t_D) = LJ(t_D, t_C)$$

Definition: Inner Join

$$I(t_C, t_D) = RJ(t_C, t_D) \cap L(t_C, t_D)$$

Definition: Outer Join

$$O(t_C, t_D) = RJ(t_C, t_D) \cup L(t_C, t_D)$$

Venn Diagram Relating all Four Joins

enter image description here

The outer join is not labelled in the picture but it is the union of the areas of the two circles.

NB: THE CIRCLES IN THIS VENN DIAGRAM ARE NOT THE ORIGINAL TABLES THAT WERE JOINED. PLEASE DO NOT GLANCE OVER THIS IMAGE AND MISTAKE THEM AS SUCH.