After last week's post about using a Select statement with an UNION several people have asked me to clarify the differences between an Union and a Join.
I think using some simple graphics makes it a whole lot easier to explain. I have two files:
File 1
|
File 2
|
If I Join the two files I have data from both files into a single row:
Join
|
The Union merges data from two or more subselects into one result set, the data from each of the subselects is kept as a separate row:
Union
|
OK, let me give more detail using some example code. I have two files, which are identical:
A R RFILE1 A F1_1 2S 0 A F1_2 R REFFLD(F1_1 *SRC) A COLHDG('F1_2') |
And:
A REF(FILE1) A R RFILE2 A F2_1 R REFFLD(F1_1) COLHDG('F2_1') A F2_2 R REFFLD(F1_2) COLHDG('F2_2') |
And a third with just one field:
A REF(FILE1) A R RFILE3 A F3_1 R REFFLD(F1_1) COLHDG('F3_1') |
Each of the file contain five records:
FILE1 F1_1 F1_2 1 1 2 2 3 3 4 4 5 5 |
FILE2 F2_1 F2_2 2 2 4 4 6 6 8 8 10 10 |
FILE3 F3_1 3 6 9 12 15 |
I can join the three with the following SQL statement:
01 SELECT A.F1_1,A.F1_2,B.F2_1,B.F2_2,C.F3_1 02 FROM FILE1 A 03 FULL OUTER JOIN FILE2 B 04 ON A.F1_1 - B.F2_1 05 FULL OUTER JOIN FILE3 C 06 ON A.F1_1 = C.F3_1 07 OR B.F2_1 = C.F3_1 |
A FULL OUTER join will include all of the records from the three files, even if there are not matches in the others. I must admit I still refer to the SQL joins diagram to determine which kind of join is best for my desired result.
The results show that the data has been combined into rows.
F1_1 F1_2 F2_1 F2_2 F3_1 1 1 - - - 2 2 2 2 - 3 3 - - 3 4 4 4 4 - 5 5 - - - - - 6 6 6 - - 8 8 - - - 10 10 - - - - - 9 - - - - 12 - - - - 15 |
The dash ( - ) indicates a value of null, in other words no match could be made.
In my first example of an Union I am going to put together two subselects selecting all the records from FILE1 and FILE2.
01 SELECT * FROM FILE1 02 UNION 03 SELECT * FROM FILE2 04 ORDER BY 1 |
As the names of the fields in the files are different I need to give a number for the ORDER BY. By giving the number 1 I want my result set to be sorted by the first row of the result set.
F1_1 F1_2 1 1 2 2 3 3 4 4 5 5 6 6 8 8 10 10 |
By using the UNION my result set is "distinct", in other words duplicate results have been removed. If I want to see the duplicates I use the UNION ALL instead:
01 SELECT * FROM FILE1 02 UNION ALL 03 SELECT * FROM FILE2 04 ORDER BY 1 |
Now I see the duplicates:
F1_1 F1_2 1 1 2 2 2 2 3 3 4 4 4 4 5 5 6 6 8 8 10 10 |
I can include more than two files in an Union, so let me add FILE3.
01 SELECT * FROM FILE1 02 UNION 03 SELECT * FROM FILE2 04 UNION 05 SELECT * FROM FILE3 06 ORDER BY 1 |
When I execute that statement I get an error:
Number of columns not consistent. Message ID . . . . . : SQL0421 Severity . . . . . . : 30 Message type . . . . . : Diagnostic Message . . . . : Number of columns not consistent. Cause . . . . . : The subselects of a UNION, INTERCEPT, or EXCEPT must have the same number of result columns. All rows specified for a multiple row insert or for a VALUES clause must have the same number of values. Recovery . . . : Correct the SQL statement so that the same number of columns are defined for each row. Try the request again. |
The result set of the Union between FILE1 and FILE2 produces a result set of two columns. As FILE3 only has one field it cannot be part of the result set. Unless I do something like:
01 SELECT * FROM FILE1 02 UNION 03 SELECT * FROM FILE2 04 UNION 05 SELECT F3_1,0 FROM FILE3 06 ORDER BY 1 |
Line 5: I have added a second column to the subselect from FILE3 by using the zero ( 0 ). This will place zero into the second column of the result set, and I get:
F1_1 F1_2 1 1 2 2 3 0 3 3 4 4 5 5 6 6 6 0 8 8 9 0 10 10 12 0 15 0 |
If I just want to have a "distinct" list of the values I need to modify my SQL statement to just be for the first columns:
01 SELECT F1_1 FROM FILE1 02 UNION 03 SELECT F2_1 FROM FILE2 04 UNION 05 SELECT F3_1 FROM FILE3 06 ORDER BY 1 |
Which gives me:
F1_1 1 2 3 4 5 6 8 9 10 12 15 |
To those who asked me about the Union clause I hope that these example adequately show the differences between the Join and the Union.
You can learn more about this from the IBM website:
This article was written for IBM i 7.3, and should work for earlier releases too.
FWIW, I use UNION to detect row differences in 2 files where one file is a clone of the other. If 2 rows are identical, only 1 row gets put into the result set.
ReplyDeletewith mytable as
( select * from ITEM a
UNION
select * from ITEM_SAVE b )
select item, count(*) from mytable
group by item
having count(*) > 1
Ringer
Another smart tecnique for detecting row differences from 2 files is using the EXCEPT keyword. Eg.:
DeleteSELECT * FROM item
EXCEPT
SELECT * FROM item_save
This returns all rows from ITEM table that do not have a matching row in ITEM_SAVE. Here the comparison is made on all columns, but you can also choose single columns to compare.
For reference:
https://www.ibm.com/support/knowledgecenter/ssw_ibm_i_72/sqlp/rbafyexcept.htm
I use union with CTE to give me a number of data samples union-ed together. I create a CTE of data where a for 5 rows then 5 rows where b and 2 rows where c then union the data set.
ReplyDeleteIf you like to know, what row comes from what table you can add a new field like:
ReplyDeleteSelect 'tableA' as Source, f11 from tableA
Union
Select 'tableB' as Source, f12 from tableB
a short tip for using union and sort out what row comes from what table afterwards: just add a
ReplyDelete'Table A' as Source, ...
union
'Table B' as Source, ...
...
field and you still know the source if you like to do...