Tuesday, June 9, 2020

Making a hash of records for comparisons

sql hash_row to allow comparison of records in files

Another addition to Db2 for i as part of the latest Technology Refreshes for IBM i 7.4 TR2 only, is the ability to create a hash value for a record from a file using the HASH_ROW SQL built in function. I am not going to describe what hash values are, if you are interested there is a Wikipedia article describing it. I am just going to say that hash-ing a string produces hexadecimal value of the data.

Previously I had been able to hash individual fields, or columns, using the various standards of hash. As HASH_ROW uses SHA512 I can do the same for a field using HASH_SHA512:

01  SELECT HASH_SHA512('A') AS "Hash",
02         LENGTH(TRIM(HASH_SHA512('A'))) AS "Length"
03    FROM SYSIBM.SYSDUMMY1
04  UNION
05  SELECT HASH_SHA512('ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
06         LENGTH(TRIM(HASH_SHA512('ABCDEFGHIJKLMNOPQRSTUVWXYZ')))
07    FROM SYSIBM.SYSDUMMY1

Lines 1 – 3: I am retrieving the hash value of the letter A, line 1, and the length of that hash value, line 2.

Line 4: I am using a UNION clause to combine the results from the two Select statements.

Lines 5 – 7: I am retrieving the hash value and its length of the entire alphabet.

The results how that the length of the hash value is the same even though the lengths of the strings were different.

Hash                 Length
-------------------- ------
69E7660CA6A1843B9...     64
D8AC4B838921A83C4...     64

Why would I want to hash an entire record? As HASH_ROW is relatively quick it allows for me to rapidly be able compare records/rows of data to determine if they are the same.

To be able to compare data I need files with data. This is the first file, TESTFILE.

01  SELECT RRN(A) AS "RRN",
02         A.*,
03         HASH_ROW(A) AS "HASH_ROW" 
04    FROM MYLIB.TESTFILE A

Line 1: I have added the Relative Record Number, RRN, as I will need to be able to determine which record is which later.

Line 2: All the columns/fields from the file.

Line 3: The HASH_ROW value.

The results are:

RRN  F001     F002   F003  HASH_ROW
---  -------  ------ ----  -------------------------
  1  FIRST    ONE       1  4B030C744BB45493F80471...
  2  SECOND   TWO       2  39E8E1A760DC66D149B250...
  3  THIRD    THREE     3  45E899ABBD40EC02E0216C...
  4  FOURTH   FOUR      4  6C727446C2AC6B369720A8...
  5  FIFTH    FIVE      5  D380D0B49F1DB2166EF0B4...

I will be comparing that file to TESTFILE1.

01  SELECT RRN(B) AS "RRN",
02         B.*,
03         HASH_ROW(B) AS "HASH_ROW" 
04    FROM MYLIB.TESTFILE1 B

The results from this file are:

RRN  F001     F002   F003  HASH
---  -------  ------ ----  -------------------------
   1  FIRST    ONE      1  4B030C744BB45493F80471...
   2  THIRD    THREE    3  45E899ABBD40EC02E0216C...
   3  FIFTH    FIVE     5  D380D0B49F1DB2166EF0B4...
   4  SEVENTH  SEVEN    7  82FEF9236EDBA4D666EA26...
   5  NINTH    NINE     9  E2797A8B726A228B2936A4...

Now I have shown how easy getting the HASH_ROW value is I can start comparing the two files. Let me start by comparing the files by RRN:

01  SELECT CASE WHEN HASH_ROW(A) = HASH_ROW(B) THEN 'Y'
02         ELSE 'N' 
03         END AS "MATCH",
04         RRN(A) AS "RRN1",HASH_ROW(A) AS "HASH1",
05         RRN(B) AS "RRN2",HASH_ROW(B) AS "HASH2"
06    FROM MYLIB.TESTFILE A
07           FULL OUTER JOIN MYLIB.TESTFILE1 B
08           ON RRN(A) = RRN(B)

Lines 1 – 3: I am using a derived column to show if the data in the two records are the same.

Line 4: RRN and hash columns from TESTFILE.

Line 5: The same from TESTFILE1.

Lines 6 – 8: I am joining the two files with a FULL OUTER JOIN to include all the records from both files in the results.

The results are:

MATCH  RRN1  HASH 1                RRN2  HASH2
-----  ----  --------------------  ----  --------------------
Y         1  4B030C744BB45493F...     1  4B030C744BB45493F...
N         2  39E8E1A760DC66D14...     2  45E899ABBD40EC02E...
N         3  45E899ABBD40EC02E...     3  D380D0B49F1DB2166...
N         4  C727446C2AC6B3697...     4  82FEF9236EDBA4D66...
N         5  D380D0B49F1DB2166...     5  E2797A8B726A228B2...

Only the first record in both files match.

Let me now compare the values in the records to one another:

01  SELECT RRN(A) AS "RRN1",HASH_ROW(A) AS "HASH1",
02         RRN(B) AS "RRN2",HASH_ROW(B) AS "HASH2"
03    FROM MYLIB.TESTFILE A
04           FULL OUTER JOIN MYLIB.TESTFILE1 B
05           ON HASH_ROW(A) = HASH_ROW(B)

I do not need the derived Match column in the results this time as if a match is not found the results from that file will be null.

Line 5: The files are joined by their hash row values.

The results are:

RRN1  HASH 1                RRN2  HASH2
----  --------------------  ----  --------------------
   1  4B030C744BB45493F...     1  4B030C744BB45493F...
   2  39E8E1A760DC66D14...     -  -
   3  45E899ABBD40EC02E...     2  45E899ABBD40EC02E...
   4  6C727446C2AC6B369...     -  -
   5  D380D0B49F1DB2166...     -  -
   -  -                        4  82FEF9236EDBA4D66...
   -  -                        5  E2797A8B726A228B2...

The null values in the results are represented by the hyphens ( - ).

The first record from the files is the same. As is the third record from TESTFILE and the second record from TESTFILE1. All other records are not matched.

Would I use this? I feel I have been spoiled with what I wrote about last week, the COMPARE_FILE. When COMPARE_FILE is used with the COMPARE_DATA = 'QUICK' it will give me the same result as comparing the HASH_ROW of two different records, that the data is different. But unlike the COMPARE_DATA = 'YES' HASH_ROW cannot tell me what is different.

 

You can learn more about the changes to SQL's HASH_ROW function from the IBM website here.

 

This article was written for IBM i 7.4 TR2.

1 comment:

  1. I think the most likely place you would use this is for implementing optimistic locking. That is, where you don't retain a lock between when you first read the data, and when you execute an update/delete.

    You would:
    * Select the columns you are interested in, plus the row hash
    * Let the user change what they need
    * Then when updating/deleting, add an additional condition that says the row hash must match the value first read.
    * If the hashes differ, the underlying data has changed and you should probably refresh the screen for the user to decide what action to take.

    And if you are wondering why you would need optimistic locking, our experience was this became necessary when we started using foreign key constraints more widely. In order to validate a constraint, the database needs a lock on both the row you are changing, plus any linked rows in parent tables. So you can end up with this scenario:

    * Take a simple orders table, that includes a foreign key to a customers table
    * A staff member loads up customer details to make a change, but then goes out to lunch without exiting the program, holding a lock on the database row
    * New orders cannot be added for that customer, as the database cannot lock the row in the parent table to validate that the constraint has been satisfied.

    ReplyDelete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.