How to remove duplicate records from a file? is a question I have been asked by so many different people I feel that it must be on one of those interview questions lists. What I mean by duplicate records is where there is more than one record in a file were values in all are the same as another record. I do not mean records with duplicate keys, as that is whole other "can of worms".
In a well built database each "master" file or table should have a unique key, which prevents duplicate keys and records:
A UNIQUE A R CSTMSTR |
Alas, in these examples I am going to have a DDS file that does not have any unique keys, or even keys:
A..........T.Name++++++RLen++TDpB...... A R TESTFILER A F00 10A A F01 10A A F02 10A A F03 10A A F04 10A A F05 10A A F06 10A A F07 10A A F08 10A A F09 10A |
It should come as no surprise as there are no unique keys the file contains duplicate records:
F00 F01 F02 F03 One One One One Two Two Two Two Three Three Three Three Four Four Four Four Five Five Five Five One One One One Two Two Two Two Three Three Three Three Four Four Four Four Five Five Five Five One One One One Two Two Two Two Three Three Three Three Four Four Four Four Five Five Five Five |
Rather than delete the duplicates records from the original file, TESTFILE, I am going to create a duplicate file, TESTFILE2, and copy the unique data to it. The command to create a duplicate of any object is the Create Duplicate Object command, CRTDUPOBJ.
CRTDUPOBJ OBJ(TESTFILE) FROMLIB(MYLIB) OBJTYPE(*FILE) NEWOBJ(TESTFILE2) CST(*NO) TRG(*NO) |
The easiest way I could think of was to use a simple SQL statement:
INSERT INTO TESTFILE2 (SELECT DISTINCT * FROM TESTFILE) |
Using the SELECT DISTINCT * makes sure I am only selecting distinct (unique) records using all of the file's fields ( * ). Personally I think this is easiest way as this statement works the same if the file has one field or several hundred fields.
If I really had to I could use the Open Query File, OPNQRYF, and Copy From Query File, CPYFRMQRYF, commands. I am reluctant to use these as IBM is encouraging us all to use SQL instead:
01 OPNQRYF FILE((TESTFILE)) + KEYFLD((F00) (F01) (F02) (F03) (F04) + (F05) (F06) (F07) (F08) (F09)) + UNIQUEKEY(*ALL) 02 CPYFRMQRYF FROMOPNID(TESTFILE) TOFILE(TESTFILE2) + MBROPT(*REPLACE) 03 CLOF OPNID(TESTFILE) |
Line 1: I list all of the key fields to sort the file by. This can be a lot of work for a file with a lot of fields, and the KEYFLD parameter is limited to maximum of 50 entries. The UNIQUEKEY(*ALL) parameter is essential as this will only select unique keys, the first occurrence of our duplicate records.
Line 2: The unique keyed records are copied from the original file to the duplicated file.
Line 3: As I have used OPNQRYF I have to close the file using the Close File command, CLOF.
What if I had to remove the duplicate records using RPG? RPG is not a database language, in other words it does contain within the language the ability to sort the contents of a file within it. Therefore, I would have to provide a key external to the program that the program to use. Without an external key I would have to read the entire output file until I either found a match for the record on the original file or end of file, a very slow and tedious process.
So I built a logical file over my duplicate file, which I called TESTLF. The key is made up of every field within the file, and the UNIQUE ensures that there can only be one record with that key. As the maximum number of key fields you can have on a file is 120 if the file contains more than that then we could not use this approach.
A UNIQUE A R TESTFILER PFILE(TESTFILE2) A K F00 A K F01 A K F02 A K F03 A K F04 A K F05 A K F06 A K F07 A K F08 A K F09 |
Here is my RPG program:
01 dcl-f TESTFILE ; 02 dcl-f TESTLF usage(*output) rename(TESTFILER:TESTLFR) ; 03 dou %eof(TESTFILE) ; 04 read TESTFILER ; 05 write(e) TESTLFR ; 06 enddo ; 07 *inlr = *on ; |
Line 1: This is the file with the duplicates in it.
Line 2: The output file is the logical file. It is opened for output, and I have to rename the record format as it is the same in both files.
Line 3: I am going to perform this loop until I reach end of file for TESTFILE.
Line 4: I read TESTFILE.
Line 5: Notice that after the WRITE is the error operation code extender, (E). The error indicator, %ERROR, will come on whenever I try to write a duplicate keyed record to the output file. After the last read of TESTFILE when the end of file is encountered the error indicator will prevent me from writing the last record in TESTFILE to TESTLF again.
Line 6: Loop's end.
Once you understand how the error operation code extender is working, to prevent database errors, you can see how these few lines of code can do what we want, only copy the unique records to the output file.
For those of you still stuck having to use fixed definitions in your RPG these are the file definitions:
FFilename++IPEASF.....L.....A.Device+.Keywords+++++++++++++++++ 01 FTESTFILE IF E DISK 02 FTESTLF O E DISK rename(TESTFILER:TESTLFR) |
No matter which of these alternatives were used I would do the following after them:
- Back up TESTFILE, does not matter whether to tape or just make a copy of the file in another library.
- Clear TESTFILE.
- Copy the data from TESTFILE2 into TESTFILE.
- Make a new logical file with a unique key over TESTFILE to prevent duplicate keys records being written to TESTFILE again. This does not have to have every field in the file in it, just the fields that would be enough to make each record unique.
Which of these three methods would I use? The SQL approach it does not matter how many fields the file has it works the same for 1 field or 300. If someone insists that it has be a RPG program then I would given them this:
01 exec sql INSERT INTO TESTFILE2 (SELECT DISTINCT * FROM TESTFILE) ; 02 *inlr = *on ; |
Addendum
I have received another way to remove the duplicate records from Paul R, Sumit Goyal, and others, using SQL:
DELETE FROM TESTFILE A WHERE RRN(A) > (SELECT MIN(RRN(B)) FROM TESTFILE B WHERE A.F00 = B.F00 AND A.F01 = B.F01 AND A.F02 = B.F02 AND A.F03 = B.F03 AND A.F04 = B.F04 AND A.F05 = B.F05 AND A.F06 = B.F06 AND A.F07 = B.F07 AND A.F08 = B.F08 AND A.F09 = B.F09) |
It is not too long if I only have ten columns/fields, but if the file/table had 100...
This article was written for IBM i 7.2, and should work for earlier releases too.
Removing duplicate records is needed very often but I've never seen a case where all columns are considered the key. A more practical example would be if you checked the first x columns for a unique key.
ReplyDeleteIf I only checked a few of the columns/fields then I still might not have a duplicate record, just a duplicate key.
DeleteIn my opinion duplicate keyed records/rows as it is a nightmare to work out which one is right, or perhaps they are both partially right.
cpyf errlvl(*NOMAX) into file with unique key
ReplyDeleteIf all you need to do is delete duplicate records, wouldn't SQ be a cleaner alternative?
ReplyDeleteFor example:
Delete from libr.file01 f1
where rrn(f1) > (select min(rrn(f2))
from libr.file01 f2
where f2.fld01 = f1.fld01
and f2.fld02 = f1.fld02
and f2.fld03 = f1.fld03
and f2.fld04 = f1.fld04)
asi es mas practico y eficiente
DeleteHelpful
ReplyDeleteVery nice explanation thanks 🙏
ReplyDelete