This post is based on the presentation Thomas Leo Swint made to the Central Texas IBM i User Group last night. Thomas performed a series of tests looking to find the fastest methods to "read" several million rows of data, and also to "read" the data from one table and output it to another table. When he shared his results with me, I was surprised what he had discovered. I asked him to make a presentation to the CTXiUG, and for his permission to share his results here.
Thomas needed to copy millions of rows of data from one set of SQL tables to another. This process needed to take the least amount of time possible. He created four test tables with the character columns defined in different ways. He compared the time taken to "copy" rows from the tables to another table using:
- A Copy File command, CPYF, embedded in a RPG program, executed by calling SQL's QCMDEXC procedure
- RPG program with SQL cursor performing a single row fetch
- RPG program with a SQL cursor performing multiple row fetches
But before I start showing the code of these programs, I want to show the four different tables. All of the code is his, all I have done is removed extra lines not necessary for showing how this all works.
In his first table, I am calling TCHAR, the columns are all defined as CHAR:
01 CREATE TABLE TLSLIB.TCHAR ( 02 MSGID CHAR(7) CCSID 37 NOT NULL DEFAULT '' , 03 FROMJOB CHAR(28) CCSID 37 NOT NULL DEFAULT '' , 04 MSG_TEXT CHAR(200) CCSID 37 NOT NULL DEFAULT '' , 05 MSG_TEXT2 CHAR(2000) CCSID 37 NOT NULL DEFAULT '' , 06 MSG_TOKEN CHAR(500) CCSID 37 NOT NULL DEFAULT '' ) 07 RCDFMT TCHARR ; |
The concern with having large character columns is that the contents of these columns are not that long, which wastes DASD space. To overcome this waste of space he created another file, I am calling TVARCHAR, where columns are defined as variable character, VARCHAR:
01 CREATE TABLE TLSLIB.TVARCHAR ( 02 MSGID VARCHAR(7) CCSID 37 NOT NULL DEFAULT '' , 03 FROMJOB VARCHAR(28) CCSID 37 NOT NULL DEFAULT '' , 04 MSG_TEXT VARCHAR(200) CCSID 37 NOT NULL DEFAULT '' , 05 MSG_TEXT2 VARCHAR(2000) CCSID 37 NOT NULL DEFAULT '' , 06 MSG_TOKEN VARCHAR(500) CCSID 37 NOT NULL DEFAULT '' ) 07 RCDFMT TVARCHARR ; |
There is a performance issue when using columns defined as VARCHAR. When data is fetched from them two input/output operations will occur. This is nicely explained on this IBM Support document, here. I have copied the section below from that document:
Q1: If I insert only 50, 30, or 80 bytes into a VARCHAR column (defined as 254 with ALLOCATE(0) ), will I incur two I/Os?
A1: Yes, it will be two I/0s because ALLOCATE(0) was specified. To avoid an extra I/0, use ALLOCATE(30) or (50). We suggest allocating enough so most of your input will fit and only the long ones will overflow the allocated part.
How much space do you allocate for each VARCHAR column?
Thomas found that the consensus was that the allocated space should be the size to accommodate 80% of the data in that column. Not 80% of the size. I am not going to show how the size of 80% of the data was calculated, that will be the subject of a future post.
The file with the 80% allocated columns, I am calling TVAR80, looks like:
01 CREATE TABLE TLSLIB.TVAR80 ( 02 MSGID VARCHAR(7) ALLOCATE(7) CCSID 37 NOT NULL DEFAULT '' , 03 FROMJOB VARCHAR(28) ALLOCATE(20) CCSID 37 NOT NULL DEFAULT '' , 04 MSG_TEXT VARCHAR(200) ALLOCATE(69) CCSID 37 NOT NULL DEFAULT '' , 05 MSG_TEXT2 VARCHAR(2000) ALLOCATE(849) CCSID 37 NOT NULL DEFAULT '' , 06 MSG_TOKEN VARCHAR(500) ALLOCATE(107) CCSID 37 NOT NULL DEFAULT '' ) 07 RCDFMT TVAR80R ; |
The final table allocated 100% of the columns size. This was to see if the results would be the same as the TCHAR table. This file I am going to call TVAR100:
01 CREATE TABLE TLSLIB.TVAR100 ( 02 MSGID VARCHAR(7) ALLOCATE(7) CCSID 37 NOT NULL DEFAULT '' , 03 FROMJOB VARCHAR(28) ALLOCATE(28) CCSID 37 NOT NULL DEFAULT '' , 04 MSG_TEXT VARCHAR(200) ALLOCATE(200) CCSID 37 NOT NULL DEFAULT '' , 05 MSG_TEXT2 VARCHAR(2000) ALLOCATE(2000) CCSID 37 NOT NULL DEFAULT '' , 06 MSG_TOKEN VARCHAR(500) ALLOCATE(500) CCSID 37 NOT NULL DEFAULT '' ) 07 RCDFMT TVAR100R ; |
Thomas performed three rounds of tests with different numbers of rows:
- 1 million rows
- 5 million rows
- 15 million rows
I am not going to show all the results from all of those tests. I am just going to show the results from his tests where he used 15 million rows.
As I mentioned above, he used three RPG programs, that I will show below. The code I am going to show is not for the full programs. I am just going to show what I want to call the "interesting" parts of them. I am also just showing one version of the programs. There were four programs for each of the four files.
The first RPG program used a CPYF executed by the SQL's QCMDEXC:
01 dcl-s CopyCommand char(250) ; 02 dcl-s StrTS timestamp ; 03 dcl-s EndTS timestamp ; 04 CopyCommand = 'CPYF FROMFILE(TLSLIB/TCHAR) ' + 05 'TOFILE(QTEMP/OUTFILE) ' + 06 'CRTFILE(*YES) ' ; 07 Exec SQL SET :StrTS = CURRENT_TIMESTAMP ; 08 Exec sql CALL QSYS2.QCMDEXC(:CopyCommand) ; 09 Exec SQL SET :EndTS = CURRENT_TIMESTAMP ; 10 Exsr $WritetoLogFile ; |
The next program uses a SQL cursor to perform a single row fetch to receive data from the table, and then uses a single row insert to add data to the output table.
01 dcl-ds @DataIn extname('TLSLIB/TCHAR') qualified ; 02 end-ds ; 03 dcl-ds @TestFileOut extname('TLSLIB/TCHAROUT') qualified ; 04 end-ds ; 05 dcl-s StrTS timestamp ; 06 dcl-s EndTS timestamp ; 07 dcl-s @SqlSuccessful ind ; 08 Exec SQL Declare TestFile_Cursor Cursor For 09 select * from TLSLIB.TCHAR ; 10 Exec SQL SET :StrTS = CURRENT_TIMESTAMP ; 11 Exec SQL Open TestFile_Cursor ; 12 @SqlSuccessful = (SQLCOD = 0) ; 13 DoW @SqlSuccessful ; 14 Exec SQL Fetch TestFile_Cursor Into :@DataIn ; 15 @SqlSuccessful = (SQLCOD <> 100) ; 16 If @SqlSuccessful ; 17 @TestFileOut = @DataIn ; 18 Exec SQL Insert into TLSLIB.TCHAROUT 19 values( :@TestFileOut ) ; 20 EndIf ; 21 Enddo; 22 Exec SQL Close TestFile_Cursor ; 23 Exec SQL SET :EndTS = CURRENT_TIMESTAMP ; 24 Exsr $WritetoLogFile ; |
The third program uses a SQL cursor to do a multiple row fetch, and then performs a multiple row insert to add data to the output table.
01 dcl-ds @DataIn extname('TLSLIB/TCHAR') qualified dim(6000); 02 end-ds ; 03 dcl-ds @TestFileOut extname('TLSLIB/TCHAROUT') qualified ; 04 end-ds ; 05 dcl-ds gd qualified ; 06 RowsCount int(10) ; 07 end-ds ; 08 dcl-s StrTS timestamp ; 09 dcl-s EndTS timestamp ; 10 dcl-s @SqlSuccessful ind ; 11 dcl-s NbrOfRows int(5) inz(%elem(@DataIn)) ; 12 dcl-s p_RowsCount int(10) ; 13 Exec SQL Declare TestFile_Cursor Cursor For 14 select * from TLSLIB.TCHAR ; 15 Exec SQL SET :StrTS = CURRENT_TIMESTAMP ; 16 Exec SQL Open TestFile_Cursor ; 17 @SqlSuccessful = (SQLCOD = 0) ; 18 DoW @SqlSuccessful ; 19 Clear @DataIn ; 20 Exec SQL Fetch TestFile_Cursor 21 FOR :NbrOfRows ROWS Into :@DataIn ; 22 SQLCOD = SQLCOD ; 23 @SqlSuccessful = (SQLCOD <> 100) ; 24 If @SqlSuccessful ; 25 gd = GetSqlDiagnostics(); 26 p_RowsCount = gd.RowsCount ; 27 Exec SQL 28 INSERT into TLSLIB.TCHAROUT :p_RowsCount ROWS 29 values( :@DataIn ) ; 30 EndIf ; 31 Enddo; 32 Exec SQL Close TestFile_Cursor ; 33 exec sql SET :EndTS = CURRENT_TIMESTAMP ; 34 Exsr $WritetoLogFile ; |
Line 1: The data structure array is defined with 6,000 elements. Defining it with a greater number of elements exceeded the maximum size for an array.
I have broken out the results into a table making it easy to compare the results from each time of table. Each program was run 10 times in different jobs. The times are the average of the 10 runs and are in seconds.
Program | No. of tests |
TCHAR | TVARCHAR | TVAR80 | TVAR100 |
SQL multi row | 10 | 150.780885 | 224.643632 | 271.863179 | 271.826870 |
CPYF | 10 | 446.536102 | 556.010700 | 708.202601 | 467.950940 |
SQL single row | 10 | 472.460167 | 620.534527 | 785.702390 | 774.258705 |
What conclusions can I draw from these results:
- Using a SQL multi row fetch is the fastest way to copy data from one table to another (I was surprised how much faster it is when compared to CPYF).
- Using VARCHAR is slower than just a straight CHAR (not surprised).
- Using the 80% allocation was the slowest of all copies (surprised).
- Only with the CPYF was the 100% allocation time close to the TCHAR number (surprised).
Why was RPG not used in these examples?
In earlier rounds of testing Thomas compared RPG, SQL single row fetch, and SQL multiple row fetch for input only. The RPG program was so much slower than the others that it was not used in the next set of tests.
Results for reading the table of 1 million rows:
Table | RPG | Single row fetch | Multiple row fetch |
TCHAR | 56.258650 | 17.878956 | 7.920792 |
TVARCHAR | 56.335911 | 19.058225 | 9.734931 |
TVAR80 | 66.362741 | 21.619942 | 11.693264 |
These results reinforce my preference for using SQL multi row fetches for performing all kinds of database input and output.
Thank you to Thomas for allowing me to share his results.
Thanks for posting
ReplyDeleteThank Simon for sharing
ReplyDeleteThanks Simon. Interesting methods. For the cpyf, we used the blocking factor to speed up copy way back when. I wonder if it still works?
ReplyDeleteYes, you can still use control blocking with CPYF. I did when I was copying large files... now I think I should replace those with SQL.
DeleteSee here how to increase the blocking for CPYF.
Hi
Deletei totally agree
some tests we did long time ago whas that SQL was in fact using blocking like the maximum you could get with OVRDBF I even built a specific FCPYF command forcing the blocking but calculated as at this time the buffer size today available was not implemented, and using it with CPYF was makign the timing more similar than without OVRDBF. But still SQL was always the best, as long as you didn't need CPYF functionnality.
Part 2 includes the CPYF with blocking versus standard CPYF, and SQL.
DeleteRead it here.
Thanks for sharing! I was in a discussion about this very thing a couple days ago... frustratingly, not everyone I work with is on-board with using SQL for things like this. Now I have some proof to show.
ReplyDeleteThanks Simon for sharing this fantastic article. Its really enhance my knowledge.
ReplyDeleteGood finding's. Embedded SQL faster than RPG native DB call. Thanks for sharing.
ReplyDeleteInteressante
ReplyDeleteBesides the performance aspect, a valid reason to use the CPYF command is when you have identity column(s) in the table which are referred as foreign keys in other table(s). This is the only way to be 100% sure that records copied have the same value in souce and target table
ReplyDeleteThanks for sharing.
ReplyDeleteThanks for sharing
ReplyDeleteThanks Simon
ReplyDeleteThanks Simon for this one.
ReplyDeleteSanjay