Finding fastest way to copy data from one table to another @ RPGPGM.COM

fastest way to copy data from table to another

This post is based on the presentation Thomas Leo Swint made to the Central Texas IBM i User Group last night. Thomas performed a series of tests looking to find the fastest methods to "read" several million rows of data, and also to "read" the data from one table and output it to another table. When he shared his results with me, I was surprised what he had discovered. I asked him to make a presentation to the CTXiUG, and for his permission to share his results here.

Thomas needed to copy millions of rows of data from one set of SQL tables to another. This process needed to take the least amount of time possible. He created four test tables with the character columns defined in different ways. He compared the time taken to "copy" rows from the tables to another table using:

A Copy File command, CPYF, embedded in a RPG program, executed by calling SQL's QCMDEXC procedure
RPG program with SQL cursor performing a single row fetch
RPG program with a SQL cursor performing multiple row fetches

But before I start showing the code of these programs, I want to show the four different tables. All of the code is his, all I have done is removed extra lines not necessary for showing how this all works.

In his first table, I am calling TCHAR, the columns are all defined as CHAR:

01  CREATE TABLE TLSLIB.TCHAR (
02   MSGID     CHAR(7)    CCSID 37 NOT NULL DEFAULT '' ,
03   FROMJOB   CHAR(28)   CCSID 37 NOT NULL DEFAULT '' ,
04   MSG_TEXT  CHAR(200)  CCSID 37 NOT NULL DEFAULT '' ,
05   MSG_TEXT2 CHAR(2000) CCSID 37 NOT NULL DEFAULT '' ,
06   MSG_TOKEN CHAR(500)  CCSID 37 NOT NULL DEFAULT '' )
07  RCDFMT TCHARR ;

The concern with having large character columns is that the contents of these columns are not that long, which wastes DASD space. To overcome this waste of space he created another file, I am calling TVARCHAR, where columns are defined as variable character, VARCHAR:

01  CREATE TABLE TLSLIB.TVARCHAR (
02   MSGID     VARCHAR(7)    CCSID 37 NOT NULL DEFAULT '' ,
03   FROMJOB   VARCHAR(28)   CCSID 37 NOT NULL DEFAULT '' ,
04   MSG_TEXT  VARCHAR(200)  CCSID 37 NOT NULL DEFAULT '' ,
05   MSG_TEXT2 VARCHAR(2000) CCSID 37 NOT NULL DEFAULT '' ,
06   MSG_TOKEN VARCHAR(500)  CCSID 37 NOT NULL DEFAULT '' )
07  RCDFMT TVARCHARR ;

There is a performance issue when using columns defined as VARCHAR. When data is fetched from them two input/output operations will occur. This is nicely explained on this IBM Support document, here. I have copied the section below from that document:

Q1: If I insert only 50, 30, or 80 bytes into a VARCHAR column (defined as 254 with ALLOCATE(0) ), will I incur two I/Os?

A1: Yes, it will be two I/0s because ALLOCATE(0) was specified. To avoid an extra I/0, use ALLOCATE(30) or (50). We suggest allocating enough so most of your input will fit and only the long ones will overflow the allocated part.

How much space do you allocate for each VARCHAR column?

Thomas found that the consensus was that the allocated space should be the size to accommodate 80% of the data in that column. Not 80% of the size. I am not going to show how the size of 80% of the data was calculated, that will be the subject of a future post.

The file with the 80% allocated columns, I am calling TVAR80, looks like:

01  CREATE TABLE TLSLIB.TVAR80 (
02   MSGID     VARCHAR(7)    ALLOCATE(7)   CCSID 37 NOT NULL DEFAULT '' ,
03   FROMJOB   VARCHAR(28)   ALLOCATE(20)  CCSID 37 NOT NULL DEFAULT '' ,
04   MSG_TEXT  VARCHAR(200)  ALLOCATE(69)  CCSID 37 NOT NULL DEFAULT '' ,
05   MSG_TEXT2 VARCHAR(2000) ALLOCATE(849) CCSID 37 NOT NULL DEFAULT '' ,
06   MSG_TOKEN VARCHAR(500)  ALLOCATE(107) CCSID 37 NOT NULL DEFAULT '' )
07  RCDFMT TVAR80R ;

The final table allocated 100% of the columns size. This was to see if the results would be the same as the TCHAR table. This file I am going to call TVAR100:

01  CREATE TABLE TLSLIB.TVAR100 (
02   MSGID     VARCHAR(7)    ALLOCATE(7)    CCSID 37 NOT NULL DEFAULT '' ,
03   FROMJOB   VARCHAR(28)   ALLOCATE(28)   CCSID 37 NOT NULL DEFAULT '' ,
04   MSG_TEXT  VARCHAR(200)  ALLOCATE(200)  CCSID 37 NOT NULL DEFAULT '' ,
05   MSG_TEXT2 VARCHAR(2000) ALLOCATE(2000) CCSID 37 NOT NULL DEFAULT '' ,
06   MSG_TOKEN VARCHAR(500)  ALLOCATE(500)  CCSID 37 NOT NULL DEFAULT '' )
07  RCDFMT TVAR100R ;

Thomas performed three rounds of tests with different numbers of rows:

1 million rows
5 million rows
15 million rows

I am not going to show all the results from all of those tests. I am just going to show the results from his tests where he used 15 million rows.

As I mentioned above, he used three RPG programs, that I will show below. The code I am going to show is not for the full programs. I am just going to show what I want to call the "interesting" parts of them. I am also just showing one version of the programs. There were four programs for each of the four files.

The first RPG program used a CPYF executed by the SQL's QCMDEXC:

01  dcl-s CopyCommand     char(250) ;
02  dcl-s StrTS           timestamp ;
03  dcl-s EndTS           timestamp ;

04  CopyCommand =  'CPYF FROMFILE(TLSLIB/TCHAR) ' +
05                   'TOFILE(QTEMP/OUTFILE) ' +
06                   'CRTFILE(*YES) ' ;

07  Exec SQL SET :StrTS = CURRENT_TIMESTAMP ;

08  Exec sql CALL QSYS2.QCMDEXC(:CopyCommand) ;

09  Exec SQL SET :EndTS = CURRENT_TIMESTAMP ;

10  Exsr $WritetoLogFile ;

The next program uses a SQL cursor to perform a single row fetch to receive data from the table, and then uses a single row insert to add data to the output table.

01  dcl-ds @DataIn  extname('TLSLIB/TCHAR') qualified ;
02  end-ds ;

03  dcl-ds @TestFileOut extname('TLSLIB/TCHAROUT') qualified ;
04  end-ds ;

05  dcl-s StrTS           timestamp ;
06  dcl-s EndTS           timestamp ;
07  dcl-s @SqlSuccessful  ind ;

08  Exec SQL Declare TestFile_Cursor Cursor For
09      select * from TLSLIB.TCHAR ;

10  Exec SQL SET :StrTS = CURRENT_TIMESTAMP ;

11  Exec SQL Open TestFile_Cursor ;
12  @SqlSuccessful = (SQLCOD = 0) ;

13  DoW @SqlSuccessful ;
14      Exec SQL Fetch TestFile_Cursor Into :@DataIn ;
15      @SqlSuccessful = (SQLCOD <> 100) ;

16      If @SqlSuccessful ;
17          @TestFileOut = @DataIn ;
18          Exec SQL Insert into TLSLIB.TCHAROUT
19                 values( :@TestFileOut ) ;
20      EndIf ;
21  Enddo;

22  Exec SQL Close TestFile_Cursor ;

23  Exec SQL SET :EndTS = CURRENT_TIMESTAMP ;

24  Exsr $WritetoLogFile ;

The third program uses a SQL cursor to do a multiple row fetch, and then performs a multiple row insert to add data to the output table.

01  dcl-ds @DataIn  extname('TLSLIB/TCHAR') qualified dim(6000);
02  end-ds ;

03  dcl-ds @TestFileOut extname('TLSLIB/TCHAROUT') qualified ;
04  end-ds ;

05  dcl-ds gd qualified ;
06    RowsCount        int(10) ;
07  end-ds ;

08  dcl-s StrTS           timestamp ;
09  dcl-s EndTS           timestamp ;
10  dcl-s @SqlSuccessful  ind ;
11  dcl-s NbrOfRows       int(5) inz(%elem(@DataIn)) ;
12  dcl-s p_RowsCount     int(10) ;

13  Exec SQL Declare TestFile_Cursor Cursor For
14      select * from TLSLIB.TCHAR ;

15  Exec SQL SET :StrTS = CURRENT_TIMESTAMP ;

16  Exec SQL Open TestFile_Cursor ;
17  @SqlSuccessful = (SQLCOD = 0) ;

18  DoW @SqlSuccessful ;
19      Clear @DataIn ;

20      Exec SQL Fetch TestFile_Cursor
21          FOR :NbrOfRows ROWS Into :@DataIn ;

22      SQLCOD = SQLCOD ;
23      @SqlSuccessful = (SQLCOD <> 100) ;

24      If @SqlSuccessful ;
25        gd = GetSqlDiagnostics();
26        p_RowsCount = gd.RowsCount ;

27        Exec SQL
28          INSERT into TLSLIB.TCHAROUT :p_RowsCount ROWS
29            values( :@DataIn ) ;
30     EndIf ;
31  Enddo;                                   

32  Exec SQL Close TestFile_Cursor ;

33  exec sql SET :EndTS = CURRENT_TIMESTAMP ;

34  Exsr $WritetoLogFile ;

Line 1: The data structure array is defined with 6,000 elements. Defining it with a greater number of elements exceeded the maximum size for an array.

I have broken out the results into a table making it easy to compare the results from each time of table. Each program was run 10 times in different jobs. The times are the average of the 10 runs and are in seconds.

Program	No. of tests	TCHAR	TVARCHAR	TVAR80	TVAR100
SQL multi row	10	150.780885	224.643632	271.863179	271.826870
CPYF	10	446.536102	556.010700	708.202601	467.950940
SQL single row	10	472.460167	620.534527	785.702390	774.258705

What conclusions can I draw from these results:

Using a SQL multi row fetch is the fastest way to copy data from one table to another (I was surprised how much faster it is when compared to CPYF).
Using VARCHAR is slower than just a straight CHAR (not surprised).
Using the 80% allocation was the slowest of all copies (surprised).
Only with the CPYF was the 100% allocation time close to the TCHAR number (surprised).

Why was RPG not used in these examples?

In earlier rounds of testing Thomas compared RPG, SQL single row fetch, and SQL multiple row fetch for input only. The RPG program was so much slower than the others that it was not used in the next set of tests.

Results for reading the table of 1 million rows:

Table	RPG	Single row fetch	Multiple row fetch
TCHAR	56.258650	17.878956	7.920792
TVARCHAR	56.335911	19.058225	9.734931
TVAR80	66.362741	21.619942	11.693264

These results reinforce my preference for using SQL multi row fetches for performing all kinds of database input and output.

Thank you to Thomas for allowing me to share his results.

15 comments:

Sam DixAugust 11, 2021 at 5:32 AM
Thanks for posting
AllenAugust 11, 2021 at 6:49 AM
Thank Simon for sharing
Titus AguilarAugust 11, 2021 at 7:15 AM
Thanks Simon. Interesting methods. For the cpyf, we used the blocking factor to speed up copy way back when. I wonder if it still works?
Mark TaylorAugust 11, 2021 at 8:28 AM
Thanks for sharing! I was in a discussion about this very thing a couple days ago... frustratingly, not everyone I work with is on-board with using SQL for things like this. Now I have some proof to show.
Abel WilliumAugust 11, 2021 at 11:26 AM
Thanks Simon for sharing this fantastic article. Its really enhance my knowledge.
Abhijeet MukkawarAugust 11, 2021 at 2:25 PM
Good finding's. Embedded SQL faster than RPG native DB call. Thanks for sharing.
Catia MalagoliAugust 12, 2021 at 5:39 AM
Interessante
Gianluca AdamoAugust 12, 2021 at 5:44 AM
Besides the performance aspect, a valid reason to use the CPYF command is when you have identity column(s) in the table which are referred as foreign keys in other table(s). This is the only way to be 100% sure that records copied have the same value in souce and target table
Mohammed M' BarkiAugust 12, 2021 at 9:36 AM
Thanks for sharing.
Paweł GębskiAugust 12, 2021 at 9:38 AM
Thanks for sharing
cei_blogAugust 16, 2021 at 6:48 AM
Thanks Simon
Sanjay RayAugust 18, 2021 at 12:19 AM
Thanks Simon for this one.
Sanjay

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.

RPGPGM.COM - From AS400 to IBM i

Pages

Wednesday, August 11, 2021

Finding fastest way to copy data from one table to another

15 comments: