Wednesday, November 22, 2023

Program to capture CPU usage over time

I have been asked multiple times if there is an easy way to capture a list of jobs with the amount of CPU they, with a program. The answer is, of course, "Yes". I thought it was time to give an example of a program example of how I would do this.

My scenario is that I need to capture all the jobs in this partition with the CPU percentage they are using, every five minutes. This data needs to be captured in an output file, or in this example it's a table. The information I need is:

  • CPU percent
  • Job name
  • Subsystem and the library it is found in
  • User name
  • Job type – batch, interactive, prestart, etc.
  • Function and function type
  • Job status
  • CPU time
  • Time the data was captured

Fortunately all of this can be captured from the ACTIVE_JOB_INFO table function.

First I need to create the output file. As I said before this will be a SQL DDL table. I can create a Table based on another using SQL, in this situation based on the columns returned from ACTIVE_JOB_INFO:

01  CREATE TABLE MYLIB.ACTIVE_JOB_HISTORY
02     FOR SYSTEM NAME "ACTJOBHIST"
03     (RUN_TIME,CPU_PERCENT,JOB_NAME,SUBSYSTEM,SUBSYSTEM_LIBRARY,
04      USER_NAME,JOB_TYPE,FUNCTION_TYPE,FUNCTION,JOB_STATUS,CPU_TIME)
05   AS
06   (SELECT CURRENT_TIMESTAMP,ELAPSED_CPU_PERCENTAGE,JOB_NAME,SUBSYSTEM,
07           SUBSYSTEM_LIBRARY_NAME,AUTHORIZATION_NAME,JOB_TYPE,FUNCTION_TYPE,
08           FUNCTION,JOB_STATUS,CPU_TIME
09      FROM TABLE(QSYS2.ACTIVE_JOB_INFO()))
10  DEFINITION ONLY ;

Lines 1 and 2: Create this table with a long SQL name and a short system name.

Lines 3 and 4: List of what the columns will be called in this new Table.

Lines 6 – 9: SQL statement to capture the columns of information I want from ACTIVE_JOB_INFO. The only exception is the first column, CURRENT_TIMESTAMP.

Line 10: This means that the Table is just created, no data is written to it.

Now I have my output file I need a program to fill it:

01  **free
02  ctl-opt main(Main) option(*nodebugio:*srcstmt) dftactgrp(*no) 
              alwnull(*usrctl) ;

03  dcl-proc Main ;
04    dcl-pr sleep int(10) extproc('sleep') ;
05      *n uns(10) value ;
06    end-pr ;
07    dcl-s Seconds int(10) inz(300) ; // 300 secs = 5 mins

08    dcl-s CurrentTime timestamp ;
09    dcl-s MessageText char(512) ;		
10    dcl-s Nothing char(1) ;

11    MessageText = 'ACTIVE_JOB started ' ;
12    exec sql CALL QSYS2.SEND_MESSAGE('CPF9898',512,:MessageText,'*LIBL','QCPFMSG') ;

13    dow (*on) ;
14      CurrentTime = %timestamp() ;

15      exec sql INSERT INTO ACTIVE_JOB_HISTORY
16                 SELECT :CurrentTime,ELAPSED_CPU_PERCENTAGE,JOB_NAME,SUBSYSTEM,
17                        SUBSYSTEM_LIBRARY_NAME,AUTHORIZATION_NAME,JOB_TYPE,FUNCTION_TYPE,
18                        FUNCTION,JOB_STATUS,CPU_TIME
19                   FROM TABLE(QSYS2.ACTIVE_JOB_INFO())
20                  WHERE ELAPSED_CPU_PERCENTAGE > 0
21                  ORDER BY ELAPSED_CPU_PERCENTAGE DESC ;

     // Reset stats
22      exec sql SELECT '1' INTO :Nothing
                   FROM TABLE(QSYS2.ACTIVE_JOB_INFO(RESET_STATISTICS => 'YES')) LIMIT 1 ;

23      sleep(Seconds) ;
24    enddo ;

25  on-exit ;
26    MessageText = 'ACTIVE_JOB ended ' ;
27    exec sql CALL QSYS2.SEND_MESSAGE('CPF9898',512,:MessageText,'*LIBL','QCPFMSG') ;
26  end-proc ;

Line 1: All my new code is always totally free RPG.

Line 2: My favorite control options. I also have the MAIN option as I will be using a Main procedure.

Line 3: Start of the Main procedure.

Lines 4 – 6: Procedure prototype of the external C procedure sleep. This allows us to "sleep", pause, the program for a number of seconds.

Line 7: Definition of the variable that will contain the number of seconds to sleep. It must be defined as a ten long integer. Here I am initializing the variable with 300 seconds, or 5 minutes.

Lines 8 – 10: Other variables, whose purpose I will describe later.

Lines 11 and 12: I want to send a message to the System Operator message queue that this job is starting. I am going to call this job and this program ACTIVE_JOB. I can add a message to the System Operator message queue using the SEND_MESSAGE SQL procedure. I need to pass this procedure the following parameters:

  1. Message id: I am using the message CPF9898 as this message has not default text, just the substitution parameter
  2. Message length: The substitution parameter is 512. If I pass just the length of the string for the message the rest of the message is filled with strange characters. If I use a parameter of 512 then I do not get those, and just the string I desire.
  3. Message text: This is where I am using the variable MessageText. On line 11 I move the string I want the message to be into the variable
  4. Message file library
  5. Message file

Line 13: Start of a never ending Do loop.

Line 14: I thought about using SQL's CURRENT_TIMESTAMP for the run time column in the output file. The danger is that the insert could take several microseconds which could result in records from the same "snapshot" having different timestamps. If I capture the timestamp in a variable I can use this variable for the run time for all the records of this "snapshot".

Lines 15 – 21: The Insert statement to add data to my output file. Things to notice are:

Line 16: I am using the RPG variable for the run time, therefore, it has to be prefixed with a colon so that SQL knows it is an RPG variable.

Line 20: I am only selecting rows from the Table function where the CPU percent is greater than zero.

Line 21: The order in which the data is returned from the Table function, and then inserted into the output file, is in CPU percent in descending order. In other words the job that is using the most CPU is the first to be inserted.

Line 22: I want my CPU percentage to be calculated for the time between this capture and the next, in five minutes time. Therefore, I need to reset the statistics. By using the RESET_STATISTICS parameter I tell the ACTIVE_JOB_INFO to reset itself.

Line 23: The program sleeps.

Line 24: End of the never ending Do loop.

I am sure you have noticed that there is no way to exit the Do loop, it just keeps going. The only way to end the program is to end the job. This is why I used a Main procedure. If I end the job in a cycle RPG program the program just ends. If a procedure ends, whether in a normal exit or when the job is ended the code in the ON-EXIT section is executed.

Line 25: Start of the ON-EXIT section, which will be executed when I cancel this job.

Lines 26 - 27: I write the message to the System Operator message queue that the job has ended.

I compile the above to create the program ACTIVE_JOB.

I can now submit it to batch so it can run and gather this information for me:

SBMJOB CMD(CALL PGM(MYLIB/ACTIVE_JOB)) 
         JOB(ACTIVE_JOB) 
         JOBQ(QS36EVOKE)

Notice the job queue I have used, QS36EVOKE. Why did I use this job queue rather than one of the QBATCH* job queues?

Most people have their QBATCH* job queues to have only one active job at a time. As this job is never ending it means that in a "single threaded" job queue all the jobs submitted later would just queue up behind it waiting for this job to finish.

The job queue Q36EVOKE still submits the job to the QBATCH subsystem. But it has a no maximum number of jobs it can have running in the subsystem. By using Q36EVOKE I know that the job ACTIVE_JOB will not hold any others up.

When I end the job ACTIVE_JOB I can look in the system operator message queue to see when it started and ended.

Here I am using the MESSAGE_QUEUE_INFO the return the messages from any job that has the pattern "ACTIVE_JOB" in its job name:

01  SELECT MSG_TIME AS TIME,MSGID AS MSGID,MSG_TYPE AS TYPE,
02         CAST(MSG_TEXT AS CHAR(100) CCSID 37) AS TEXT
03    FROM QSYS2.MESSAGE_QUEUE_INFO
04   WHERE MSGQ_NAME = 'QSYSOPR'
05     AND FROM_JOB LIKE '%ACTIVE_JOB%'

The results are:


TIME                        MSGID    TYPE           TEXT
--------------------------  -------  -------------  ----------------------------------------
2023-MM-DD 19:32:39.332633  CPF9898  INFORMATIONAL  ??ACTIVE_JOB started
2023-MM-DD 19:32:54.592587  CPC1125  COMPLETION     Job 999999/SIMON/ACTIVE_JOB was ended...
2023-MM-DD 19:32:54.609828  CPF9898  INFORMATIONAL  ??ACTIVE_JOB ended

Notice that the completion message comes before the job ended message. This shows that I ended the job, and the ON-EXIT group was executed afterwards.

Before I show the results I need to explain something. The first time the Do loop is executed no time has elapsed, therefore, no CPU has been used and there are no results to insert. When the Do loop loops the second time, time has elapsed and there is now data to insert into the output file.

The first five results from my testing looks like:

                            CPU_                                        SUBSYSTEM
RUN_TIME                    PERCENT  JOB_NAME                 SUBSYSTEM  _LIBRARY
--------------------------  -------  -----------------------  ---------  ---------
2023-MM-DD 21:32:48.386692    44.33  833496/USER_1/SOME_JOB   QBATCH     QSYS
2023-MM-DD 21:32:48.386692     0.98  886083/QUSER/QZDASOINIT  QUSRWRK    QSYS
2023-MM-DD 21:32:48.386692     0.17  872079/QUSER/QZRCSRVS    QUSRWRK    QSYS
2023-MM-DD 21:32:48.386692     0.16  771163/QUSER/QZRCSRVS    QUSRWRK    QSYS
2023-MM-DD 21:32:48.386692     0.16  872084/QUSER/QZRCSRVS    QUSRWRK    QSYS


USER_   JOB_  FUNCTION            JOB_    CPU_
NAME    TYPE  _TYPE     FUNCTION  STATUS  TIME
------  ----  --------  --------  ------  --------
USER_1  BCH   PGM       FUNCTION  RUN     13060167
QUSER   PJ    <NULL>    <NULL>    PSRW        3808
QUSER   PJ    <NULL>    <NULL>    TIMW       20562
USER_2  PJ    <NULL>    <NULL>    TIMW      169591
USER_3  PJ    <NULL>    <NULL>    RUN        19325

I left this program running for five days and found that it never used enough CPU for it to appear in the generated output.

There you go a simple program to gather what jobs are running in your partition, and how much of the CPU they use.

 

This article was written for IBM i 7.5, and should work for some earlier releases too.

1 comment:

  1. Instead of DoW (*On) I would suggest to write DoW not %Shtdn which allows the program to terminate nicely when performing a controlled end.

    ReplyDelete

To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.

Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.