Calculate different types of averages using SQL @ RPGPGM.COM

As we learned at school there are three types of averages:

Mean: Which is what people when they use the word "average" mean (pun is intended). It is calculated by adding several values together and dividing them by the total number of values.
Median: The midpoint of a sequence of numbers. This number may or may not be a value in the sequence.
Mode: The most frequent number of the set of data.

Fortunately Db2 for i gives us scalar functions for the first two, and we can easily determine the third.

Let me start with my list of values, that are all taken from field FLD1 in the file TESTFILE.

01  SELECT * 
02    FROM TESTFILE 
03   ORDER BY FLD1

The results are as follows:

Mean

The mean is easily calculated by using the AVG scalar function. This function has two parameters:

ALL/DISTINCT: This parameter is optional, with the default value of ALL.
- ALL = Calculate the mean for all of the values
- DISTINCT = Calculate the mean for a distinct set of the values
Expression: In this case it will be FLD1 which is the name of the field. This is required.

What do I mean by a "distinct set of values"? If I use the Select with Distinct you will see what I mean:

01  SELECT DISTINCT * 
02    FROM TESTFILE
03   ORDER BY FLD1

As you can see below the DISTINCT means that only unique numbers are returned.

Let me take this and put it into one Select statement:

01  SELECT AVG(FLD1) AS "Default",
02         AVG(ALL FLD1) AS "All",
03         AVG(DISTINCT FLD1) AS "Distinct"
04    FROM TESTFILE

Line 1: This is the "default" for the AVG function.

Line 2: The ALL as the first parameter in the AVG will return the same result as if I had omitted it.

Line 3: Calculates the mean over the Distinct values.

Results are:

Default                 All                     Distinct
----------------------- ----------------------- -----------------------
5.100000000000000000000 5.100000000000000000000 5.500000000000000000000

I am sure you will agree that these results have way too many decimal places to be practical. I just want two decimal places.

01  SELECT CAST(ROUND(AVG(FLD1),2) AS NUMERIC(10,2)) AS "Mean"
02    FROM TESTFILE

To get the result in a column with only two decimal places I have Cast the column to only have two decimal places. I have used the ROUND to round the results to just decimal places. This is not a big deal with these results but could be a big deal with a different set of values.

Mean
-----
 5.10

Median

Median is the midpoint between the highest and lowest numbers. With the data from my example TESTFILE I can calculated the median by using the Median scalar function:

01  SELECT MEDIAN(FLD1) AS "Median"
02    FROM TESTFILE

The returned result is:

Median
------
5.0

The result may confuse as there are several values of 5 in the file.

Let's look into this a bit deeper, I have a second file, TESTFILE2:

01  SELECT FLD1 
02    FROM TESTFILE2 
03   ORDER BY 1

The results show there are just two records in the file:

FLD1
----
   1
 100

When I use the Median function over those values…

01  SELECT MEDIAN(FLD1) AS "Median" 
02    FROM TESTFILE2

The result is not one of the two values, it is the midpoint between the two.

Median
------
  50.5

This is not a value in the file.

Mode

Mode is the most frequent value. But there can be multiple values that are equally frequent. I can use this statement to get a count of the occurrences of each value in TESTFILE:

01  SELECT FLD1,COUNT(*) AS "Count"
02    FROM TESTFILE
03   GROUP BY FLD1
04   ORDER BY COUNT(*) DESC

The results show that the value of 5 is the most popular:

FLD1  Count
----  -----
   5      4
   3      2
   8      1
   1      1
  10      1
   6      1

If I use the same SQL statement for the file TESTFILE2, changing line 3:

03     FROM TESTFILE2

I get two values with the same count:

FLD1  COUNT
----  -----
   1      1
 100      1

Therefore, the mode for TESTFILE2 is both 1 and 100.

If I just want to create a statement that returns just the mode value I need to get a bit more creative using a Common Table Expression, CTE.

01  WITH T1(FLD1,"COUNT") AS
02  (SELECT FLD1,COUNT(*)
03     FROM TESTFILE
04    GROUP BY FLD1),

05  T2(FLD1,"COUNT","RANK") AS
06  (SELECT T1.*,
07          DENSE_RANK() 
              OVER(ORDER BY "COUNT" DESC) 
              AS "RANK" 
08     FROM T1)

09  SELECT * FROM T2 
10   WHERE "RANK" = 1 
11   ORDER BY FLD1 ;

Lines 1 – 4: The first part of the CTE creates a virtual table in memory that contains a list of the values and the number of times they occur in the file TESTFILE. As I have called the column that will contain the value from the COUNT() function COUNT I have to enclose it in quotes so that the SQL engine will realize that it is the column's name and not a misplaced count function.

Lines 5 – 8: In the second part I am assigning a rank value to every column in the virtual table T1. I am using the DENSE_RANK to give a rank value to each of the rows in T1. I have ordered the rank in descending order so that the most frequent value(s) will be ranked first.

Lines 9 – 11: The final statement returns the results that are returned. I only want to return any result row that is ranked number 1.

The result for TESTFILE returns to me a single result, which is what I expect:

FLD1  COUNT  RANK
----  -----  ----
   5      4     1

For TESTFILE2 I need to only change line 3.

03     FROM TESTFILE2

For this file two results are returned as they are both valid values for the mode:

FLD1  COUNT  RANK
----  -----  ----
   1      1     1
 100      1     1

You can learn more about this from the IBM website:

AVG SQL scalar function
MEDIAN SQL scalar function

This article was written for IBM i 7.4, and should work for some earlier releases too.

RPGPGM.COM - From AS400 to IBM i

Pages

Wednesday, January 12, 2022