As we learned at school there are three types of averages:
- Mean: Which is what people when they use the word "average" mean (pun is intended). It is calculated by adding several values together and dividing them by the total number of values.
- Median: The midpoint of a sequence of numbers. This number may or may not be a value in the sequence.
- Mode: The most frequent number of the set of data.
Fortunately Db2 for i gives us scalar functions for the first two, and we can easily determine the third.
Let me start with my list of values, that are all taken from field FLD1 in the file TESTFILE.
01 SELECT * 02 FROM TESTFILE 03 ORDER BY FLD1 |
The results are as follows:
FLD1 ---- 1 3 3 5 5 5 5 6 8 10 |
Mean
The mean is easily calculated by using the AVG scalar function. This function has two parameters:
- ALL/DISTINCT: This parameter is optional, with the default value of ALL.
- ALL = Calculate the mean for all of the values
- DISTINCT = Calculate the mean for a distinct set of the values
- Expression: In this case it will be FLD1 which is the name of the field. This is required.
What do I mean by a "distinct set of values"? If I use the Select with Distinct you will see what I mean:
01 SELECT DISTINCT * 02 FROM TESTFILE 03 ORDER BY FLD1 |
As you can see below the DISTINCT means that only unique numbers are returned.
FLD1 ---- 1 3 5 6 8 10 |
Let me take this and put it into one Select statement:
01 SELECT AVG(FLD1) AS "Default", 02 AVG(ALL FLD1) AS "All", 03 AVG(DISTINCT FLD1) AS "Distinct" 04 FROM TESTFILE |
Line 1: This is the "default" for the AVG function.
Line 2: The ALL as the first parameter in the AVG will return the same result as if I had omitted it.
Line 3: Calculates the mean over the Distinct values.
Results are:
Default All Distinct ----------------------- ----------------------- ----------------------- 5.100000000000000000000 5.100000000000000000000 5.500000000000000000000 |
I am sure you will agree that these results have way too many decimal places to be practical. I just want two decimal places.
01 SELECT CAST(ROUND(AVG(FLD1),2) AS NUMERIC(10,2)) AS "Mean" 02 FROM TESTFILE |
To get the result in a column with only two decimal places I have Cast the column to only have two decimal places. I have used the ROUND to round the results to just decimal places. This is not a big deal with these results but could be a big deal with a different set of values.
Mean ----- 5.10 |
Median
Median is the midpoint between the highest and lowest numbers. With the data from my example TESTFILE I can calculated the median by using the Median scalar function:
01 SELECT MEDIAN(FLD1) AS "Median" 02 FROM TESTFILE |
The returned result is:
Median ------ 5.0 |
The result may confuse as there are several values of 5 in the file.
Let's look into this a bit deeper, I have a second file, TESTFILE2:
01 SELECT FLD1 02 FROM TESTFILE2 03 ORDER BY 1 |
The results show there are just two records in the file:
FLD1 ---- 1 100 |
When I use the Median function over those values…
01 SELECT MEDIAN(FLD1) AS "Median" 02 FROM TESTFILE2 |
The result is not one of the two values, it is the midpoint between the two.
Median ------ 50.5 |
This is not a value in the file.
Mode
Mode is the most frequent value. But there can be multiple values that are equally frequent. I can use this statement to get a count of the occurrences of each value in TESTFILE:
01 SELECT FLD1,COUNT(*) AS "Count" 02 FROM TESTFILE 03 GROUP BY FLD1 04 ORDER BY COUNT(*) DESC |
The results show that the value of 5 is the most popular:
FLD1 Count ---- ----- 5 4 3 2 8 1 1 1 10 1 6 1 |
If I use the same SQL statement for the file TESTFILE2, changing line 3:
03 FROM TESTFILE2 |
I get two values with the same count:
FLD1 COUNT ---- ----- 1 1 100 1 |
Therefore, the mode for TESTFILE2 is both 1 and 100.
If I just want to create a statement that returns just the mode value I need to get a bit more creative using a Common Table Expression, CTE.
01 WITH T1(FLD1,"COUNT") AS 02 (SELECT FLD1,COUNT(*) 03 FROM TESTFILE 04 GROUP BY FLD1), 05 T2(FLD1,"COUNT","RANK") AS 06 (SELECT T1.*, 07 DENSE_RANK() OVER(ORDER BY "COUNT" DESC) AS "RANK" 08 FROM T1) 09 SELECT * FROM T2 10 WHERE "RANK" = 1 11 ORDER BY FLD1 ; |
Lines 1 – 4: The first part of the CTE creates a virtual table in memory that contains a list of the values and the number of times they occur in the file TESTFILE. As I have called the column that will contain the value from the COUNT() function COUNT I have to enclose it in quotes so that the SQL engine will realize that it is the column's name and not a misplaced count function.
Lines 5 – 8: In the second part I am assigning a rank value to every column in the virtual table T1. I am using the DENSE_RANK to give a rank value to each of the rows in T1. I have ordered the rank in descending order so that the most frequent value(s) will be ranked first.
Lines 9 – 11: The final statement returns the results that are returned. I only want to return any result row that is ranked number 1.
The result for TESTFILE returns to me a single result, which is what I expect:
FLD1 COUNT RANK ---- ----- ---- 5 4 1 |
For TESTFILE2 I need to only change line 3.
03 FROM TESTFILE2 |
For this file two results are returned as they are both valid values for the mode:
FLD1 COUNT RANK ---- ----- ---- 1 1 1 100 1 1 |
You can learn more about this from the IBM website:
This article was written for IBM i 7.4, and should work for some earlier releases too.
No comments:
Post a Comment
To prevent "comment spam" all comments are moderated.
Learn about this website's comments policy here.
Some people have reported that they cannot post a comment using certain computers and browsers. If this is you feel free to use the Contact Form to send me the comment and I will post it for you, please include the title of the post so I know which one to post the comment to.