Over recent years many of us are dealing with data in different character sets, not just the default CCSID of the partition we are using. The most common of these I encounter is UTF8.
UTF8 can contain double byte characters, which take two bytes for the character, as opposed to the standard single byte characters. If I am using a variable I have defined as UTF8 it is difficult to calculate the number of characters within, as the double byte characters result in an over count of the number of character present.
One of the additions to the RPG language as part of IBM i 7.5 Technology Refresh 1 and 7.4 TR7 are several things to make it possible to get a real character count from an UTF8 variable.
I am going to show several example programs to demonstrate how these new features work. Let me start with what I consider the most basic way, that is not bad thing I just me that IMHO the easiest. I am going to break this program into parts so it will be easy to explain what is happening. Let me start at the "top":
01 **free 02 dcl-s String1 varchar(100) ccsid(*utf8) ; 03 dcl-s String2 varchar(100) ; 04 dcl-s Length uns(5) ; 05 dcl-c DoubleByte const('áâãåæçèéêëìíîïñòóôõøùúûý') ; 06 dcl-c SingleByte const('aaaaaceeeeiiiinooooouuuy') ; 07 String1 = SingleByte ; 08 Length = %len(String1) ; 09 dsply ('1. ' + %char(Length)) ; |
Line 1: Why would anyone not code in totally free RPG?
Line 2: I have defined this variable as variable length character, VARCHAR, and with a CCSID of UTF8.
Line 3: This variable will be VARCHAR, as I have not given a CCSID this will use the partitions default CCSID.
Line 4: This variable is defined as an unsigned integer. As it is going to contain the count of characters in a variable the number can only be zero or greater.
Line 5: This constant contains characters that are double byte when in a UTF8 variable.
Line 6: Single byte equivalents.
Line 7: I move the values in the constant of single bye characters to the UTF8 variable.
Line 8: I am using the %LEN built in function, BiF, under the assumption that the length of the variable is the number of characters it contains. As the variable is defined as VARCHAR I don't have to trim it as I would with a fixed length character variable.
Line 9. I use the display operation code to display the number returned by the %LEN BiF. I have used the %CHAR BiF to convert the number to a character version of the number.
DSPLY 1. 24 |
The value 24 I displayed which is both the length of the variable and the number of characters it contains.
Next piece of the code is:
10 String1 = DoubleByte ; 11 Length = %len(String1) ; 12 dsply ('2. ' + %char(Length)) ; |
Line 10: Here I am moving the characters that will be double byte characters into the UTF8 variable.
Line 11: using %LEN to determine the length of String1.
DSPLY 2. 48 |
Notice that the result is 48, which is double of the count of characters. This is because of them being double byte characters in UTF8.
Next piece of code introduces us to a new BiF.
13 Length = %charcount(String1) ; 14 dsply ('3. ' + %char(Length)) ; |
Line 13: The %CHARCOUNT is what is new here. This gives the count of characters.
DSPLY 3. 24 |
The result is 24 as there are 24 double byte characters in the variable.
Next problem I encounter with double byte characters is when I try and substring their contents. This program shows how I accurately get the data I want, regardless if the characters are double byte. I am going to break this program into parts again to make it easier to understand.
The goal of the program is the extract the first two characters from the variable String1.
01 **free 02 ctl-opt charcounttypes(*utf8) ; 03 dcl-s String1 varchar(100) ccsid(*utf8) ; 04 dcl-s String3 char(20) ; 05 dcl-c DoubleByte const('áâãåæçèéêëìíîïñòóôõøùúûý') ; 06 String1 = DoubleByte ; 07 String3 = %subst(String1 : 1 : 2) ; 08 dsply ('4. ' + String3) ; |
Line 2: This is a new control option, CHARCOUNTTYPES, this gives the default data type. I must have this present to have the lines with the new features in them compile.
Line 4: This is the variable I will be using for the output from the %SUBST BiF.
Line 7: I am substring the first two positions from String1 into String3.
Line 8: I am using the display operation code to display the contents of String3. Which displays:
DSPLY 4. á |
Only one character was returned as being a double byte character it occupied the first two positions of String1.
09 String3 = %subst(String1 : 1 : 2 : *stdcharsize) ; 10 dsply ('5. ' + String3) ; |
Line 9: The %SUBST BiF has a new fourth parameter. There are two possible values:
- *STDCHARSIZE: Returns the standard character size, which considers double byte characters as two bytes.
- *NATURAL: Returns the natural character size, which considers double byte characters as a count of one.
As I have the standard character option my result is:
DSPLY 5. á |
Just a single character as the standard character regards this as two bytes.
Next example uses the *NATURAL option.
11 String3 = %subst(String1 : 1 : 2 : *natural) ; 12 dsply ('6. ' + String3) ; |
This returned two characters, as it treated each character as a character rather than a byte.
DSPLY 6. áâ |
A new compiler directive has also been added to RPG, /CHARCOUNT. This allows me to set which type of character count I want to use.
In this example I want to use the natural character count.
13 /charcount natural 14 String3 = %subst(String1 : 1 : 2) ; 15 dsply ('7. ' + String3) ; |
Line 13: Here is the new compiler directive stating that all substrings will use the natural character count from this point forward.
Line 14: I do not have to put the natural character count type I want to use in the %SUBST BiF.
The result is two characters, which is what I expected.
DSPLY 7. áâ |
I can use the compiler directive to set it to the other counting method.
16 /charcount stdcharsize 17 String3 = %subst(String1 : 1 : 2) ; 18 dsply ('8. ' + String3) ; |
Line 16: I use the CHARCOUNT compiler directive to set the count method to standard character size from here onwards.
The result is only one character is returned as the standard count treats the double byte characters as two characters.
DSPLY 8. á |
There is one more method I can use for setting the type of character count, a new control option CHARCOUNT. This only has two allowed values:
- *NATURAL
- *STDCHARSIZE
This is what a program would look like using this new control option:
01 **free 02 ctl-opt charcounttypes(*utf8) charcount(*natural) ; 03 dcl-s String1 varchar(100) ccsid(*utf8) ; 04 dcl-s String3 char(20) ; 05 dcl-c DoubleByte const('áâãåæçèéêëìíîïñòóôõøùúûý') ; 06 String1 = DoubleByte ; 07 String3 = %subst(String1 : 1 : 2) ; 08 dsply ('9. ' + String3) ; |
Line 2: For the CHARCOUNT option to be used I also need a CHARCOUNTTYPES. In this program the character count option is for a natural count.
Line 7: Just a normal substring statement.
The result shows that the natural character count was used and there are two characters in String3.
DSPLY 9. áâ |
Now using the standard character size count:
01 **free 02 ctl-opt charcounttypes(*utf8) charcount(*stdcharsize) ; 03 dcl-s String1 varchar(100) ccsid(*utf8) ; 04 dcl-s String3 char(20) ; 05 dcl-c DoubleByte const('áâãåæçèéêëìíîïñòóôõøùúûý') ; 06 String1 = DoubleByte ; 07 String3 = %subst(String1 : 1 : 2) ; 08 dsply ('10. ' + String3) ; |
Line 2: The CHARCOUNT is for the standard character size.
The result, as expected, is a single character.
DSPLY 10. á |
We now have a plethora of ways of being able to perform a count of the characters in the UTF8 variable. Which do you think you would use?
You can learn more about this from the IBM website:
- %CHARCOUNT BiF
- /CHARCOUNT complier directive
- CHARCOUNT control option
- Processing string data by the natural size of each character
- CHARCOUNTTYPES control option
This article was written for IBM i 7.5 TR1 and 7.4 TR7.
Since UTF-8 can have 3 or 4 bytes per character, it would be interesting to get some of those into a test string. I think an ellipsis or em dash are a couple of those.
ReplyDeleteOr just about any unicode emoji character.
DeleteUTF8 is a variable encoding, a "character"/codepoint can have from 1 to 4 bytes.
ReplyDeleteIt is surely a good format for interop.
But as a side effect you *cannot* know or guarantee a number of characters allowed to be to stored in a field (because it will change according to the script used).
Depending on the application, it can be a problem, I would personally resort to a fixed sized character - say, every character is two bytes fixed, this will support a vast range of scripts - for db storage and representation.
Is there any sensible reasoning behind that overly complex implementation?
ReplyDeleteI can see the benefit of switching %subst from byte to char. But why should not everything else use just charcount(*natural) as a default? Ok, I can see problems with backwards compatibility, so then maybe make charcount(*natural) optional. But what sense does it make to have charcounttypes even an option?
Any reason why a newly written program would not use:
charcount(*natural);
charcounttypes(*all); // this will not compile