A friend told me about the old database he had to work with. There was one file with a field that contained two pieces of information. Any leading characters, A – Z, meant one thing and the numbers within it meant something else. Unfortunately there could be any number of leading characters, the numbers could be three of four long, and there could be characters at the end. The examples he shared with me were:
- 'A123B'
- 'ABC123'
- 'AB4679'
- 'BC0123D'
His plan was to add a new numeric field to the file, and update it with the numbers extracted from this original field.
"What would be a simple way to do this with SQL?" he asked.
Before I answer that question I need a file and data within it. The DDS code for my test file is:
01 A R TESTFILER 02 A ORIGINAL 10A 03 A NBR_CODE 5P 0 |
The file contains two fields:
Line 4: The original alphanumeric field.
Line 5: The equivalent of his new numeric field.
I can use SQL to insert the example data he gave me into the file:
01 INSERT INTO TESTFILE (ORIGINAL) 02 VALUES('A123B'),('ABC123'),('AB4679'),('BC0123D') ; |
This statement just added values into the first field, ORIGINAL. I can check using the following SQL statement:
01 SELECT * FROM TESTFILE |
The results for the above statement is:
ORIGINAL NBR_CODE -------- -------- A123B 0 ABC123 0 AB4679 0 BC0123D 0 |
It is obvious that I will need to substring the "number" from the character strings. The Substring SQL scalar function requires three pieces of information:
01 SUBSTR(ORIGINAL, 02 < start of string >, 03 < length of string >) |
Line 1: The first piece of information is the field name we are substring the information out of, which is ORIGINAL.
Line 2: Start position of the "number" I want to substring out of the field. This could be in different places, therefore, I can use a hardcoded value.
Line 3: As the length of the "number" can vary too I cannot hardcode it too.
The best way I know to look for a range of characters in one statement is to use a Regular Expression, specifically REGEXP_INSTR. It will return to me the first position in the string the desired character(s) are found.
It has a number of parameters. The ones I will be using are:
- source-string: The variable, or string, that will be searched. In this example this will be 'ORIGINAL' as that is the field I am looking within.
- search-pattern: The pattern I am looking for in the source string.
- starting-position: Where to start searching for the pattern.
- occurrence: The occurrence I want to check string for the search pattern. In this scenario it will always be the first occurrence.
The first thing I need to find is the position first number in the field. For that I would use:
01 SELECT ORIGINAL, 02 REGEXP_INSTR(ORIGINAL,'[0-9]',1,1) 03 FROM TESTFILE |
Line 1: I want to display the field I am searching for the pattern in.
Line 2: What this REGEXP_INSTR says is:
- ORIGINAL: I am looking within this field.
- '[0-9]': This does need to be enclosed within apostrophes/single quotes. I am enclosing within the square brackets ( [ ] ) I am looking to any character that is in the range of zero to nine.
- 1: Start searching in the first position.
- 1: Stop when you find the first occurrence of a character in the desired range.
The results are:
ORIGINAL 00002 -------- ------ A123B 2 ABC123 4 AB4679 3 BC0123D 3 |
Which has returned the first "numeric" character in the field.
The next thing I need to find is harder. Where is the next alphabetic character?
01 SELECT ORIGINAL, 02 REGEXP_INSTR(ORIGINAL,'[A-Z ]', 03 (REGEXP_INSTR(ORIGINAL,'[0-9]',1,1)), 04 1) 05 FROM TESTFILE |
Lines 2 - 4: This looks a lot more complicated as I have a REGEXP_INSTR nested within another REGEXP_INSTR. Let me break this line down.
Line 2: This is the start of the REGEXP_INSTR expression. I am looking for any character that is in the range of A – Z or blank.
Line 3: Start position is anywhere after the where the first "numeric" character is found.
Line 4: I am looking for the first occurrence.
The results are:
ORIGINAL 00002 -------- ------ A123B 5 ABC123 7 AB4679 7 BC0123D 7 |
In the first result the letter 'B' is in the fifth position.
In the second and third results there are no letters following the "number", but there is a blank which is in the seventh position.
In the fourth result the letter 'D' is in the seventh position.
I think I have enough to be able to construct the REGEXP_INSTR expression to substring the "number" from the field.
01 SELECT ORIGINAL, 02 SUBSTR(ORIGINAL, 03 REGEXP_INSTR(ORIGINAL,'[0-9]',1,1), 04 REGEXP_INSTR(ORIGINAL,'[A-Z ]', 05 REGEXP_INSTR(ORIGINAL,'[0-9]',1,1),1) 06 - REGEXP_INSTR(ORIGINAL,'[0-9]',1,1)) 07 FROM TESTFILE |
Line 1: I want to display the content of the ORIGINAL field.
Lines 2 – 6: This is the part that substrings the "number" from ORIGINAL.
Line 2: I am substring-ing from ORIGINAL.
Line 3: The starting place is returned from this REGEXP_INSTR returns the position of the first "numeric" character.
Lines 4 – 6: This is where it gets complicated… my statement calculate the length of the "number". Which it can do:
< length of "number" > = < position of alphabetic or blank character after "number" > - < start position of "number" > |
Lines 4 and 5: This part determines the position of the first alphabetic or blank character after the "number". The REGEXP_INSTR scans for first A – Z or blank character that is found after the first part of the "number".
Line 6: From that value the starting position of the number is subtract. This is the length of the number.
When I execute this statement my results are:
ORIGINAL 00002 -------- ------ A123B 123 ABC123 123 AB4679 4679 BC0123D 0123 |
The second column are character values, not numbers.
I now need to update the file NBR_CODE field in the file TESTFILE. I can take the above Select statement and modify it into an Update statement:
01 UPDATE TESTFILE 02 SET NBR_CODE = 03 SUBSTR(ORIGINAL, 04 REGEXP_INSTR(ORIGINAL,'[0-9]',1,1), 05 REGEXP_INSTR(ORIGINAL,'[A-Z ]', 06 REGEXP_INSTR(ORIGINAL,'[0-9]',1,1),1) 07 - REGEXP_INSTR(ORIGINAL,'[0-9]',1,1)) |
Lines 3 – 7: This substring is the same as the one in the Select statement in line 2 – 6.
Once the above is executed I can check the contents of TESTFILE with the following:
01 SELECT * FROM TESTFILE |
The results are:
ORIGINAL NBR_CODE -------- --------- A123B 123 ABC123 123 AB4679 4679 BC0123D 123 |
The character "numbers" extracted by the substring statement are converted to numbers when the NBR_CODE is updated.
As you can see that this is a somewhat simple statement extracted the number from the character field and was used to update the new field in the file. All of which is what my friend had asked for.
This article was written for IBM i 7.5 TR3 and 7.4 TR9.
Hi Simon.
ReplyDeleteI think a better solution would be to use the REGEXP_SUBSTR to extract the number from the text, e.g.
01 UPDATE TESTFILE
02 SET NBR_CODE =
03 REGEXP_SUBSTR(ORIGINAL, '[A-Z ]*([0-9]*).*', 1, 1, 'i', 1 )
The "([0-9]*)" captures any consecutive digits after any number of letters or spaces, and the final '1' returns the captured group.
For more information, see here:
https://www.ibm.com/docs/en/i/7.5?topic=functions-regexp-substr
Thank you for all your great posts!
Best regards,
Christian
I love the examples of regexp* SQL tools in use as they can simply complicated code. Was there a reason REGEXP_SUBSTR was not selected as the preferred tool for extracting the data? Or is that a part two post?
ReplyDeleteMy biggest issue with REGEXP_SUBSTR is the occurrance parameter, in cases like this, you have to give it a value that is not one (1) to get the value out of the string.
```
with t1 (ORIG_VALUE) as (values ('A123B'),('ABC123'),('AB4679'),('BC0123D'))
select ORIG_VALUE
,REGEXP_SUBSTR(ORIG_VALUE,'[A-Z]*') Alpha
,REGEXP_SUBSTR(ORIG_VALUE,'\d*',1,REGEXP_INSTR(ORIG_VALUE,'[0-9]',1,1)) Numbers
from T1
```
-Matt
Hi Simon,
ReplyDeletethis should be easier with REGEXP_SUBSTR because you don't need to find begin and end of the number in the string.
Try: REGEXP_SUBSTR(ORIGINAL, '([0-9]+)', 1, 1, '', 1)
I can't try it out right now, but if my Regex-Fu isn't completely gone, it should work.
The () do form a group expression of at least 1 or more digits. And the BIF is extracting the first sequence of consecutive digits.
VALUES REGEXP_EXTRACT('ABC123','[0-9]+')
ReplyDeleteCan you squeeze out the non-digits and cast the result? I added a leading zero here in case no digits in the string.
ReplyDeletevalues(decimal('0' || regexp_replace('ABC123DEF', '[^\d]', ''), 5, 0));
RInger
with mydata(fld) as (
ReplyDeleteVALUES('A123B'),('ABC123'),('AB4679'),('BC0123D'))
select fld, dec(trim(translate(upper(fld), ' ', 'ABCDEFGHJIKLMNOPQRSTUVWXYZ')),10,0)
from mydata
FLD | 00002
A123B 123
ABC123 123
AB4679 4679
BC0123D 123
Another idea; strip out the letters. The regex process used here seems a bit over engineered to me.