The germ for the idea for this post came from a question I saw in a Facebook Group. The question was how to identify records in a file where a particular field contains "special" characters, using SQL.
As the question was specially about a DDS file, I created a file, TESTFILE, with one field, FIELD1. I am not going to give the code for this file, just its contents. I would use the following SQL statement to list the contents of this file:
SELECT FIELD1 FROM TESTFILE |
The results from this file are:
FIELD1 -------- abcde ABCDE 12345 ABC 123 A@ B# 4$ !@#$%^&* |
I have used the REGEXP_LIKE scalar function to return whether a character, or string of characters, is in the field. In this first statement I am going to return whether the field contains one of the characters I am testing for.
01 SELECT FIELD1, 02 REGEXP_LIKE(FIELD1,'[!@#$%^&*()_+]') 03 FROM TESTFILE |
Line 2: I am using just two of the possible parameters of REGEXP_LIKE:
- Name of the field
- List of the characters to test for
As I want to check for a list of possible characters, I list them within square brackets ( [ ] ).
Two columns are returned:
- Contents of FIELD1
- Whether the field contains one of characters I am testing for
Which returns:
FIELD1 00002 -------- ----- abcde false ABCDE false 12345 false ABC 123 false A@ true B# true 4$ true !@#$%^&* true |
If I wanted to return only those records that has at least one of the characters I am testing for in FIELD1, I would use:
01 SELECT FIELD1 FROM TESTFILE 02 WHERE REGEXP_LIKE(FIELD1,'[!@#$%^&*()_+]') |
Line 2: In this statement the REGEXP_LIKE is in the Where clause.
The results are:
FIELD1 --------- A@ B# 4$ !@#$%^&* |
If I wanted the opposite, a list of records without any of those test characters in FIELD1, I would make one simple change.
01 SELECT FIELD1 FROM TESTFILE 02 WHERE NOT REGEXP_LIKE(FIELD1,'[!@#$%^&*()_+]') |
Line 2: I have added the word NOT to the Where clause, to reverse its meaning from the previous statement.
The results are:
FIELD1 --------- abcde ABCDE 12345 ABC 123 |
What if I wanted to return the records that contain characters other than A-Z and 0-9?
If I used the method I have described above I would need to list every character that is not in the desired ranges. Which would be a laborious process. There must be an alternative.
Fortunately there is, I can use wildcard and "or" in the list of characters:
01 SELECT FIELD1 FROM TESTFILE 02 WHERE NOT REGEXP_LIKE(FIELD1,'.[A-Z]|[0-9]') |
Let me describe which is meant in the REGEXP_LIKE values.
- . = The dot character is used as the wildcard character. This allows for a match to a character anywhere in FIELD1
- [A-Z] = Range of values from A to Z
- | = The pipe symbol is used to denote "or"
- [0-9] = Range of values from 0 to 9
The results from this statement are:
FIELD1 --------- abcde A@ B# !@#$%^&* |
All of the Regular Expressions are case sensitive, therefore, the first returned row is considered invalid as the characters are all lower case.
I consider 'abcde' to be valid, therefore, I need to add a third parameter to REGEXP_LIKE:
01 SELECT FIELD1 FROM TESTFILE 02 WHERE NOT REGEXP_LIKE(FIELD1,'.[A-Z]|[0-9]','i') |
Line 2: The 'i' character tells REGEXP_LIKE to ignore case.
This change means that the record that contains 'abcde' is not consider invalid:
FIELD1 --------- A@ B# !@#$%^&* |
If you think the third parameter of REGEXP_LIKE is too cryptic for others to follow I can simply use the UPPER scalar function in its place.
01 SELECT FIELD1 FROM TESTFILE 02 WHERE NOT REGEXP_LIKE(UPPER(FIELD1),'.[A-Z]|[0-9]') |
Which produces the same results as the prior statement.
I have found the above very useful when scanning strings coming from non-IBM i sources for invalid characters before moving the data into the database.
This article was written for IBM i 7.5, and should work for some earlier releases too.
Hi Simon, very informative as always. In the example with the WHERE clause as WHERE NOT REGEXP_LIKE(FIELD1,'.[A-Z]|[0-9]'), shouldn't the value of 4$ also be listed in the results?
ReplyDeleteI had to look up what "4$" stands for, local currency symbol.
DeleteIn the original question the person did not want currency symbols either.
But this is a useful code to know and to use in future examples.
Thanks
thanks for sharing Simon
ReplyDelete