SIMPLIFY YOUR SAS ‘SEARCH’ OR ‘REPLACE’ ENGINE USING REGULAR EXPRESSION

Genproresearch
21 min readNov 21, 2020

Author: Mr. Pranav Kurode — Clinical SAS Programmer

Ever worked with NONUNIFORMSTRING?

Finding, extracting or replacing text from or within non-standard strings like above are usually difficult. Perl regular expression is advanced technique to solve this issue.

Regular expression is a sequence of characters that is used to define a search pattern. Many of the text processing tasks in SAS can be performed using Perl Regular Expression. These tasks can be performed using traditional character functions, but Perl Regular Expression can provide simple solutions to much-complicated text manipulation tasks.

When performing a match, SAS will search in source string with the help of substring provided. For example prxmatch(‘/bike/’,’I have 1 bike’). In this case “bike” is substring that is searched in source string “I have 1 bike”. Perl regular expressions are composed of characters and special characters that are called metacharacters. Metacharacter are used to perform forcing the match to begin in a particular location and matching a particular set of characters. Some Metacharacter are covered below

Functions :

PRXPARSE

It is used to define a Perl regular expression to be used later by the other Perl regular expression functions. Each time you compile a regular expression, SAS assigns sequential numbers to the resulting expression. This number is needed to perform searches by the other PRX functions such as PRXMATCH, PRXCHANGE

Syntax : prxparse(Perl-Regular-Expression)

Perl-Regular-Expression : String placed in quotation marks

PRXMATCH

It is used to locate the position in a string, where a regular expression match is found. This function returns the first position in a string expression of the pattern described by the regular expression. If this pattern is not found, the function returns a zero.

Note: In some case “m” operator is used in prxmatch it is default operator. So “m/…/” is similar as “/…/”.

Syntax : prxmatch(Pattern-ID | Perl-Regular-Expression, String)

Pattern-ID : Value returned from prxparse

String : A character variable or string in quotation marks

PRXCHANGE

It is used to substitute one string for another. One advantage of using PRXCHANGE over TRANWRD is that you can search for strings using wild cards. Wildcards is covered in metacharacter below in section.

Note that you need to use the “s” operator in the regular expression to specify the search and replacement expression

Syntax : prxchange(Pattern-ID | Perl-Regular-Expression, Times, old-string)

Times : Times is the number of times to search for and replace a string. 1 means replace 1 time. -1 indicates to replace until the end of string is reached.

Old-string : Is the string that you want to replace.

Metacharacter used are:

Basic SyntaxCharacterDescription/…/Starting and ending of Regex delimiters()Grouping|Alternation

Example 1:

Suppose the data contains value “rat” , “cat” , “bat”. We are interested in “rat” and “cat”.

In this case “at” is same in all values only difference is “r” or “c” or “b”

Program 1:

data program1;

set test1;

/*In prxparse we have pattern mentioned above*/

parse = prxparse(‘/(r|c)at/’);

/*prxmatch will search for string using substring*/

/*Substring is same variable we passed in prxparse*/

match = prxmatch(parse,a);

/*< — — -OR — — — –>*/

/*this prxmatch in match1 variable is same as above instead

direct pattern is passed without using prxparse variable*/

match1 = prxmatch(‘/(r|c)at/’,a);

run;

Output 1:

Character ClassCharacterDescription[…]Matches the character in the bracket[^..]Matches the character in not the bracket[a-z]Matches character ranging from a to z

Example 2:

We will consider same Example 1 data

Program 2:

data program2;

set test1;

match2 = prxmatch(‘/[rc]at/’,a); /*Either r or c*/

/*< — — -OR — — — –>*/

match3 = prxmatch(‘/[^b]at/’,a); /*Not in b*/

run;

Output :

Position MatchingCharacterDescription^Match beginning of the line$Match end of the line

Example 3:

Consider same Example 1 data

Program 3:

data program3;

set test1;

match4 = prxmatch(‘/^[rc]/’,a); /*Starting with r or c*/

/*< — — -OR — — — –>*/

match5 = prxmatch(‘/^[^b]/’,a); /*Not starting with b*/

run;

Output 3:

Wildcards ClassCharacterDescription.Matches any character\dmatches a digit character [0–9]\Dmatches everything except a digit character\wmatches a word character or alpha numeric character including underscore [a-zA-Z0–9_]\Wmatches a non-word or non-alphanumeric character excluding underscore\tmatches tab character\smatches a blank “space”\Smatches everything except blank “space”

Example 4 :

Program 4:

data program4;

/*Match digit in this case 1 from abc123*/

/*Output will display position in the string*/

num1 = prxmatch(‘m/\d/’,”abc123″);

/*Match character in this case a from abc123*/

char2 = prxmatch(‘m/\D/’,”abc123″);

/*Match the charachter a from abc123*/

numchar1 = prxmatch(‘m/\w/’,”abc123″);

/*Match the digit 1 from 123abc*/

numchar2 = prxmatch(‘m/\w/’,”123abc”);

/*Matches “*” from abc*123 */

nonumchar = prxmatch(‘m/\W/’,”abc*123″);

/*Matches a blank ” ” from abc123*/

blank = prxmatch(‘m/\s/’,”abc 123″);

/*Matches “*” from abc*123 */

noblank = prxmatch(‘m/\S/’,”abc*123″);

run;

Output 4:

Repetition Factor(match as many times as possible)CharacterDescription*Matches 0 or more times+Matches 1 or more times?Matches 0 or 1 time{n}Matches exactly n times{n,}Matches at least n times{n,m}Matches minimum n times but not more than m times

Example 5:

Please note: If special symbols are present in data. It is better to use ‘\’ as an escape character before special symbol. E.g: Consider your data consist ‘*’. It is better to use ‘\*’.

Program 5:

data program5;

/*matches character 1 or more time and replace*/

/*In this case ab is character which is 2 times(>=1)*/

one_mor = prxchange(‘s/\w+/*/’,-1,”ab%”);

/*both will get replaced by “*”*/ /*In this case first digit is checked zero or more time then character is checked one or more time */

/*this pattern is replaced by “*”*/

zer_mor = prxchange(‘s/\d*\w+/*/’,-1,”ab%”) ;

/*Match 2 character*/

/*”i” operator indicates “case insensitivity” so ab is equal Ab aB*/

match2c = prxchange(‘s/\w{2}/1/i’,-1,”Ab”);

/*Match min 1 character max 2 character*/

/*$1 represented first ()*/

/*Similarly $2 represents second ()*/

match12c = prxchange(‘s/\w{1,2}(\d)/$1/i’,-1,”Ab1″); match1c = prxchange(‘s/\w{1,}/1/i’,-1,”Ab”);

run;

Output 5:

Code snippet:

Example 1:

In the following example if there are 5 values in “trt” variable TRT1, TRT2, TRT3, PROD1, PROD3 and we are interested in extracting TRT1, TRT2 and PROD1

Program :

data b;

set a;

if prxmatch(‘m/[12]/’,trt) >=1;

run;

Output :

Example 2:

If dataset contains 1000 values. In this example we will consider unique pattern. Values consist of domain name and number “AE 10 DM 12”, “CM 11, DS 20” “Adverse Event 17” “MH 17,20” and the value should represent one domain with following number. In case of “Adverse Event 17” the value should display “AE 17” in case “MH 17,20” the value should display “MH 17” “MH 20”

Data :

Program :

data test;

set domain;

if prxmatch(‘/\w+\s\d+(,)?\s\w+\s\d+/’,domN) >= 1 then do;

var1=prxchange(‘s/(\w+\s\d+)(,)?\s\w+\s\d+/$1/’,-1,domN); var2 = prxchange(‘s/\w+\s\d+(,)?\s(\w+\s\d+)/$2/’,-1,domN); end;

if prxmatch(‘/\w{3,}\s\w{3,}\s\d+/’,domN) >= 1 then

var1 = prxchange(‘s/(\w)\w{2,}\s(\w)\w{2,}\s(\d+)/$1$2

$3/’,-1,domN);

if prxmatch(‘/\w+\s\d+,\d+/’,domN) >=1 then do;

var1 = prxchange(‘s/(\w+)\s(\d+),\d+/$1 $2/’,-1,domN);

var2 = prxchange(‘s/(\w+)\s\d+,(\d+)/$1 $2/’,-1,domN); end;

run;

Output :

Wish to know more? Always feel free to write to us at info@genproindia.com.

Author: Mr. Pranav Kurode — Clinical SAS Programmer

Ever worked with NONUNIFORMSTRING?

Finding, extracting or replacing text from or within non-standard strings like above are usually difficult. Perl regular expression is advanced technique to solve this issue.

Regular expression is a sequence of characters that is used to define a search pattern. Many of the text processing tasks in SAS can be performed using Perl Regular Expression. These tasks can be performed using traditional character functions, but Perl Regular Expression can provide simple solutions to much-complicated text manipulation tasks.

When performing a match, SAS will search in source string with the help of substring provided. For example prxmatch(‘/bike/’,’I have 1 bike’). In this case “bike” is substring that is searched in source string “I have 1 bike”. Perl regular expressions are composed of characters and special characters that are called metacharacters. Metacharacter are used to perform forcing the match to begin in a particular location and matching a particular set of characters. Some Metacharacter are covered below

Functions :

PRXPARSE

It is used to define a Perl regular expression to be used later by the other Perl regular expression functions. Each time you compile a regular expression, SAS assigns sequential numbers to the resulting expression. This number is needed to perform searches by the other PRX functions such as PRXMATCH, PRXCHANGE

Syntax : prxparse(Perl-Regular-Expression)

Perl-Regular-Expression : String placed in quotation marks

PRXMATCH

It is used to locate the position in a string, where a regular expression match is found. This function returns the first position in a string expression of the pattern described by the regular expression. If this pattern is not found, the function returns a zero.

Note: In some case “m” operator is used in prxmatch it is default operator. So “m/…/” is similar as “/…/”.

Syntax : prxmatch(Pattern-ID | Perl-Regular-Expression, String)

Pattern-ID : Value returned from prxparse

String : A character variable or string in quotation marks

PRXCHANGE

It is used to substitute one string for another. One advantage of using PRXCHANGE over TRANWRD is that you can search for strings using wild cards. Wildcards is covered in metacharacter below in section.

Note that you need to use the “s” operator in the regular expression to specify the search and replacement expression

Syntax : prxchange(Pattern-ID | Perl-Regular-Expression, Times, old-string)

Times : Times is the number of times to search for and replace a string. 1 means replace 1 time. -1 indicates to replace until the end of string is reached.

Old-string : Is the string that you want to replace.

Metacharacter used are:

Basic SyntaxCharacterDescription/…/Starting and ending of Regex delimiters()Grouping|Alternation

Example 1:

Suppose the data contains value “rat” , “cat” , “bat”. We are interested in “rat” and “cat”.

In this case “at” is same in all values only difference is “r” or “c” or “b”

Program 1:

data program1;

set test1;

/*In prxparse we have pattern mentioned above*/

parse = prxparse(‘/(r|c)at/’);

/*prxmatch will search for string using substring*/

/*Substring is same variable we passed in prxparse*/

match = prxmatch(parse,a);

/*< — — -OR — — — –>*/

/*this prxmatch in match1 variable is same as above instead

direct pattern is passed without using prxparse variable*/

match1 = prxmatch(‘/(r|c)at/’,a);

run;

Output 1:

Character ClassCharacterDescription[…]Matches the character in the bracket[^..]Matches the character in not the bracket[a-z]Matches character ranging from a to z

Example 2:

We will consider same Example 1 data

Program 2:

data program2;

set test1;

match2 = prxmatch(‘/[rc]at/’,a); /*Either r or c*/

/*< — — -OR — — — –>*/

match3 = prxmatch(‘/[^b]at/’,a); /*Not in b*/

run;

Output :

Position MatchingCharacterDescription^Match beginning of the line$Match end of the line

Example 3:

Consider same Example 1 data

Program 3:

data program3;

set test1;

match4 = prxmatch(‘/^[rc]/’,a); /*Starting with r or c*/

/*< — — -OR — — — –>*/

match5 = prxmatch(‘/^[^b]/’,a); /*Not starting with b*/

run;

Output 3:

Wildcards ClassCharacterDescription.Matches any character\dmatches a digit character [0–9]\Dmatches everything except a digit character\wmatches a word character or alpha numeric character including underscore [a-zA-Z0–9_]\Wmatches a non-word or non-alphanumeric character excluding underscore\tmatches tab character\smatches a blank “space”\Smatches everything except blank “space”

Example 4 :

Program 4:

data program4;

/*Match digit in this case 1 from abc123*/

/*Output will display position in the string*/

num1 = prxmatch(‘m/\d/’,”abc123″);

/*Match character in this case a from abc123*/

char2 = prxmatch(‘m/\D/’,”abc123″);

/*Match the charachter a from abc123*/

numchar1 = prxmatch(‘m/\w/’,”abc123″);

/*Match the digit 1 from 123abc*/

numchar2 = prxmatch(‘m/\w/’,”123abc”);

/*Matches “*” from abc*123 */

nonumchar = prxmatch(‘m/\W/’,”abc*123″);

/*Matches a blank ” ” from abc123*/

blank = prxmatch(‘m/\s/’,”abc 123″);

/*Matches “*” from abc*123 */

noblank = prxmatch(‘m/\S/’,”abc*123″);

run;

Output 4:

Repetition Factor(match as many times as possible)CharacterDescription*Matches 0 or more times+Matches 1 or more times?Matches 0 or 1 time{n}Matches exactly n times{n,}Matches at least n times{n,m}Matches minimum n times but not more than m times

Example 5:

Please note: If special symbols are present in data. It is better to use ‘\’ as an escape character before special symbol. E.g: Consider your data consist ‘*’. It is better to use ‘\*’.

Program 5:

data program5;

/*matches character 1 or more time and replace*/

/*In this case ab is character which is 2 times(>=1)*/

one_mor = prxchange(‘s/\w+/*/’,-1,”ab%”);

/*both will get replaced by “*”*/ /*In this case first digit is checked zero or more time then character is checked one or more time */

/*this pattern is replaced by “*”*/

zer_mor = prxchange(‘s/\d*\w+/*/’,-1,”ab%”) ;

/*Match 2 character*/

/*”i” operator indicates “case insensitivity” so ab is equal Ab aB*/

match2c = prxchange(‘s/\w{2}/1/i’,-1,”Ab”);

/*Match min 1 character max 2 character*/

/*$1 represented first ()*/

/*Similarly $2 represents second ()*/

match12c = prxchange(‘s/\w{1,2}(\d)/$1/i’,-1,”Ab1″); match1c = prxchange(‘s/\w{1,}/1/i’,-1,”Ab”);

run;

Output 5:

Code snippet:

Example 1:

In the following example if there are 5 values in “trt” variable TRT1, TRT2, TRT3, PROD1, PROD3 and we are interested in extracting TRT1, TRT2 and PROD1

Program :

data b;

set a;

if prxmatch(‘m/[12]/’,trt) >=1;

run;

Output :

Example 2:

If dataset contains 1000 values. In this example we will consider unique pattern. Values consist of domain name and number “AE 10 DM 12”, “CM 11, DS 20” “Adverse Event 17” “MH 17,20” and the value should represent one domain with following number. In case of “Adverse Event 17” the value should display “AE 17” in case “MH 17,20” the value should display “MH 17” “MH 20”

Data :

Program :

data test;

set domain;

if prxmatch(‘/\w+\s\d+(,)?\s\w+\s\d+/’,domN) >= 1 then do;

var1=prxchange(‘s/(\w+\s\d+)(,)?\s\w+\s\d+/$1/’,-1,domN); var2 = prxchange(‘s/\w+\s\d+(,)?\s(\w+\s\d+)/$2/’,-1,domN); end;

if prxmatch(‘/\w{3,}\s\w{3,}\s\d+/’,domN) >= 1 then

var1 = prxchange(‘s/(\w)\w{2,}\s(\w)\w{2,}\s(\d+)/$1$2

$3/’,-1,domN);

if prxmatch(‘/\w+\s\d+,\d+/’,domN) >=1 then do;

var1 = prxchange(‘s/(\w+)\s(\d+),\d+/$1 $2/’,-1,domN);

var2 = prxchange(‘s/(\w+)\s\d+,(\d+)/$1 $2/’,-1,domN); end;

run;

Output :

Wish to know more? Always feel free to write to us at info@genproindia.com.

Author: Mr. Pranav Kurode — Clinical SAS Programmer

Ever worked with NONUNIFORMSTRING?

Finding, extracting or replacing text from or within non-standard strings like above are usually difficult. Perl regular expression is advanced technique to solve this issue.

Regular expression is a sequence of characters that is used to define a search pattern. Many of the text processing tasks in SAS can be performed using Perl Regular Expression. These tasks can be performed using traditional character functions, but Perl Regular Expression can provide simple solutions to much-complicated text manipulation tasks.

When performing a match, SAS will search in source string with the help of substring provided. For example prxmatch(‘/bike/’,’I have 1 bike’). In this case “bike” is substring that is searched in source string “I have 1 bike”. Perl regular expressions are composed of characters and special characters that are called metacharacters. Metacharacter are used to perform forcing the match to begin in a particular location and matching a particular set of characters. Some Metacharacter are covered below

Functions :

PRXPARSE

It is used to define a Perl regular expression to be used later by the other Perl regular expression functions. Each time you compile a regular expression, SAS assigns sequential numbers to the resulting expression. This number is needed to perform searches by the other PRX functions such as PRXMATCH, PRXCHANGE

Syntax : prxparse(Perl-Regular-Expression)

Perl-Regular-Expression : String placed in quotation marks

PRXMATCH

It is used to locate the position in a string, where a regular expression match is found. This function returns the first position in a string expression of the pattern described by the regular expression. If this pattern is not found, the function returns a zero.

Note: In some case “m” operator is used in prxmatch it is default operator. So “m/…/” is similar as “/…/”.

Syntax : prxmatch(Pattern-ID | Perl-Regular-Expression, String)

Pattern-ID : Value returned from prxparse

String : A character variable or string in quotation marks

PRXCHANGE

It is used to substitute one string for another. One advantage of using PRXCHANGE over TRANWRD is that you can search for strings using wild cards. Wildcards is covered in metacharacter below in section.

Note that you need to use the “s” operator in the regular expression to specify the search and replacement expression

Syntax : prxchange(Pattern-ID | Perl-Regular-Expression, Times, old-string)

Times : Times is the number of times to search for and replace a string. 1 means replace 1 time. -1 indicates to replace until the end of string is reached.

Old-string : Is the string that you want to replace.

Metacharacter used are:

Basic SyntaxCharacterDescription/…/Starting and ending of Regex delimiters()Grouping|Alternation

Example 1:

Suppose the data contains value “rat” , “cat” , “bat”. We are interested in “rat” and “cat”.

In this case “at” is same in all values only difference is “r” or “c” or “b”

Program 1:

data program1;

set test1;

/*In prxparse we have pattern mentioned above*/

parse = prxparse(‘/(r|c)at/’);

/*prxmatch will search for string using substring*/

/*Substring is same variable we passed in prxparse*/

match = prxmatch(parse,a);

/*< — — -OR — — — –>*/

/*this prxmatch in match1 variable is same as above instead

direct pattern is passed without using prxparse variable*/

match1 = prxmatch(‘/(r|c)at/’,a);

run;

Output 1:

Character ClassCharacterDescription[…]Matches the character in the bracket[^..]Matches the character in not the bracket[a-z]Matches character ranging from a to z

Example 2:

We will consider same Example 1 data

Program 2:

data program2;

set test1;

match2 = prxmatch(‘/[rc]at/’,a); /*Either r or c*/

/*< — — -OR — — — –>*/

match3 = prxmatch(‘/[^b]at/’,a); /*Not in b*/

run;

Output :

Position MatchingCharacterDescription^Match beginning of the line$Match end of the line

Example 3:

Consider same Example 1 data

Program 3:

data program3;

set test1;

match4 = prxmatch(‘/^[rc]/’,a); /*Starting with r or c*/

/*< — — -OR — — — –>*/

match5 = prxmatch(‘/^[^b]/’,a); /*Not starting with b*/

run;

Output 3:

Wildcards ClassCharacterDescription.Matches any character\dmatches a digit character [0–9]\Dmatches everything except a digit character\wmatches a word character or alpha numeric character including underscore [a-zA-Z0–9_]\Wmatches a non-word or non-alphanumeric character excluding underscore\tmatches tab character\smatches a blank “space”\Smatches everything except blank “space”

Example 4 :

Program 4:

data program4;

/*Match digit in this case 1 from abc123*/

/*Output will display position in the string*/

num1 = prxmatch(‘m/\d/’,”abc123″);

/*Match character in this case a from abc123*/

char2 = prxmatch(‘m/\D/’,”abc123″);

/*Match the charachter a from abc123*/

numchar1 = prxmatch(‘m/\w/’,”abc123″);

/*Match the digit 1 from 123abc*/

numchar2 = prxmatch(‘m/\w/’,”123abc”);

/*Matches “*” from abc*123 */

nonumchar = prxmatch(‘m/\W/’,”abc*123″);

/*Matches a blank ” ” from abc123*/

blank = prxmatch(‘m/\s/’,”abc 123″);

/*Matches “*” from abc*123 */

noblank = prxmatch(‘m/\S/’,”abc*123″);

run;

Output 4:

Repetition Factor(match as many times as possible)CharacterDescription*Matches 0 or more times+Matches 1 or more times?Matches 0 or 1 time{n}Matches exactly n times{n,}Matches at least n times{n,m}Matches minimum n times but not more than m times

Example 5:

Please note: If special symbols are present in data. It is better to use ‘\’ as an escape character before special symbol. E.g: Consider your data consist ‘*’. It is better to use ‘\*’.

Program 5:

data program5;

/*matches character 1 or more time and replace*/

/*In this case ab is character which is 2 times(>=1)*/

one_mor = prxchange(‘s/\w+/*/’,-1,”ab%”);

/*both will get replaced by “*”*/ /*In this case first digit is checked zero or more time then character is checked one or more time */

/*this pattern is replaced by “*”*/

zer_mor = prxchange(‘s/\d*\w+/*/’,-1,”ab%”) ;

/*Match 2 character*/

/*”i” operator indicates “case insensitivity” so ab is equal Ab aB*/

match2c = prxchange(‘s/\w{2}/1/i’,-1,”Ab”);

/*Match min 1 character max 2 character*/

/*$1 represented first ()*/

/*Similarly $2 represents second ()*/

match12c = prxchange(‘s/\w{1,2}(\d)/$1/i’,-1,”Ab1″); match1c = prxchange(‘s/\w{1,}/1/i’,-1,”Ab”);

run;

Output 5:

Code snippet:

Example 1:

In the following example if there are 5 values in “trt” variable TRT1, TRT2, TRT3, PROD1, PROD3 and we are interested in extracting TRT1, TRT2 and PROD1

Program :

data b;

set a;

if prxmatch(‘m/[12]/’,trt) >=1;

run;

Output :

Example 2:

If dataset contains 1000 values. In this example we will consider unique pattern. Values consist of domain name and number “AE 10 DM 12”, “CM 11, DS 20” “Adverse Event 17” “MH 17,20” and the value should represent one domain with following number. In case of “Adverse Event 17” the value should display “AE 17” in case “MH 17,20” the value should display “MH 17” “MH 20”

Data :

Program :

data test;

set domain;

if prxmatch(‘/\w+\s\d+(,)?\s\w+\s\d+/’,domN) >= 1 then do;

var1=prxchange(‘s/(\w+\s\d+)(,)?\s\w+\s\d+/$1/’,-1,domN); var2 = prxchange(‘s/\w+\s\d+(,)?\s(\w+\s\d+)/$2/’,-1,domN); end;

if prxmatch(‘/\w{3,}\s\w{3,}\s\d+/’,domN) >= 1 then

var1 = prxchange(‘s/(\w)\w{2,}\s(\w)\w{2,}\s(\d+)/$1$2

$3/’,-1,domN);

if prxmatch(‘/\w+\s\d+,\d+/’,domN) >=1 then do;

var1 = prxchange(‘s/(\w+)\s(\d+),\d+/$1 $2/’,-1,domN);

var2 = prxchange(‘s/(\w+)\s\d+,(\d+)/$1 $2/’,-1,domN); end;

run;

Output :

Wish to know more? Always feel free to write to us at info@genproindia.com.

Author: Mr. Pranav Kurode — Clinical SAS Programmer

Ever worked with NONUNIFORMSTRING?

Finding, extracting or replacing text from or within non-standard strings like above are usually difficult. Perl regular expression is advanced technique to solve this issue.

Regular expression is a sequence of characters that is used to define a search pattern. Many of the text processing tasks in SAS can be performed using Perl Regular Expression. These tasks can be performed using traditional character functions, but Perl Regular Expression can provide simple solutions to much-complicated text manipulation tasks.

When performing a match, SAS will search in source string with the help of substring provided. For example prxmatch(‘/bike/’,’I have 1 bike’). In this case “bike” is substring that is searched in source string “I have 1 bike”. Perl regular expressions are composed of characters and special characters that are called metacharacters. Metacharacter are used to perform forcing the match to begin in a particular location and matching a particular set of characters. Some Metacharacter are covered below

Functions :

PRXPARSE

It is used to define a Perl regular expression to be used later by the other Perl regular expression functions. Each time you compile a regular expression, SAS assigns sequential numbers to the resulting expression. This number is needed to perform searches by the other PRX functions such as PRXMATCH, PRXCHANGE

Syntax : prxparse(Perl-Regular-Expression)

Perl-Regular-Expression : String placed in quotation marks

PRXMATCH

It is used to locate the position in a string, where a regular expression match is found. This function returns the first position in a string expression of the pattern described by the regular expression. If this pattern is not found, the function returns a zero.

Note: In some case “m” operator is used in prxmatch it is default operator. So “m/…/” is similar as “/…/”.

Syntax : prxmatch(Pattern-ID | Perl-Regular-Expression, String)

Pattern-ID : Value returned from prxparse

String : A character variable or string in quotation marks

PRXCHANGE

It is used to substitute one string for another. One advantage of using PRXCHANGE over TRANWRD is that you can search for strings using wild cards. Wildcards is covered in metacharacter below in section.

Note that you need to use the “s” operator in the regular expression to specify the search and replacement expression

Syntax : prxchange(Pattern-ID | Perl-Regular-Expression, Times, old-string)

Times : Times is the number of times to search for and replace a string. 1 means replace 1 time. -1 indicates to replace until the end of string is reached.

Old-string : Is the string that you want to replace.

Metacharacter used are:

Basic SyntaxCharacterDescription/…/Starting and ending of Regex delimiters()Grouping|Alternation

Example 1:

Suppose the data contains value “rat” , “cat” , “bat”. We are interested in “rat” and “cat”.

In this case “at” is same in all values only difference is “r” or “c” or “b”

Program 1:

data program1;

set test1;

/*In prxparse we have pattern mentioned above*/

parse = prxparse(‘/(r|c)at/’);

/*prxmatch will search for string using substring*/

/*Substring is same variable we passed in prxparse*/

match = prxmatch(parse,a);

/*< — — -OR — — — –>*/

/*this prxmatch in match1 variable is same as above instead

direct pattern is passed without using prxparse variable*/

match1 = prxmatch(‘/(r|c)at/’,a);

run;

Output 1:

Character ClassCharacterDescription[…]Matches the character in the bracket[^..]Matches the character in not the bracket[a-z]Matches character ranging from a to z

Example 2:

We will consider same Example 1 data

Program 2:

data program2;

set test1;

match2 = prxmatch(‘/[rc]at/’,a); /*Either r or c*/

/*< — — -OR — — — –>*/

match3 = prxmatch(‘/[^b]at/’,a); /*Not in b*/

run;

Output :

Position MatchingCharacterDescription^Match beginning of the line$Match end of the line

Example 3:

Consider same Example 1 data

Program 3:

data program3;

set test1;

match4 = prxmatch(‘/^[rc]/’,a); /*Starting with r or c*/

/*< — — -OR — — — –>*/

match5 = prxmatch(‘/^[^b]/’,a); /*Not starting with b*/

run;

Output 3:

Wildcards ClassCharacterDescription.Matches any character\dmatches a digit character [0–9]\Dmatches everything except a digit character\wmatches a word character or alpha numeric character including underscore [a-zA-Z0–9_]\Wmatches a non-word or non-alphanumeric character excluding underscore\tmatches tab character\smatches a blank “space”\Smatches everything except blank “space”

Example 4 :

Program 4:

data program4;

/*Match digit in this case 1 from abc123*/

/*Output will display position in the string*/

num1 = prxmatch(‘m/\d/’,”abc123″);

/*Match character in this case a from abc123*/

char2 = prxmatch(‘m/\D/’,”abc123″);

/*Match the charachter a from abc123*/

numchar1 = prxmatch(‘m/\w/’,”abc123″);

/*Match the digit 1 from 123abc*/

numchar2 = prxmatch(‘m/\w/’,”123abc”);

/*Matches “*” from abc*123 */

nonumchar = prxmatch(‘m/\W/’,”abc*123″);

/*Matches a blank ” ” from abc123*/

blank = prxmatch(‘m/\s/’,”abc 123″);

/*Matches “*” from abc*123 */

noblank = prxmatch(‘m/\S/’,”abc*123″);

run;

Output 4:

Repetition Factor(match as many times as possible)CharacterDescription*Matches 0 or more times+Matches 1 or more times?Matches 0 or 1 time{n}Matches exactly n times{n,}Matches at least n times{n,m}Matches minimum n times but not more than m times

Example 5:

Please note: If special symbols are present in data. It is better to use ‘\’ as an escape character before special symbol. E.g: Consider your data consist ‘*’. It is better to use ‘\*’.

Program 5:

data program5;

/*matches character 1 or more time and replace*/

/*In this case ab is character which is 2 times(>=1)*/

one_mor = prxchange(‘s/\w+/*/’,-1,”ab%”);

/*both will get replaced by “*”*/ /*In this case first digit is checked zero or more time then character is checked one or more time */

/*this pattern is replaced by “*”*/

zer_mor = prxchange(‘s/\d*\w+/*/’,-1,”ab%”) ;

/*Match 2 character*/

/*”i” operator indicates “case insensitivity” so ab is equal Ab aB*/

match2c = prxchange(‘s/\w{2}/1/i’,-1,”Ab”);

/*Match min 1 character max 2 character*/

/*$1 represented first ()*/

/*Similarly $2 represents second ()*/

match12c = prxchange(‘s/\w{1,2}(\d)/$1/i’,-1,”Ab1″); match1c = prxchange(‘s/\w{1,}/1/i’,-1,”Ab”);

run;

Output 5:

Code snippet:

Example 1:

In the following example if there are 5 values in “trt” variable TRT1, TRT2, TRT3, PROD1, PROD3 and we are interested in extracting TRT1, TRT2 and PROD1

Program :

data b;

set a;

if prxmatch(‘m/[12]/’,trt) >=1;

run;

Output :

Example 2:

If dataset contains 1000 values. In this example we will consider unique pattern. Values consist of domain name and number “AE 10 DM 12”, “CM 11, DS 20” “Adverse Event 17” “MH 17,20” and the value should represent one domain with following number. In case of “Adverse Event 17” the value should display “AE 17” in case “MH 17,20” the value should display “MH 17” “MH 20”

Data :

Program :

data test;

set domain;

if prxmatch(‘/\w+\s\d+(,)?\s\w+\s\d+/’,domN) >= 1 then do;

var1=prxchange(‘s/(\w+\s\d+)(,)?\s\w+\s\d+/$1/’,-1,domN); var2 = prxchange(‘s/\w+\s\d+(,)?\s(\w+\s\d+)/$2/’,-1,domN); end;

if prxmatch(‘/\w{3,}\s\w{3,}\s\d+/’,domN) >= 1 then

var1 = prxchange(‘s/(\w)\w{2,}\s(\w)\w{2,}\s(\d+)/$1$2

$3/’,-1,domN);

if prxmatch(‘/\w+\s\d+,\d+/’,domN) >=1 then do;

var1 = prxchange(‘s/(\w+)\s(\d+),\d+/$1 $2/’,-1,domN);

var2 = prxchange(‘s/(\w+)\s\d+,(\d+)/$1 $2/’,-1,domN); end;

run;

Output :

Wish to know more? Always feel free to write to us at info@genproindia.com.

ORIGINAL SOURCE : https://genproresearch.com/knowledge/simplify-your-sas-search-or-replace-engine-using-regular-expression/

--

--

Genproresearch

Genpro brings together the global best practices in Clinical Data Management, Biostatistics, Programming, Medical Writing and Strategic Consulting.