What strings does this regex match? (a|b)*

  • Thread starter Thread starter shivajikobardan
  • Start date Start date
  • Tags Tags
    Match Strings
AI Thread Summary
The discussion centers around the behavior of the regular expression (a|b)*, which matches any string consisting of the characters 'a' and 'b', including mixed sequences and even the empty string. Participants clarify that the regex checks each character independently, allowing for combinations like "baaaaa" to be valid matches. The confusion arises from the interpretation of the '*' operator, which applies to the entire expression, allowing for any number of 'a's and 'b's in any order. To restrict matches to sequences of only one type of character (like "aa" or "bb" but not "baaaa"), a different regex format, such as (a*|b*), is suggested. The conversation emphasizes the importance of understanding how regex components interact, particularly the placement of parentheses and the implications of the '*' operator.
shivajikobardan
Messages
637
Reaction score
54
regepxal is showing infinite matches and shows "baaaaaaa" as a match. Doesn't (a|b) means a or b?
* means match 0 or more times. So, it should be matching aa,aaa,bb,bbb, , etc?
I'm not getting how baaaa is a match. Can anyone clarify please?
 
Technology news on Phys.org
Go through "baaaaa" character by character. Which character is not an 'a' or a 'b'?
Sometimes you have to see what it does before you realize that it is doing exactly what it was told to do.
CORRECTION: This is wrong. It is for the Perl regular expression [ab]*
 
Last edited:
FactChecker said:
Go through "baaaaa" character by character. Which character is not an 'a' or a 'b'?
Sometimes you have to see what it does before you realize that it is doing exactly what it was told to do.
I feel I'm not getting it. I feel I'm missing something that you have good grasp of. You choose a or b and repeat, so aaaa, bbbb should be only matches. Could you help me out? How'd this look like in FSM?
 
shivajikobardan said:
You choose a or b and repeat
You are misinterpreting the "repeat" part. As the regex is written, it means "for each character, match if it is a or b".

What you are thinking of, namely, "choose a or b; then match a string containing any number of that character and no others", would be expressed as the regex (a*|b*) (with stars inside the parentheses, instead of a star outside).
 
  • Like
Likes FactChecker and DaveC426913
Go through the string "baaaa" one character at a time and for each one, apply the test (a|b). As @PeterDonis says, the parenthesis forces that part to be evaluated first. Each character passes the test and is a match. Then apply the * to match as many as possible; that is all of them.
CORRECTION: This is wrong. It is for the Perl regular expression [ab]*
 
Last edited:
PeterDonis said:
You are misinterpreting the "repeat" part. As the regex is written, it means "for each character, match if it is a or b".

What you are thinking of, namely, "choose a or b; then match a string containing any number of that character and no others", would be expressed as the regex (a*|b*) (with stars inside the parentheses, instead of a star outside).
I can't get it till now. Can you suggest some readings instead? I feel a good chance there's something you guys know that I don't know.
 
FactChecker said:
Go through the string "baaaa" one character at a time and for each one, apply the test (a|b). As @PeterDonis says, the parenthesis forces that part to be evaluated first. Each character passes the test and is a match. Then apply the * to match as many as possible; that is all of them.
I am not getting it :(
 
shivajikobardan said:
I am not getting it :(
Consider what (a|b) matches:
baaaa Is 'b' an 'a' or a 'b'? Yes.
baaaa Is 'a' an 'a' or a 'b'? Yes.
baaaa Is 'a' an 'a' or a 'b'? Yes.
baaaa Is 'a' an 'a' or a 'b'? Yes.
baaaa Is 'a' an 'a' or a 'b'? Yes.
The regular expression (a|b) matches all 5 characters.
Since the * tells it to match as many as possible, (a|b)* matches all of baaaa.
CORRECTION: This is wrong. It is for the Perl regular expression [ab]*
 
Last edited:
shivajikobardan said:
I can't get it till now.
You have two operations that the regex is doing: (1) looking at a character and deciding whether it's a match; (2) deciding how many characters to match.

The "looking at a character and deciding if it's a match" operation, by itself, is straightforward: a single character matches if it's an a or a b.

The problem you are having comes with the "deciding how many characters to match" operation. You are thinking of that operation as follows: once you find a matching character, continue matching more characters as long as they are the same character as the first one that matched. In other words, once a match is found with an a, continue matching as long as there are a's, or once a match is found with a b, continue matching as long as there are b's; but stop matching if you started with a's but then encounter a b, or vice versa.

But that is not what the regex (a | b)* actually does. The regex that actually does what I just described in the last paragraph is the one I gave in my last post, (a* | b*). That regex will indeed match aa, aaa, bb, bbb, etc., but will not match baaaa.

What the regex (a | b)* actually does is to keep matching characters as long as each character is either an a or a b; it does not care if a's and b's are mixed. That is why baaaa is a match. The reason it doesn't care if a's and b's are mixed is that the "check for a match" operation, a | b, is inside the parentheses, but the "keep matching" operation, *, is outside the parentheses. That means it applies the a | b match criterion to each character independently during the "keep matching" operation, so it will keep matching as long as each character is either an a or a b, without caring about mixing the two.

Whereas in the regex (a* | b*), the "keep matching" operation is inside the parentheses, and there are two of them, one for a and one for b; that means that once it succeeds at the "check for a match" operation just once, it has to try the "keep matching" operation with just that character that it matched, for as long as it can.
 
  • #10
@FactChecker, @PeterDonis I am afraid you are both analysing a different regex, namely ^(a|b)+$. The regex in the OP will also match "gt 5*au!6".
 
Last edited:
  • #11
Thank you. The perfect way to think about this for me is this:

(X)*= X X X X X X

Here X=a or b
So,

a or b then a or b then a or b then....
 
  • #12
shivajikobardan said:
Thank you. The perfect way to think about this for me is this:

(X)*= X X X X X X

Here X=a or b
So,

a or b then a or b then a or b then....
No, if you look at what I posted in #10, (a|b)* will be a match for ANY string, even an empty string
it matches 0 or more occurences of either a or b: any string has 0 or more occurances of any character

If you want to only match a, b, aa, aaa, bb, bbb etc. but not e.g. baaaaa or the empty string then you need ^(a+|b+)$ (or if there is a possibility that your test string may contain newline characters \A(a+|b+)\z)
 
  • #13
pbuk said:
@FactChecker, @PeterDonis I am afraid you are both analysing a different regex, namely ^(a|b)+$. The regex in the OP will also match "gt 5*au!6".
I stand corrected. Actually, I was confusing it with the Perl regular expression, [ab]*
As a Perl regular expression, (a|b)* only matches one 'a' in 'baaaaxxxyyy'.
 
  • #14
pbuk said:
No, if you look at what I posted in #10, (a|b)* will be a match for ANY string, even an empty string
it matches 0 or more occurences of either a or b: any string has 0 or more occurances of any character

If you want to only match a, b, aa, aaa, bb, bbb etc. but not e.g. baaaaa or the empty string then you need ^(a+|b+)$ (or if there is a possibility that your test string may contain newline characters \A(a+|b+)\z)
yes it matches empty string as well. e, X, XX, XXX...where each X can be either a or b.
 
  • #15
pbuk said:
@FactChecker, @PeterDonis I am afraid you are both analysing a different regex, namely ^(a|b)+$. The regex in the OP will also match "gt 5*au!6".
Hm, yes, there should be a + instead of a * to ensure at least one matching character, and we should make explicit that we want to match the entire string.
 
  • #16
shivajikobardan said:
yes it matches empty string as well. e, X, XX, XXX...where each X can be either a or b.
You are not listening: (a|b)* is completely useless (as a complete regex) because it matches absolutely anything whether there are a or b characters in it or not.
 
Last edited:
  • #17
pbuk said:
You are not listening: (a|b)* is completely useless because it matches absolutely anything whether there are a or b characters in it or not.
Maybe worth stressing that it's only completely useless on its own (because, as you say, it matches any string containing a sequence of zero or more a or b, and anything contains zero or more as and bs). But it has its uses as a component of a larger regex (e.g. your ^(a|b)*$).
 
  • #18
Ibix said:
Maybe worth stressing that it's only completely useless on its own
Yes, that is closer to what I intended to convey: edited.
 
Last edited:
Back
Top