[General boards] [Winter 2019 courses] [Fall 2018 courses] [Summer 2018 courses] [Older or newer terms]

Substrings that match Regex


#1

When the question asks for all substrings that match a regex, are we following the way java works - 1. find the first matching substring and then start the matching process from the next index 2. find the longest possible substring?

For example, if we have “a2bbc1” and the regex “a.*[12]”, should the solution be a2bbc1 or {a2, a2bbc1}? Or if we have the regex “[abc]{2}”, should we only write bb or {bb, bc}?

Thanks!


#2

If the instructions say “all”, then write all of them. In your first example, that means both a2 and a2bb1. For your second example bb and bc both match, so you should write them both.

The only reason not to write both is if you are given code that matches regex in a way that does not backtrack. Then you would write the first substring that matches. Then starting from the end of that substring, look for another one.


#3

Thanks!

Does backtracking mean that the program will try to match as much as possible first, and if the last character doesn’t match, it restarts from the beginning?

What would the code look like when it does and does not backtrack?


#4

Interesting question! “Backtracking” might happen only when you use a quantifier. For example, when you try to match [abcd]+ddd against addd, the + quantifier is greedy, and it will prefer to consume as many characters as it can. So, at the beginning, [abcd]+ will eat up the whole string first, but then, ddd does not match, so [abcd]+ must drop at least one character. This is called backtracking. First it goes back to add, then ad, then a, when the whole regex finally matches.

There are modifiers to make the quantifiers non-greedy, namely the ? modifier. For example, matching [abcd]+?ddd against addd. The [abcd]+? part will prefer to consume as few characters as it can, so only a is used at the beginning. Regex ddd matches string ddd, so no backtracking will happen.

The modifier + can completely disable backtracking, and in this case, when the remaining part does not match, the regex simply fails. For example, matching .++@foo\.bar against aaa@foo.bar. .++ will consume as many characters as it can, and in this case, the whole string. But the remaining string (empty string) does not match @foo\.bar, so the regex will just not match. For another example, try matching [^@]++@foo\.bar against aaa@foo.bar. [^@]++ will consume as much as it can, and in this case, aaa. Regex @foo\.bar matches string @foo.bar, so the regex succeeds to match.

Oracle’s document on Regexes:
https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html


#5

Thank you so much!!!


#6

To add to this, consider the string “abcbabcb”. If you are looking for matching substrings for the regex [ac]b[ac], a human will find:

abc
cba
abc

But a java program will only get
abc
abc

because the cba overlaps the first substring, requiring the program to backtrack to before the third character, which it will not do.


#7

Then is it right that:

  1. generally by default, before the matcher finds the first matching substring, it will do backtracking if the regex contain special characters such as + * ;
  2. after it finds a matching substring, it does not backtrack

#8

It seems we are talking about two different concepts.