Regular Expression Epiphany

Posted 4 Nov 2000 at 17:47 UTC by japhy Share This

I just made an entry in my diary about regex reversal. But I just thought of an easier way! Please entertain this, and tell me if you can think of any fouled up cases.

Matching the last occurrence of something in a string is often inefficient:

# pseudocode
string = "123 foo 234 bar 345 quux 456 xyzzy 567 blat 678 krufty"
last_digits = match(string, /(\d+)\D*$/);

That regex will fail at '123', '234', '345', '456', and '567' before succeeding at '678'. Ick. WHAT IF you figured out what the first "node" of the regular expression is that can match -- in this case, '\d+' -- and then work backwards from the end of string until you find a point at which this first node matches. THEN keep moving backwards until you find where it fails to match. Then try matching JUST ahead of that point. This is the concept of regex reversal anyway, but without the nastiness of ACTUALLY reversing anything.

The problem is when you use the * quantifier. Hmm. If the user had done /(\d*)\D*$/, then \d* would keep matching even 0 digits, so we'd work to the BEGINNING of the string, and then we'd have to work back up the string. That sounds slow and icky, and is double the work. That's a worst-case, and it's a very bad one indeed.

So what if we check for a match right after a point where we matched the first node of LESS length than before? Whoa, let me explain myself.

string = 123 foo 234 bar 345 quux
# start at the back of the string, and trying matching \d*
# succeeds at EOS, matched 0 times
# succeeds at 'x', matched 0 times
# succeeds at 'u', matched 0 times
# succeeds at 'u', matched 0 times
# succeeds at 'q', matched 0 times
# succeeds at ' ', matched 0 times
# succeeds at '5', matched 1 times '5'
# succeeds at '4', matched 2 times '45'
# succeeds at '3', matched 3 times '345'
# succeeds at ' ', matched 0 times

Because 0 is less than 3, we stop backtracking for a moment and see if we can get ourselves a match. Now it tries to do the match of (\d*)\D*$ and succeeds. We must have found the last one (partly because the anchor, too). I could make it so that I don't need to be explicit about the anchoring. That would make a less intensive regex for the user to come up with.

This changes "longest, leftmost match" to "longest, rightmost match". I'm going to try to implement a solution in pure Perl.

Wish me luck.

I miss the point, posted 5 Nov 2000 at 01:58 UTC by mvw » (Journeyer)

Sorry, but I don't get it yet. You present examples where the example regexp performs unfavourably - but won't it perform quite well on other examples? So you would have to judge the performance in average over a wider range of examples.

And a general remark - it is amazing what kind of knowledge on string matching and string alignment (this means matching only part of the string - think dropping some letters before matching) exists in bioinformatics. I have a big book near my desk at work, I should mail you the title.

There is a lot more on that than boyer moore or regexps.


What it's all about, posted 5 Nov 2000 at 02:40 UTC by japhy » (Journeyer)

Basically, the purpose is to speed up last-of-something matching. The reason I want this ability is because some regular expressions make it difficult to match the last of something in the manual many-failures way. That is, it becomes simpler to say "match ALL occurrences of PATTERN" and then get the last element from the list of matches.

There are cases where the technique won't "work":

string = "/* this is a /* valid C comment */"
comment = last_match(string, REx_to_match_C_comments)

That will match "/* valid C comment */", for rather obvious reasons.

Use greedy operators and look-ahead/behinds., posted 5 Nov 2000 at 08:24 UTC by dalke » (Journeyer)

What if you replaced the regexp with a "/.*(\d+)\D*$/" ? The .* would greedily eat everything to the end of the string. This fails, so it backs up until it finds a \d character.

Hmm, okay, that won't work since it will only get the "0" in "10". So use a look-behind assertion, like "/.*(?<!\d)(\d+)\D*$/". This says "go to the end. Backtrack until you find a digit which doesn't have a digit before it, then grab the digits and (because this is guaranteed to the be last set of digits) ignore everything up to the end of the line."

This means the example you gave can be converted to an existing regexp with the expected performance.

Now, the generalization is more complicated, but rarely needed. Suppose you have a pattern 'P', then

should give you exactly what you want. What it says is:
  • Go to the end of the string.
  • Back up until you get to a pattern which doesn't match P.
  • Skip one character and match P to the end of string.
  • If this fails, see if the whole string matches P (this would occur because there is no '.' available to do the second step.)
Also, you do realize your proposal is best implemented purely as an optimization on existing regexp engines? Basically, you've got a special case which the regexp parser can be made to recognize and emit specialized code to handle it well.

You also said:

   comment = last_match(string, REx_to_match_C_comments)

That will match "/* valid C comment */", for rather obvious reasons.

Huh? It isn't obvious to me. Why can't REx_to_match_C_comments be !/\*.*?\*/! ? Or, if you don't like non-greedy operators, you could use zero-length lookahead assertions and do !/\*(?\*/).*\*/! .

Disclaimer: I didn't actually test any of these regexps, but they should be close to properly working solutions.

Continued..., posted 5 Nov 2000 at 13:40 UTC by japhy » (Journeyer)

I meant my technique for matching the last of something wouldn't work in the C comment case, due to the fact that when it goes backwards and finds a /* and can match forward to a */, it thinks it has the last match.

The same problem will occur with your solution -- it will end up finding the SECOND /* in a string like "foo /* bar /* blat */ baaz".

Look-behind is not an option, because look-behind is only supported for constant width patterns because variable-width look-behind requires the technique I'm trying to establish. ;)

Yes, I realize this is an optimization thing -- I'd expect it to be used by people that know what they're doing, and care about this sort of thing. Or maybe I just find it extremely interesting even though it's solely academic and won't be used.

agreement, requestioned, and a minor rant, posted 5 Nov 2000 at 19:56 UTC by dalke » (Journeyer)

    I meant my technique for matching the last of something
    wouldn't work in the C comment case, due to the fact that when
    it goes backwards 
I understand your point now. That's to be expected. This "reversed regexp" of yours will always have problems because of its implied non-greedy behaviour. This correlates exactly to the fact that "^.*" is greedy in the forward direction but acts non-greedy during backtracking.
    The same problem will occur with your solution -- it will end up 
    finding the SECOND /* in a string like "foo /* bar /* blat */ baaz
Do you mean my regexp for finding /* */ comments? In which case you are incorrect, as seen with:
  $ perl -ne 'print "$1\n" if m#(/\*.*?\*/)#'
  foo /* bar /* blat */ baaz
  /* bar /* blat */
If you mean my generalized regexp of /.*(?!P$).(P)$|^P$/ then you are correct, because it implements (I believe) exactly what you want it to do, with a non-greedy reverse search of the string. You're not allowed to change what you want part-way through without saying you're changing definitions :) .

I noticed you didn't actually comment about the generalized regexp. Doesn't it implement exactly the behaviour you are looking for? At most you said Look-behind is not an option... which doesn't apply since this regexp doesn't use look-behind. (I only used it to show a way to build a regexp which implemented your \d+\D* case.)

That being the case, your proposal doesn't even require changing the re engine. It calls for analyzing the regexp parse tree and converting it to a more appropriate form, then passing the modified result to the existing re engine; similar in functionality to 'study'.

As I recall from the Friedl regexp book, in most cases the overhead for doing all of these optimizations is more than is needed for languages like Perl. It's faster to work with a simple understanding of regexps than an optimized ones. Only in a few cases (eg, parsing huge log files) is is better to spend the time optimizing the string.

    I'd expect it to be used by people that know what they're
    doing, and care about this sort of thing.
Hmm. I actually find that to be a somewhat derisive statement. Would you consider rewording it as "used by people interested in automatic regexp optimization"? As an example, my Martel project is a regexp engine to parse huge bioinformatics data files. I would love a good regexp optimizer so that Martel grammer writters wouldn't have to worry as much about how to write things for performance (eg, (a|ab) is better written as (ab?)).

Thus, I want people to use an optimizer while having to know less about how regexps work, not more. Just like I can use use a good compiler without knowing the instruction set and timing details of the processor I'm on. Of course, in both cases the better you know the details the more likely the regexp/code will be optimized well.

Your text assumes people want to know all of the details of how something works to use it. I like the ability to forget how something works and just know that it does.

(Okay, minor rant finished :) )

Algorithms on strings, trees and sequences, posted 6 Nov 2000 at 11:48 UTC by mvw » (Journeyer)

The book I mentioned was:

Dan Gusfield: Algorithms on strings, trees and sequences - computer science and computational biology, Cambridge University Press, 1997

Quite interesting stuff for anyone who is involved in matching and alignment problems. With broad range of pointers to original work.

Clarifications, etc., posted 6 Nov 2000 at 18:19 UTC by japhy » (Journeyer)

I didn't mean to speak down to you with the comment I made. I meant it to be "I'd expect that the only people that use this are the ones that care about the performance increase", and possibly also "the ones that know WHAT the process does". Perhaps that second comment is slightly off, but regardless, I wouldn't expect the average Joe to start questioning whether HIS regexes need to be rethought.

Your solution to the problem is almost "right", except that the matching pattern needn't be at the end of the string. Obviously, it would be, if it included a check to make sure it was, like in the /(\d+)\D*/ regex. But if the regex were simply /(\d+)/, then a slight modification would need to be made. But this method seems inefficient:


use Benchmark; $string = "123abc456def" x 100; $rstring = reverse $string;

# run the four methods, each for 5 seconds at least timethese(-5, { backward => sub { my $match = reverse(reverse($string) =~ /(\d+)/) }, backward_cache => sub { my $match = reverse($rstring =~ /(\d+)/) }, forward => sub { my ($match) = $string =~ /(?:.*(?!\d+).)?(\d+)/ }, regular => sub { my ($match) = $string =~ /(\d+)\D*$/ }, }); __END__ output:

backward: 3406.19/s (n=17065) backward_cache: 10482.95/s (n=55350) forward: 3606.31/s (n=18284) regular: 475.90/s (n=2389)

The backward_cache method is about 3 times faster than the forward method, and the caching has serious efficiency increaseses. The normal regex is far slower than any of the others.

I would bet that the reason your method (named 'forward' here) is slower than the reversal method is because you have to backtrack at a character a time after having read all of the string. In fact, running a second benchmark where the string is only digits, I see that the regular method is the fastest (5700 Hz), the caching method is second (3700 Hz), the non-caching reversal method is next (2500 Hz), and the forward-backtrack method is slowest (46 Hz). However, this is a bad test, since one would not want to use this method on such data.

So now I've modified the test. There are 8 tests being run on these 4 methods. The tests use values of 100 or 500 characters for the pre-match, match, and post-match sections. The pre-match sections look like "a1" several times (this gives false matches). The match sections look like "1" several times. The post-match sections look like "a" several times. These 8 tests are the sets from (100,100,100) to (500,500,500) where the number denotes the length of the field.

Yes, I know this is lots of work for a test, but I want to find the comparitive results.

This is getting long and bulky here, so here's a link to the results, as well as an article I wrote a month ago regarding the first stages of this concept (I called it "sexeger") -- I'll probably be writing another article discussing the possibility of end-first matching.

The documents are at If they're not there right now, wait a few minutes. This latest benchmark will be called "advogato-comparitive". I'll include the Perl source for those interested.

Thanks for your time (and rant -- it was much appreciated).

Why not reverse the regexp?, posted 9 Nov 2000 at 02:19 UTC by jenglish » (Master)

Regular expressions are closed under reversal:

R(a . b) = R(b) . R(a)
R(a | b) = R(a) | R(b)
R(a*) = R(a)*
(where '.' is concatenation, '|' is union, and '*' is Kleene closure), so to find the last match in a string s, you could find the first match in R(s).

This might be useful if reversing the string is less expensive than searching for the last match using the original regexp.

Doh!, posted 9 Nov 2000 at 02:21 UTC by jenglish » (Master)

... and of course that approach (regexp reversal) is exactly what your diary entry was talking about...

Is there any way to cancel a reply?

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

Share this page