Negated regular expressions in Oniguruma and Ruby
About a week ago, I submitted this feature request asking for the ability to negate regular expressions in Ruby. I have since implemented the feature myself by patching Ruby trunk and patching Oniguruma 5.9.2 beside it. And now, I shall explain to you how it all works.
Overview
Regular expressions can be negated either as a whole (/.../v
) or in part
/(?v:...)/
) through the negation flag (v
). The letter v was chosen
for the negation flag because it comes from the ed(1) program of UNIX
heritage, where it serves a global command that negates the sense of the
regular expression passed in. This v heritage can also be found in
grep(1), whose -v
command-line
option inverts its matching behavior.
Wholly negated regular expressions
This is the simplest part of the implementation. We match the negated regular expression as usual, treating it as if it were not negated, and then negate the result of that match if there were no matching errors.
Partly negated regular expressions
At parse time, a partly negated regular expression, say r in /(?v:r)/
,
is expanded into /(?:rN)?.*?/
where N is a special instruction
(OP_NEGATE
) that causes the regular expression matching engine to treat
the input matched thus far as a mismatch.
For example, the /a(?v:b)c/
parse tree looks like this after expansion:
PATTERN: /a(?v:b)c/ (US-ASCII)
<list:275ca40>
<string:275c9f0>a
<enclose:275c950> option:4
<quantifier:2755640>{0,1}
<string:275c9a0>b
<quantifier:2755690>{0,-1}?
<anychar:275af20>
<string:2755550>c
Since r is wrapped in an optional non-capturing group (/(?:rN)?/
), the
matching engine is able to continue onward when r does not match.
However, if r did not match because the matching engine reached
non-matching input characters, then those characters will obstruct the
matching engine from continuing onward. That is why /.*?/
is provided.
For example, when /a(?v:b)c/
is matched against the input "axyzc"
, the
non-matching input characters ("xyz"
) will obstruct the matching engine
from matching the rest of the regular expression (/c/
) against the rest
of the input ("xyzc"
). However, since /.*?/
consumes "xyz"
, the
matching engine can proceed to match /c/
against "c"
.
Conclusion
In all, this was a challenging and fun learning experience for me. I brushed up on my rusty (thanks Ruby!) knowledge of the C language and the GDB debugger while also adding some shiny new tools to my repertoire:
- Clewn for graphical operation of GDB through GVim
- Cscope for answering the inverse of questions I ask Ctags
I have submitted my patches simultaneously to the Ruby developers and to the author of Oniguruma, and now await their response. If all goes well, we might see this feature in Ruby and Oniguruma someday. Enjoy! :-]