A Ruby Regular Expression Engine
Recently I put some finishing touches on the exreg gem, a pure-Ruby implementation of a Unicode regular expression engine. It supports nearly all of the same functionality as Onigmo, the Ruby regular expression engine, with caveats listed in the README. Importantly however, it uses a Thompson-style NFA virtual machine, meaning it is immune to ReDoS caused by catastrophic backtracking.
Background
Most of the technical background for this can be found in Russ Cox’s excellent series of blog posts entitled Regular Expression Matching Can Be Simple And Fast. In short, traditional backtracking regular expression engines (like Onigmo) can be tricked into taking exponential time on certain pathological inputs. This is because they try to explore all possible paths through the NFA until they find a match, and in some cases the number of paths grows exponentially with the size of the input.
Thompson-style NFA engines, on the other hand, simulate all possible paths in parallel, thus making their execution linear with the size of the input. Note that this does not mean that they are inherently faster than backtracking engines; it just means that their runtime is bounded by a linear function of the input size.
Implementation
Exreg follows a fairly standard architecture for a regular expression engine. It consists of a parser that converts a regular expression pattern into an abstract syntax tree (AST), a compiler that converts the AST into bytecode for a custom virtual machine, and the virtual machine itself that executes the bytecode against input strings. There are a couple of interesting implementation details worth mentioning, which I’ll cover below.
Unicode
Unicode support requires quite a large database of properties, codepoint sets, and case folding information. To make this somewhat performant, Exreg has a rake task that generates a binary database from the Unicode Character Database (UCD) files that it downloads when the gem is installed. It keeps offsets into this binary database in memory, but only loads actual codepoint sets on demand. To make this happen, I needed to come up with and implement a custom binary format that is both compact and fast to load. Fortunately some judicious use of pack/unpack made this fairly seamless.
USet
Codepoints are always stored in USet objects, which are collections of half-open ranges that can efficiently represent large sets of Unicode codepoints. USet supports standard set operations like union, intersection, and difference, as well as more specialized operations like case folding and inversion. This primitive is used throughout the compilation process, but is eliminated by the time the bytecode is generated.
ByteSet
Once the bytecode is generated, individual instructions that consume sets of bytes use ByteSet objects to check for a match. Because 256-bit integers are not very efficient to work with, we instead use an array of 8 32-bit integers to represent the set of bytes. (We do this because it is below the maximum tagged pointer size so the integers are not actually allowed.) This allows for very fast membership testing using bitwise operations.
Encoding
Onigmo supports dozens and dozens of encodings, with certain regular expression features only making sense in certain encodings. You can see the full set of supported features here and how they interact with encodings. Exreg on the other hand, only supports Unicode encodings, effectively treating each regular expression as if it had (?u) at the start.
Ruby and Onigmo also do an odd dance whenever a string is going to be matched against a regular expression which resolves which encoding to use. You can get a sense of this from Kevin Menard’s excellent talk a few years back at RubyConf called The Three-Encoding Problem. Exreg does not do this, and instead assumes UTF-8 unless an explicit encoding is provided.
Exreg also makes the decision to iterate over strings a single byte at a time. This means that it encodes the byte representation of Unicode codepoints into the VM itself. Then when it matches against strings, it effectively treats them all as byte arrays. This has some technical tradeoffs, but largely simplifies the API.
Conclusion
I have been thinking about this problem for a very long time, and am happy to finally have it published. There are still a bunch of features that I would like to add, mostly to achieve parity with the Onigmo engine. Some of these would require implementing a backtracking engine as well, since backreferences and lookaround assertions usually require backtracking. It would be nice from a user perspective to be able to see which strategy has to be used for a given regular expression in order to determine if it is safe to use with untrusted input.
This is also a great place to introduce a JIT. The current VM is purely interpreted, which is not very fast. A JIT could compile frequently used regular expressions down to native code, which would speed things up considerably. This is something I may explore in the future, since it is already a bytecode VM with mostly mathematical operations.
Contributions or questions are always welcome, wherever you can find me on the internet (GitHub, email, etc.). Happy matching!
← Back to home