In the back of my head, I’ve always had a bunch of open-source projects that I wanted to build if I had the time. This has led to me sporadically working on a wide variety of projects without any real clear direction. This coming year I have decided to try to focus on specific goals in the hopes that it will help me stay more on track and actually complete a project for once. In an effort to be more transparent and maybe incentive others to help me reach them, I’m sharing them here. Without further ado, here are my 2023 open-source resolutions, in their order of priority.
Fortunately, the first goal is also the thing that Shopify pays me to work on. We want a new parser for the Ruby programming language. I’ve talked about this at length before so won’t dive too much into the reasons now. At a high level, we want a library that can be used by other projects outside of CRuby, we want it to be well-documented and maintainable, and we want it to be error tolerant. I’m hoping to deliver the first alpha version of this library that successfully parses most Ruby code sometime this spring.
This project is housed at Shopify/yarp and is already well on its way. If you’d like to contribute to this effort, I’ve opened up a bunch of issues on the project that are good places to start.
Syntax Tree AST updates
The main open-source project that I maintain is Syntax Tree. This library provides an object layer on top of the Ruby parser. It also provides tools to manipulate that layer and convert it to other formats. This allows formatting, linting, translation, compilation, etc. The syntax tree that Syntax Tree uses is a parse tree. In other words, the tree is representative of the syntax that was found as opposed to the semantics that were implied.
For example, both lines in the following Ruby code has very different syntax but the same exact semantics:
if foo then bar else baz end foo ? bar : baz
The AST that the new parser uses to represent these two lines is almost the same tree (only the tokens attached to the tree are different). This makes it easier to use for the consumer because you don’t want to have to worry about variations in syntax when you’re trying to compile the tree into your own format.
This year I’d like to make Syntax Tree’s AST match the AST used by Shopify/yarp. This should allow consumers of Syntax Tree to use the new parser without any changes to their code, because the object layer will be the same. To do this, I’ll need to make some breaking changes to Syntax Tree’s AST and release a new major version. This won’t impact the formatting at all, because the output should stay consistent. This is only about changing the internal representation of the tree.
I’m hoping to get this done by the end of the year. That being said, progress can be made on this already since the new parser’s nodes are largely already defined. If you’d like to help out with this effort, I’d love to have your help by opening issues or pull requests on Syntax Tree.
Syntax Tree compilation
Syntax Tree recently gained the ability to represent instruction sequences. It can convert a syntax tree into them, assemble them from a text format, disassemble them into the same format as used by CRuby, and even evaluate them by emulating the CRuby virtual machine. This is extremely powerful for a couple of reasons. Here are a couple of things I’m dreaming of with this functionality:
- I would love a visualizer for CRuby execution. This means walking through the process of tokenizing, parsing, compiling, optimizing, and executing. You can see a bare-bones version of this on ruby-syntax-tree.github.io/. I’d love to extend this to be more like Compiler Explorer. This would largely be for educational purposes to help people better understand how CRuby works.
- I would like to treat YARV as a compilation target. You can already see this in Syntax Tree today with the bf.rb file. This actually compiles brainf*** code directly into YARV instruction sequences, and allows you to evaluate it natively on the CRuby virtual machine. This is actually something Koichi discussed all of the way back in 2004 (see slide 46 from his RubyConf 2004 presentation).
- Treating YARV as a compilation target also means we could potentially write a C interpreter that would execute natively on YARV. This could potentially allow us to see through C extensions and allow us to perform more optimizations with YJIT and other JITs.
- YARV being a compilation target also means we could do something like prepack for Ruby. Since we can now compile our own instruction sequences, we could potentially modify them on the fly like I did with kddnewton/preval. With a combination of dynamic type analysis, partial evaluation, and rewriting the instruction sequences, we could build an entirely new kind of JIT all in Ruby.
- I would love to be able to execute Ruby code in an environment with complete control over which methods can be called. This would allow you to write a Ruby program that can be executed in a sandboxed environment. We can do this by having complete control over the Ruby virtual machine without having to write any C.
- First, we need to upstream the work done in Syntax Tree into the project that it originated from: kddnewton/yarv. I will probably want to move this project under the
ruby-syntax-treeGitHub organization once this is done.
- I’d like to get most of
ruby/specpassing for our toy VM. It’s mostly there, but at moment having some trouble with catch tables. I need to do more investigation into this.
- I’d like to instrument the toy VM such that we could build a visualizer on top of the execution of it. At the moment this happens whenever the value or frame stacks change, which is a pretty good start.
If you’d like to help me out with this project, pull requests making more specs pass would be dearly appreciated.
Syntax Tree translation
The Ruby ecosystem has developed 3 main syntax trees. Those are:
- CRuby’s syntax tree represented using
- whitequark/parser’s syntax tree, used to build
rubocopamong other tools.
- seattlerb/ruby_parser’s syntax tree, used to build
flogamong other tools.
A while ago when I started Syntax Tree I had the dream of translating the syntax tree used by Syntax Tree into these other syntax trees. (I really shouldn’t have named this library Syntax Tree…) This resulted in ruby-syntax-tree/syntax_tree-translator. This project is still a work in progress, but it is functioning to a certain extent. It finally got to the point where it can actually run a small subset of rubocop rules. This means that we can parse using Syntax Tree, translate into the parser syntax tree, and then run rubocop rules on it.
We want this because when the new parser is complete, it is going to be used by Shopify/ruby-lsp to provide language server functionality. The tree itself is cached in the background of the editor to speed up LSP requests. If we want to run other tools like rubocop or flog, it would be great to not have to reparse the entire file. This is where the translation comes in.
This effort has largely stalled while I’ve worked on other projects. I’d like to get back to it eventually. At the moment it’s running an older version of Syntax Tree and needs to be updated. If you’d like to contribute to this project, I’d love any and all pull requests to update the version of Syntax Tree or get more nodes translated correctly.
A while ago I started working on a regular expression engine written in just Ruby. That work was written at kddnewton/regular_expression. That project featured a couple of interesting things:
- a compiler that compiled regular expressions into a bytecode, similar to Onigmo
- a control-flow graph that was used to optimize the bytecode
- a JIT compiler that compiled the bytecode into x86-64 machine code
The project was a lot of fun to work on, but had some significant issues. It was using a very naive uncached NFA approach, which has some significant downsides. I’ve learned a lot about regular expressions and their implementations since then, and have consolidated that work into kddnewton/exreg.
I want this project to be completed for a couple of reasons:
- Even with the new Ruby 3.2 enhancements to regular expressions, there still isn’t the possibility of JIT-ing them. We can get even more performance out of our regular expressions with a JIT compiler, especially if is using SIMD instructions judiciously.
- Onigmo still doesn’t do powerset construction, even for simple regular expressions. This means it never resolves to a DFA, even if it could. For the majority of cases, this doesn’t really matter, especially with the new NFA caching. But it’s leaving a tiny bit of performance on the table that I’d like to gain.
- Having a regular expression engine written in Ruby would allow us to hook into it and instrument its execution. This feeds into my dream of a full Ruby visualizer. We could do really interesting things like showing the state machine inline with the editor through a language server.
To get this project completed, I need to do the following:
- finish implementing the other features of Onigmo’s regular expression engine in
- bring the JIT compiler back from
kddnewton/regular_expressionand get it working with
- add the language server functionality that we have in Syntax Tree into
exregso we can visualize it
Any help on this would be massively appreciated! This is the least likely project to happen this year because of the other priorities above.
I’m still going to be maintaining other projects and updating them as their dependencies change. However, I’m going to resisting working on any new features for them. Here are a couple of those projects that I’ll be maintaining:
I honestly do not know how much of this I’ll be able to get done in the next year. It’s a completely overambitious list of projects. But, it’s out in the open now, so hopefully I’ll be a little more focused this year than I have been in the past. Here’s the dream: if all of these things were able to be completed this year, we would have:
- A new parser and an object layer on top of it that can be used to power any of the existing tools that require a parser.
- A complete visualization tool for Ruby that would teach folks about how CRuby works. (Imagine Advent of YARV’s diagrams but completely animated and automated.)
- The ability to partially evaluate any Ruby code for education and analysis purposes.
- A regular expression engine that is written exclusively in Ruby that can be used for education, analysis, and experimentation.
Again, it’s probably overambitious. But it’s definitely fun to dream. As I’ve mentioned multiple times in this post, any help is appreciated.← Back to home