Kevin Newton

Prism: Ruby 3.3’s new error-tolerant parser

2024-01-23T00:00:00+00:00

Prism is a new library shipping as a default gem in Ruby 3.3.0 that provides access to the Prism parser, a new parser for the Ruby programming language. Prism is designed to be error tolerant, portable, maintainable, fast, and efficient.

Usage

To use the Prism parser through the Ruby bindings, you would require the prism library and the call any of the various parse methods on the Prism module. For example:

require "prism"
Prism.parse("1 + 2")

This method will return to you a parse result object, which contains the syntax tree corresponding to the parsed source code, lists of errors, warnings, and comments, as well as various other metadata related to the parse operation. Importantly this method will always return a parse result (as opposed to raising an exception when a syntax error is found), which makes it suitable for working on source code that may contain syntax errors.

History

Prism was originally designed in 2021. It originated at Shopify, where the need for a fast and efficient error-tolerant parser became quite evident. In 2021, Shopify was already heavily invested in CRuby, TruffleRuby, Sorbet, and various Ruby tooling. In total, Shopify developers were helping to maintain four different parsers for the Ruby programming language. This was a lot of work, and it was clear that the community would benefit from a single parser that could be used by all of these projects.

In consultation with the maintainers of all of these projects and more, the project went through various prototyping and design phases before eventually landing on the current design. This progressed over the course of a year and a half to get us to where we are today. In that time the project has been open sourced, and has been integrated into various projects in the Ruby ecosystem.

Design

As mentioned, Prism is designed to be error tolerant, portable, maintainable, fast, and efficient. The parser and nodes therein are designed to be as simple as possible to deal with from the perspective of an implementation or tooling. We will discuss each of these design goals in turn.

Error tolerance

Since Microsoft created Visual Studio Code and the language server protocol, error tolerance has been much more in the spotlight for programming languages. It has become tablestakes for a good developer experience that the parser powering your editor is able to parse code that contains syntax errors, because most of the time that code is being written it is not in a completed state. Prism was designed and hand-written with error tolerance in mind for this reason. At a minimum, with a file containing myriad syntax errors, Prism will always return a list of the top-most statements.

As Prism has been developed, the team has worked closely with the team designing Ruby LSP, a language server for Ruby. This has allowed the developers to ensure that Prism is able to parse the code that Ruby LSP is sending it, and that the errors Prism is returning are useful to the end user. As we continue this work in Ruby 3.4.x, we will continue to iterate on and improve the error tolerance of Prism.

Portability

Prism was designed to be a replacement for all of the various parsers that had been developed over the years of Ruby’s lifetime. This includes CRuby’s parser, but also the parsers of all of the other Ruby implementations and third-party tools. Because of this, the developers of Prism have been consulting from the beginning with the maintainers of JRuby, TruffleRuby, IRB, and various other implementations and tools.

To that end — CRuby, JRuby, TruffleRuby, and Natalie have all integrated Prism as a replacement for their existing parsers. Within CRuby (the default Ruby implementation) it ships as an optional parser. JRuby and TruffleRuby are both working on making it their default parsers in their next version. Natalie has already made it their default parser.

Over the course of the Ruby programming language’s lifetime, there have been various other third-party parsers that have been developed. This includes whitequark/parser and seattlerb/ruby_parser. Both of these parsers have powered various tools and libraries over the years, including big names in the ecosystem like rubocop. We have been working with the developers of these tools to provide alternate options to include Prism as a backend in order to fully integrate the entire ecosystem into one cohesive effort.

Prism is a standalone library with no dependencies, which makes it easy to also ship bindings to other languages. As of writing this article, Prism is already powering tooling written in Ruby, C, C++, Rust, Java, and JavaScript. We are actively working with maintainers of libraries in all of these languages to ensure that Prism is a viable option for them.

Maintainability

Prism was designed to be as maintainable as possible in order for it to last as the default parser for the community. To that end, every node and field in the entire syntax tree is documented with comments and tests. Additionally a whole blog series has been written about the design and implementation of Prism to provide additional context. We hope that by continuing to invest in the maintainability of Prism, we can provide the community with a basis for all kinds of excellent developer tooling for years to come.

Parser design

Prism is a hand-written recursive descent parser. It is written in C99, and is designed to be portable to any platform that Ruby supports. It is structured as a large Pratt parser, with additional modification when the Ruby grammar changes precedence or associativity rules.

In general, Prism parses a superset of valid Ruby code. For example, in addition to parsing a constant path in the place of the name of a class, it will also parse any valid expression beginning with a constant. This would look like:

class foo.bar
end

We do this to enable good error recovery. By allowing the parser to parse expressions where they would normally not be permitted, we can recover from errors in a way that is more useful to the end user.

It is also beneficial to parse a superset because of incremental parsing. Incremental parsing refers to the ability to parse a subset of a file as it is being written. By parsing any kind of expression in any position (like above), we enable tools to represent more of the syntax tree even when it is in an invalid form. This becomes particularly important for linters and type checkers because they do not have to discard as much information whenever the file changes.

If you take the example from above, even though foo.bar is in an invalid location in the parse tree, typecheckers and linters can still process the method call as if it were valid. Then, if the user types additional characters to make it valid, the tool can keep around the method call node without having to reprocess it.

Node design

The nodes in Prism’s syntax tree are designed to make it as simple as possible to compile, while retaining enough information to be able to recreate the source code at any point. With this in mind, Prism splits up a lot of nodes that other syntax trees general keep together to make their intention as clear as possible. For example the following code:

@foo = 1
for @foo in 1..10 do end

In both of the lines above, the @foo instance variable is being written to. In the first line it is being written directly with the value of 1, in the second line it is being written indirectly with the current value of the iteration of the loop. In other syntax trees, this is usually represented with a single node type (instance variable write) with an optional value attached. This means that in order to compile and understand the node, the consumer always has to check if a value is present. In Prism, we split up these two cases into two separate nodes: InstanceVariableWriteNode and InstanceVariableTargetNode. The first node is used for direct writes, and the second node is used for indirect writes.

With these splits in place, the resulting compiler within CRuby ends up being a “flatter” compiler because there are fewer nested branches to deal with. This is intentional; one of the key tenets of designing the Prism nodes is that you never have to consult a child node to determine how to compile the parent node. We believe this will make it easier to maintain and extend the compiler in the future. We also end up saving on space because we don’t end up storing any null values in the nodes where it’s not possible for them to have a value.

Speed and efficiency

Lots of benchmarking has been done to ensure that Prism is as fast as possible and as efficient with memory as it can be, though there is a lot of room for improvement here. We have been benchmarking by parsing large suites of Ruby code and measuring both the time it takes to parse on its own, as well as the time it takes to reify the syntax tree into Ruby. This work will continue in the new year.

Testing

It has been massively important to our development efforts to build a robust test suite for Prism. Various test suites have been created over the years for the Ruby programming language, but few — if any — have been built with a parser in mind. In addition to our own set of fixtures that we have built over the regular course of development, we have also vendored parser test suites from whitequark/parser and seattlerb/ruby_parser. We have also been testing against the latest version of every released gem on rubygems.org, which has been a great source of bugs and edge cases.

In testing, we have used a combination of many different forms of tests. The first is regression tests: we take snapshots of syntax trees that are the result of parsing fixtures and on subsequent runs of the test suite we compare them against the saved version. This is useful for ensuring that we do not regress on syntax trees that we have already parsed correctly. The second is manual unit tests addressing both particular functionality and error tolerance. These are useful for testing specific edge cases and for ensuring we are able to recover from errors in a consistent manner. Finally, we have small test suites for specific features like regular expressions, encodings, and escape sequences. These test suites employ brute-force testing (i.e., testing every possible combination of values). For example, with encodings we test every codepoint in every encoding. These test suites ensure those concerns are handled correctly.

Finally, it has been very important to fuzz the various inputs to the Prism parser. As with any C project, there are many ways to introduce memory corruption bugs. We use AFL++ to fuzz the parser and lexer to ensure we never crash or read off the ends of the input. In conjunction with ASAN and various other memory sanitizers, we have been able to ensure that Prism is as stable as possible.

Challenges

There are many challenges in working with Ruby source code. The grammar itself is very complicated, and has been extended many times over the years. Beyond this, there are some specific challenges that we have faced in developing Prism.

Local variable reads and method calls are indistinguishable when they are represented using a single identifier. Unfortunately, this becomes quite significant because an identifier being a local variable can change the shape of the parse tree. As such, all local variable scopes must be resolved at parse time. Normally, this wouldn’t be particularly difficult. But certain structures can introduce local variables that are more complex than simple writes. As an example, regular expressions with named capture groups can introduce or modify local variables. The implication is that in order to properly parse Ruby code, Prism must therefore have a regular expression parser that parses as CRuby does. In code, this looks like:

/(?bar)/ =~ "bar"
foo / bar#/

In the code above, the first line introduces a local variable foo that is then used in the second line. The second line is a method call to the / method with bar as an argument. However, if foo is not introduced, this will be parsed as a method call to foo with a regular expression as an argument. This is a very subtle distinction, but it illustrates the importance of having all of the local variables resolved at parse time.

Source code in Ruby can be encoded in any of the 90 ASCII-compatible encodings that CRuby supports. Therefore in order to properly parse Ruby code, Prism has to explicitly support every encoding that CRuby does. Fortunately it is only a subset of the functionality; just enough to determine if the subsequent bytes form an alphabetic, alphanumeric, or uppercase character. In code, this looks like:

# encoding: Shift_JIS

The name of the encoding can be any of the 154 aliases for the ASCII-compatible encodings. This must be resolved as soon as the encoding comment is encountered to ensure all subsequent strings and identifiers are parsed correctly.

Finally, Ruby has a very rich set of escape sequences that can be used in strings and regular expressions. These escape sequences can be used to represent any Unicode codepoint, as well as various other special characters. In order to properly parse Ruby code, Prism has to support all of these escape sequences and return the exact bytes that they represent. This makes it easier on individual implementations as they no longer have to parse escape sequences, but makes it more difficult to maintain on the Prism side.

APIs

Many APIs exist in Prism beyond just parsing that can be useful to a developer creating tooling on top of the Ruby syntax tree. Some APIs are novel, and exist to provide additional information. Others are replacements for existing workflows that have never had a standard API before.

One such existing workflow was to find all of the comments in a source file. Usually this was done with Ripper, but you can accomplish the same with Prism with less effort:

Prism.parse_comments(<<~RUBY)
# foo
# bar
RUBY

This will result in an array of comments, which looks like:

# =>
# [#>,
#  #>]

Another common workflow was to determine if a source file was valid or not. This was frequently accomplished using either Ripper or RubyVM::InstructionSequence. Prism provides a simpler API for this:

Prism.parse_success?("1 + 2") # => true
Prism.parse_success?("1 +") # => false

By providing these additional APIs, it makes it easier for the consumer to write less code and to have a more consistent experience across different versions of Ruby.

Every node in the syntax tree itself has a common set of APIs as well. All nodes have their own class (as opposed to every other Ruby syntax tree which tends to use a single class with a type attribute). These classes all respond to their own named fields for children and attributes. Additionally they all respond to #child_nodes (which includes nil values) and #compact_child_nodes (which does not include nil values) to gather up all child nodes contained in the current parent node. You can leverage this common interface to walk over every node in the syntax tree:

def walk(node, indent = 0)
  puts "#{" " * indent}#{node.type}"
  node.compact_child_nodes.each { |child| walk(child, indent + 2) }
end

walk(Prism.parse("foo.bar(1); baz(2)").value)

The above code will output the following tree-like structure:

program_node
  statements_node
    call_node
      call_node
      arguments_node
        integer_node
    call_node
      arguments_node
        integer_node

Each node also responds to #copy, which is useful for treating nodes as immutable and generating new nodes with certain fields overridden. They all implement pattern matching with #deconstruct and #deconstruct_keys. Finally they all respond to #location, which allows the user to determine the exact location in the source code that the node was parsed from.

For working with subsets of nodes, nodes all implement the #accept method, which accepts a visitor object. Visitors implement the double-dispatch visitor pattern to allow for easy traversal of the syntax tree. Prism ships with Prism::Visitor and Prism::Compiler to provide a common set of visitors for common use cases. The Prism::Visitor class is useful for finding subsets of the nodes or generally querying output. The Prism::Compiler class is useful for transforming the syntax tree into a different form, like a bytecode or other representation. As an example, if you wanted to find all method calls in a syntax tree, you could:

class MethodCallFinder < Prism::Visitor
  attr_reader :calls

  def initialize(calls)
    @calls = calls
  end

  def visit_call_node(node)
    super
    calls << node.name
  end
end

calls = []
Prism.parse("foo.bar.baz").value.accept(MethodCallFinder.new(calls))

calls
# => [:foo, :bar, :baz]

Prism ships with some visitors and compilers already built in, which are useful on their own and as examples of manipulating the tree. It ships with the ability to convert syntax trees into a directional graph in the Graphviz format. It also provides a Prism::DesugarCompiler, which “desugars” syntax into equivalent syntax using fewer node types. Finally, it provides a Prism::MutationCompiler, which allows users to modify syntax trees like you would to provide automated refactoring.

Future work

Now that we are shipping with Ruby 3.3.0, we will continue to develop Prism in harmony with the Ruby community to produce the best possible foundation for Ruby tooling going forwarding. In service to that goal, there are many directions that we are looking to take Prism in the future.

The first major goal of Prism is to achieve exact parity with CRuby’s current parser. Today, Prism parses all valid Ruby correctly, but there are still some edge cases where it fails to reject invalid Ruby code. We are working to close this gap as quickly as possible, and intend on having it closed by the time Ruby 3.4.0 ships. There are additionally some warnings, niceties in terms of error message ergonomics, and tweaks to error recovery that we are working on to ensure CRuby does not lose any functionality (like specific error recoveries or warnings) when and if they switch to using Prism as the default parser.

The second major goal of Prism in the new year is to increase adoption within the community. While we have already integrated many major tools and implementations, there are still many more places in the ecosystem that could benefit from Prism. This includes implementations like mruby and tools like Sorbet. We hope this year to work with the maintainers of these projects to ensure that Prism is a viable option for them.

Thirdly, we would like to improve documentation and the general developer experience when working with Prism. While we have worked hard to make this a good experience from the start, there is always room for improvement here. Ideally we would like to lower the bar as much as possible to make it approachable for anyone (regardless of experience level) to contribute to Prism.

Finally, we plan to spend time this year working on performance. While Prism is already quite fast, there are still some areas where we can improve. We will be looking at SIMD instructions and other low-level optimizations to optimize for specific target platforms. We will also be looking at optimizing memory layout and allocations to reduce the overall memory footprint of Prism.

Overall, we are very excited about Prism and the future of Ruby tooling that it enables. Already we are seeing a plethora of new tools and libraries being developed on top of Prism, and we hope that this trend continues with the release of Ruby 3.3.0.

Advent of Prism: Part 24 - Error tolerance

2023-12-24T00:00:00+00:00

This blog series is about how the prism Ruby parser works. If you’re new to the series, I recommend starting from the beginning. This post is about error tolerance.

We have finally reached the end of our series. To date, we have covered 147 nodes in the prism syntax tree. As it turns out, this is 1 less than the total. The final node is MissingNode, which is the subject of today’s post. Before we get into that, however, we need to talk about error tolerance.

Error tolerance

Every example we have seen in this blog series so far has been a valid Ruby program. Parsing valid Ruby is actually not that difficult — it has been done correctly by many different tools over the years. Parsing invalid Ruby, however, is another challenge altogether.

Most of the time that code is being written, it is invalid. We are not talking about production code or code that has already been saved to disk (hopefully). We’re mostly talking about code that is in the middle of being edited. As you type, you introduce syntax errors until you get to the end of the current expression. Editors and linters want to be able to parse as you type, however. This means that they need to be able to parse invalid code.

Error tolerance is a field of study that involves parsing invalid code. It refers to the ability to the parser to “tolerate” syntax errors in the input and continue to parse the file to return a syntax tree. This is a difficult problem to solve, and ends up being a bit more art than science. However, there are some guardrails in place that we can talk about.

Let’s take, for example, the following code:

1 +

We know that this is invalid Ruby code, because the + operator is in the infix position and requires there to be an expression on the right-hand side. However, intuitively we know that this is a method call with a missing argument. We can translate that into our parser to allow it to “handle” this syntax error by determining if the token after the + operator could potentially begin an expression.

In this case it’s the newline token, so the subsequent token cannot begin an expression. When we encounter a situation like this, we can insert a MissingNode into the syntax tree. This node is a placeholder that represents the missing expression. It is a child of the + method call, and has no children or fields of its own. After inserting the missing node we log an error and then continue parsing as if nothing happened.

Here is what the AST looks like for 1 +:

We have weaved this kind of error tolerance into prism from the beginning. This has made it suitable for use in editors and linters, which is why it is the parse tree backing the ruby-lsp project. By providing a syntax tree regardless of errors, it means tools like RuboCop and Sorbet can still lint and type check the input file even if it is invalid. This means sections of the file can be cached so that they do not have to be re-parsed and re-processed when the file is edited. This would not be possible if the parser simply failed on the first instance of invalid input.

Ambiguous tokens

Another form of syntax error is ambiguous tokens. Consider, for example, the following code:

class Foo
  def bar
    self.
  end
end

As a developer, most people would read this as a missing method name being sent to self inside the bar method. However, it is perfectly valid Ruby to have self.end be separated by newlines and whitespace. This means there is an ambiguity here between if the end is a method name or the keyword that closes the def block.

If the end is parsed as a method name, then the class statement will not be closed. In this case a syntax error will be raised. CRuby recently developed a solution for this: insert a missing end token and see if it “fixes” the problem. This turns out to be a common enough pattern that this solves a lot of the ambiguity problems in the parser.

Prism has not yet implemented this kind of recovery, but it is first on our list of tasks for next year. If and when CRuby adopts prism as its primary parser, we could not in good conscience do so without parity or improvements in error tolerance.

Wrapping up

There you have it, folks! After 24 days of posts, we have covered every piece of known Ruby syntax up to Ruby 3.3.0. Tomorrow this version of Ruby will be released, and I’m assuming shortly thereafter we will have more fun syntax coming down the pipe.

I wrote this series for a couple of reasons. I wanted to introduce you all to prism, so that you can use it if you want to build something on top of the Ruby syntax tree. I also wanted to introduce you all to all of the varieties of Ruby syntax that I have gotten to know through building prism. Finally, I wanted a snapshot in time of what Ruby looks like, so that I have something to point people to if they have questions in the future.

I have learned a lot about Ruby, AST/IR design, and parsing in this journey. I hope you have learned something too. Here are the main things I hope you take away from this series:

Ruby’s grammar is incredibly complex because it tries to allow you to express code in whatever natural way you feel is best. It has grown and will continue to grow organically over the years to fit the needs of the community. Although it is difficult to parse, it is a joy to read and write, which is far more important.
Usually the relative complexity of syntax and semantics are correlated, but not always! As an example, the binary one-character + operator consistently represents a single method call, but the binary two-character += operator represents a method call and an assignment.
Syntax that looks very similar can have very different meanings, depending on context. As a corrolary, syntax that looks very different can have the same meaning, depending on context. Consider the if modifier, which can either be an if statement or a guard clause in a pattern match. Also consider the ternary ? and : markers, which can represent the same thing as an if.
Through hard work, dedication, and cooperation, we can create incredible tools and developer experiences for Rubyists everywhere.

Thank you so much for reading. I hope you have a wonderful holiday season!

Advent of Prism: Part 23 - Pattern matching (part 2)

2023-12-23T00:00:00+00:00

This blog series is about how the prism Ruby parser works. If you’re new to the series, I recommend starting from the beginning. This post is about pattern matching.

Yesterday, we looked at the basics of pattern matching. Today we’re going to close out that discussion by talking about the more advanced features: destructuring and capturing. Let’s get into it.

`HashPatternNode`

It’s common to want to match against certain attributes of an object, even if they are method calls. For example, let’s say we have some kind of person class:

class Person
  attr_reader :name, :age

  def initialize(name, age)
    @name = name
    @age = age
  end
end

If we wanted to match against a specific name and age, we could do something like:

person = Person.new("Kevin", 33)

if (person.name in "Kevin") && (person.age in 33)
  puts "It's Kevin!"
end

This gets a bit verbose if you want to match against more than just 2 values. Fortunately, Ruby has a shorthand for this: the hash pattern. It looks like this:

case person
in { name: "Kevin", age: 33 }
  puts "It's Kevin!"
end

This indicates that we want to match against a hash with the keys name and age, and the values "Kevin" and 33 respectively. In order to get this working, we will need to implement a deconstruct_keys method on Person. That looks like:

class Person
  def deconstruct_keys(matching_keys)
    ((matching_keys || %i[name age]) & %i[name age]).to_h do |matching_key|
      [matching_key, public_send(matching_key)]
    end
  end
end

With this method in place, Ruby knows how to normalize a Person object into a hash. In doing so, it can then perform its matching as expected. This post is meant to discuss the parser aspects of pattern matching, but first let’s take a brief look into what deconstruct_keys is doing:

#deconstruct_keys is called whenever Ruby tries to match an object against a hash pattern
It is given the keys that are present in the hash pattern or nil if all keys should be matched
In our implementation, we ensure a default value of all keys and then intersect them with the known keys
Given we know the keys, we can then call public_send to get the values
This returns a hash of { name: name, age: age } in the case that all keys are matched against

In terms of the actual syntax, every time you see a hash pattern you can know that #deconstruct_keys is going to be called on the match object before any matching occurs. This is significantly different from other patterns we have seen which do not usually trigger method calls on the object iself.

For the hash pattern itself, there are a couple of variations. Here are some examples:

case person
in Person[name: "Kevin"]               # (1)
in Person(age: 33)                     # (2)
in { name: /Kevin/ }                   # (3)
in age: Integer                        # (4)
in Person[**attributes]                # (5)
in Person[**nil]                       # (6)
in Person[name: Person[name: "Kevin"]] # (7)
end

We’ll talk about each of these in turn:

You can optionally attach a constant path to a hash pattern which will first check the constant to see if it matches the class of the object using the #=== method.
You can use [] or () to surround the attributes of the hash pattern after a constant.
Keys in hash patterns must always be symbol labels but values can be any object that could be used in a pattern match.
The braces can be omitted on hash patterns in most cases.
You can use the double splat operator to capture all remaining keys in a hash pattern. This will assign them to a local variable if a name is present.
You can use the double splat operator with nil to match against empty hashes.
You can nest hash patterns inside of other patterns as the values of keys.

Let’s simplify the example first:

person in Person[name: "Kevin"]

So that we can look at the AST:

You can see we have pointers to the optional constant as well as the list of elements within the hash pattern to match against.

`ArrayPatternNode`

Normalizing to a hash is common, but sometimes objects more closely resemble arrays. For example, let’s say we have a Point class:

class Point
  attr_reader :x, :y

  def initialize(x, y)
    @x = x
    @y = y
  end
end

We can match against this class using an array pattern:

case point
in Point[5, 6]
  puts "found!"
end

This will call #deconstruct on the Point object, which must return an array. This is then matched against the array pattern. This method looks like:

class Point
  def deconstruct
    [x, y]
  end
end

Note that unlike #deconstruct_keys there is no argument to #deconstruct, so there is no way to limit the size of the resulting array in the case that only a couple of values are matched.

Most of the varieties of hash patterns also apply to array patterns as well. Here are some examples:

case point
in Point[5, *]        # (1)
in Point(5, *)        # (2)
in [5, 6]             # (3)
in 5, 6               # (4)
in [Integer, Integer] # (5)
in [5, [6, 7]]        # (6)
end

We’ll talk about each of these in turn:

You can use the splat operator to capture all remaining elements in an array pattern. This will assign them to a local variable if a name is present.
You can use [] or () to surround the elements of the array pattern.
You do not have to match against a constant, you can match instead directly against an array.
You can omit the surrounding [] on array patterns in most cases.
You can use any pattern as an element of an array pattern. The value will always be matched with the #=== method.
You can nest array patterns inside of other patterns as the elements of the array.

Simplifying our match a bit:

point in Point[5, 6]

Let’s take a look at the AST:

You can see that this is split up in much the same way as a multi target node where we have a list of requireds, posts, and an optional slot for rest. Note that it is only possible to use a single splat operator in an array pattern.

`FindPatternNode`

There is another way of matching against arrays that allows you to search for specific elements. This is called the find pattern. It looks like this:

integers in [*, 5, *]

This will return true if the array contains the value 5 at any position. We represent this kind of pattern with a FindPatternNode. Let’s take a look at the AST:

Note that all of the syntactic variations of the array pattern also apply here to the find pattern. The splats on the left and right of the pattern are required, and may optionally have names. The list of values in the middle can have as many sub patterns as you want.

Local variable targeting

As we mentioned yesterday, reading local variables in patterns involves the use of the ^ operator. Writing local variables, on the other hand, involves only the name of the local variable. For example:

foo in bar

In this pattern match we are assigning the value of foo to the local variable bar. Here’s the AST for this example:

This gets much more powerful when combined with all of the other patterns we have learned about so far. For example, if you combine pinning, local variable targeting, and a find pattern, you can do:

integers in [*, value, ^(value + 1), *]

This will check within the array for a value that is followed by a value that is 1 greater than it. If it finds one, it will assign the value to the local variable value and return true. Here’s the AST for this example:

As you can see, pattern matching can get quite complex quite quickly.

`CapturePatternNode`

Writing to a local variable is very nice, especially when you want to use that value later. However, using this syntax does not allow you to pattern matching the value you are about to write. That is where the => operator comes into play. Note that this is a different operator from the hash key/value pair delimiter and a different operator from the operator that triggers pattern matching in the first place.

Let’s take a look at an example:

person in Person[age: Integer => age]

In this example, we are matching against a Person object with an age key that is an Integer. If we find a match, we will assign the value of the age key to the local variable age. Here’s the AST for this example:

Note that only local variables can be written this way. Local variables at different depths can be written, though, so something like this is possible:

age = 30
self.then { person in Person[age: Integer => age] }

This is somewhat contrived, but it demonstrates that you can assign to an already existing local variable.

Wrapping up

Today we looked at the more powerful features of pattern matching: destructuring and capturing. Here are the main takeaways:

Ruby allows you to define your own normalization functions named #deconstruct and #deconstruct_keys to form arrays and hashes, respectively from your objects to match against.
The argument to #deconstruct_keys can be nil. In this case, all keys will be matched against.
You can write to local variables by simply listing the name of the local variable.
You can match against and capture the value of a field in a match by using the => operator.
The => operator is very overloaded.

Believe it or not, we only have a single node left in our tree. We’ll talk about it tomorrrow. See you then!

Advent of Prism: Part 22 - Pattern matching (part 1)

2023-12-22T00:00:00+00:00

This blog series is about how the prism Ruby parser works. If you’re new to the series, I recommend starting from the beginning. This post is about pattern matching.

Pattern matching was introduced in Ruby 2.7 as a way to match against a value and extract parts of it. It’s a very powerful feature that effectively allows you to replace syntactically complicated if/case statements with a more terse syntax. (To be clear: the syntax is less complicated in pattern matching but the semantics — if anything — are more complicated.)

The pattern matching grammar is a whole grammar unto itself. You can think of it as a mini-parser within the overall Ruby parser. Operators like |, ^, and => have different meaning, brackets and braces create different kinds of structures, and reads/writes are flipped from what you might expect. It’s a lot to take in, which is why pattern matching is split over two posts.

In this first part we’ll look at the nodes that trigger pattern matching, as well as introduce the basics of matching against individual values. We’ll also look at alternation and pinning. Tomorrow we’ll cover the more advanced concepts: destructuring and capturing. For now, let’s jump in.

Matching

There are three ways to trigger pattern matching: using a case ... in statement, using the binary in operator, or using the binary => operator. They each do different things, so we’ll look at each one in turn.

`CaseMatchNode`

When a case keyword is used, the parser first checks to see if there is a value associated with it. (Remember from Part 7 - Control-flow that case can optionally replace if/elsif chains by omitting the value.) If there is a value, then the parser parses it and then checks the subsequent keyword. If the keyword is when then a CaseNode is created and parsed. If the keyword is in then a CaseMatchNode is created and parsed. Here’s an example:

case foo
in Integer
  puts "foo is an integer"
end

The above code will call the foo method and then check if the return value is an Integer using Integer::=== (just like case ... when statements). If it is, then the puts statement will be executed. If it isn’t the subsequent clause will be checked. In this case because there are no more, it will raise a NoMatchingPatternError. The AST for the above code looks like this:

You can see that the structure is very similar to a CaseNode. Initially we had it as the same node, but decided to split considering it has such significantly different semantics.

The CaseMatchNode contains a pointer to the value to match against as well as a flat list of clauses to check. Each clause is or contains an InNode node. It also contains an optional else clause, which is an ElseNode node. That looks like:

case foo
in Integer
  puts "foo is an integer"
else
  puts "foo is something else"
end

That AST looks like:

The else clause allows you to specify a default behavior, meaning a NoMatchingPatternError will not be raised. Note that this can initially be surprising for developers who are familiar with case ... when statements because this error raising behavior is specific to pattern matching.

`InNode`

Every clause in a CaseMatchNode is or contains an InNode. It contains a pointer to the singular pattern to match against and the statements to execute if the pattern matches. For example:

case foo
in Integer
  puts "foo is an integer"
end

Importantly, in differs from when in that the pattern is singular and not a comma-separated list. Further evidence that the pattern matching grammar differs somewhat significantly from the Ruby grammar. The AST for this example is a part of the CaseMatchNode AST above.

Guards

It is possible to add guard clauses to in clauses. These are conditions that will also be checked in addition to the pattern, after the pattern has run. They can begin with either an if or unless keyword. For example:

case foo
in Integer if foo > 10
  puts "foo is an integer greater than 10"
in Integer if foo > 5
  puts "foo is an integer greater than 5"
else
  puts "foo is something else"
end

These guards can be extremely powerful because you can reference values that you matched against. Fortunately for us, we already have a node that represents this kind of behavior: IfNode. In this case we reuse it. Here is the AST for this example (with the bodies of the InNode clauses stripped out):

`MatchPredicateNode`

The in keyword can be also used as a binary operator. We call this a “match predicate” because it always returns true or false. Here is an example:

foo in Integer

This will call the Integer::=== method with the return value of the foo method call and return true or false depending on whether the value matches. Importantly, no error will be raised regardless of the outcome. The AST for this example looks like:

This is another case of a relatively simple AST that represents a relatively complicated semantic. Under the hood the entire pattern on the right-hand side is compiled into a set of requirements that are then checked against the value on the left-hand side.

`MatchRequiredNode`

The => operator is reused from hashes and rescues as a binary operator to match “match required”. Here is an example:

foo => Integer

This is similar to the in operator, but it will raise a NoMatchingPatternError if the value does not match. The AST for this example looks like:

Again, this is a relatively simple AST that hides some real complexity. Lots of developers are initially confused by the difference between in and => because of the inconsistency with the rest of the language. As we’ve seen, usually operator/keyword pairs do the same thing and just have different precedence like and/&&, or/||, not/!. In this case, however, it’s important to remember that this keyword and operator have very different semantics.

Patterns

Now that we’ve looked at the nodes that hold patterns, let’s look at some of the patterns themselves. In general you can match against most literal objects (numbers, strings, ranges, regular expressions, etc.). In every case the #=== method will be called on the pattern with the value to match against (under the hood in CRuby the checkmatch instruction does exactly this). For example:

foo in 1
foo in 1.0
foo in 1..10
foo in "foo"
foo in :foo
foo in :"foo"
foo in /foo/
foo in Foo

Matching against a single value is useful, but sometimes you want to match against multiple values. We’ll look at that next.

`AlternationPatternNode`

When you want to match against multiple values, you can use the | operator. This operator is different from the normal Ruby | method call. Instead, it indicates that the pattern on the left-hand side or the pattern on the right-hand side should check for a match. You can think of it as semantically similar to the commas in a case ... when statement. For example:

foo in 1 | 2

This will match if foo is either 1 or 2. The AST for this example looks like:

Note that | can be chained, in which case the parser will form a linked list of AlternationPatternNode nodes. For example:

foo in 1 | 1.0 | 1r | 1i

The AST for this example looks like:

Pinning

Matching against static values is nice, but it’s not nearly as powerful as matching against dynamic values. For example, let’s say you have some local variable that you want to match against the return value of a method. Let’s see how we can do that.

`PinnedVariableNode`

When you want to match against a variable value, you can use the ^ operator. This is called the “pin” operator, which “pins” the value within the pattern. For example:

bar = 5
foo in ^bar

This will call #=== on the value of the bar local variable to check if the return value of the foo method call matches. The AST for this example looks like:

Note that you can pin any kind of variable, so this could also be instance, class, or global variables. For example:

foo in ^@bar
foo in ^@@bar
foo in ^$bar

In all cases, the PinnedVariableNode will be used, which has a single pointer to the variable being pinned. Note that this syntax is how you read variables in pattern matching: by prefixing them with the ^ operator. We’ll see in our post tomorrow how writing variables looks an awful lot like reading variables everywhere else in Ruby.

`PinnedExpressionNode`

Beyond pinning variables, you can also pin expressions. This looks like:

foo in ^(bar)

This will call the bar method and use its value within the pattern (i.e., it will call #=== on the return value). The AST for this example looks like:

Note that the parentheses are the only difference betwen PinnedVariableNode and PinnedExpressionNode in terms of syntax, though they have very different semantics. Note also that unlike everywhere else in Ruby, multiple statements are not allowed within the parentheses. So even though space is allowed between ^ and (, I encourage you to think of them as a single delimiter.

Wrapping up

Today we looked at the basics of pattern matching syntax. This includes all of the nodes that trigger pattern matching, as well as some of the more basic patterns. Here are some things to remember from today:

Pattern matching is triggered by case ... in statements, the binary in operator, and the binary => operator.
The binary in and => operators have very different semantics.
The | operator is used to match against multiple values.
Reading variables in patterns is done by prefixing them with the ^ operator.
Reading singular expressions in patterns is done by wrapping them in ^( and ).

Tomorrow we’ll close out our discussion of pattern matching by looking at destructuring and capturing. See you then!

Advent of Prism: Part 21 - Throws and jumps

2023-12-21T00:00:00+00:00

This blog series is about how the prism Ruby parser works. If you’re new to the series, I recommend starting from the beginning. This post is about throws and jumps.

The terms “throw” and “jump” have more to do with the actual execution of Ruby than the parse tree, but they neatly categorize the nodes that we’re going to look at today.

Throws

“Throw” refers to throwing an exception. CRuby implements many of these using setjmp/longjmp, which are context-saving functions that allow you to break the execution flow of your C program much like you would with exceptions in Ruby. Ruby provides a couple of syntactic structures for handling these kinds of non-local control flow.

`BeginNode`

The parent node of any kind of exception handling is the BeginNode node. This node houses an optional set of statements as well as any number of rescue clauses, an optional ensure clause, and an optional else clause. Here is an example:

begin
  1
rescue
  2
end

This is represented by the following AST:

You can see the node has a statements field that is the optional StatementsNode holding the statements that should be executed. It also has a pointer to a rescue node that is the first rescue clause. If there are more rescue clauses, they are linked together in a linked list. The ensure and else clauses are not present in this example so you don’t see their fields.

Remember from our previous posts that this node is also used to represent rescue/else/ensure clauses being used in other contexts: class and module definitions, singleton class definitions, method definitions, and blocks and lambdas that use do/end.

`RescueNode`

When the rescue keyword is used as another clause in a begin statement, we represent it with the RescueNode node. This node has a list of exceptions to rescue, an optional variable to assign the exception to, an optional set of statements, and an optional consequent rescue clause. Here is an example that showcases all of that:

begin
  foo
rescue Exception1 => error
  warn error.message
rescue Exception2, Exception3 => @error
rescue *exception_list
rescue
  warn "unknown error"
end

The actual flow of this program works like this:

foo is called.
If foo raises an error, Ruby walks through the rescue clauses in order.
In the first rescue clause, the Exception1 constant is looked up. If it does not contain a class or module, a TypeError is raised. If it does, then it checks if it is in the ancestor chain of the exception that was raised. If it is, then the exception is assigned to the error local variable and the statements in the clause are executed. If it is not, then the error is reraised to trigger checking the subsequent clause.
In the second rescue clause, both the Exception2 and Exception3 variables are checked in the same manner. If either of them are in the ancestor chain of the exception that was raised, then the exception is assigned to the @error instance variable. Because there are no statements in this clause, nothing else happens. If neither of them are in the ancestor chain, then the error is reraised to trigger checking the subsequent clause.
In the third rescue clause, exception_list has #to_a called on it and then Ruby iterates over each element in the resulting array to check for classes or modules in the same as the other exceptions. If any of them are in the ancestor chain, the code jumps out of the begin node. Otherwise the error is reraised to trigger checking the subsequent clause.
In the last rescue clause the error is implicitly checked against StandardError. If it is in the ancestor chain, then the body of the clause is executed. Otherwise the error is reraised.

A couple of important things to note here in terms of syntax:

The optional error handle is any target that we have seen so far, including call targets. This means you can have the error handle actually be a method call if you want.
The list of errors is a comma-separated list of (optionally splatted) expressions, not just constants. This is very powerful, but also a source of confusion. Remember that constant lookup itself can trigger method calls (through const_missing) so this can get quite dynamic.
If you omit any classes or modules to check against, Ruby implicitly checks against StandardError.

Let’s look at a slightly simpler example to see how this is represented in the AST:

begin
rescue Error1 => error
rescue Error2
  warn("error")
end

This is represented by the following AST:

Notice that the RescueNode nodes form a linked list, much like the if statements that we covered back in Part 7 - Control-flow. As we discussed back then, the two options we have for representing these kinds of nodes is a linked list or a flat list. We went with a linked list in this case because it’s not that common that you have more than a couple of rescue clauses, and it’s simpler to implement this way.

`RescueModifierNode`

When the rescue keyword is used as a modifier to an expression, we represent it with the RescueModifierNode node. Here’s an example:

foo rescue "error!"

This is semantically equivalent to:

begin
  foo
rescue StandardError
  "error!"
end

The example is represented by the following AST:

This relatively simple node is deceptively complex to parse, but easy to understand and compile. The rescue keyword actually breaks operator precedence rules and is allowed to be used as the modifier to any assignment expression. This means that you can do things like:

foo = bar rescue baz

and instead of being parsed as (foo = bar) rescue baz, it is parsed as foo = (bar rescue baz). This special path through the parser makes things complex, but tends to better match programmers intuition.

`EnsureNode`

The ensure keyword is an optional clause on the begin statement that is always executed, even if an exception is raised. We represent it with the EnsureNode. Here is an example:

begin
  foo
ensure
  bar
end

This is represented by the following AST:

Effectively this node is just a wrapper around a set of statements. It is far more complicated to implement than to parse.

`ReturnNode`

The last throw is the return keyword. In normal execution, the return keyword can be implemented using a leave instruction, however you can also return from within blocks. In this case the virtual machine must jump all of the way out to the method, which is why this is a throw. First, here is an example:

def foo
  [1, 2, 3].each do |i|
    return i if i == 2
  end
end

This is a little contrived, but it demonstrates the point. This code will call the #each method on the array literal, and when the iteration variable i is equal to 2, it will return i from the method. This whole example is represented by the following AST:

You can see the ReturnNode in the bottom right of the diagram there. It has an optional set of arguments, which are the values to return from the method. If there are multiple values, they are grouped together into an array.

Jumps

“Jump” refers to jumping around the instructions in a program. You can think of them effectively as goto statements. Ruby provides many keywords for jumping around, and they all have their own nodes in the parse tree. Let’s look at them one by one.

`BreakNode`

The break keyword jumps out of the current block. It can optionally accept a value to return from the block as here. Here is an example:

while true
  break 1
end

This is represented by the following AST:

The code above says to immediately break out of the loop and return 1. Any number of arguments can be passed to break — they end up being grouped together into an array if there are multiple. A common misconception is that break accepts parentheses; in reality if you use parentheses you’re actually just grouping together the first argument.

`NextNode`

The next keyword jumps to the end of the current block, but not out of it. Like break, it can optionally accept any number of values to return from the block. Here is an example:

while true
  next 1
end

This is represented by the following AST:

The code above says to immediately jump to the end of the loop and return 1. This will actually loop indefinitely because the next keyword just keeps getting executed. Like break, next accepts any number of arguments, which are grouped together into an array if there are multiple.

`RedoNode`

The redo keyword is effectively the opposite of the next keyword: it jumps back to the start of the current block. It does not accept any arguments. Here is an example:

while true
  redo
end

This will, of course, loop indefinitely. Parsing this is very simple; you only parse the keyword. The node itself is therefore relatively simple as well. Here is the AST for the above snippet:

`RetryNode`

The retry keyword is used to jump out of a rescue clause and back to the begin block. It does not accept any arguments. Here is an example:

begin
  foo
rescue
  retry
end

This retry will get triggered if foo raises an exception. It will then jump back to the begin block and try again. This is represented by the following AST:

`YieldNode`

Using the yield keyword, you can trigger the execution of a block that was passed to the current method. It can optionally accept any number of arguments to pass to the block. Here is an example:

def foo
  yield 1
end

This is represented by the following AST:

Parsing the yield construct is much the same as the other keywords we’ve looked at so far. It also accepts a list of arguments that are comma-delimited.

Wrapping up

Throws and jumps allow you to issue non-local control flow within your program. They are very powerful constructs, and understanding their semantics will help you get a better picture of what Ruby is doing under the hood. Here are a couple of things to remember from today:

There are many ways to represent non-local control flow in Ruby
There is a lot of syntax that allows you to jump around statements in your program
break, next, yield, and return all accept arguments but none of them use parentheses

We’re almost at the end here! Tomorrow we’ll be looking at the first of two posts on pattern matching. See you then!

Advent of Prism: Part 20 - Alias and undef

2023-12-20T00:00:00+00:00

This blog series is about how the prism Ruby parser works. If you’re new to the series, I recommend starting from the beginning. This post is about the alias and undef keywords.

These two keywords are not often used, largely because there are methods that can be called to do the same thing. However, they are still a part of the Ruby language.

`AliasMethodNode`

The alias keyword allows you to create an alias for a method. For example:

alias new_name old_name

This creates a new method called new_name that is an alias for the old_name method from the current context. This is represented by the following AST:

We represent the names of the methods with symbols even if they are bare words because they can also be symbols. A semantically equivalent example to the above using symbols would be:

alias :new_name :old_name

Any method name at all can be used, including those that are not valid Ruby identifiers. For example, the following is valid:

alias push <<

You can also use dynamic method names with interpolated symbols, as in:

new_prefix = "new"
old_prefix = "old"
alias :"#{new_prefix}_name" :"#{old_prefix}_name"

This is semantically equivalent to the first example. This is represented by:

`AliasGlobalVariableNode`

You can also alias global variables. For example:

alias $new_name $old_name

This is represented by:

This is particularly useful for providing longer names for global variables that are used often. As an example, see the English.rb core Ruby library.

`UndefNode`

The undef keyword allows you to undefine a method. For example:

undef foo

This is represented by:

Much like the alias keyword, we use symbols to represent the method names even if they are bare words. undef accepts multiple method names, so the following is also valid:

undef :foo, :bar, :baz

This is represented by:

Finally, you can also use dynamic symbols, as in:

undef :"foo_#{bar}"

Wrapping up

The alias and undef keywords are not found very often but they are pieces of syntax that stretch back as far as Ruby 1.0. Here are a couple of things to remember from today:

alias can be used to create an alias for a method or a global variable
undef can be used to undefine one or more methods

In the next post, we’ll be looking at throws and jumps.

Advent of Prism: Part 19 - Blocks

2023-12-19T00:00:00+00:00

This blog series is about how the prism Ruby parser works. If you’re new to the series, I recommend starting from the beginning. This post is about blocks and lambdas.

At long last, we have reached the point of talking about blocks and lambdas. These are major pieces of Ruby functionality that we have been deftly avoiding until now. Today, we’ll take a look.

`BlockNode`

Blocks in Ruby code are represented by braces or the do and end keywords. They can also optionally declare parameters. They then accept a set of statements that are saved and then executed later when the block is called (either through the yield keyword or by transforming it into a Proc and then calling #call). Here’s an example:

foo do
  1
end

This code is represented by the following AST:

As you can see from the diagram, blocks hold a pointer to their body as well as their local table. The body field can either be a StatementsNode (as we see in this example) or a BeginNode (like we saw with methods, classes, modules, and singleton classes). That would look like:

foo do
  1
rescue
end

which is represented by the following AST:

rescue and its corresponding else and ensure clauses can only be used when the keywords are being used as the bounds of the block, and not braces.

It’s also worth noting that semantically, there is no difference between the bounds of the block. Once they are parsed, they are exactly the same. However, in the parser they have different precedence. Braces are bound much more tightly than do and end. For example:

foo bar {} # send the block to `bar`
foo bar do end # send the block to `foo`

It’s not necessarily important for you to remember the specifics of how these are bound as much as it is to remember that they cannot be immediately substituted.

`BlockParametersNode`

When blocks (or lambdas) declare parameters they are wrapped in a BlockParametersNode. These nodes are effectively a wrapper around a list of parameters. For example:

foo { |bar| }

This is represented by the following AST:

There are two differences from regular parameters nodes. The first is that they hold an inner location to their bounds (|| for blocks, () for lambdas). The second is that they hold a list of block locals. We’ll talk about these next.

`BlockLocalVariableNode`

In both blocks and lambdas, you can declare local variables that are only visible within the scope of the block or lambda. These declarations go right next to the declaration of the parameters themselves. For example:

foo { |; bar| }

The bar variable is then only visible within the block. This is semantically similar to:

foo do
  bar = nil
end

The main difference is that if bar is declared in an outer scope the block local will not overwrite it, while assigning nil to it will. These locals are represented by BlockLocalVariableNode nodes and go into the locals field on BlockParametersNode. The first example is represented by the following AST:

The actual syntax for these is that they are a semicolon-separated list of identifiers that follow a semicolon within the parameter list.

`LambdaNode`

Lambda literals are represented by the LambdaNode node. They look similar to blocks and function in much the same way — both function as closures around a set of parameters and a body. Here is an example:

-> (foo) { foo * 2 }

The syntax for a lambda literal begins with the -> token. It is then optionally followed by a parameter list. The parameter list can be optionally wrapped in parentheses. The parentheses are required if certain types of parameter types are used. This is followed by a body that is either wrapped in braces or the do and end keywords.

The example above is represented by the following AST:

Believe it or not, we’ve seen every node in this AST before except for the LambdaNode itself. On that node we have lots of internal locations, a pointer to a local table, a set of parameters, and a body. Much like blocks the body can be either a StatementsNode or a BeginNode.

Like blocks, lambdas can also declare block locals. These are represented by the same BlockLocalVariableNode nodes that we saw above. This looks like:

-> (; foo) {}

It’s important to note that these are lambda literals only and not calls to the Kernel#lambda method. Those are represented by CallNode nodes like all other method calls because they can be overridden depending on context.

`NumberedParametersNode`

The last piece of syntax we’re going to talk about today is numbered parameters. This is a special syntax that allows referencing positional parameters without explicitly declaring them. For example:

-> { _1 * 2 }

The syntax for numbered parameters is an underscore followed by a digit. The digit is the position of the parameter that you want to reference (1-indexed).

Numbered parameters are mutually exclusive with regular parameters. If you declare both in the same context, you’ll get a syntax error. You also cannot use them in nested contexts without a syntax error (e.g., -> { -> { _1 } }). Because of this mutual exclusivity we can be assured that the parameters field on BlockNode and LambdaNode will be nil when numbered parameters are used. We take advantage of that fact to provide some extra information for prism consumers. Here’s the AST for the above example:

As you can see, when numbered parameters are in use we use a NumberedParametersNode node to represent them. This node holds an integer that represents the number of parameters that are being referenced. Compilers can use this to set up the correct number of parameters for the block or lambda.

As a brief aside, Matz recently accepted a proposal for it to be another reference to _1. It’s controversial to say the least.

Wrapping up

Blocks and lambdas play a foundational role in Ruby. They are used to execute a set of statements over a closure at a prescribed time. Knowing their syntax and semantics will allow you to take full advantage of them. Here are a couple of things to remember from today:

Blocks and lambdas can have local variables declared that are only visible within the block or lambda.
Numbered parameters are a special syntax that allows referencing positional parameters without explicitly declaring them.

That’s all for today. Tomorrow we’ll be looking at two interesting keywords: alias and undef.

Advent of Prism: Part 18 - Parameters

2023-12-18T00:00:00+00:00

This blog series is about how the prism Ruby parser works. If you’re new to the series, I recommend starting from the beginning. This post is about parameters.

Parameters appear in three locations in the prism AST: method definitions, blocks, and lambdas. There is very little difference between the three, so they are all represented with ParametersNode. We’ll start there today.

`ParametersNode`

When parameters to a method, block, or lambda are declared, they are represented by a ParameterNode. Here’s an example:

def foo(bar)
end

This code is represented by the following AST:

You can see the ParametersNode in the middle of the diagram above. In this case it holds a bunch of empty lists except for the list of required parameters, which has a single node. We’ll go through each type of parameter that can be attached to this parent node in turn.

Positional

Certain parameters are “positional” in that they are bound to a specific position in the parameter list. These are the most common types of parameters, and were the only ones (besides blocks) until keyword parameters were introduced.

`RequiredParameterNode`

When positional parameters are declared before optionals/a rest, they are represented by a RequiredParameterNode. The first snippet in this post has an example of this, but to reiterate:

This node also represents parameters declared after optionals/a rest. Here’s an example:

def foo(*, bar)
end

This code is represented by the following AST:

In either of these two places, it’s also possible for the required parameter to be automatically destructured. (We saw this in Part 8 - Target writes). Here’s an example:

foo { |(bar,)| }

This makes use of the MultiTargetNode that we’ve already seen. The AST for this example looks like:

When Ruby executes this code, it first accepts the argument in its normal position on the stack. It then will destructure it at the beginning of the execution of the method.

`ImplicitRestNode`

If you look at the AST in the above diagram, you’ll see a reference to an ImplicitRestNode. This is triggered when there is a trailing comma in a destructure list, as in the example above. It implies that the values should be spread and that the rest of the parameters should be ignored. That means the above is almost equivalent to:

foo { |(bar, *)| }

The difference comes in blocks and lambdas, where it changes the arity. For example:

def arity(&block) = block.arity

arity { |bar,| } # => 1
arity { |bar, *| } # => -2

Explaining why that is is beyond the scope of this blog post, but it’s worth noting that it is a difference.

`OptionalParameterNode`

Optional positional parameters are declared using the = operator after an identifier indicating the name. Here’s an example:

def foo(bar = 1)
end

This code is represented by the following AST:

Much like destructuring, the values of these parameters are evaluated at the beginning of the method if they are not already present on the stack. They can even reference other variables in their default values (just not themselves), as in:

def foo(bar, baz = bar)
end

This can get particularly confusing when combined with destructuring because the order in which things are executed can get quite weird. As an exercise, think about what def foo((bar, baz), qux = bar); end should do, and then try it. The answer may surprise you.

`RestParameterNode`

Parameters can declare a “rest” parameter, which will gather up all remaining positional arguments into an array. Here’s an example:

def foo(bar, *baz)
end

This says to assign the first argument to bar, and then group the rest into an array and assign that to baz. This code is represented by the following AST:

You may also omit the identifier and use just the * operator. This does the same thing without providing you a handle to access the values. It also enables you to forward the arguments to another method, as we saw in Part 15 - Call arguments.

Keywords

When keyword parameters were first introduced, there was some difficulty in adoption. This was because their implementation implicitly allocated a hash underneath the hood and occasionally exposed it. Since Ruby 3, this has been solved and we have “true” keyword parameters. Let’s take a look.

`RequiredKeywordParameterNode`

Keywords can be required by not declaring a default value. That is represented using the RequiredKeywordParameterNode node. Here’s an example:

def foo(bar:)
end

This code is represented by the following AST:

This indicates the parameter bar is required and must be passed as a keyword argument.

`OptionalKeywordParameterNode`

Keywords can be optional by declaring a default value. That is represented using the OptionalKeywordParameterNode node. Here’s an example:

def foo(bar: 1)
end

This code is represented by the following AST:

Much like optional positional parameters, the default value is evaluated at the beginning of the method if it is not already present on the stack. Default values can also reference other parameters, but not themselves.

`KeywordRestParameterNode`

The remaining keywords that were not explicitly named can be grouped together into a hash using the ** operator. That is represented using the KeywordRestParameterNode node. Here’s an example:

def foo(bar:, **baz)
end

This code is represented by the following AST:

The name can be omitted, which will still gather up the remaining keywords into a hash, but will not provide you a handle to access the values. It also enables you to forward the keywords to another method.

`NoKeywordsParameterNode`

In terms of keyword parameters, the last one to cover is the least commonly used: **nil. This syntax allows you to indicate that a method accepts no keywords. We represent this with the NoKeywordsParameterNode node. Here’s an example:

def foo(**nil)
end

This yields:

We store this in the keyword_rest position to indicate that it should apply to all keywords.

Others

`BlockParameterNode`

When declaring that a set of parameters accepts a block, you can use the & operator. This is represented using the BlockParameterNode node. Here’s an example:

def foo(&bar)
end

This code is represented by the following AST:

As with the other parameters with unary prefix operators, the name itself is optional. Omitting it will still accept a block, but will not provide you a handle to access it. It will, however, enable you to forward the block to another method call.

`ForwardingParameterNode`

The last parameter type is the ForwardingParameterNode. This is created when the ... parameter is declared within a parameter list. It indicates that all other parameters should be grouped so that they can later be forwarded. Here’s an example:

def foo(...)
end

This is represented by the following AST:

You cannot use a name for this parameter as it cannot be grouped into an object. You can only then reuse the ... operator to forward all of the arguments to another method call. It’s important to note that this is the only parameter that can only be found on method definitions, not blocks or lambdas.

Wrapping up

Perhaps because method calls are so foundational to Ruby, parameters in Ruby are quite varied. Here are some things to remember from our overview of them:

Destructuring parameters and assigning default values to parameters are evaluated at the beginning of a method.
Default values for parameters can reference other parameters, but not themselves.
*, **, and & can be used without names to forward arguments to another method call.

Because we talked so much about parameters today, it is only fitting that tomorrow we talk about blocks and lambdas.

Advent of Prism: Part 17 - Scopes

2023-12-17T00:00:00+00:00

This blog series is about how the prism Ruby parser works. If you’re new to the series, I recommend starting from the beginning. This post is about nodes that introduce a new scope.

“Scope” is a term that gets somewhat abused in programming languages. It can mean quite a lot of things. Our definition for today refers to local variables. For today’s post when we say “scope” we mean a new set of local variables. Let’s have a look at the nodes that introduce new scopes.

`DefNode`

When you define a method using the def keyword, we represent it with a DefNode. Here’s an example:

def foo
  1
end

This code is represented by the following AST:

There are a lot of fields on these nodes (more than any other in the AST!). The important ones here are:

name - the name of the method
locals - the local table for the method
body - the body of the method

Explicit receiver

In the first example, the implicit owner of this new method is the current value of self. That can be made explicit, however, with an expression that ends in a . or ::. Here’s an example:

def self.foo
  1
end

In this case the owner will be the singleton class of the current value of self. It does not need to be limited to the self keyword, though. It can be almost any Ruby expression (especially when wrapped in parentheses), as in:

def Object.foo
  1
end

Object.foo # => 1

The AST for the first explicit receiver example looks like:

Here we get even more fields, but the important one is receiver which points to the expression on which the method should be defined.

Other parses have split this node up into two different nodes: one with an implicit receiver and one with an explicit receiver. We felt like this could be annoying for consuming tools because processing all method definitions is a very common task. We wanted them to be able to do it in one place.

Single-line

The def keyword can also be used to define a method on a single line. Here’s an example:

def foo = 1

In this case the body of the method is the expression that follows the = sign. This is semantically equivalent to the following:

def foo
  1
end

In terms of the parser there are some eccentricities. For example based on the existing precedence, def foo = bar rescue baz would normally be parsed as (def foo = bar) rescue baz, but there is a different path through the parser that allows def foo = bar rescue baz to be parsed as def foo = (bar rescue baz). There is also a current debate on allowing and/or to be used in these kinds of methods as well.

The AST for this example looks like:

You’ll notice that none of our examples today have any parameters on the methods. That’s because the subject of tomorrow’s post is parameters. We’ll come back to them.

Rescues

We haven’t gotten to rescues yet, but it’s important that we mention them here because you can use them in the body of a method definition. Here’s an example:

def foo
  bar
rescue
  1
end

This code says to execute the bar method call, rescue any errors that inherit from StandardError, and then return 1 in the case an error was thrown. These rescue clauses can be chained together, and they can be combined with else and ensure clauses. We’ll see more of this when we get to the post on rescues. The important piece of this to note for today is that in the event that some of these clauses are present, the body field will be replaced by a BeginNode instead of a StatementsNode. As an illustration, the AST for the above example is:

`ClassNode`

Classes that are defined with the class keyword are represented by a ClassNode. Here’s an example:

class Foo
end

This code is represented by the following AST:

This simplistic class has the following important fields:

constant_path - a pointer to the expression after the class keyword before the body
locals - the local table for the class
name - the name of the class. This could easily be derived from the constant_path node, but it requires descending down the tree in order to find the leaf node. We cache it here because all compilers need to know the name of the class in order to generate the correct name for the frame pushed by the class.

Classes can also have superclasses and a body. Here’s an example:

class Foo < Bar
  1
end

This is represented by the following AST:

It’s important to note two things from this example. First, Bar does not need to be a constant or constant path. It can be the result of any method call that you want. For example:

superclass = Object
class Foo < superclass
end

This works just fine, and in fact is equivalent to the first example in this post. The second thing to note is that any code can be placed inside of Foo, not just method definitions or method calls. In our example we have a single 1 as the body of the class. That actually changes the return value of the entire class .. end expression to be 1.

As with method definitions, classes can also have rescue clauses. That would look like:

class Foo
  bar
rescue
end

We’ll cover this more when we get to rescues.

The last piece of this to note is that classes can be defined on a constant path and not just a constant. For example:

class Foo::Bar::Baz
end

This will look up the Foo::Bar constant path, define a class, and then assign that class to the Baz constant on that namespace. The AST for this example looks like:

`ModuleNode`

Modules that are defined using the module keyword are represented by a ModuleNode. Here’s an example:

module Foo
end

The AST for this example looks like:

Parsing these expressions is effectively a simpler form of parsing classes. The also have a constant path, a local table, and a body. They can also be combined with rescue clauses in the same way. As with classes, they can have any expressions in their body, not just method definitions or method calls.

`SingletonClassNode`

The final scope that we’re going to talk about today are singleton class expressions. These expressions allow you to execute code within the singleton class of an object. For example:

class << self
  1
end

This code is represented by the following AST:

These nodes have a pointer to the expression that is used to find the singleton class, a pointer to the body of expressions that should be executed within the singleton class, and a local table. As with classes and modules, the body can be any expression, not just method definitions or method calls. Also as with classes and modules, they can be combined with rescue clauses.

It’s important to remember that self is not the only singleton class you can enter into. For example, let’s say you wanted to define a method on Object:

class << Object
  def foo
    1
  end
end

Object.foo

Now if you wanted to remove that method, you could:

class << Object
  undef foo
end

Entering into a singleton class of an object can be very powerful, especially when combined with metaprogramming.

Wrapping up

As you can imagine, these four nodes are very common in Ruby code, so it’s important to understand their semantics. Here are some things to remember from today’s post:

def, class, module, and class << can be combined with rescue, else, and ensure clauses
The superclass of a class can be any expression
The receiver of a method definition can be any expression
You can enter into the singleton class of any object, not just self

After discussing method definitions today, tomorrow we’ll be rounding out method definitions by looking at method parameters.

Advent of Prism: Part 16 - Control-flow calls

2023-12-16T00:00:00+00:00

This blog series is about how the prism Ruby parser works. If you’re new to the series, I recommend starting from the beginning. This post is about control-flow calls.

Today we’re going to be looking at the four nodes that represent control-flow calls. As we saw in Part 6 - Control-flow writes the &&= and ||= operators are quite complex. When combined with method calls, they get even more complex. Let’s have a look.

`CallAndWriteNode`

When a method call is combined with the &&= operator, we create a CallAndWriteNode. When this is done, it actually represents two method calls in one node, much like the CallOperatorWriteNode. Here’s an example:

foo.bar &&= 1

This code is semantically similar to the following:

receiver = foo
result = receiver.bar

if result
  receiver.bar=(1)
else
  result
end

First, the receiver of the methods is cached on the stack. Then, the read method is called on the receiver (in this case #bar). If the result of the read method is truthy, then the write method is called on the receiver (in this case #bar=) with the right-hand side of the operator as the argument. Otherwise, the result of the read method is returned. The result of the read method is returned.

The important part to remember about this node is that it represents a potential two method calls, not just one. Static analyzers that want to find all method calls have to account for this, which is why we’ve chosen to split this node out from a regular CallNode. Here is the AST for foo.bar &&= 1:

The fields on this node are pretty much the same as CallOperatorWriteNode, of which we are already familiar so we won’t go through all of them. The important ones to see here are read_name and write_name which are the two methods that will be called.

`CallOrWriteNode`

When the ||= operator is combined with a method call, we create a CallOrWriteNode. This node is very similar to CallAndWriteNode, except that it represents a different control-flow path. Here’s an example:

foo.bar ||= 1

This code is semantically similar to the following:

receiver = foo
result = receiver.bar

if result
  result
else
  receiver.bar=(1)
end

First, the receiver of the methods is cached on the stack. Then, the read method is called on the receiver (in this case #bar). If the result of the read method is truthy, then the result of the read method is returned. Otherwise, the write method is called on the receiver (in this case #bar=) with the right-hand side of the operator as the argument. The result of the read method is returned.

Again, the important part here is that two methods are called and not just one. Here is the AST for foo.bar ||= 1:

`IndexAndWriteNode`

As with all of the other pairs of method call nodes, we must have the equivalent for the [] form. When an index expression is combined with a &&= operator, we create an IndexAndWriteNode. Here’s an example:

foo[:bar] &&= 1

This code is semantically similar to the following:

receiver = foo
result = receiver.[](:bar)

if result
  receiver.[]=(:bar, 1)
else
  result
end

First, the receiver of the methods is cached on the stack. Then, the read method is called on the receiver (in this case #[]) with whatever arguments are present between the brackets. If the result of the read method is truthy, then the write method is called on the receiver (in this case #[]=) with the arguments inside the brackets and the right-hand side of the operator as the last argument. Otherwise, the result of the read method is returned.

In this case #[] will always be called and #[]= will optionally be called. Here is the AST for foo[:bar] &&= 1:

`IndexOrWriteNode`

Finally, if an index expression is combined with the ||= operator, we create an IndexOrWriteNode. Here’s an example:

foo[:bar] ||= 1

This code is semantically similar to the following:

receiver = foo
result = receiver.[](:bar)

if result
  result
else
  receiver.[]=(:bar, 1)
end

Surprisingly, this type of code is actually somewhat common. It is commonly used as a way of ensuring default values in arrays and hashes or as a manner of memoization. Here is the AST for foo[:bar] ||= 1:

As with the other Index* nodes, there are no read_name nor write_name fields because the names of the methods are always #[] and #[]=, respectively.

Wrapping up

As we’ve seen in the past, &&= and ||= are quite complex operators. When combined with call nodes, they can be downright confusing. However, you’ve now seen all of the possible places where they can appear, so hopefully they’ll be a little less daunting the next time you encounter them in production code. Here are some things to remember from today’s post:

&&= and ||= operators trigger two method calls when used with a call expression, not just one.
||= is commonly used as a way of ensuring default values in arrays and hashes or as a manner of memoization.

We are finally done with method calls! Tomorrow we will be filling in some of the larger gaps in our knowledge to date: scopes. See you then!