Advent of Prism: Part 9 - Strings

This blog series is about how the prism Ruby parser works. If you’re new to the series, I recommend starting from the beginning. This post is about strings.

Individuals

StringNode

Strings that represent a contiguous sequence of characters in source and that do not contain interpolation are represented by StringNode. This includes a fairly large variety of syntax, including:

As you can see, there are many, many ways to express strings in Ruby code. The StringNode class is responsible for representing the plain string content within them all. This maps relatively closely to the way CRuby compiles them as well (with either putstring or putobject instructions). Let’s take a look at the AST for "foo":

string node

There are some interesting nuances to the things we have chosen to store on these nodes. We’ll go through each:

The most important of these for compilers and interpreters is the unescaped field. This field contains the actual bytes that were represented by the source code. For simple strings like all of the examples listed above, this will be a slice of the source file. However, in the event of escape sequences in strings that allow them, this can have a different value. Consider for example:

'foo\n'.bytes # => [102, 111, 111, 92, 110]
"foo\n".bytes # => [102, 111, 111, 10]

In the first example the \n escape sequence is not allowed, so it is represented as two characters: \ and n. In the second example the \n escape sequence is allowed, so it is represented as a single \n character. By providing the unescaped field it means individual implementations like CRuby, JRuby, and TruffleRuby can rely on the parser to give them the right bytes instead of having to re-implement the escape sequence logic themselves.

Overall, StringNode shows up in a lot of places, but because the semantics are the same in every case, we reuse the same type. By keeping around the *_loc fields, we allow other non-compiler tools to recreate the source as they see fit (e.g., for a character literal the opening_loc will point to a ?).

XStringNode

Compared to the relative complexity of the StringNode, XStringNode can only show up in a single place:

`foo`

This is an XStringNode, and it causes the Ruby implementation to call the #` method on the current self object. Most of the time, this isn’t overridden, so it ends up calling Kernel#`. This method is responsible for executing the command in the string and returning any output to stdout that the command may have produced as a string. Here is the AST for `foo`:

xstring node

You’ll notice this effectively has all of the same fields as a regular string node. For all intents and purposes, that’s what it is. It represents the string node that will be passed to the #` method. The only difference in the flags is that these strings are always frozen, so that flag doesn’t need to exist.

SymbolNode

Symbol nodes are listed in this post because of their relationship with strings. Symbols are represented by SymbolNode and can be present in many places:

Symbols have lots of special rules around what is an is not allowed inside of them. In general, they closely follow the lexer rules for what is allowed as a method name. Here is the AST for %s[foo]:

symbol node

You’ll notice it has all of the same fields as a string node. As with the XStringNode, the only difference is that there is no frozen flag.

Interpolated content

When creating strings, symbols, xstrings, and regular expressions, you are allowed to interpolate statements and variables into them. This creates a dynamic string that is determined at runtime. First, we’ll look at the kinds of content that can be interpolated.

EmbeddedStatementsNode

First, there are embedded statements. This is the much more common form of interpolation. Here is what it looks like:

"foo #{bar} baz"

The code says to create a string that is flanked by "foo " and " baz". The result of the bar expression is evaluated and then implicitly #to_s is called before it becomes part of the string. This is represented by an EmbeddedStatementsNode. When inside of a string-like node that allows interpolation, it is represented by the #{ operator followed by any number of statements followed by the } operator. Here is the AST for "foo #{bar} baz":

embedded statements node

The node itself is relatively simple. It contains inner locations for the #{ and } operators, as well as a pointer to a statements node that contains the statements that will be interpolated.

EmbeddedVariableNode

The far less common form of interpolation is embedded variables. Effectively it is a way to interpolate instance, class, or global variables into a string by omitting the braces. Here is what it looks like:

"#@foo #@@bar #$baz"

Semantically this is equivalent to:

"#{@foo} #{@@bar} #{$baz}"

We represent this as an EmbeddedVariableNode. These nodes hold an inner location for the # marker as well as a pointer to the node they are interpolating. Here is the AST for "#@foo":

embedded variable node

The nodes that can be in that position are InstanceVariableReadNode, ClassVariableReadNode, and GlobalVariableReadNode.

Containers

As you’ve already seen, when interpolation is present in a string, symbol, xstring, or regular expression, the result is a list of nodes contained in a node that is prefixed by Interpolated.

InterpolatedStringNode

When interpolation is present in a string, the result is an InterpolatedStringNode. This node contains a list of nodes that represent the content of the string. Again, here is the AST for "foo #{bar} baz":

interpolated string node

The node itself contains locations for the opening quote, the closing quote, and the list of parts of the string. Interpolated strings can also appear in %W lists, as in %W[foo\ #{bar}\ baz]:

interpolated string node in %W

You’ll notice the first element in the list is equivalent to the interpolated string we saw first. The only difference is that it does not contain the locations for the opening and closing quotes as they are not present.

InterpolatedXStringNode

In the same way that XStringNode nodes are the same as StringNode with the added semantic that they send the string to the #` method, InterpolatedXStringNode nodes are the same as InterpolatedStringNode with the same semantic. Here is the AST for `foo #{bar} baz`:

interpolated xstring node

InterpolatedSymbolNode

Symbols can have interpolation when they are wrapping in :" and ". They can also have interpolation when used inside hashes and keyword arguments wrapped in " and ":. Here is the AST for :"foo #{bar} baz":

interpolated symbol node

Heredocs

Heredocs are another way of representing strings. They are semantically equivalent to StringNode and InterpolatedStringNode, although their syntax is significantly different. Here is an example:

<<-FOO
  bar
FOO

After a heredoc has been declared, the content of the heredoc begins on the next newline. Syntactically, this can get quite complicated, because the next newline might be further than you think. For example:

foo(<<-BAR)
  baz
BAR

In this example, it’s semantically equivalent to passing a string into the foo method. In order to parse this correctly, prism sees the declaration of the heredoc and creates a save point. It then skips to the next newline and parses the content of the heredoc. When it finds the closing delimiter, it creates another save point. It then jumps back to the first save point, parses the rest of the line, and then jumps back to the second save point. If it sounds complicated, it’s because it is. Here is the AST for the first example:

heredoc node

You can see it just resolves to a string node. This allows compilers to not have to care about heredocs at all, since they are effectively resolved by the time they get the tree. Heredocs as a whole are one of the most difficult parts of both parsing and understanding Ruby. Because of various combinations of nodes, they can be even more complicated than originally intended, as in:

<<A+%
a
A
b

While this code is valid Ruby, it’s also the last snippet we know of that is still failing to parse in prism. Obviously we can’t have that, so by putting this into a blog post I’m promising myself I’ll fix it.

Heredocs have other semantics depending on the quotes they use and the indentation form they use. For example, a heredoc that begins with << requires the closing delimiter to be at the start of its own line. <<- indicates it can have any amount of whitespace before it. <<~ means to eliminate all common whitespace (ignoring blank lines) from every line in the heredoc.

Quoting can change the semantics as well. <<' means to disable interpolation (interpolation is allowed by default in heredocs). <<" means to keep interpolation (it is somewhat redundant). <<` means to transform the heredoc into an xstring.

There is so much hidden complexity in this feature that it could be its own blog series. Suffice to say, it’s not a blog series I’d like to write.

Escapes

Within strings, escape sequences make it easier to write certain bytes. They can also change the encoding of the string from the default encoding of the file. Here is a list of the escape sequences that are supported:

Listing out how each of these function is beyond the scope of this blog, but if you’re interested in learning you should check out the official CRuby documentation.

Encoding

Every string (and xstring and symbol) has an associated encoding. These encodings represent the way the bytes of the string should be interpreted when considered as characters/codepoints. The default encoding of a string is the encoding of the file it is in. The file encoding, in turn, defaults to UTF-8 but can be changed by a magic # encoding: comment at the top of the file.

String-like nodes can have a different encoding depending on their internal bytes and escape sequences. In general, if a string contains a unicode escape sequence (\u) that resolves to a codepoint that is not in the US-ASCII encoding, the string will be forced into the UTF-8 encoding. The only caveat is that if a file is encoded in US-ASCII and a byte is present that is not in the US-ASCII encoding, the string will be forced into the ASCII-8BIT encoding.

Encodings have much more nuance than that as a whole, but that’s enough for today.

Wrapping up

We made it! Strings have lots of complexity, but hopefully this post gives you a window into everything you might want to know or lookup later. Here are a couple of things to remember:

Tomorrow we’ll be looking at the other kind of string-like node: regular expressions. See you then!

← Back to home