Formatting Ruby: Part 1 - How ripper works
This post is part of a series about how ruby-syntax-tree/syntax_tree works under the hood. It’s part of an ongoing effort to spread the word about this project, document how it works, and explain some of the internals for anyone interested. This post specifically focuses on ripper, the Ruby standard library used to parse Ruby files. For an overview of this series, see the introduction post.
History
On October 20th, 2001, Minero Aoki released version 0.0.1 of ripper, a streaming Ruby parser. You can still go to his website and download the first version he released.
At the time, Ruby 1.7 had just come out. Ruby was still a tree-walk interpreter (it didn’t use an internal bytecode). There was work going on to create Ruby’s first bytecode, but it hadn’t yet landed on the master branch. Because Ruby’s tree was so central to how it functioned, there was a lot of interest in accessing, debugging, and manipulating it. Other projects at the time focused on taking the already parsed tree and returning it to the user, as in ruth and nodeDump.
Ripper took a different approach. Because Ruby’s parser was created through yacc, there was a parse.y
grammar file in the Ruby repository that encapsulated both how the parser was to be created and how the file should be lexically parsed. Ripper took that parse.y
grammar file, forked it, and manipulated it to stream nodes and tokens as they were parsed. This is still how the project works today.
Eventually, it became difficult to maintain a separate grammar file outside of the Ruby codebase, since it required keeping up with any relevant change in the upstream file. In Ruby 1.9, when YARV landed and Ruby became a bytecode interpreter, ripper was merged in as a standard library. Now instead of checking in the changes to the grammar file, the changes were moved into comments and a separate build step was added to replace the body of the production rule with the content of the comments. We’ll come back to this momentarily.
Over time, ripper has continued to be one of the most frequently changed libraries in core (because it changes whenever Ruby syntax changes), but also one of the least understood. It’s no surprise that from 2001 to today, it still says it’s in early-alpha. Nevertheless, many projects have been built on top of the functionality that ripper provides, including some that you may have heard of like prettier, rubyfmt, rufo, and yard.
A lot more can be written about the history of ripper and what it has meant for the community, but this post is meant to focus on its technical aspectss. If you’re interested in more history like this, see my previous talk on the history of parsing Ruby.
How it works
At a high-level, ripper is a streaming bottom-up parser. It works by modifying the actions of each production of the Ruby grammar file to dispatch an event on the parser object. When it dispatches an event, it calls a method that can be overridden by the user. The result of the called method is then passed up the tree. Let’s break down what that actually means.
Step 1. Building
Ripper is shipped as a native extension to CRuby. This means it has its own extconf.rb
and depend
file for configuring how it’s built. Digging into the depend
file, we can find the exact steps used to build ripper’s grammar file here: ext/ripper/depend. In these steps it is running a couple of file processing scripts written in Ruby to grab various data from around the Ruby codebase to create a new parse.y
file that ripper can use. Below is the list of scripts that it runs.
tools/id2token.rb
id2token.rb runs over the checked in parse.y
file and replaces instances of the RUBY_TOKEN(...)
pattern with their numeric equivalents. In this case a token represents a scanner token (e.g., the ==
operator, a symbol literal, the ->
lambda begin, etc.). To get a better sense of what it’s actually doing, you can run the file and diff its result with the original file, as in: diff parse.y <(tool/id2token.rb id.h parse.y)
. That results in the diff below.
1216,1238c1216,1238
< %token tUPLUS RUBY_TOKEN(UPLUS) "unary+"
< %token tUMINUS RUBY_TOKEN(UMINUS) "unary-"
< %token tPOW RUBY_TOKEN(POW) "**"
< %token tCMP RUBY_TOKEN(CMP) "<=>"
< %token tEQ RUBY_TOKEN(EQ) "=="
< %token tEQQ RUBY_TOKEN(EQQ) "==="
< %token tNEQ RUBY_TOKEN(NEQ) "!="
< %token tGEQ RUBY_TOKEN(GEQ) ">="
< %token tLEQ RUBY_TOKEN(LEQ) "<="
< %token tANDOP RUBY_TOKEN(ANDOP) "&&"
< %token tOROP RUBY_TOKEN(OROP) "||"
< %token tMATCH RUBY_TOKEN(MATCH) "=~"
< %token tNMATCH RUBY_TOKEN(NMATCH) "!~"
< %token tDOT2 RUBY_TOKEN(DOT2) ".."
< %token tDOT3 RUBY_TOKEN(DOT3) "..."
< %token tBDOT2 RUBY_TOKEN(BDOT2) "(.."
< %token tBDOT3 RUBY_TOKEN(BDOT3) "(..."
< %token tAREF RUBY_TOKEN(AREF) "[]"
< %token tASET RUBY_TOKEN(ASET) "[]="
< %token tLSHFT RUBY_TOKEN(LSHFT) "<<"
< %token tRSHFT RUBY_TOKEN(RSHFT) ">>"
< %token <id> tANDDOT RUBY_TOKEN(ANDDOT) "&."
< %token <id> tCOLON2 RUBY_TOKEN(COLON2) "::"
---
> %token tUPLUS 132 "unary+"
> %token tUMINUS 133 "unary-"
> %token tPOW 134 "**"
> %token tCMP 135 "<=>"
> %token tEQ 140 "=="
> %token tEQQ 141 "==="
> %token tNEQ 142 "!="
> %token tGEQ 139 ">="
> %token tLEQ 138 "<="
> %token tANDOP 148 "&&"
> %token tOROP 149 "||"
> %token tMATCH 143 "=~"
> %token tNMATCH 144 "!~"
> %token tDOT2 128 ".."
> %token tDOT3 129 "..."
> %token tBDOT2 130 "(.."
> %token tBDOT3 131 "(..."
> %token tAREF 145 "[]"
> %token tASET 146 "[]="
> %token tLSHFT 136 "<<"
> %token tRSHFT 137 ">>"
> %token <id> tANDDOT 150 "&."
> %token <id> tCOLON2 147 "::"
tools/pure_parser.rb
pure_parser.rb is a very simple script that replaces a couple of lines in the parse.y
file in order to support Bison 3. The generated diff from running this file over the parser is below.
1085,1086c1085,1086
< %define api.pure
< %define parse.error verbose
---
> %pure-parser
>
These first two scripts are the same scripts that are run to generate the main parse.y
file that Ruby uses (as you can see in common.mk). So at this point, we have a temporary file that exactly matches the file that is going to be passed into bison
to generate the Ruby parser.
ext/ripper/tools/preproc.rb
preproc.rb is responsible for transforming the default parse.y
file that bison uses into a different parse.y
file that bison will use to generate the ripper parser. It does 3 things while it’s parsing the file.
- First, it copies over the entire prelude of the file before the grammar. The prelude contains all of the required includes, struct definitions, and helper function forward declarations necessary to make parsing work. While it’s copying, it also makes a list of each of the lexer states and their associated descriptions.
- Next, it copies over the entire grammar section of the file. While it’s copying, it modifies the actions of each of the production rules. First, it finds the C comments within the action that match the pattern
%r</\*% *ripper(?:\[(.*?)\])?: *(.*?) *%\*/>
. Then, it grabs up the rest of the action and puts it behind an#if 0 ... #endif
macro call so that you can still see the original action. Then, it runs each of the comments through the ext/ripper/tools/dsl.rb file to get their C equivalent. We’ll come back to this in a second. - Finally, it copies over the user code section of the file. It runs
ERB
over that section in order to dump the lexer states into defined constants on theRipper
class so that they can be accessed in user code. These constants look something like the following code.
/* newline significant, +/- is an operator. */
rb_define_const(Ripper, "EXPR_END", INT2NUM(EXPR_END));
At this point, the file is ready to be fed into bison
to generate the compiler.
ext/ripper/tools/dsl.rb
It’s worth discussing quickly how the DSL within the ripper comments work. Effectively, they’re a quick way of describing the events that should be dispatched, as well as the values that should be passed up the tree. You can run the script yourself locally (ruby -r./ext/ripper/tools/dsl -e 'p DSL.new("...", []).generate'
) to see how it works. Below are a couple of examples along with the code they generate (after macro expansion):
$1
- In a production rule with a single element, this passes the value directly up to the parent without dispatching a single event:
$$ = $1;
foo!($1)
- In a production rule with a single element, this calls out to theripper_dispatch1
function with the IDripper_id_foo
(coming fromfoo!
), then passes the result of that value directly up to the parent:
{
VALUE v1, v2;
v1 = $1;
v2 = ripper_dispatch1(p, ripper_id_foo, v1);
$$ = v2;
}
The implementation of ripper_dispatch1
is relatively simple in that it first validates and then makes a function call.
static VALUE
ripper_dispatch1(struct parser_params *p, ID mid, VALUE a)
{
validate(a);
return rb_funcall(p->value, mid, 1, a);
}
That function call ends up calling out to Ripper#_dispatch_1
on whatever parser object ripper is currently using. That will be important later.
foo!($1, $2)
- In a production rule with two elements, this calls theripper_dispatch2
function, then passes the result of that value up to the parent:
{
VALUE v1, v2, v3;
v1 = $1;
v2 = $2;
v3 = ripper_dispatch2(p, ripper_id_foo, v1, v2);
$$ = v3;
}
foo!($1, bar($2))
- In a production rule with two elements, this first calls thebar
function but does not dispatch an event to ripper (notice the lack of a!
on the function call). Then it calls theripper_dispatch2
function like the previous example and passes it up the tree.
Step 2. Parsing
Now that the parser is built, it can be used to parse Ruby source for whatever purpose. Instantiating a parser is pretty simple:
Ripper.parse("1 + 2") # => nil
As we discussed in the previous section, each of the nodes in the tree as they’re being parsed are created through the dispatchN
family of functions, which make it up to user-space through a set of on_*
methods. So if you wanted to parse integers, you would define the on_int
method. (Note this is because on_int
is aliased to _dispatch_1
here.)
class Parser < Ripper
def on_int(value)
puts "Got an integer: #{value}"
end
end
Parser.new("1 + 2").parse # => nil
# Got an integer: 1
# Got an integer: 2
Unfortunately, you still receive a nil
for the result of the parse method. This is because by default the base Ripper
class does not provide implementations of the various event handlers needed to produce a useful tree. Instead, it provides subclasses for that purpose. The actual abstract syntax tree that Ruby uses internally for 1 + 2
is akin to the following:
So, in order to get a useful value out of your parser, you would need to implement on_program
, on_stmts_add
, on_stmts_new
, on_binary
, and on_int
. If you do define those methods, you can get the parse tree out. Let’s do that quickly below as an example:
class Parser < Ripper
def on_binary(left, operator, right) = [:binary, left, operator, right]
def on_int(value) = value.to_i
def on_program(stmts) = stmts
def on_stmts_add(stmts, stmt) = stmts << stmt
def on_stmts_new = []
end
Parser.parse("1 + 2") # => [[:binary, 1, :+, 2]]
If you look at the tree that I drew above, you see how we arrive at this final product. The order that the methods are called in is left to right, bottom to top. So if we trace the execution, we get the following:
on_stmts_new
is called and we return an empty array.on_int
is called with"1"
. We convert that to an integer with#to_i
and return it to move it up the tree.on_int
is called with"2"
. We do the same as the previous call.on_binary
is called with the result of the twoon_int
calls and a symbol literal representing the operator. For that, we return an array containing all of the information present as well as a:binary
symbol literal used to indicate the kind of node being represented.on_stmts_add
is called with the result of theon_stmts_new
andon_binary
calls. We push the result of the righthand operand (theon_binary
) onto the value of the lefthand operand (theon_stmts_new
). This results in an array of a single value containing the result of theon_binary
call.on_program
is called with the result of theon_stmts_add
call. We immediately return that value.
API
There are a number of APIs within ripper that you can use when processing dispatched events. First, it’s important to discuss a couple of helper APIs within ripper that represent a lot of important functionality. Then we’ll discuss the bigger APIs represented by the two subclasses that ripper ships with.
#lineno
and #column
These two methods represent the internal state of the parser when you are within a specific event handler. For example, if you’re handling an on_int
, then lineno
is the line in the source that the int was found on, and column
is the offset (in bytes, not characters) on that line at which the int was found. It’s important to know that these APIs were designed to reflect the internal state of the parser only, and are not meant as definitive location information. For example, if you’re processing an on_binary
event and there happens to be a comment after your righthand operand, the column
method will return a value after the comment has been read. So, if you want specific location information you need to get it on a case-by-case basis for each event handler.
#encoding
When Ruby first reads a source file, it assumes an encoding for that file. If it encounters an encoding magic comment, then it will change how its parsing the remainder of the file from that point on. encoding
is an accessor method to retrieve the encoding it is currently using at any given time. This is very important if you’re parsing files that aren’t in an encoding you might expect like UTF-8
or ASCII
.
#error?
Error handling is a massive subject in parser literature. Within parser generators like bison
, you can use the error
production to catch errors and perform some action when they are found. In the case that an error has been found, the error?
method will return true. This is very important to note because Ruby might be able to generate a parse tree for you that contains errored nodes internally.
#state
Because the lexical grammar of the Ruby language is not regular, CRuby maintains its own state machine within the lexer in order to accept different patterns at different locations. state
is your access to that information. Previously, this information wasn’t available, but in Ruby 2.5 it got exposed as part of the preproc.rb
script’s work.
::sexp_raw
Finally, we get to the big APIs. Ripper::sexp_raw
is the same thing is ::parse
except that it uses the Ripper::SexpBuilder
subclass of Ripper
instead of the main class. The main’s classes implementation of every dispatch*
method is to simply return the first argument. That doesn’t work for most use cases, so instead SexpBuilder
ships with a default implementation for every node that returns an array representing an s-expression for each node. You can check this out yourself on the command line by running:
Ripper.sexp_raw("1 + 2")
# => [:program, [:stmts_add, [:stmts_new], [:binary, [:@int, "1", [1, 0]], :+, [:@int, "2", [1, 4]]]]]
You can see how this mirrors the AST structure we mentioned earlier. You can also see that scanner events (like @int
, i.e. leaf nodes) contain their source location while parser events (like binary
, i.e., inner nodes) do not.
::sexp
Lastly, Ripper::sexp
is like Ripper::sexp_raw
except that it uses the Ripper::SexpBuilderPP
subclass of Ripper
. It can a couple of small differences, but the main one is that it treats arrays of elements differently. First, let’s take a look at the s-expressions it generates:
Ripper.sexp("1 + 2")
# => [:program, [[:binary, [:@int, "1", [1, 0]], :+, [:@int, "2", [1, 4]]]]]
You’ll notice there aren’t any stmts_*
nodes contained in this output. That’s because SexpBuilderPP
treats all of the various add/new
pairs as just arrays that can be concatenated whenever a new element is added. You can see how that’s implemented here. Depending on your use case, this tree could be easier to work with. However, if you need specific information about exactly which nodes were used in the source, this is not the method for you. For more information about dealing with lists in ripper, see this page in my kddnewton/ripper-docs
repository for more information.
Example parser
As a final note, let’s build an example parser to put a more practical look on what we’ve been talking about. It’s going to be relatively contrived in order to fit into a blog post, but it should give you a sense of what we’re talking about here.
We’re going to build a small parser that will return to us the documentation for any module declarations within our code. When I say documentation I mean comments immediately preceeding the module. If there’s a newline between them, then we don’t want them. First, let’s start by just trying to get a list of the modules declared within the file.
It’ll be easiest to tell which kinds of event handlers we need to define by using the ::sexp_raw
method. Let’s start by running that:
Ripper.sexp_raw(<<~RUBY)
module Foo::Bar
end
RUBY
# =>
# [:program,
# [:stmts_add,
# [:stmts_new],
# [:module,
# [:const_path_ref,
# [:var_ref, [:@const, "Foo", [1, 7]]],
# [:@const, "Bar", [1, 12]]],
# [:bodystmt,
# [:stmts_add, [:stmts_new], [:void_stmt]],
# nil,
# nil,
# nil]]]]
Remember that any event dispatched that only takes one argument (the dispatch1
family of nodes) we don’t have to define an event handler because it just passes up the first argument. For the rest of the nodes, we’ll need an explicit handler if we want its information. For our purposes, we want the names of the modules, which from the comment above we know we need on_const
, on_var_ref
, on_const_path_ref
, and on_module
. Let’s define those explicitly, as well as an on_program
to return the module names at the end.
class Parser < Ripper
def initialize(...)
super
@modules = []
end
def on_const(value) = value
def on_const_path_ref(left, right) = "#{left}::#{right}"
def on_var_ref(value) = value
def on_module(const, bodystmt) = @modules << const
def on_program(...) = @modules
end
We can test this out by running Parser::parse
on a source string, as in:
p Parser.parse(<<~RUBY)
module Foo
end
module Foo::Bar
end
RUBY
# => ["Foo", "Foo::Bar"]
Now that we have the module names, we can change @modules
to be a hash keyed off the names where the values should be the documentation. We’ll need to hook into the on_comment
scanner event handler to get the value of the comments, and change our event handlers a bit to better handle line numbers. All told, this is what the new parser looks like:
class Parser < Ripper
def initialize(...)
super
@comments = []
@modules = {}
end
def on_comment(value) = @comments << [value[1..-1].strip, lineno]
def on_const(value) = [value, lineno]
def on_const_path_ref(left, right) = ["#{left[0]}::#{right[0]}", left[1]]
def on_var_ref(value) = value
def on_module(const, bodystmt)
comments = []
lineno = const[1]
@comments.reverse_each do |comment|
break if (lineno - comment[1]) > 1
comments.unshift(comment[0])
lineno = comment[1]
end
@modules[const[0]] = comments.join("\n")
end
def on_program(...) = @modules
end
Again, we can test this out by running against our previous example, with some additional comments:
p Parser.parse(<<~RUBY)
# This is the first module.
module Foo
end
# This is an unassociated comment.
# This is the second module.
# This is definitely the second module.
module Foo::Bar
end
RUBY
# => {
# "Foo"=>"This is the first module.",
# "Foo::Bar"=>"This is the second module.\nThis is definitely the second module."
# }
Now we’ve got something that’s actually working. It still has some definite limitations (for example it doesn’t properly handle nested modules), but it’s good enough for a small example.
Wrapping up
If you’re interested in learning more about ripper, here are a couple of other things you can read:
In the next post in this series I’ll show how ruby-syntax-tree/syntax_tree uses ripper under the hood to build its own syntax tree representing Ruby source. So if you’re looking for a more practical application of the knowledge you got from this blog post, stay tuned.
← Back to home