Advent of YARV: Part 14 - Branching
This blog series is about how the CRuby virtual machine works. If you’re new to the series, I recommend starting from the beginning. This post is about branching.
Although this blog series isn’t about how CRuby compiles into YARV instructions, we’ve touched on it briefly in a couple of posts. From our first post where we covered how literals are compiling into putobject
instructions to our most recent post where we discussed the alias
and undef
methods, we’ve covered a lot of Ruby syntax without even knowing it. There is still one major piece of Ruby syntax that we haven’t covered: branching.
Braching refers to the ability to execute different code paths based on a condition. You can think of it like railroad tracks diverging. The logic that controls the branching can be conditional or unconditional. In this post, we’ll see how that maps back to both Ruby syntax and YARV instructions.
Consider for example, the if
statement:
if foo
bar
end
Using the instructions we’ve seen so far, we know that foo
will get compiled into a method call using the send
instruction. We also know that bar
will be compiled similarly. However, we need to skip executing the send
instruction for bar
if the result of the foo
method call is falsy.1 This means our logic needs to branch between two different instruction paths.
The first path will be if foo
returns a truthy value. In this case, we want to execute the send
instruction for bar
. The disassembly for this would look something like:
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,19)> (catch: false)
0000 putself ( 1)[Li]
0001 opt_send_without_block <calldata!mid:foo, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0003 putself
0004 opt_send_without_block <calldata!mid:bar, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0006 leave
Note that this effectively boils down to foo; bar
. The second path will be if foo
returns a falsy value. In this case, we want to skip the send
instruction for bar
.
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,19)> (catch: false)
0000 putself ( 1)[Li]
0001 opt_send_without_block <calldata!mid:foo, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0003 pop
0004 putnil
0005 leave
Note that this effectively boils down to foo; return
.
In order to marry these two paths, let’s look at the commonalities. The first two instructions are the same in both paths. The last instruction is also the same. Therefore, we need something that pops a value instead of executing the send
instruction for bar
if the result of the foo
method call is falsy. This will look like something to the effect of:
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,19)> (catch: false)
0000 putself ( 1)[Li]
0001 opt_send_without_block <calldata!mid:foo, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0003 **************************************
0005 putself
0006 opt_send_without_block <calldata!mid:bar, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0008 leave
0009 putnil
0010 leave
The asterix in the previous instructions is where we need to insert our special instruction. To branch ahead based on the top value of the stack. But before we can get there, we first need to understand a little more about how instructions are compiled.
Compiling instructions
To compile instructions the VM walks through the syntax tree generated by the parser and builds up a linked list. This linked list includes both instructions and operands. This is why when you look at the disassembly, you can see that the numbers on the left-hand side are not sequential: they are counting operands as well.
This linked list allows the VM to perform various optimizations while it is being compiled. For example, we looked at the opt_str_uminus
instruction when we were looking at send
specializations. This changes a putstring
+ send
sequence of instructions into a single opt_str_uminus
instruction. This optimization changes the number of instructions in the list. A linked list structure allows this to happen more easily than in an array because nothing needs to be copied to perform the removal.
There is one other kind of object that can be an entry in the linked list: a label. Labels are locations in an instruction sequence that other instructions are allowed to branch to. You don’t see them in the disassembly because when the VM compiles the instructions into an array, they are no longer necessary.
Let’s look again at our previous example. We left off by looking at an instruction sequence like the following:
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,19)> (catch: false)
0000 putself ( 1)[Li]
0001 opt_send_without_block <calldata!mid:foo, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0003 **************************************
0005 putself
0006 opt_send_without_block <calldata!mid:bar, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0008 leave
0009 putnil
0010 leave
In this case the VM will insert a label into the linked list between the first leave
and the putnil
instruction. This is the location where we want to branch to if the result of the foo
method call is falsy. The VM will then insert a branchunless
instruction that branches to this label. The final instruction sequence will look like:
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,19)> (catch: false)
0000 putself ( 1)[Li]
0001 opt_send_without_block <calldata!mid:foo, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0003 branchunless 9
0005 putself
0006 opt_send_without_block <calldata!mid:bar, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0008 leave
0009 putnil
0010 leave
Notice that the branchunless
instruction has a single operand which is a 9
. That 9
is the offset from the start of the instruction sequence to the putnil
instruction. This is the label that the branchunless
instruction is conditionally branching to. With this new knowledge in mind, let’s look at today’s instructions.
branchunless
As we just saw branchunless
, accepts a single operand which is the offset to branch to. It pops a single value off the top of the stack and branches to the offset if the value is falsy. Otherwise it continues to the next instruction.
In Ruby:
class BranchUnless
attr_reader :label
def initialize(label)
@label = label
end
def call(vm)
vm.jump(label) unless vm.pop
end
end
In if foo then bar end
disassembly:
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,19)> (catch: false)
0000 putself ( 1)[Li]
0001 opt_send_without_block <calldata!mid:foo, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0003 branchunless 9
0005 putself
0006 opt_send_without_block <calldata!mid:bar, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0008 leave
0009 putnil
0010 leave
branchif
branchif
does the opposite of branchunless
. It branches to its offset operand if the top value of the stack is truthy. Otherwise it continues to the next instruction.
In Ruby:
class BranchIf
attr_reader :label
def initialize(label)
@label = label
end
def call(vm)
vm.jump(label) if vm.pop
end
end
In while foo do bar end
disassembly:
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,20)> (catch: false)
0000 jump 10 ( 1)[Li]
0002 putnil
0003 pop
0004 jump 10
0006 putself
0007 opt_send_without_block <calldata!mid:bar, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0009 pop
0010 putself
0011 opt_send_without_block <calldata!mid:foo, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0013 branchif 6
0015 putnil
0016 leave
branchnil
Like the previous two instructions, branchnil
accepts a label operand and pops a single value off the stack. This time it branches to the label if the value is nil
. Otherwise it continues to the next instruction. This is how Ruby implements the &.
operator.
In Ruby:
class BranchNil
attr_reader :label
def initialize(label)
@label = label
end
def call(vm)
vm.jump(label) if vm.pop.nil?
end
end
In foo&.bar
disassembly:
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,8)> (catch: false)
0000 putself ( 1)[Li]
0001 opt_send_without_block <calldata!mid:foo, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0003 dup
0004 branchnil 8
0006 opt_send_without_block <calldata!mid:bar, argc:0, ARGS_SIMPLE>
0008 leave
jump
There are occasions when we want to unconditionally branch to a label. For example in a while
loop we want to jump back to the top of the loop. The jump
instruction does just that. It accepts a single operand which is the offset to jump to. It does not pop any values off the stack, and instead directs the VM to continue execution at the offset. It never continues to the next instruction.
In Ruby:
class Jump
attr_reader :label
def initialize(label)
@label = label
end
def call(vm)
vm.jump(label)
end
end
In while foo do bar end
disassembly:
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,20)> (catch: false)
0000 jump 10 ( 1)[Li]
0002 putnil
0003 pop
0004 jump 10
0006 putself
0007 opt_send_without_block <calldata!mid:bar, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0009 pop
0010 putself
0011 opt_send_without_block <calldata!mid:foo, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0013 branchif 6
0015 putnil
0016 leave
You can see in the above disassembly that it first jumps directly to the predicate for the while
loop. Then it conditionally branches back to the top of the loop if the predicate is truthy.
opt_case_dispatch
The final instruction to look at today is the opt_case_dispatch
instruction. First, let’s look at which instructions a case statement compiles into. For example, in:
case foo
when :bar then bar
when :baz then baz
when :qux then qux
end
This code is going to conditionally execute one of the branches based on the value of the foo
method call. It will do this by first compiling a series of branchif
instructions that check the result of calling #===
on each of the conditions. It will then compile each of the bodies of the conditions into a series of instructions that unconditionally jump to the end when they are done. For example, the above code compiles into:
== disasm: #<ISeq:<main>@test.rb:1 (1,0)-(5,3)> (catch: false)
0000 putself ( 1)[Li]
0001 send <calldata!mid:foo, argc:0, FCALL|VCALL|ARGS_SIMPLE>, nil
0004 putobject :bar ( 2)
0006 topn 1
0008 send <calldata!mid:===, argc:1, FCALL|ARGS_SIMPLE>, nil
0011 branchif 35
0013 putobject :baz ( 3)
0015 topn 1
0017 send <calldata!mid:===, argc:1, FCALL|ARGS_SIMPLE>, nil
0020 branchif 42
0022 putobject :qux ( 4)
0024 topn 1
0026 send <calldata!mid:===, argc:1, FCALL|ARGS_SIMPLE>, nil
0029 branchif 49
0031 pop ( 1)
0032 putnil
0033 jump 56
0035 pop ( 2)
0036 putself [Li]
0037 send <calldata!mid:bar, argc:0, FCALL|VCALL|ARGS_SIMPLE>, nil
0040 jump 56
0042 pop ( 3)
0043 putself [Li]
0044 send <calldata!mid:baz, argc:0, FCALL|VCALL|ARGS_SIMPLE>, nil
0047 jump 56
0049 pop ( 4)
0050 putself [Li]
0051 send <calldata!mid:qux, argc:0, FCALL|VCALL|ARGS_SIMPLE>, nil
0054 jump 56
0056 leave
Notice that the first half is the “dispatch” part. It checks each of the conditions in turn and dispatches to the correct branch. The second half is the “body” part. This part contains each of the bodies of each of the conditions of the case statement.
One of the optimizations that can be done on this kind of code is to use the opt_case_dispatch
instruction. Presuming the #===
operator never gets redefined on the Symbol
class, we can store a hash where the keys are the symbols to check and the values are the labels to jump to. Then, instead of performing a bunch of instructions to dispatch to the correct branch, we can just look up the label in the hash and jump to it.
The opt_case_dispatch
instruction accepts a single operand, which is the compiled hash of conditions to labels. It first pops the value off the top of the stack, then looks up the label to jump to. If it is found, it jumps there, otherwise it jumps past all of the instructions. This optimization can be performed for symbols, integers, floats, nil
, true
, false
, and strings. Applying this optimization to the previous example results in:
== disasm: #<ISeq:<main>@test.rb:1 (1,0)-(5,3)> (catch: false)
0000 putself ( 1)[Li]
0001 opt_send_without_block <calldata!mid:foo, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0003 dup
0004 opt_case_dispatch <cdhash>, 31
0007 putobject :bar ( 2)
0009 topn 1
0011 opt_send_without_block <calldata!mid:===, argc:1, FCALL|ARGS_SIMPLE>
0013 branchif 34
0015 putobject :baz ( 3)
0017 topn 1
0019 opt_send_without_block <calldata!mid:===, argc:1, FCALL|ARGS_SIMPLE>
0021 branchif 39
0023 putobject :qux ( 4)
0025 topn 1
0027 opt_send_without_block <calldata!mid:===, argc:1, FCALL|ARGS_SIMPLE>
0029 branchif 44
0031 pop ( 1)
0032 putnil
0033 leave ( 4)
0034 pop ( 2)
0035 putself [Li]
0036 opt_send_without_block <calldata!mid:bar, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0038 leave ( 4)
0039 pop ( 3)
0040 putself [Li]
0041 opt_send_without_block <calldata!mid:baz, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0043 leave ( 4)
0044 pop
0045 putself [Li]
0046 opt_send_without_block <calldata!mid:qux, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0048 leave
Note that the other instructions are still in the instruction sequence. They have to be there as a fallback in case one of the basic operators was redefined. For example, if you were to write:
# please don't do this
class Symbol
def ===(other)
true
end
end
then the opt_case_dispatch
instruction would immediately exit.
Wrapping up
In this post we looked at all of the instructions that can branch within a single instruction sequence. We also looked at how the compiler uses these instructions to implement branching syntax like if
and case
. A couple of things to remember from this post:
- When compiling instructions, the compiler keeps instructions in a linked list. This allows it to easily insert new instructions and remove instructions from the sequence.
- Labels are used to mark the location of instructions that can be jumped to. They are not present in the final disassembly because they are not needed in the final instruction sequence.
In the next post we’ll talk about just a single instruction: defineclass
.
-
The term “falsy” refers
nil
orfalse
. Everything else is refered to as “truthy”. ↩