Wtf Ruby Pt 1: DSL Dyslexia
Preface
This summer I’ve been working on the Ruby language adding type annotations to the grammar. However, the Ruby codebase is rather challenging. There’s not a whole lot of documentation (at least in English), there’s a fair amount of C preprocessor hacking, and there’s a few tricky scripts attached. I’ve decided to document these challenges in this blog. I hope that they’re useful to anybody who decides to work on the Ruby codebase, and I also hope that I can use these notes to improve the codebase myself. If any Ruby maintainers read this, I apologize in advance for any seemingly critical language. I totally respect and appreciate your contributions to Ruby and to the open source community as a whole. Please understand that these criticisms are coming from a place of respect and learning.
DSL Dyslexia
Today I’m going to talk about Domain Specific Languages (DSLs), specifically the one in Ripper. Ripper, for those who don’t know, is a Ruby library that consists of a Ruby parser. Ripper allows developers to literally “rip apart” the Ruby code into tokens, a process called lexing, or into an Abstract Syntax Tree (AST), a process called parsing. An AST is essentially the Ruby code turned into a general, homogeneous form independent of syntax.
Now the way that Ripper is generated is rather interesting. Instead of rewriting the parser in Ruby, the developers of Ruby opted to take the existing Ruby parser and extend it to also work with Ripper. This makes the most sense, as Ruby has a rather complex grammar with a lot of little details. It would have been far too time consuming and painful to recreate the parser in Ruby. Furthermore, if new grammar rules were to be added, two parsers would have to be updated.
So how did the developers extend the Ruby parser? Well they implemented a DSL. Domain Specific Languages are a really interesting form of computer languages. They are built for very specific usecases, such as quickly writing data models, or defining objects in a graphics engine, as opposed to general purpose languages like Python, Ruby, C, etc. They can allow for immense expressiveness in a clear, concise manner. However, if they are not well documented or understood, they can be confusing, even alien. The Ruby community as a whole tends to use DSLs a fair amount. For instance, Ruby on Rails uses a DSL to help programmers write web applications with minimal boilerplate.
But back to the Ruby parser. The way that the developers implemented
this DSL was by adding comments into the Ruby parser file,
parse.y
. While this is a completely understandable solution, as a
result, a fair bit of complexity was introduced. You see, the Ruby
parser is created using Bison, Bison being a very popular parser
generator. Parser generators allow programmers to create a parser by
simply defining some grammar rules in…you guessed it, a DSL. So by
adding a second DSL through comments, the developers put two DSLs in
one file. That won’t be confusing, right?
Which brings us to this past day. I decided to read the code in
ext/ripper/tools
, where the Ripper DSL is parsed. There’s a fun
bit of code in generate.rb
where it takes the Ripper annotations:
/*% ripper[brace]: rb_ary_new3(1, get_value($1)) %*/
and, using a regular expression, or regex (%r</\*% *ripper(?:\[(.*?)\])?: *(.*?) *%\*/>
) splits them into:
$1 = "brace"
$2 = "rb_ary_new3(1,get_value($1))"
This is done implicitly in Ruby, as $n
(where n
is an integer) are
magic global variables that refer to the nth match in the regular
expression. Note that it’s rb_ary_new3(1...
, not $1. I actually had
to double check that I didn’t accidentally delete the dollar sign. I
don’t know why it’s there.
Anyways, what’s interesting is that generate.rb
initializes a DSL
object that takes these two variables, well, $2
and ($1 || "").split(",")
(which is probably an attempt to put $1 in an
array?). The DSL object’s constructor, as defined in dsl.rb
,
proceeds to do a few things which I’ll get to and then evaluates
$2
. Kinda like this:
eval("rb_ary_new3(1, get_value($1))")
Therefore all the annotations in parse.y should be valid Ruby!
But…$1-$9
aren’t defined. Well, actually…they are. Let’s take a
look at the code above this:
# create $1 == "$1", $2 == "$2", ...
re, s = "", ""
1.upto(9) do |n|
re << "(..)"
s << "$#{ n }"
end
/#{ re }/ =~ s
At first I didn’t really pay attention to this code. But the comment caught my eye the second time around. So I read it a little more carefully and realized that this code is automatically generating values for $1-$9!
How, pray tell is it doing that? Well one side effect of $n
being
magic global variables is that they cannot be directly
mutated. Therefore something like $1 = 'mike'
doesn’t
work. Therefore, to get around this issue, the developers created this
hack.
Basically, we build up to strings, re
and s
. re
is actually a
regular expression, one of the form
/(..)(..)(..)(..)(..)(..)(..)(..)(..)(..)/
, basically 10
(..)
in a row. The string s
on the other hand consists of
"$1$2$3$4$5$6$7$8$9"
. The final operation is a match operation,
namely it matches re
against s
. Now, at first glance, this appears
to do nothing. There’s no apparent side effects and no variables bound
except re
and s
, which are never used again. EXCEPT, if you
remember the meaning of $n
. You see, in regular expressions, (..)
means match exactly two characters, that’s it. So when you run re
,
basically 10 of these (..)
, against s
, you match two characters 10
times. For instance, the first match, which will be stored in $1
is…yep "$1"
, the second match, stored in $2
is "$2"
, and so
on.
To recap, this code generates a regular expression and generates a string to implicitly bind 10 global variables. Pretty gnarly.
But that’s not the end of it! After that, I started to wonder how in
the world the eval
function was working. After all, these functions
aren’t naturally in the Ruby scope, there aren’t any other files
imported in dsl.rb
, and there isn’t a binding passed along to
eval. Also, something weird would happen. I would try to print out the
output of the function call like such:
puts(rb_ary_new3(1, get_value($1)))
Only to get rb_ary_push($1, get_value($3))
as my output. Kinda weird…
So I scrolled down a little and found to my horror: def method_missing
. To quote Gary Bernhardt,
wat.
For those who don’t know, method_missing
is a rather infamous
function in Ruby. When you attempt to call a function that doesn’t
exist in Ruby, Ruby normally raises a MethodError
, which is fairly
standard, rather sensible, basically an overall normal thing to do. But
the demented, insane geniuses that are the Ruby developers decided
that this isn’t flexible enough, so they gave an alternative:
method_missing
. method_missing
, when defined, allows you to
implicitly catch the error. Instead of raising a MethodError
, Ruby
calls your method_missing
and passes it the name of the method you
tried to call, along with any potential arguments you might
passed. This is NOT advised. Most Ruby style guides heavily discourage
method_missing
. Granted, the developers of Ruby are not most Ruby
programmers. But still, it’s not exactly common behavior.
Anyways, this method_missing
does a few different things. For
instance, if your DSL annotations end in a bang, i.e. they match this
regex /!\z/
, then it outputs this in ripper.y:
{VALUE v1,v2;v1=$1;v2=dispatch1(assoclist_from_args,v1);$$=v2;}
or, if they begin with an id, you just get the event back, as something like:
{$$=idCOLON2}
I can’t actually find any example of this in parse.y
, but we’ll
assume it’s used somewhere. And for the rest you basically just get
the same function that you started with (the value of $2
). That’s why
the puts(rb_ary_new3(1, get_value($1)))
gives rb_ary_push($1, get_value($3))
as the output. This gives the following in ripper.y (the output of this DSL):
$$=rb_ary_push($1, get_value($3));
Oh yeah, and if you happen to have an annotation called opt_event
,
that actually does call a method called opt_event
. It basically does
something similar with a few minor changes (I believe a nil check and
some other stuff).
In conclusion, this has been an excellent exercise in interesting Ruby code. In all seriousness, I wouldn’t really advise writing Ruby code in this manner, but I have a serious amount of respect for the person who wrote this code and thought of it in this manner. I certainly would have written it in a more *ahem* pedestrian manner.
Indeed, I might decide to attempt a refactor of this code, or at the very least some heavy documentation. Right now I’m holding off on refactoring simply because I don’t know the whole picture and I don’t know whether my refactoring attempts would actually fix anything, or if it would just miserably fail due to some detail of which I’m not aware. The developers of Ruby are smart people and I can’t claim to know better than them.
In the next post I’ll explain where all those weird dispatch
functions came from and what they’re doing.
Hope this has been helpful!
Nicholas