November 26, 2005

ANTLR-based Ruby grammar

It looks like other people are showing interest in replacing the YACC-based Ruby parser by one based on ANTLR. This hits close to a project I've been working on in my spare time.

When I looked at JRuby's source code a few weeks ago, I thought I'd start trying to understand how the parser works. I had a few ideas I wanted to experiment with, but...

The YACC-based parser is too hard to fathom. There are too many scattered pieces with no clear dependencies. Making changes without breaking things should be even harder.

So I decided to start working on ANTLR-based grammar to replace the current implementation -- pretty much what MenTaLguY is trying to do. I'm still in experimentation mode and trying to figure out how I'll deal with some of the features of Ruby that are not usually found in other languages.

One of the hardest aspects, I think, is how to parse %Q-delimited strings. Here's how they work (from the Pickaxe book):

Following the type character is a delimiter, which can be any nonalphabetic or nonmultibyte
character. If the delimiter is one of the characters (, [, {, or <, the literal
consists of the characters up to the matching closing delimiter, taking account of nested
delimiter pairs. For all other delimiters, the literal comprises the characters up to the
next occurrence of the delimiter character.

Additionally, %Q strings can contain #{} constructs, whose contents should be parsed as Ruby expressions. These can probably be handled in a manner similar to how this grammar for the E language implements what they call quasi-literals.

I haven't yet decided on how to deal with delimiters but I have the feeling it may require writing a custom lexer for string contexts.