Deprogramming the LALR(1) Bias
Look around for LR(1) parser generators, and you will primarily find LALR(1) parser generators. There seems to be an unspoken assumption that LALR(1) is somehow better than LR(1), but look at the following pertinent facts:
- In terms of grammars that can be handled by various parser generation techniques, SLR(1) ⊂ LALR(1) ⊂ LR(1).
- SLR(1) and LALR(1) tables are always the same size, but LR(1) tables are potentially larger.
So then, why is there a bias toward LALR(1)? I suspect that it has to do with the well known and widely cited dragon book, which treats LALR(1) as the culmination of the parser generation algorithms it presents. There is one little detail that is not mentioned at all though: it is possible to compress LR(1) tables to be the same as LALR(1) tables, with the exception of avoiding table compression that would introduce LALR(1)'s reduce/reduce conflicts. Granted, the book is over 20 years old now, but I would not be surprised if the new edition preserves this omission.
I recently finished implementing a parser generator (as the basis for a GLR parser, as described by the Elkhound technical report). I was initially unable to wrap my head around the "efficient" method that the dragon book provides for LALR(1) parser generation, so I took the incremental approach of implementing SLR(1), converting that to LR(1), then finally converting that to LALR(1) -- the "easy, inefficient" method according to the dragon book. Well, when I got to the point of compressing LR(1) tables to LALR(1) tables, I questioned the necessity of compressing states when reduce/reduce conflicts would result. As near as I could tell, there is no fundamental requirement to do so, which means that compressed LR(1) tables that avoid such conflicts should be the gold standard, rather than LALR(1).
This seemed too obvious to have been overlooked by the compiler community, so I looked high and low for prior art. The best I found was a comp.compilers post by Chris F. Clark on 1 July 2003, which says, in part:
That brings us back to "full" or canonical LR(k), where k is >= 1.There is likely some discussion of this optimization somewhere in the primary literature, but several hours of searching failed to turn it up. The closest I found is Sönke Kannapinn's PhD thesis (in German, but the abstract is also in English), which comes to the same conclusion via a very different route.
LR(k) parsing preserves the needed left context to disambiguate
certain situations where LALR(k) finds the rules to conflict. The
canonical way of doing this is to not merge states with different
lookaheads in the first place. That causes an explosion in the
table size as many states are kept distinct simply because they
have different lookaheads, when in reality the lookahead for those
states will never be consulted. A more modern method for solving
the problem involves splitting the states only when a conflict is
detected. Then, if the grammar is LALR, no splitting will occur
and one has the same size machine as the LR(0), but as conflicts
arise the require left context to resolve, the tables slowly grow.
In summary, if you need to write a parser generator, make it LR(1) and use careful table compression, rather than using the less general LALR(1) approach.