11 January 2007

Parser generation algorithms of a bygone era

In my most recent post, I talked about the disadvantages of LALR parser generation as compared to the more general LR method. Although I now believe even more fervently that LR parser generation is the clear winner over LALR, I need to retract part of what I said, and expound on what really works. Indeed, I have since then transcended a technical travail, thanks in part to some most helpful comments from readers.

History
First, let us informally go over a bit of history. The earliest high level languages were a real challenge to create, partly because the designers did not at that time have adequate language parsing theory at their disposal. This gave rise to a flurry of research, and in 1965, Donald Knuth published the seminal LR paper (Knuth, 1965). This provided a foundation upon which to base general understanding of parsing, but the method for generating LR parsers was extremely computationally intensive, both in terms of space and time. (This method can be found in the dragon book.) Eventually, researchers began to publish derivative methods that were more practical. Of particular note are the SLR (DeRemer, 1971) and LALR (LaLonde, 1971) methods. The original yacc LALR parser generator came into use in about 1973. In ten years, we went from having little understanding of parsing theory, to having powerful tools. Perhaps this is why later work apparently went unnoticed. In particular, Pager (1977) seems to have been practically forgotten.

LALR was devised as a compromise method to get around the extreme cost of the standard LR method. However, Pager's method delivers full LR power, without any downside. Pager's method merges states during construction wherever possible, and the generated parser is approximately the same size as that generated by LALR. Computationally, the algorithm is on par with LALR. Finally, the algorithm is no more challenging to implement than LALR [1].

Travails
Now, back to my travails. I have been developing a parser generator as supporting software for Lyken, which is a language I am currently developing. Parser generation is a means to an end for the purposes of Lyken, so I was hoping to spend as little time as possible on this part of the project. In fact, I started out using lemon, which is an excellent LALR parser generator. However, as the Lyken grammar neared its final form, I found that the convolutions necessary to wedge it into a form lemon could handle had left me with an unmaintainable mess. Even small modifications were often requiring far-reaching changes, which was going to force me to do some substantial work to normalize the AST (abstract syntax tree) in order to make the guts of the compiler immune to superfluous parser changes [2].

Eventually, I decided to go back and look more carefully at what GLR parsing could do to alleviate the problems I was having. There is an excellent paper by Scott McPeak that describes in detail how to implement a practical GLR parser (McPeak 2002), and I was sufficiently convinced of GLR's ability to help me clean up the Lyken parser mess that I took a step back and implemented a parser generator from scratch [3]. The parser generator ended up taking about 200 hours to implement, not because it requires lots of coding, but because I had to learn a lot along the way.

I ran into a myriad of stumbling blocks, among which the most serious are listed below:
  • The only reference I initially had for SLR/LALR/LR parser generation was the dragon book. Opinions surely differ, but I found it to be rather inscrutable; full comprehension took many, many readings. I was not confident that I understood what I was reading, so I had to implement the algorithms as I read. Since the book builds up, from LR(0) to SLR(1) to LR(1) to LALR(1), I ended up implementing the first three variants (before realizing that LALR was not what I wanted).
  • The naive LR(1) algorithm requires exponential time and space. This was an issue 40 years ago, and to my great surprise (and mild shame, truth be told), it is still an issue now. Of course, during testing, I was using relatively small grammars, so I did not recognize the critical nature of this issue until just after I thought I was completely done. Since this was all Python code, there was some question as to whether C code would be fast enough. 25 hours of effort demonstrated that Python could only be blamed for 2-3 orders of magnitude slowdown, and I was looking at a 5+ order of magnitude problem.
  • The literature on LR parser generation is quite old at this point, so when I did start trying to find information on better approaches, the web was of surprisingly little help. I finally went to the trouble of re-joining the ACM so that I could access its digital library, where I found several papers of interest. At about the same time, Chris K pointed me at Pager's paper via Menhir, and from there the solution to LR parser generation unfolded rapidly.
  • There was an erratum for the algorithms presented in the Elkhound technical report, and I did not find the notice online until after I had discovered and fixed the problem myself. This may not sound like a big deal, but my first attempt at solving the problem was very wrong, and it masked other problems with my implementation that did not surface until I gained a sufficient understanding of the erratum to fix the problem correctly.
  • The Elkhound technical report does not discuss how to deal with ε-grammars, and I was unable to obtain any references on the subject. Thus, I had to reinvent that wheel. This, in combination with the erratum just mentioned, provided the setting for some difficult debugging.
Overall, the time requirements for this project were about what I estimated beforehand, but what actually took up that time surprised me in the extreme.

Footnotes
[1] In the interest of full disclosure, I mention here that Pager presents two methods for determining state compatibility during state merging. "Weak compatibility" is in practice perfectly adequate, and simple to implement, though there are some edge cases, rarely encountered in practice, in which a limited number of states are duplicated. "Strong compatibility" guarantees that the state machine is minimal, though it requires a bit more sophisticated programming, and appears to be included in the paper mainly for completeness, rather than due to practical need.

[2] Also of note, Lyken requires two tokens of lookahead in one case, but this was not a critical parser issue, since I was able to convert the lookahead within the scanner.

[3] Elkhound appears to be an excellent parser generator, but I am bootstrapping Lyken with Python. I had already gone through enough trouble gluing Lyken's C-based scanner/parser and the Python portions of the compiler that I felt it worthwhile to move to writing the bootstrap compiler purely in Python if I was going to be rewriting the parser anyway. Unfortunately, there were no parser generation tools for Python that met my needs, which is why I wrote one from scratch.

References
Aho, Alfred V., Ravi Sethi, and Jeffrey D. Ullman (1986) Compilers: Principles, Techniques, and Tools. Addison-Wesley Publishing Company, Reading, MA.

DeRemer, F.L. (1971) Simple LR(k) grammars. Comm. ACM 14, 453-460.

Knuth, Donald E. (1965) On the translation of languages from left to right. Information and Control 8, 607-639.

LaLonde, W.R. (1971) An efficient LALR parser generator. Computer Systems Research Group, University of Toronto, Tech. Report CSRG-2.

McPeak, Scott (2002) Elkhound: A Fast, Practical GLR Parser Generator. University of California, Berkeley, Tech. Report UCB/CSD-2-1214.

Pager, David (1977) A Practical General Method for Constructing LR(k) Parsers. Acta Informatica 7, 249-268.