22 March 2007

A simple Parsing-based parser example

As requested by several people, I have uploaded a simple example parser that uses the Parsing module. It is pretty self explanatory, so I encourage you to take a look at it, run it, and experiment with changes.

That done, I should say a bit more about how I actually use the Parsing module. Well, the first thing I did with it was to write a parser for a parser generator input language similar to what Elkhound supports. The parser translates the input to two output files, Token.py and Ast.py, which contain code that the Parsing module can generate a parser from. Here are a few example productions from Lyken's grammar specification:
token comment=xclass:"comment" {
inline Init ${
self.val = raw

fail pStmtList <{pExpr};
nonterm StmtList=[e] {
-> Stmt(Stmt, semicolon);
-> DelimitedExpr(DelimitedExpr) [pStmtList];

-> ExtendStmt(StmtList, Stmt, semicolon);
-> ExtendDelimitedExpr(StmtList, DelimitedExpr) [pStmtList];

nonterm Stmts {
-> Empty;
-> Stmt(Stmt);
-> StmtList(StmtList);

start Module=[S] {
-> Module(boi, DocStr, ModuleDecl, Version, InitialCBlock, Stmts);
There are numerous features not represented above, but the general idea should be apparent. Note that embedded code can be associated with productions. This allows me to do some pretty highly stylized code generation, yet still embed custom code where necessary.

One of the non-obvious clauses above is the "=[S]" that follows "start Module". This is extra annotation that says the Module production provides an outer lexical scope. By supporting such custom annotations, I am able to automatically generate code that deals with many aspects of semantic analysis. This is also one of the main reasons I haven't seen fit to release the grammar specification parser -- it is not obvious to me how to generalize such features in a way that everyone can benefit. At the moment I am of the opinion that the low level docstring-based interface to the Parsing module is good for small- to medium-size parsers, and that for large parsers, you need to write a custom translator that I can't hope to guess the needs of.

19 March 2007

Parsing.py parser generator is now available

The parser generator I implemented has been quite stable for over a month now. It has the potential to be of use to others, so I am making it publicly available. Parsing.py is a stand-alone pure Python module. This makes it easy to maintain and use, but as a result it is substantially slower than C-based parser generators and parsers. That is the only negative thing I can think of to say though. In my obviously biased opinion, the Parsing module is extremely cool. If you need to implement a parser in Python, you should give it a serious look.

Here is a quick summary of what the Parsing module is:
  • True LR(1) parser generator. Python slowness aside, the algorithms used are extremely scalable; I am currently using it for a grammar with well over 500 productions.

  • Both standard LR (aka CFSM) and GLR parser drivers.

  • Tight Python integration. Parser generator directives are specified via docstrings. Rather than running a parser generator as a separate step, it is done on the fly, and the results are cached in a pickle. For subsequent runs, as long as the pickle is still compatible with the parser specification, the pickle is used directly.

  • Extensive error checking and logging. You can get a very clear idea of what is going wrong, during both parser generation and parsing, by enabling logfile output.

The module is heavily documented via docstrings. The easiest way to view the documentation in a reasonable format is via the interactive python command line (import Parsing; help(Parsing)). It is worth mentioning here that you need Python 2.5. As far as I know, the (... if ... else ...) expression syntax is the only reason for this dependency, so if you want to use Parsing.py with an older Python interpreter, porting it should not cause you much trouble.

Okay, without further delay, here it is: http://www.canonware.com/download/Parsing/Parsing.py