I’ve been auditing a course in computer language implementation and particularly interested in parser generator. Just spent an afternoon reading about the Python parser generator PLY. It’s a pure Python Implementation of Lex and Yacc. And here is the PLY documentation I’ve been reading the whole afternoon.
PLY Lex
Basically, writing a tokenizer is to generate a finite automata. It should be easy to implement with the assist of regular expressions. For PLY Lex, the following needs to be defined:
- Tokens: The token types;
- Token definition: You can define a token by a variable of regular expression, or a method whose docstring is regular expression definition. Naming convention follows:
t_TOKENNAME
, e.g. SYMBOL token should be defined by a variable or method with namet_SYMBOL
; - Error method: define the
t_error()
method for error handling.
Finally, run Lex build method to build the tokenizer. If you define all data structure in a class, point the module argument to that class.
Code listed as following:
1 | class MyLexer: |
PLY Yacc
Yacc generates a table-driven LR parser, and LALR(1) by default, SLR when specified.
Yacc also uses docstring to define Context Free Grammar. Similarly, grammar definition method has naming convention as p_PRODUCT_NAME
. It also generates a shift/reduce parser.out output for debugging purpose.
Yacc allows ambiguous grammar. It can resolve ambiguity by supporting precedence. One example for arithmetic operations from documentation:
1 | expression : expression PLUS expression |
Which creates ambiguity when parsing expressions like
1 | 3 + 4 * 5 |
With precedence, Yacc would always know to handle higher precedence operations than lower precedence ones.
One example (from PLY offical release 3.14 examples) of expression definition with precedence defined:
1 | precedence = ( |
A collection of examples could be found in here.
Afterthoughts
PLY is an interesting tool that I want to build something with. There’s also a variation based on PLY called PLYPlus that trys to provide a cleaner interface for programmers. Somehow I have a hunch that it could be done better.
GCC used to use bison generated parser as frontend, but now it’s using a hand-written recursive-descent parser for performance reasons. So is clang. For language generators as far as I know, Ruby uses Yacc as its parser, and Python uses ASDL, which are all worth digging when I have time.
Somehow I wonder why not very many people claim to use PLY as a tool for language manipulations. It could be quite handy when you consider constructing something with relatively complex grammar parsing, requires faster development cycle, and is not performance critical. If I encounter any projects like that in future, I think PLY would be on the top list of my tool selections.