Principles of top-down parsing

(下向き構文解析の実装)

8th lecture, June 7, 2019

Language Theory and Compilers

http://www.sw.it.aoyama.ac.jp/2019/Compiler/lecture8.html

Martin J. Dürst

Today's Schedule

Leftovers, summary, and homework for last lecture
Top-down parsing
Recursive descent parsing
Implementation of recursive descent parsing
How to deal with various problems in a grammar:
- Limitation of number of operations
- Priority
- Left/right associativity
- Left recursion

Leftovers from Last Lecture

Summary of Last Lecture

flex homework: Lexical analysis/regular expressions for C
Regular expression for C comments
Grammars for various languages
How to construct a grammar
Results of parsing: Parse tree and abstract syntax tree

Comment about Grammars Collected Last Week

Many students submitted descriptions of syntax, not grammars
These students received less points

Last Week's Homework 1

都合により削除

Last Week's Homework 2

都合により削除

Last Week's Homework 3

都合により削除

About Ambiguous Grammars

Grammars that may produce more than one parse tree for the same input are called ambiguous grammars
Sometimes, it may seem okay to allow an ambiguous grammar (e.g. mathematically, a+(b+c) = (a+b)+c). But this will not work for overflows, and it will not work for non-commutative operators (e.g. a-(b-c) ≠ (a-b)-c).
Some grammars can be changed to remove ambiguity
There is no general algorithm to remove ambiguity; there is also no algorithm to decide whether removal is possible for a given grammar
Ambiguous grammars are okay when only defining the syntax of a language, but are not suited for a programming language or a data format
Whether a grammar is ambiguous and whether its language requires a nondeterministic pushdown automaton for recognition are separate problems
(example: palindrome)

General Top-Down Parsing

Create parse tree starting with the start symbol of the grammar
Expand parse tree depth-first from the left
If there is a choice (of rewriting rules), try each one in turn
Once the parse tree reaches a terminal symbol, compare with input
- If there is a match, continue
- If there is no match, give up and try another choice (backtracking)

Main Points of Backtracking

Backtracking tries all possible pathways (similar to finding exit in a labyrinth without map)

Backtracking may be very slow, but this can be improved:

Change the grammar so that backtracking is reduced or eliminated
(ideally, the next token should be enough to select a single rewriting rule)
Use lookahead (check some more tokens) to eliminate some choices
Remember intermediate results (packrat parser)
- Similar to making marks in a labyrinth
- Example implementation: treetop (for Ruby)

Recursive Descent Parsing

Implementation of top-down parsing, easy to write by hand
Create a function for each non-terminal symbol of the grammar
In the function, proceed along the right side of the rewriting rule(s):
- For a terminal symbol, compare with input
- For a non-terminal symbol, call the corresponding function
- Use branching (if,...) for a choice (|) in the grammar
- Use repetition (while,...) for repetition in BNF
Reason for name: Recursive grammar rules
Example: A → variable '=' A | integer (assignment expression)

Recursive Descent Parsing: Simple Hand-Written Parser

Program files: scanner.h, scanner.c, parser1.c

How to complie: gcc scanner.c parser.c && ./a

Details of Recursive Descent Parsing: Lexical Analysis

(see scanner.c)

Invariant:
- The next character is always in nextChar (one-character lookahead)
- As soon as a character is processed, the next character is read into nextChar
nextChar is a global variable (can be changed to a function parameter)
How to use from parser:
- Initialize with initScanner
- Read tokens with getNextToken
Implementation of getNextToken:
- One-character tokens: Direct decision
- Multiple-character tokens: Decide on first character, read the rest with a dedicated function

Details of Recursive Descent Parsing: Parsing

(see parser1.c)

Invariant (same as for lexical analysis):
- The next token to be looked at is always in nextToken (one-token lookahead)
- As soon as a token is processed, the next token is read into nextToken
nextToken is a global variable, but this can be changed to a function parameter
Overall usage:
- Initialize using initScanner and getNextToken
- Call the function corresponding to the start symbol of the grammar (e.g. Expression())
- Further process the returned value (abstract syntax tree or result of evaluation)

Details of Recursive Descent Parsing: Non-Terminal Symbols

Create a function for each non-terminal symbol of the grammar
In the function, deal with all rewriting rules for this non-terminal symbol
For each non-terminal on the right-hand side, call the corresponding function
For each terminal on the right-hand side, compare with nextToken

How to Deal with Left Recursion

Example of left recursion:

E → E '-' integer | integer

Wrong solution (change of associativity):

E → integer '-' E | integer

Correct solution:

E → integer Econtinued

Econtinued → '-' integer Econtinued | ε

In (E)BNF:

E → integer {'-' integer}

Differences between Grammars and Regular Expressions

Grammar:

Multiple rules
Non-terminal symbols, derivation from left-hand side to right-hand side
For simple grammars, *, (), and | are not available

Regular Expression:

Limited to one single rule
No non-terminal symbols, only right-hand side
For practical regular expressions, lots of metacharacters/functionality besides *, (), and | available

A simple regular expression corresponds to a single rewriting rule in an (BNF,...) grammar

Homework

Deadline: ~~June 21, 2017 (Thursday)~~ June 13, 2019 (Thursday), 19:00

Where to submit: Box in front of room O-529 (building O, 5th floor)

Format: A4 double-sided printout of parser program. Stapled in upper left if more than one page, no cover page, no wrapping lines, legible font size, non-proportional font, portrait (not landscape), formatted (indents,...) for easy visibility, name (kanji and kana) and student number as a comment at the top

Collaboration: The same rules as for Computer Practice I (計算機実習 I) apply

Expand the top-down parser of parser1.c to correctly deal with the four basic arithmetic operations.
(scanner.h/c do not change, so no need to submit them)
(bonus problem) Add more operations to the top-down parser, and/or deal with parentheses.
(If you solve this problem, also submit the scanner.h/c files, but only one parser.c file for both problems.)
Bring your notebook computer to the next lecture. Check again that flex, bison, make, and gcc are installed.

Glossary

ambiguous grammar: 曖昧な文法
recursive descent parsing: 再帰的下向き構文解析
depth-first: 深さ優先
lookahead: 先読み
backtracking: バックトラック
labyrinth: 迷路
right associative: 右結合
invariant: 不変条件
left recursion: 左再帰