Principles of top-down parsing
(下向き構文解析の実装) 
8th lecture, June 8, 2018 
Language Theory and Compilers 
http://www.sw.it.aoyama.ac.jp/2018/Compiler/lecture8.html
Martin J. Dürst

© 2005-18 Martin
J. Dürst 青山学院大学
Today's Schedule
  - Remainders, summary, and homework for last lecture
- Top-down parsing
- Recursive descent parsing
- Implementation of recursive descent parsing
- How to deal with various problems in a grammar: 
    
      - Limitation of number of operations
- Priority
- Left/right associativity
- Left recursion
 
 
Remainders from Last Lecture
 
Summary of Last Lecture
  - flexhomework: Lexical analysis/regular expressions for
  C
- Regular expression for C comments
- Grammars for various languages
- How to construct a grammar
- Results of parsing: Parse tree and abstract syntax tree
 
Last Week's Homework 1
(In the problems below, n, +, -, *, and / are terminal symbols, and any
other letters are non-terminal symbols. n denotes an arbitrary number, and the
other symbols denote the four basic arithmetic operations.)
For the three grammars below, construct all the possible parse trees for
words of length 5. Find the grammar that allows all and only those parse trees
that lead to correct results.
  - E → n | E - E
- E → n | n - E
- E → n | E - n
都合により削除 
Last Week's Homework 2
Same as in problem 1 for the four grammars below.
  - E → n | E + E | E * E
- E → n | E + n | E * n
- E → T | T + T; T → n | n * n
- E → T | T * T; T → n | n + n
都合により削除 
Last Week's Homework 3
(Bonus problem) Based on the knowledge obtained when solving problems 1 and
2, create a grammar that allows to correctly calculate expressions with the
four arithmetic operations (without parentheses). Check this grammar with
expressions of length 5.
都合により削除
 
About Ambiguous Grammars
  - Grammars that may produce more than one parse tree for the same input are
    called ambiguous grammars
- Some grammars can be changed to remove ambiguity
- There is no general algorithm to remove ambiguity; there is also no
    algorithm to decide whether removal is possible for a given grammar
- Ambiguous grammars are okay when only defining the syntax of a language,
    but are not suited for a programming language or a data format
- Whether a grammar is ambiguous and whether its language requires a
    nondeterministic pushdown automaton for recognition are separate
    problems
 (example: palindrome)
 
General Top-Down Parsing
  - Create parse tree starting with the start symbol of the grammar
- Expand parse tree depth-first from the left
- If there is a choice (of rewriting rules), try each one in turn
 
- Once the parse tree reaches a terminal symbol, compare with input 
    
      - If there is a match, continue
- If there is no match, give up and try another choice
      (backtracking)
 
 
Main Points of Backtracking
Backtracking tries all possible pathways (similar to finding exit in a
labyrinth without map)
Backtracking may be very slow, but this can be improved:
  - Change the grammar so that backtracking is reduced or eliminated
 (ideally, the next token should be enough to select a single rewriting
  rule)
- Use lookahead (check some more tokens) to eliminate some
  choices
- Remember intermediate results (packrat parser) 
    
      - Similar to making marks in a labyrinth
- Example implementation: treetop (for Ruby)
 
 
Recursive Descent Parsing
  - Implementation of top-down parsing, easy to write by hand
- Create a function for each non-terminal symbol of the grammar
- In the function, proceed along the right side of the rewriting rule(s): 
    
      - For a terminal symbol, compare with input
- For a non-terminal symbol, call the corresponding function
- Use branching (if,...) for a choice (|) in
        the grammar
- Use repetition (while,...) for repetition in BNF
 
- Reason for name: Recursive grammar rules
 Example:A → variable '=' A | integer(assignment
    expression)
 
 
Recursive Descent Parsing: Simple Hand-Written Parser
Program files: scanner.h, scanner.c, parser1.c
How to complie: gcc scanner.c parser.c && ./a
 
Details of Recursive Descent Parsing: Lexical Analysis
(see scanner.c)
  - Invariant: 
    
      - The next character is always in nextChar(one-character
        lookahead)
- As soon as a character is processed, the next character is read into
        nextChar
 
- nextCharis a global variable (can be changed to a function
    parameter)
- How to use from parser: 
    
      - Initialize with initScanner
- Read tokens with getNextToken
 
- Implementation of getNextToken:
      - One-character tokens: Direct decision
- Multiple-character tokens: Decide on first character, read the rest
        with a dedicated function
 
 
Details of Recursive Descent Parsing: Parsing
(see parser1.c)
  - Invariant (same as for lexical analysis): 
    
      - The next token to be looked at is always in nextToken(one-token lookahead)
- As soon as a token is processed, the next token is read into
        nextToken
 
- nextTokenis a global variable, but this can be changed to a
    function parameter
- Overall usage: 
    
      - Initialize using initScannerandgetNextToken
- Call the function corresponding to the start symbol of the grammar
        (e.g. Expression())
- Further process the returned value (abstract syntax tree or result of
        evaluation)
 
 
Details of Recursive Descent Parsing: Non-Terminal Symbols
  - Create a function for each non-terminal symbol of the grammar
- In the function, deal with all rewriting rules for this non-terminal
    symbol
- For each non-terminal on the right-hand side, call the corresponding
    function
- For each terminal on the right-hand side, compare with
    nextToken
 
How to Deal with Left Recursion
Example of left recursion:
E → E '-' integer | integer
Wrong solution (change of associativity):
E → integer '-' E | integer
Correct solution:
E → integer EE
EE → '-' integer EE | ε
In (E)BNF:
E → integer {'-' integer}
 
Differences between Grammars and Regular Expressions
Grammar:
  - Multiple rules
- Non-terminal symbols, derivation from left-hand side to right-hand
  side
- For simple grammars, *, (), and | are not available
Regular Expression:
  - Limited to one single rule
- No non-terminal symbols, only right-hand side
- For practical regular expressions, lots of metacharacters/functionality
    besides *, (), and | available
A simple regular expression corresponds to a single rewriting rule in an
(BNF,...) grammar
 
Homework
Deadline: June 21, 2017 (Thursday), 19:00
Where to submit: Box in front of room O-529 (building O, 5th floor)
Format: A4 double-sided printout of parser program. Stapled in upper left if
more than one page, no cover page, no wrapping lines, legible font size,
non-proportional font, portrait (not landscape), formatted (indents,...) for
easy visibility, name (kanji and kana) and student number as a comment at the
top
Collaboration: The same rules as for Computer Practice I (計算機実習 I)
apply
  - Expand the top-down parser of parser1.c
    to correctly deal with the four basic arithmetic operations.
 (scanner.h/cdo not change, so no need to submit
    them)
- (bonus problem) Add more operations to the top-down parser, and/or deal
    with parentheses.
 (If you solve this problem, also submit thescanner.h/cfiles, but only oneparser.cfile for both problems.)
- Bring your notebook computer to the next lecture. Check again that
    flex,bison,make, andgccare installed.
 
Glossary
  - ambiguous grammar
- 曖昧な文法
- recursive descent parsing
- 再帰的下向き構文解析
- depth-first
- 深さ優先
- lookahead
- 先読み
- backtracking
- バックトラック
- labyrinth
- 迷路
- right associative
- 右結合
- invariant
- 不変条件
- left recursion
- 左再帰