Tools for Lexical Analysis

(字句解析ツール)

5th lecture, May 17, 2019

Language Theory and Compilers

http://www.sw.it.aoyama.ac.jp/2019/Compiler/lecture5.html

Martin J. Dürst

Today's Schedule

Homework from last lecture
Leftovers from last lecture
Automating lexical analysis
How to use cygwin
Overview of flex
flex exercises

Last Week's Homework 4

Bring your notebook PC (with flex, bison, gcc, make, diff, and m4 installed and usable)

Leftovers from Last Week

Missing Diagram: DFA Minimization

Missing Diagram: NFA for `r`|`s`

全体の初期状態から r と s の初期状態へと、r と s の受理状態から全体の受理状態へ ε で結ぶ

Missing Diagram: NFA for `r`*

全体の初期状態と r の初期状態、r の受理状態と全体の受理状態、全体の初期状態と全体の受理状態、そして r の受理状態と初期状態 (逆!) を ε で結ぶ。

Homework

Deadline (changed!): May 23, 2018 (Thursday), 19:00
(If you already submitted problems 1/2, just submit problem 3 on a separate sheet of paper.)

Where to submit: Box in front of room O-529 (building O, 5th floor)

Format: A4 single page (using both sides is okay; NO cover page), easily readable handwriting (NO printouts), name (kanji and kana) and student number at the top right

Construct the state transition diagram for the NFA corresponding to the following grammar
S → εA | bB | cB | cC, A → bC | aD | a | cS, B → aD | aC | bB | a, C →εA | aD | a
(Caution: In right linear grammars, ε is not allowed except in the rule S → ε)
(Hint: Create a new accepting state F)

Convert the following transition table to a right linear grammar

	0	1
→T	G	H
*G	K	L
*H	M	K
*K	K	K
*L	M	K
M	L	-

Construct the state transition diagram for the regular expression ab|c*d
(write down both the result of the procedure explained during this lecture (with all ε transitions) as well as a version that is as simple as possible)

Summary for Regular Languages

(Non-)Deterministic Finite Automata
(Left|Right) Linear Grammars
Regular Expressions

These all have the same power, describe/recognize regular languages, and can be converted into each other.

Compilation Stages

Lexical analysis
Parsing (syntax analysis)
Semantic analysis
Optimization (or 5)
Code generation (or 4)

Compiler Structure

Parsing is the core of the front end (analysis) or of the whole compiler
The parser repeatedly obtains the next token from the lexical analyzer
(with a function such as getNextToken())
The parser calls semantic analysis,... when needed

Implementing Lexical Analysis

Lexical analysis is the first stage of a compiler
Analyzing/extracting the "words" (tokens) of a programming language
Processing is character by character
→ speed/efficiency is important
Structure of tokens is simple
→ Regular language is powerful enough
Efficient implementation is possible
(regular expressions → NFA → DFA → minimized DFA)

Choices:

Write a lexical analyzer by hand (tedious and error-prone)
Use a tool to automate creation of lexical analyzer (e.g. flex)

Overview of `flex`

Lexical analyzer generator
Opensource version of lex, with various extensions
lex: Lexical analyzer generator available with Unix (creator: Mike Lesk)
Easy to write/create lexical analyzers e.g. for compilers
Works well with parser generator bison

How to Use Cygwin

(reminder)

ls: list the files in a directory
mkdir: create a new directory
cd: change (working) directory
pwd: print (current) working directory
gcc: compile a C program
./a: execute a compiled program

Cygwin and Harddisks

(reminder)

Assumption: cygwin is installed in C:\cygwin
Usually, only directories below C:\cygwin can be reached
The user's home directory is e.g. C:\cygwin\home\user1
This can be displayed as /home/user1 with pwd
How to escape C:\cygwin:
cd /cygdrive/c (change directory to the C drive of MS Windows)

`flex` Usage Steps

Create an input file for flex (a (f)lex file), with the extension .l (example: test.l)
Use flex to convert test.l to a C program:
$ flex test.l
(the output file is named lex.yy.c)
Compile lex.yy.c with a C compiler (maybe together with other files)
Execute the compiled program

Two Ways to Use `flex`

Independent file processing (use regular expressions to recognize or change parts of a file):
Call the yylex() function once from the main function
Combination with parser:
Repeatedly call yylex() from the parser, and return a token with return

In today's exercises and homework, we will use 1.

Later in this course, we will use 2. together with bison.

Example of `flex` Input Format

        int num_lines = 0, num_chars = 0;
%%
\n      ++num_lines; ++num_chars;
 .       ++num_chars;

%%
int main(void)
{
        yylex();
        printf( "# of lines = %d, # of chars = %d\n",
                num_lines, num_chars );
}

int yywrap () { return 1; }

`flex` Exercise 1

Process and execute the flex program on the previous slide

Create a file test.l and copy the contents of the previous slide to the file
Create the file lex.yy.c with
$ flex test.l
Create the executable file a.exe with
$ gcc lex.yy.c
Execute the program with some input from standard input
$ ./a <file

Skeleton of `flex` Input Format

declarations,... (C program language)
declarations,... (C program language) 
%%
regexp    statement (C program language)
regexp    statement (C program language)
%%
functions,... (C program language)
functions,... (C program language)

Structure of `flex` Input Format

Mixture of flex-specific instructions and C program fragments

Three main parts, separated by two %%:

Preparation/setup part:
- C #includes, #defines
- Definition and initialization of global variables
- Definition of regular expression components
Flex rules:
- Left side: Regular expressions to be recognized (lexical rules)
- Right side: Program fragments executed on recognition
Rest of C program (functions,...)

Newlines and indent can be significant!

How to Study `flex`

Read the manual: English, Japanese
To complete the homework, the use of the manual is necessary
Caution: A manual is not a novel.
Read the output of flex (lex.yy.c)
Compare the output of flex for different inputs
Read the source code of flex (flex also uses lexical analysis, which is written using flex)
Use options to output internal information (example: flex -v)

How `flex` Works

Convert each regular expression to an NFA,
and associate each accepting state with the corresponding C fragment
Combine all NFAs into a single large NFA (if multiple C fragments are available at an accepting state, use the earliest fragment in the .l file)
Convert this NFA to a DFA
Minimize the DFA
Create and initialize the necessary tables
Copy the program that executes the DFA
Copy the C fragments in the .l file

How the Program Created by `flex` Works

Read the input character by character
Match the longest possible input string with a regular expression
If multiple regular expressions match the same length input string,
use the earliest regular expression in the .l file
Always remember the last accepting state passed and the input up to that state; use this state/input when there is no next state
If a match is found, execute the corresponding C fragment
If no match is found:
- Output the first input character
- Start processing again with the next input character
Repeat starting with the first character after a match or the next character after a non-match

`flex` Exercise 2

The table below shows how to escape various characters in XML
Create a program in flex (for this conversion, and) for the reverse conversion

Raw text	XML escapes
`'`	`'`
`"`	`"`
`&`	`&`
`<`	`<`
`>`	`>`

`flex` Exercise 3: Detect Numbers

Create a program with flex to output the input without changes, except that numbers are enclosed with >>> and <<<

Example input:

abc123def345gh

Example output:

abc>>>123<<<def>>>345<<<gh

Hint: The string recognized by a regular expression is available with the variable yytext

`flex` Exercise 4 (Homework):
General Rules

Deadline: May 30, 2019 (Thursday), 19:00

Where to submit: Box in front of room O-529, available starting May 24

(start early, so that you can ask questions on May 24)

Format:

A4 double-sided printout of flex input file (.l file)
Stapled in upper left if more than two pages of output
NO cover page, Line numbers, NO wrapping lines, legible font size, non-proportional font, portrait (not landscape), formatted (indents,...) for easy visibility
Name (kanji and kana) and student number as a comment at the top

Collaboration: The same rules as for Computer Practice I (計算機実習 I) apply

`flex` Exercise 4 (Homework):
Lexical Analysis for Dates and Times

XML (see W3C XML Recommendation) is a generalization of HTML for document and data formats. XML is stricter than HTML (e.g. attribute values are always quoted,...). Using flex, create a program that takes an arbitrary XML file as input and outputs its syntactic components, one component per line.

Simple example input: <letter>Hello & Happy World!</letter>

Simple example output:

Start tag: <letter>
  Contents: Hello
  Entity: &amp;
  Contents: Happy World!
End tag: </letter>

Details:

Input and output can be limited to US-ASCII only
White space/newline only content can be ignored
Start tag, end tag, contents, entity, comment, and processing instruction are required
Show the nesting level by indenting the output
Output a warning if the number of end tags does not match the number of start tags
Attributes (attribute names and attribute values) can be handled as an advanced problem (発展問題)

Frequent Problems with `flex`

If there is an error in the .l file, flex may still run without errors
Solution: Always start with flex:
> flex file.l && gcc lex.yy.c && ./a <input.txt
In the first part of the .l file (before the first %%), C program fragments have to be indented by at least one space
Don't forget int yywrap () { return 1; }

Hints for Homework

This homework will take considerably more time than the previous homeworks
You can ask questions about this homework next week
You can assume that the input is restricted to US-ASCII
In some cases, the order of the regular expressions (rules) is important
The text matched with a regular expression is available in the variable yytext
Example: putchar(yytext[0]); will output the first character of the matched text
Metacharacters in regular expressions have to be escaped with\ or quoted within ""

Announcement: Minitest

There will be a minitest (30 minutes) next week (Friday, May 24). Please prepare well!

Glossary

automate: 自動化する
parser: 構文解析器
lexical analyzer: 字句解析器
lexical analyzer generator: 字句解析器生成系 (生成器)
parser generator: 構文解析器生成系 (生成器)
extension: 拡張子
skeleton: 骨格
definition: 定義
initialization: 初期化
integer literal: 整数定数
character literal: 文字定数

Tools for Lexical Analysis

5th lecture, May 17, 2019

Language Theory and Compilers

Martin J. Dürst

Today's Schedule

Last Week's Homework 4

Leftovers from Last Week

Missing Diagram: DFA Minimization

Missing Diagram: NFA for r|s

Missing Diagram: NFA for r*

Homework

Summary for Regular Languages

Compilation Stages

Compiler Structure

Implementing Lexical Analysis

Overview of flex

How to Use Cygwin

Cygwin and Harddisks

flex Usage Steps

Two Ways to Use flex

Example of flex Input Format

flex Exercise 1

Skeleton of flex Input Format

Structure of flex Input Format

How to Study flex

How flex Works

How the Program Created by flex Works

flex Exercise 2

flex Exercise 3: Detect Numbers

flex Exercise 4 (Homework): General Rules

flex Exercise 4 (Homework): Lexical Analysis for Dates and Times

Frequent Problems with flex