# Tools for Lexical Analysis

(字句解析ツール)

## Language Theory and Compilers

http://www.sw.it.aoyama.ac.jp/2019/Compiler/lecture5.html

### Martin J. Dürst

© 2005-19 Martin J. Dürst 青山学院大学

# Today's Schedule

• Homework from last lecture
• Leftovers from last lecture
• Automating lexical analysis
• How to use cygwin
• Overview of `flex`
• `flex` exercises

# Last Week's Homework 4

Bring your notebook PC (with `flex`, `bison`, `gcc`, `make`, `diff`, and `m4` installed and usable)

# Homework

Deadline (changed!): May 23, 2018 (Thursday), 19:00
(If you already submitted problems 1/2, just submit problem 3 on a separate sheet of paper.)

Where to submit: Box in front of room O-529 (building O, 5th floor)

Format: A4 single page (using both sides is okay; NO cover page), easily readable handwriting (NO printouts), name (kanji and kana) and student number at the top right

1. Construct the state transition diagram for the NFA corresponding to the following grammar
S → εA | bB | cB | cC, A → bC | aD | a | cS, B → aD | aC | bB | a, C →εA | aD | a
(Caution: In right linear grammars, ε is not allowed except in the rule S → ε)
(Hint: Create a new accepting state F)
2. Convert the following transition table to a right linear grammar
 0 1 →T G H *G K L *H M K *K K K *L M K M L -
3. Construct the state transition diagram for the regular expression `ab|c*d`
(write down both the result of the procedure explained during this lecture (with all ε transitions) as well as a version that is as simple as possible)

# Summary for Regular Languages

• (Non-)Deterministic Finite Automata
• (Left|Right) Linear Grammars
• Regular Expressions

These all have the same power, describe/recognize regular languages, and can be converted into each other.

# Compilation Stages

1. Lexical analysis
2. Parsing (syntax analysis)
3. Semantic analysis
4. Optimization (or 5)
5. Code generation (or 4)

# Compiler Structure

• Parsing is the core of the front end (analysis) or of the whole compiler
• The parser repeatedly obtains the next token from the lexical analyzer
(with a function such as `getNextToken()`)
• The parser calls semantic analysis,... when needed

# Implementing Lexical Analysis

• Lexical analysis is the first stage of a compiler
• Analyzing/extracting the "words" (tokens) of a programming language
• Processing is character by character
→ speed/efficiency is important
• Structure of tokens is simple
→ Regular language is powerful enough
• Efficient implementation is possible
(regular expressions → NFA → DFA → minimized DFA)

Choices:

• Write a lexical analyzer by hand (tedious and error-prone)
• Use a tool to automate creation of lexical analyzer (e.g. `flex`)

# Overview of `flex`

• Lexical analyzer generator
• Opensource version of `lex`, with various extensions
• `lex`: Lexical analyzer generator available with Unix (creator: Mike Lesk)
• Easy to write/create lexical analyzers e.g. for compilers
• Works well with parser generator `bison`

# How to Use Cygwin

(reminder)

• `ls`: list the files in a directory
• `mkdir`: create a new directory
• `cd`: change (working) directory
• `pwd`: print (current) working directory
• `gcc`: compile a C program
• `./a`: execute a compiled program

# Cygwin and Harddisks

(reminder)

• Assumption: cygwin is installed in `C:\cygwin`
• Usually, only directories below `C:\cygwin` can be reached
• The user's home directory is e.g. `C:\cygwin\home\user1`
• This can be displayed as `/home/user1` with `pwd`
• How to escape `C:\cygwin`:
`cd /cygdrive/c` (change directory to the C drive of MS Windows)

# `flex` Usage Steps

1. Create an input file for `flex` (a (f)lex file), with the extension `.l` (example: `test.l`)
2. Use `flex` to convert `test.l` to a C program:
\$ `flex test.l`
(the output file is named `lex.yy.c`)
3. Compile `lex.yy.c` with a C compiler (maybe together with other files)
4. Execute the compiled program

# Two Ways to Use `flex`

1. Independent file processing (use regular expressions to recognize or change parts of a file):

Call the `yylex()` function once from the `main` function

2. Combination with parser:

Repeatedly call `yylex()` from the parser, and return a token with `return`

In today's exercises and homework, we will use 1.

Later in this course, we will use 2. together with `bison`.

# Example of `flex` Input Format

```        int num_lines = 0, num_chars = 0;
%%
\n      ++num_lines; ++num_chars;
.       ++num_chars;
%%
int main(void)
{
yylex();
printf( "# of lines = %d, # of chars = %d\n",
num_lines, num_chars );
}

int yywrap () { return 1; }```

# `flex` Exercise 1

Process and execute the `flex` program on the previous slide

1. Create a file `test.l` and copy the contents of the previous slide to the file
2. Create the file `lex.yy.c` with
\$ `flex test.l`
3. Create the executable file `a.exe` with
\$ `gcc lex.yy.c`
4. Execute the program with some input from standard input
\$ `./a <file`

# Skeleton of `flex` Input Format

`declarations,... (C program language)declarations,... (C program language) %%regexp    statement (C program language)regexp    statement (C program language)%%functions,... (C program language)functions,... (C program language)`

# Structure of `flex` Input Format

Mixture of `flex`-specific instructions and C program fragments

Three main parts, separated by two `%%`:

1. Preparation/setup part:
• C `#include`s, `#define`s
• Definition and initialization of global variables
• Definition of regular expression components
2. Flex rules:
• Left side: Regular expressions to be recognized (lexical rules)
• Right side: Program fragments executed on recognition
3. Rest of C program (functions,...)

Newlines and indent can be significant!

# How to Study `flex`

• Read the manual: English, Japanese
To complete the homework, the use of the manual is necessary

Caution: A manual is not a novel.

• Read the output of `flex` (`lex.yy.c`)
• Compare the output of `flex` for different inputs
• Read the source code of `flex` (`flex` also uses lexical analysis, which is written using `flex`)
• Use options to output internal information (example: ```flex -v```)

# How `flex` Works

• Convert each regular expression to an NFA,
and associate each accepting state with the corresponding C fragment
• Combine all NFAs into a single large NFA (if multiple C fragments are available at an accepting state, use the earliest fragment in the `.l` file)
• Convert this NFA to a DFA
• Minimize the DFA
• Create and initialize the necessary tables
• Copy the program that executes the DFA
• Copy the C fragments in the `.l` file

# How the Program Created by `flex` Works

• Read the input character by character
• Match the longest possible input string with a regular expression
• If multiple regular expressions match the same length input string,
use the earliest regular expression in the `.l` file
• Always remember the last accepting state passed and the input up to that state; use this state/input when there is no next state
• If a match is found, execute the corresponding C fragment
• If no match is found:
• Output the first input character
• Start processing again with the next input character
• Repeat starting with the first character after a match or the next character after a non-match

# `flex` Exercise 2

The table below shows how to escape various characters in XML
Create a program in `flex` (for this conversion, and) for the reverse conversion

Raw text XML escapes
`'` `&apos;`
`"` `&quot;`
`&` `&amp;`
`<` `&lt;`
`>` `&gt;`

# `flex` Exercise 3: Detect Numbers

Create a program with `flex` to output the input without changes, except that numbers are enclosed with `>>>` and `<<<`

Example input:

`abc123def345gh`

Example output:

`abc>>>123<<<def>>>345<<<gh`

Hint: The string recognized by a regular expression is available with the variable `yytext`

# `flex` Exercise 4 (Homework): General Rules

Deadline: May 30, 2019 (Thursday), 19:00

Where to submit: Box in front of room O-529, available starting May 24

(start early, so that you can ask questions on May 24)

Format:

• A4 double-sided printout of `flex` input file (`.l` file)
• Stapled in upper left if more than two pages of output
• NO cover page, Line numbers, NO wrapping lines, legible font size, non-proportional font, portrait (not landscape), formatted (indents,...) for easy visibility
• Name (kanji and kana) and student number as a comment at the top

Collaboration: The same rules as for Computer Practice I (計算機実習 I) apply

# `flex` Exercise 4 (Homework): Lexical Analysis for Dates and Times

XML (see W3C XML Recommendation) is a generalization of HTML for document and data formats. XML is stricter than HTML (e.g. attribute values are always quoted,...). Using `flex`, create a program that takes an arbitrary XML file as input and outputs its syntactic components, one component per line.

Simple example input: ```<letter>Hello &amp; Happy World!</letter>```

Simple example output:

```Start tag: <letter>
Contents: Hello
Entity: &amp;
Contents: Happy World!
End tag: </letter>```

Details:

• Input and output can be limited to US-ASCII only
• White space/newline only content can be ignored
• Start tag, end tag, contents, entity, comment, and processing instruction are required
• Show the nesting level by indenting the output
• Output a warning if the number of end tags does not match the number of start tags
• Attributes (attribute names and attribute values) can be handled as an advanced problem (発展問題)

# Frequent Problems with `flex`

• If there is an error in the `.l` file, `flex` may still run without errors

Solution: Always start with `flex`:
> ```flex file.l && gcc lex.yy.c && ./a <input.txt```

• In the first part of the `.l` file (before the first `%%`), C program fragments have to be indented by at least one space
• Don't forget `int yywrap () { return 1; }`

# Hints for Homework

• This homework will take considerably more time than the previous homeworks
• You can assume that the input is restricted to US-ASCII
• In some cases, the order of the regular expressions (rules) is important
• The text matched with a regular expression is available in the variable `yytext`
Example: `putchar(yytext[0]);` will output the first character of the matched text
• Metacharacters in regular expressions have to be escaped with`\` or quoted within `""`

# Announcement: Minitest

There will be a minitest (30 minutes) next week (Friday, May 24). Please prepare well!

# Glossary

automate

parser

lexical analyzer

lexical analyzer generator

parser generator

extension

skeleton

definition

initialization

integer literal

character literal