# Tools for Lexical Analysis

(字句解析ツール)

## Language Theory and Compilers

### Martin J. Dürst

# Today's Schedule

• Homework from last lecture
• Leftovers from last lecture
• Automating lexical analysis
• How to use cygwin
• Overview of `flex`
• `flex` exercises

# Last Week's Homework 4

Bring your notebook PC (with `flex`, `bison`, `gcc`, `make`, `diff`, and `m4` installed and usable)

# Homework

1. Construct the state transition diagram for the NFA corresponding to the following grammar
S → εA | bB | cB | cC, A → bC | aD | a | cS, B → aD | aC | bB | a, C →εA | aD | a
(Caution: In right linear grammars, ε is not allowed except in the rule S → ε)
(Hint: Create a new accepting state F)
2. Convert the following transition table to a right linear grammar
 0 1 →T G H *G K L *H M K *K K K *L M K M L -
3. Construct the state transition diagram for the regular expression `ab|c*d`
(write down both the result of the procedure explained during this lecture (with all ε transitions) as well as a version that is as simple as possible)

# Summary for Regular Languages

• (Non-)Deterministic Finite Automata
• (Left|Right) Linear Grammars
• Regular Expressions

These all have the same power, describe/recognize regular languages, and can be converted into each other.

# Compilation Stages

1. Lexical analysis
2. Parsing (syntax analysis)
3. Semantic analysis
4. Optimization (or 5)
5. Code generation (or 4)

# Compiler Structure

• Parsing is the core of the front end (analysis) or of the whole compiler
• The parser repeatedly obtains the next token from the lexical analyzer
(with a function such as `getNextToken()`)
• The parser calls semantic analysis,... when needed

# Implementing Lexical Analysis

• Lexical analysis is the first stage of a compiler
• Analyzing/extracting the "words" (tokens) of a programming language
• Processing is character by character
→ speed/efficiency is important
• Structure of tokens is simple
→ Regular language is powerful enough
• Efficient implementation is possible
(regular expressions → NFA → DFA → minimized DFA)

Choices:

• Write a lexical analyzer by hand (tedious and error-prone)
• Use a tool to automate creation of lexical analyzer (e.g. `flex`)

# Overview of `flex`

• Lexical analyzer generator
• Opensource version of `lex`, with various extensions
• `lex`: Lexical analyzer generator available with Unix (creator: Mike Lesk)
• Easy to write/create lexical analyzers e.g. for compilers
• Works well with parser generator `bison`

• `ls`: list the files in a directory
• `mkdir`: create a new directory
• `cd`: change (working) directory
• `pwd`: print (current) working directory
• `gcc`: compile a C program
• `./a`: execute a compiled program

• Assumption: cygwin is installed in `C:\cygwin`
• Usually, only directories below `C:\cygwin` can be reached
• The user's home directory is e.g. `C:\cygwin\home\user1`
• This can be displayed as `/home/user1` with `pwd`
• How to escape `C:\cygwin`:
`cd /cygdrive/c` (change directory to the C drive of MS Windows)

# `flex` Usage Steps

1. Create an input file for `flex` (a (f)lex file), with the extension `.l` (example: `test.l`)
2. Use `flex` to convert `test.l` to a C program:
\$ `flex test.l`
(the output file is named `lex.yy.c`)
3. Compile `lex.yy.c` with a C compiler (maybe together with other files)
4. Execute the compiled program

# Two Ways to Use `flex`

1. Independent file processing (use regular expressions to recognize or change parts of a file):

Call the `yylex()` function once from the `main` function

2. Combination with parser:

Repeatedly call `yylex()` from the parser, and return a token with `return`

In today's exercises and homework, we will use 1.

Later in this course, we will use 2. together with `bison`.

# Example of `flex` Input Format

```        int num_lines = 0, num_chars = 0;
%%
\n      ++num_lines; ++num_chars;
.       ++num_chars;
%%
int main(void)
{
yylex();
printf( "# of lines = %d, # of chars = %d\n",
num_lines, num_chars );
}

int yywrap () { return 1; }```

# `flex` Exercise 1

Process and execute the `flex` program on the previous slide

1. Create a file `test.l` and copy the contents of the previous slide to the file
2. Create the file `lex.yy.c` with
\$ `flex test.l`
3. Create the executable file `a.exe` with
\$ `gcc lex.yy.c`
4. Execute the program with some input from standard input
\$ `./a <file`

# Skeleton of `flex` Input Format

`declarations,... (C program language)declarations,... (C program language) %%regexp    statement (C program language)regexp    statement (C program language)%%functions,... (C program language)functions,... (C program language)`

# Structure of `flex` Input Format

Mixture of `flex`-specific instructions and C program fragments

Three main parts, separated by two `%%`:

1. Preparation/setup part:
• C `#include`s, `#define`s
• Definition and initialization of global variables
• Definition of regular expression components
2. Flex rules:
• Left side: Regular expressions to be recognized (lexical rules)
• Right side: Program fragments executed on recognition
3. Rest of C program (functions,...)

Newlines and indent can be significant!

# How to Study `flex`

• Read the manual: English, Japanese
To complete the homework, the use of the manual is necessary

Caution: A manual is not a novel.

• Read the output of `flex` (`lex.yy.c`)
• Compare the output of `flex` for different inputs
• Read the source code of `flex` (`flex` also uses lexical analysis, which is written using `flex`)
• Use options to output internal information (example: ```flex -v```)

# How `flex` Works

• Convert each regular expression to an NFA,
and associate each accepting state with the corresponding C fragment
• Combine all NFAs into a single large NFA (if multiple C fragments are available at an accepting state, use the earliest fragment in the `.l` file)
• Convert this NFA to a DFA
• Minimize the DFA
• Create and initialize the necessary tables
• Copy the program that executes the DFA
• Copy the C fragments in the `.l` file

# How the Program Created by `flex` Works

• Read the input character by character
• Match the longest possible input string with a regular expression
• If multiple regular expressions match the same length input string,
use the earliest regular expression in the `.l` file
• Always remember the last accepting state passed and the input up to that state; use this state/input when there is no next state
• If a match is found, execute the corresponding C fragment
• If no match is found:
• Output the first input character
• Start processing again with the next input character
• Repeat starting with the first character after a match or the next character after a non-match

# `flex` Exercise 2

The table below shows how to escape various characters in XML
Create a program in `flex` (for this conversion, and) for the reverse conversion

Raw text XML escapes
`'` `&apos;`
`"` `&quot;`
`&` `&amp;`
`<` `&lt;`
`>` `&gt;`

# `flex` Exercise 3: Detect Numbers

Create a program with `flex` to output the input without changes, except that numbers are enclosed with `>>>` and `<<<`

Example input:

`abc123def345gh`

Example output:

`abc>>>123<<<def>>>345<<<gh`

Hint: The string recognized by a regular expression is available with the variable `yytext`

# `flex` Exercise 4 (Homework): General Rules

Format:

• A4 double-sided printout of `flex` input file (`.l` file)
• Stapled in upper left if more than two pages of output
• NO cover page, Line numbers, NO wrapping lines, legible font size, non-proportional font, portrait (not landscape), formatted (indents,...) for easy visibility
• Name (kanji and kana) and student number as a comment at the top

# `flex` Exercise 4 (Homework): Lexical Analysis for Dates and Times

XML (see W3C XML Recommendation) is a generalization of HTML for document and data formats. XML is stricter than HTML (e.g. attribute values are always quoted,...). Using `flex`, create a program that takes an arbitrary XML file as input and outputs its syntactic components, one component per line.

Simple example input: ```<letter>Hello &amp; Happy World!</letter>```

Simple example output:

```Start tag: <letter>
Contents: Hello
Entity: &amp;
Contents: Happy World!
End tag: </letter>```

Details:

• Input and output can be limited to US-ASCII only
• White space/newline only content can be ignored
• Start tag, end tag, contents, entity, comment, and processing instruction are required
• Show the nesting level by indenting the output
• Output a warning if the number of end tags does not match the number of start tags
• Attributes (attribute names and attribute values) can be handled as an advanced problem (発展問題)

# Frequent Problems with `flex`

• If there is an error in the `.l` file, `flex` may still run without errors

Solution: Always start with `flex`:
> ```flex file.l && gcc lex.yy.c && ./a <input.txt```

• In the first part of the `.l` file (before the first `%%`), C program fragments have to be indented by at least one space
• Don't forget `int yywrap () { return 1; }`

# Hints for Homework

• This homework will take considerably more time than the previous homeworks
• You can assume that the input is restricted to US-ASCII
• In some cases, the order of the regular expressions (rules) is important
• The text matched with a regular expression is available in the variable `yytext`
Example: `putchar(yytext[0]);` will output the first character of the matched text
• Metacharacters in regular expressions have to be escaped with`\` or quoted within `""`

