Tools for Lexical Analysis


5th lecture, May 17, 2019

Language Theory and Compilers

Martin J. Dürst


© 2005-19 Martin J. Dürst 青山学院大学

Today's Schedule


Last Week's Homework 4

Bring your notebook PC (with flex, bison, gcc, make, diff, and m4 installed and usable)


Leftovers from Last Week


Missing Diagram: DFA Minimization


Missing Diagram: NFA for r|s

全体の初期状態から r と s の初期状態へと、r と s の受理状態から全体の受理状態へ ε で結ぶ


Missing Diagram: NFA for r*

全体の初期状態と r の初期状態、r の受理状態と全体の受理状態、全体の初期状態と全体の受理状態、そして r の受理状態と初期状態 (逆!) を ε で結ぶ。



Deadline (changed!): May 23, 2018 (Thursday), 19:00
(If you already submitted problems 1/2, just submit problem 3 on a separate sheet of paper.)

Where to submit: Box in front of room O-529 (building O, 5th floor)

Format: A4 single page (using both sides is okay; NO cover page), easily readable handwriting (NO printouts), name (kanji and kana) and student number at the top right

  1. Construct the state transition diagram for the NFA corresponding to the following grammar
    S → εA | bB | cB | cC, A → bC | aD | a | cS, B → aD | aC | bB | a, C →εA | aD | a
    (Caution: In right linear grammars, ε is not allowed except in the rule S → ε)
    (Hint: Create a new accepting state F)
  2. Convert the following transition table to a right linear grammar

        0         1    
    →T     G H
    *G K L
    *H M K
    *K K K
    *L M K
    M L -
  3. Construct the state transition diagram for the regular expression ab|c*d
    (write down both the result of the procedure explained during this lecture (with all ε transitions) as well as a version that is as simple as possible)


Summary for Regular Languages

These all have the same power, describe/recognize regular languages, and can be converted into each other.


Compilation Stages

  1. Lexical analysis
  2. Parsing (syntax analysis)
  3. Semantic analysis
  4. Optimization (or 5)
  5. Code generation (or 4)


Compiler Structure


Implementing Lexical Analysis



Overview of flex


How to Use Cygwin



Cygwin and Harddisks



flex Usage Steps

  1. Create an input file for flex (a (f)lex file), with the extension .l (example: test.l)
  2. Use flex to convert test.l to a C program:
    $ flex test.l
    (the output file is named lex.yy.c)
  3. Compile lex.yy.c with a C compiler (maybe together with other files)
  4. Execute the compiled program


Two Ways to Use flex

  1. Independent file processing (use regular expressions to recognize or change parts of a file):

    Call the yylex() function once from the main function

  2. Combination with parser:

    Repeatedly call yylex() from the parser, and return a token with return

In today's exercises and homework, we will use 1.

Later in this course, we will use 2. together with bison.


Example of flex Input Format

        int num_lines = 0, num_chars = 0;
\n      ++num_lines; ++num_chars;
 .       ++num_chars;
%% int main(void) { yylex(); printf( "# of lines = %d, # of chars = %d\n", num_lines, num_chars ); } int yywrap () { return 1; }


flex Exercise 1

Process and execute the flex program on the previous slide

  1. Create a file test.l and copy the contents of the previous slide to the file
  2. Create the file lex.yy.c with
    $ flex test.l
  3. Create the executable file a.exe with
    $ gcc lex.yy.c
  4. Execute the program with some input from standard input
    $ ./a <file


Skeleton of flex Input Format

declarations,... (C program language)
declarations,... (C program language)
regexp statement (C program language)
regexp statement (C program language)
functions,... (C program language)
functions,... (C program language)


Structure of flex Input Format

Mixture of flex-specific instructions and C program fragments

Three main parts, separated by two %%:

  1. Preparation/setup part:
  2. Flex rules:
  3. Rest of C program (functions,...)

Newlines and indent can be significant!


How to Study flex


How flex Works


How the Program Created by flex Works


flex Exercise 2

The table below shows how to escape various characters in XML
Create a program in flex (for this conversion, and) for the reverse conversion

Raw text XML escapes
' &apos;
" &quot;
& &amp;
< &lt;
> &gt;


flex Exercise 3: Detect Numbers

Create a program with flex to output the input without changes, except that numbers are enclosed with >>> and <<<

Example input:


Example output:


Hint: The string recognized by a regular expression is available with the variable yytext


flex Exercise 4 (Homework):
General Rules

Deadline: May 30, 2019 (Thursday), 19:00

Where to submit: Box in front of room O-529, available starting May 24

(start early, so that you can ask questions on May 24)


Collaboration: The same rules as for Computer Practice I (計算機実習 I) apply


flex Exercise 4 (Homework):
Lexical Analysis for Dates and Times

XML (see W3C XML Recommendation) is a generalization of HTML for document and data formats. XML is stricter than HTML (e.g. attribute values are always quoted,...). Using flex, create a program that takes an arbitrary XML file as input and outputs its syntactic components, one component per line.

Simple example input: <letter>Hello &amp; Happy World!</letter>

Simple example output:

Start tag: <letter>
  Contents: Hello
  Entity: &amp;
  Contents: Happy World!
End tag: </letter>



Frequent Problems with flex


Hints for Homework


Announcement: Minitest

There will be a minitest (30 minutes) next week (Friday, May 24). Please prepare well!



lexical analyzer
lexical analyzer generator
字句解析器生成系 (生成器)
parser generator
構文解析器生成系 (生成器)
integer literal
character literal