Tools for Lexical Analysis

(字句解析ツール)

5th lecture, May 18, 2018

Language Theory and Compilers

http://www.sw.it.aoyama.ac.jp/2018/Compiler/lecture5.html

Martin J. Dürst

AGU

© 2005-18 Martin J. Dürst 青山学院大学

Today's Schedule

 

Last Week's Homework 1

Construct the state transition diagram for the NFA corresponding to the following grammar
S → εA | bB | cB | cC, A → bC | aD | a | cS, B → aD | aC | bB | a, C →εA | aD | a
(Caution: In right linear grammars, ε is not allowed except in the rule S → ε)
(Hint: Create a new accepting state F)

 

Last Week's Homework 2

Convert the result of last week's homework 3 (after rewriting, see this handout) to a right linear grammar

都合により削除

 

Last Week's Homework 3

Construct the state transition diagram for the regular expression ab|c*d
(write down both the result of the procedure explained during this lecture (with all ε transitions) as well as a version that is as simple as possible)

 

Last Week's Homework 4

Bring your notebook PC (with flex, bison, gcc, make, diff, and m4 installed and usable)

 

Leftovers from Last Week

 

Summary up to Now

These all have the same power, describe/recognize regular languages, and can be converted into each other.

 

Compilation Stages

  1. Lexical analysis
  2. Parsing (syntax analysis)
  3. Semantic analysis
  4. Optimization (or 5)
  5. Code generation (or 4)

 

Compiler Structure

 

Implementing Lexical Analysis

Choices:

 

Overview of flex

 

How to Use Cygwin

(reminder)

 

Cygwin and Harddisks

(reminder)

 

flex Usage Steps

  1. Create an input file for flex (a (f)lex file), with the extension .l (example: test.l)
  2. Use flex to convert test.l to a C program:
    $ flex test.l
    (the output file is named lex.yy.c)
  3. Compile lex.yy.c with a C compiler (maybe together with other files)
  4. Execute the compiled program

 

Two Ways to Use flex

  1. Independent file processing (use regular expressions to recognize or change parts of a file):

    Call the yylex() function once from the main function

  2. Calling the lexical analyzer from the parser:

    Repeatedly call yylex() from the parser, and return a token with return

In today's exercises and homework, we will use 1.

In the second half of this course, we will use 2. together with bison

 

Example of flex Input Format

        int num_lines = 0, num_chars = 0;
%%
\n      ++num_lines; ++num_chars;
 .       ++num_chars;
%% int main(void) { yylex(); printf( "# of lines = %d, # of chars = %d\n", num_lines, num_chars ); } int yywrap () { return 1; }

 

flex Exercise 1

Process and execute the flex program on the previous slide

  1. Create a file test.l and copy the contents of the previous slide to the file
  2. Create the file lex.yy.c with
    $ flex test.l
  3. Create the executable file a.exe with
    $ gcc lex.yy.c
  4. Execute the program with some input from standard input
    $ ./a <file

 

Skeleton of flex Input Format

declarations,... (C program language)
declarations,... (C program language)
%%
regexp statement (C program language)
regexp statement (C program language)
%%
functions,... (C program language)
functions,... (C program language)

 

Structure of flex Input Format

Mixture of flex-specific instructions and C program fragments

Three main parts, separated by two %%:

  1. Preparation/setup part:
  2. Flex rules:
  3. Rest of C program (functions,...)

Newlines and indent can be significant!

 

How to Study flex

 

How flex Works

 

How the Program Created by flex Works

 

flex Exercise 2

The table below shows how to escape various characters in XML
Create a program in flex (for this conversion, and) for the reverse conversion

Raw text XML escapes
' &apos;
" &quot;
& &amp;
< &lt;
> &gt;

 

flex Exercise 3: Detect Numbers

Create a program with flex to output the input without changes, except that numbers are enclosed with >>> and <<<

Example input:

abc123def345gh

Example output:

abc>>>123<<<def>>>345<<<gh

Hint: The string recognized by a regular expression is available with the variable yytext

 

flex Exercise 4 (Homework):
General Rules

Deadline: May 31, 2017 (Thursday), 19:00

Where to submit: Box in front of room O-529 (building O, 5th floor)

(start early, so that you can ask questions on May 25)

Format:

Collaboration: The same rules as for Computer Practice I (計算機実習 I) apply

 

flex Exercise 4 (Homework):
Lexical Analysis for Dates and Times

ISO 8601 is the standard format for dates and times (see Wikipedia article). Detect dates and times in ISO 8601 format, and output them one line per item. As an example, if you find 2018-05-18 in the input, then output it as:

year: 2018, month: 5, day: 18

Make sure your program ignores impossible dates (e.g. 2018-02-29, 2018-04-31, 2018-05-32,...). NEW for future: USE regular expressions to distinguish impossible dates; USE C ONLY for output.

Detect all the date and time formats in the table below.

形式 項目 実例
YYYY-MM-DD year/month/day 2014-05-09
YYYY-MM year/month 2014-05
YYYY year 2014
YYYY-DDD year/day 2014-094
YYYY-Www-D year/week/weekday 2014-W15-3
HH:MM:SS hour/minute/second 12:23:47
HH:MM hour/minute 12:23

Advanced exercise (発展問題、加点): Also detect other ISO 8601 formats (combinations of dates and times, times with timezones, durations)

 

Frequent Problems with flex

 

Hints for Homework

 

Announcement

There will be a minitest (ca. 30 minutes) next week. Please prepare well!

 

Glossary

automate
自動化する
parser
構文解析器
lexical analyzer
字句解析器
lexical analyzer generator
字句解析器生成系 (生成器)
parser generator
構文解析器生成系 (生成器)
extension
拡張子
skeleton
骨格
definition
定義
initialization
初期化
integer literal
整数定数
character literal
文字定数