# Importance, classification, and definition of formal languages; finite automata

(形式言語の重要性、種類、定義)

## Language Theory and Compiler

### Martin J. Dürst

© 2006-16 Martin J. Dürst 青山学院大学

# Today's Schedule

• Last week's homework
• Definitions for formal language theory
• Definitions, operations, and properties for words
• Definitions, operations, and properties for languages
• Automata, grammars, and derivation

# Course Contens

Theory Compilers Other applications
Front end language theory, automata lexical analysis, parsing regular expressions, text/data formats
Back end
optimization, code generation

# Importance of Formal Language Theory

• Model for data formats and programming languages
• Model for computation and recognition

# Basic Terms

Terms for formal languages:

• A word is composed of symbols following some rules
• A word is a string/sequence of symbols
• Example: Words such as a, abc, aaabbb, and abcba can be created using symbols a, b, and c
• The empty word (ε) is also a word

# Definition of Word

• A word or a language are defined using a finite set of symbols (or letters) Σ
• Σ is called alphabet (example: Σ = {a, b, c})
• A word over Σ is a sequence of 0 or more symbols from Σ
• The number of symbols in a word is called the length of a word
• The length of a word w is written |w|
• Example: |abcaba| = 6; |ε| = 0
• Symbols are also words of length 1 (example: b)

# Concatenation Operation for Words

• A new word can be created by lining up two words after each other
• This is called concatenation operation for (on) words
• The concatenation operation is represented without an explicit symbol
(similar to multiplication in high school)
Example: The concatenation of words w and v is written wv
• Application example: With w = abc and v = cba, wv = abccba
• The concatenation of a word (or symbol) with itself is written using an exponent: w2 = ww = abcabc, a5 = aaaaa,...

# Properties of Concatenation

• Associativity: For any words w, v, and u: (wv)u = w(vu)
• The neutral element is ε: wε = εw = w
• Commutativity doesn't hold: wvvw (example: abccbacbaabc)
• The length of a concatenation is the sum of the lengths of its operands: |wv| = |w| + |v|

# Definition of Language

A language over Σ is a set of words over Σ

Examples for lanuages over Σ ={a,b,c}:

• Empty set: {}
• Set containing only the empty word: {ε}
• Σ (set of words of length 1 over Σ): {a,b,c}
• Set of words of length 3 (over Σ) (size of set: 27)
• Set of all words over Σ
• Set of (all) words (over Σ) starting with a
• Set of words representing the weather at each of the prefectural governments for each of the days of next week, if a stands for sunny, b stands for cloudy, and c stands for rainy/snowy (size of each word: 7; number of words: 47)

# More Examples of Languages

• Σ = {a,..., z}, set of keywords of the programming language C
• Σ = {0,..., 9, a,..., f, A,..., F, x,...}, set of integer literals of C
• Σ = {characters from ASCII or Unicode}, (set of) grammatically correct C programs
• Σ = {a,..., z, (, ), ¬, ∧, ∨}, set of all well-formed formulæ of predicate logic
• Σ = {Latin letters,...}, set of all English words
• Σ = {Latin letters,...}, set of all French words
• Σ = {Kanji, Kana,...}, set of all Japanese words
• Σ = {Kanji, Kana,...}, set of all grammatically correct Japanese sentences

# Operations on Languages

Operations on languages are combinations of operations on sets and operations on words.

• Set union of languages
• Set intersection of languages
• Set difference of langugages
• Concatenation operation for languages: For languages A and B, their concatenation AB is the set { wv | wA, vB }

As for words, we write L2 for LL,...

• Kleene closure: Concatenating the same language 0 or more times

written L*; L* = ⋃i=0 Li

Example: L = {a, b} => L* = {ε, a, b, aa, ab, ba, bb, aaa, ...}

# Main Problems in Formal Language Theory

• How can languages be defined in a way that is simple and easy to understand?
• How can words be produced from definitions of languages?
• How to decide whether some sequence of symbols is a word in some language?
• How can such decisions be implemented easily and executed quickly?
• How can we associate syntax with sematics?

# Languages and Automata and Grammars

• An automaton is a model for a machine that accepts/recognizes/distinguishes words in a given language
• A grammar is a set of rules to create a language
• There are many different kinds of automata and grammars
• These different kinds have different ranges of languages that can be accepted/generated
• Language theory distinguishes mainly four types of languages
• There is an ordered subset relationship between these four types
• For each type of language, there is a corresponding type of automaton and a corresponding type of grammar

# Table of Formal Language Types

(Chomsky hierarchy)

 文法 grammar Type Lanugage type automaton 句構造文法 phrase structure grammar (psg) 0 phrase structure language Turing machine 文脈依存文法 context-sensitive grammar (csg) 1 context-sensitive language linear-bounded automaton 文脈自由文法 context-free grammar (cfg) 2 context-free language push-down automaton 正規文法 regular grammar (rg) 3 regular language finite state automaton
• The Turing machine is a model for computation in general
• Context-free languages are used for parsing
• Regular languages are used for lexical analysis

# Types of Automata

Automata types are distinguised by the restrictions on their "external memory":

0. The external memory is a tape of unlimited length: Turing machine

1. The external memory is a tape of limited length: linear-bounded automaton

2. The external memory is a stack where only the top can be accessed: push-down automaton

3. There is no external memory: finite state automaton

# Example of a Grammar for a Formal Language

• S, A: nonterminal symbols (upper case)
• a, o, y: terminal symbols (lower case)
• rewriting rules:

Sa S o

SAw

Ay a

• S: initial symbol (start symbol)

Example of derivation of a word from the grammar:

Sa S oa a S o oa a A o oa a y a o o

Sa a y a o o

# Definition of Grammar

• A finite set of nonterminal symbols N (usually upper case)
• A finite set of terminal symbols Σ (usually lower case, NΣ ＝ {})
• A finite set of rewriting rules P (also called production rules)
• A start symbol S (SN, the symbol on the left side of the first rewriting rule if not explicitly specified)

A grammar is defined as a quadruple (N, Σ, P, S)

# 書換規則

(rewriting rule)

α は左辺 (left-hand side)、β は右辺 (right-hand side)

αβ は 0以上の非終端記号と終端記号の列

# 導出

(derivation)

• 文法から語を作るプロセス
• 初期記号から開始
• 一回の導出では一つの書き換え規則を一回適用:
• 現在ある (非)終端記号の中に
• 書換規則の左辺と同じ部分列を特定
• この部分列を書換規則の右辺と置換
• 複数の導出が可能な場合、任意に選択
(選択により、別の語を作成)
• 結果が終端記号だけの列の時点で導出が終了
→結果が (文法が定義する) 言語の一つの語
• 適用可能な書換規則がない場合、導出が失敗

# 文法と導出の例

Saba (1), SaDTa (2), TCDTa (3), TCDa (4),
DC → QC
(5), QC → QD (6), QD → CD (7),
aCaa (8), Daba (9), Dbbb (10)

(数値は書き換え規則の番号、下線は書換規則の適応範囲、普通 (宿題を含め) 省略)

# 文法の種類

0. 特に制限なし: 句構造文法 (phrase structure grammar), (Chomsky) 0 型文法

1. αAβαγβ (α, β は0以上の、γ は1以上の (非)終端記号の列) の場合:

2. Aγ は0以上の (非)終端記号の列) の場合:

3. AaB 又は Aa (ABa 又は Aaでも可) の場合:

(全ての場合に、Sε も特別に可)

# 宿題

1. L = { a, cb, ac } の場合、L* の一番短い語 10個を列挙しなさい。
発展問題 (解答自由): L* の長さ4の語を全て列挙しなさい。
2. 「導出の例」で使われた文法を使って、3つの (例とお互いと) 異なる語の導出を書きなさい (途中段階を全部含む、番号・下線不要)。この文法はどの様な言語を定義しているかを推測し、説明しなさい。
(ヒント: 推測が簡単でなかったら、導出に問題の可能性大)
発展問題 (解答自由): 自分の推測を証明してみなさい。
3. (提出なしだが、できなかった場合、必ず次回にノートパソコンを持参すること)
自分のノートパソコンに cygwin をインストールする (画像つき詳細)。インストールの手順で必ず gcc, flex, bison, diff, make と m4 を選ぶ。以前インストールされた場合、必ず確認・更新。

# Glossary

word
derivation

classification

symbol

empty word

alphabet
アルファベット
(word/language) over Σ
Σ 上の (語・言語)
concatenation (operation)

associativity

neutral element

commutativity

prefectural government (building)

keyword

well-formed formula

Kleene closure
クリーン閉包
rule

type of language

Chomsky hierarchy
チョムスキー階層
phrase structure language

context-sensitive language

context-free language

regular language

Turing machine
チューリング機械
linear-bounded automaton

push-down automaton
プッシュダウンオートマトン
finite state automaton

external memory

nonterminal symbol

upper case (letter)

lower case (letter)

terminal symbol

rewriting rule/production rule

initial/start symbol

derivation