Importance, Definition, and Classification of Formal Languages

(形式言語の重要性、定義、分類)

2rd lecture, April 15, 2022

Language Theory and Compiler

http://www.sw.it.aoyama.ac.jp/2022/Compiler/lecture2.html

Martin J. Dürst

Today's Schedule

Last week's homework
Definitions for formal language theory
- Definitions, operations, and properties for words
- Definitions, operations, and properties for languages
Automata, grammars, and derivation

Seating

One person per table
Odd rows (from front, 1st, 3rd, 5th,...): Sit left
Even rows (from front, 2nd, 4th,...): Sit right
Do not leave tables at the front empty

1 　☑口口　☑口口　☑口口　☑口口　☑口口　☑口口
2 　口口☑　口口☑　口口☑　口口☑　口口☑　口口☑
3 　☑口口　☑口口　☑口口　☑口口　☑口口　☑口口
4 　口口☑　口口☑　口口☑　口口☑　口口☑　口口☑
5 　☑口口　☑口口　☑口口　☑口口　☑口口　☑口口
6 　口口☑　口口☑　口口☑　口口☑　口口☑　口口☑

Covid Precautions

Every morning, measure your body temperature
If you have increased temperature (above 37.5°), contact the health center
Observe social distance
Always wear a mask (correctly!)
Regularly wash/disinfect your hands thoroughly
Eat/drink quietly, alone
Get your third vaccination

Example Answers for Homework

Problem: For the one-line C program fragment below, based on the examples given in this lecture, write down:

the result of lexical analysis
the result of parsing
the output of the compiler (in assembly language; comments are not needed; use SUB for substraction, and DIV for division)

grade = english - absent * 5 + math / 3;

Output of lexical analysis:: [removed]
Output of parsing:
Compiler output (other solutions possible):

[removed]

Course Contents

	Theory	Compilers	Other applications
Front end	language theory, automata	lexical analysis, parsing	regular expressions, text/data formats
Back end		optimization, code generation

Importance of Formal Language Theory

Developed for natural languages
Model for data formats and programming languages
Model for computation and recognition

Today's Schedule

Last week's homework
Definitions for formal language theory
- Definitions, operations, and properties for words
- Definitions, operations, and properties for languages
Automata, grammars, and derivation

Basic Concept: Word

A word is composed of symbols following some rules
A word is a sequence of symbols
Example 1: Words such as a, abc, aaabbb, and abcba can be created using symbols a, b, and c
Example 2: ❄☀☔ and ☀☀☀ are words that can be created with the symbols ❄, ☀, and ☔
The empty word (ε) is also a word

Definition of Word

A word or a language are defined using a finite set of symbols (or letters) Σ
Σ is called alphabet (example: Σ = {a, b, c})
A word over Σ is a sequence of 0 or more symbols from Σ
The number of symbols in a word is called the length of a word
The length of a word w is written |w|
Example: |abcaba| = 6; |❄☀☔| = 3; |ε| = 0
Symbols are also words, of length 1
(∀s∈Σ: |s|=1; examples: |b|=1, |☀|=1)

Concatenation Operation on Words

A new word can be created by putting two words one after another
This is called concatenation on words
The concatenation operation is represented without an explicit symbol
(similar to multiplication in high school)
Example: The concatenation of words w and v
is written wv
Application example 1: w = abc, v = cba
⇒ wv = abccba
Application example 2: t = ❄☀, z = ☀☔
⇒ zt = ☀☔❄☀
The concatenation of a word (or symbol) with itself is written using an exponent:
w² = ww = abcabc , a⁵ = aaaaa , a¹ = a, w⁰ = ε,...

Properties of Concatenation

Associativity: For any words w, v, and u:
(wv)u = w(vu)
Neutral element: ε (εw = w = wε)
Commutativity does not hold: wv ≠ vw
(example: abccba ≠ cbaabc)
The length of a concatenation
is the sum of the lengths of its operands:
|wv| = |w| + |v|

Today's Schedule

Last week's homework
Definitions for formal language theory
- Definitions, operations, and properties for words
- Definitions, operations, and properties for languages
Automata, grammars, and derivation

Definition of Language

A language over Σ is a set of words over Σ

Examples for lanuages over Σ ={a, b, c}:

Empty set: {}
Set containing only the empty word: {ε}
Σ (set of words of length 1 over Σ): {a,b,c}
Set of all possible words of length 3 (over Σ)
(size of set: 27 )
Set of all possible words of length n (over Σ)
(size of set: |Σ|ⁿ )
Set of all possible words over Σ
Set of all possible words over Σ where the number of as is odd and the number of cs is even
Set of (all) words (over Σ) starting with a

More Examples of Languages

Σ = {a,..., z}, set of keywords
of the programming language C
({auto, break, case, char, const,..., do, double,...})
Σ = {0,..., 9, a,..., f, A,..., F, x,...},
set of integer literals of C
({0, 1, 2, 054, 86400, 0x7F, ...})
Σ = {characters from ASCII or Unicode},
(set of) grammatically correct C programs
Σ = {a,..., z, (, ), ¬, ∧, ∨}, set of all
well-formed formulæ of predicate logic

Even More Examples of Languages

Σ = {Latin letters,...}, set of all English words
Σ = {Latin letters,...}, set of all French words
Σ = {Latin letters,..., space,...}, set of all correct English sentences
Σ = {Kanji, Kana,...}, set of all Japanese words
Σ = {Kanji, Kana,...}, set of all grammatically correct Japanese sentences
Σ = {❄, ☀, ☔}, set of words representing the weather at each of the prefectural government locations for each of the days of next week (size of each word: 7; number of words: 47)

Operations on Languages

Operations on languages are combinations of operations on sets and operations on words.

Set union on languages
Set intersection on languages
Set difference on langugages
Concatenation operation for languages:
For languages A and B, their concatenation AB is the set { wv | w∈A, v∈B }
Example: A = { ab, abc }, B = { a, ca },
AB = { aba, abca, abcca } (|AB| ≦ |A| · |B|)
As for words, we write L² for LL, L¹ for L,
L⁰ for {ε}, ...
Kleene closure: Concatenating the same language 0 or more times
written L^*; L^* = L⁰∪ L¹∪ L²∪ L³∪... = ⋃^∞_i=0 Lⁱ
Example: L = {a, b}
⇒ L^* = {ε, a, b, aa, ab, ba, bb, aaa, ...}

Terms used for Natural Languages
and Formal Languages

Unit		Smallest Unit	Sequence	Set	Classification
natural language	Japanese	(単)語	文、文書	(自然)言語	(大)語族、語族、語派、語群
English	word	sentence, text	(natural) language	language macrofamily, family, group,...
formal language	Japanese	記号 (文字など)	語	(形式)言語	言語 (族)
English	symbol (letter,...)	word	(formal) language	language type,...

Main Problems in Formal Language Theory

How to define languages in a way that is
simple and easy to understand?
How to produce words
from definitions of languages?
How to decide whether some sequence of symbols is a word in some language?
How to implement such decisions easily,
and execute them quickly?
How to associate syntax with sematics?

Today's Schedule

Last week's homework
Definitions for formal language theory
- Definitions, operations, and properties for words
- Definitions, operations, and properties for languages
Automata, grammars, and derivation

Languages, Automata, and Grammars

An automaton is a model for a machine that accepts/recognizes/distinguishes words in a given language
A grammar is a set of rules to create (the words of) a language
There are many different types of automata and grammars
These different types have different ranges of languages that can be accepted/generated
Language theory distinguishes mainly four types of language families
For each type of language, there is a corresponding type of automaton and a corresponding type of grammar

Table of Formal Language Types

(Chomsky hierarchy)

言語	grammar	Type	Lanugage type	automaton
句構造言語	phrase structure grammar (psg)	0	phrase structure language	Turing machine
文脈依存言語	context-sensitive grammar (csg)	1	context-sensitive language	linear-bounded automaton
文脈自由言語	context-free grammar (cfg)	2	context-free language	push-down automaton
正規言語	regular grammar (rg)	3	regular language	finite state automaton

The Turing machine is a model for computation in general
Context-free languages are used for parsing
Regular languages are used for lexical analysis
There is an ordered subset relationship between these four types
(Type 3 ⊂ Type 2 ⊂ Type 1 ⊂ Type 0)

Types of Automata

Automata types are distinguished by the restrictions on their "external memory":

0. The external memory is a tape of unlimited length: Turing machine

1. The external memory is a tape of limited length: linear-bounded automaton

2. The external memory is a stack (only the top can be accessed): push-down automaton

3. There is no external memory: finite state automaton

Example of a Grammar for a Formal Language

S, B: nonterminal symbols (upper case)
a, o, y: terminal symbols (lower case)
rewriting rules:
S → a S o

S → B

B → y a
S: start symbol (initial symbol)

Example of derivation of a word from the grammar:

S → a S o → a a S o o → a a B o o → a a y a o o

S ⇒ a a y a o o, other derivations: a y a o, aaayaooo, ...

(single steps in a derivation are written with →, the overall result with ⇒)

Definition of Grammar

The four components defining a grammar:

A finite set of nonterminal symbols N (usually upper case)
A finite set of terminal symbols Σ (usually lower case, N ∩ Σ ＝ {})
A finite set of rewriting rules P (also called production rules)
A start symbol S (S ∈ N, the symbol on the left side of the first rewriting rule if not explicitly specified)

A grammar is a quadruple (N, Σ, P, S)

Rewriting Rule

(also: production rule)

Each rewriting rule is written as α → β
α is called left-hand side,
β is called right-hand side
α is a sequence of (nonterminal/terminal) symbols, with at least one nonterminal symbol
β is a sequence of 0 or more
(nonterminal/terminal) symbols
Examples: aD → aDDb, EF → abc, F → Fb,
D → ε
Counterexamples: bc → Dc, ε → b

How to Apply Rewriting Rules

In the current sequence of (non)terminals,
find a subsequence that matches
the left-hand side of a rewriting rule
Replace this subsequence
with the right-hand side of the rewriting rule
If more than one rewriting rule can be applied, select one
(different choices may produce different words)

Derivation

Process of creating words from a grammar
Start from the start symbol
Repeatedly apply a rewriting rule
When the sequence contains only terminal symbols, the derivation is complete
→ The result is a word of the language
defined by this grammar
If there are still some nonterminal symbols,
but rewriting is impossible, the derivation fails

Example of Grammar and Derivation

Grammar:

S → dcd
S → dHRd
R → GHRd
R → GHd
HG → GH
dG → dd
Hd → cd
Hc → cc

Example of derivation:
S →₂ dHRd→₄ dHGHdd→₅ dGHHdd→₇ dGHcdd→₈ dGccdd→₆ ddccdd

(numbers indicate the rewriting rule that is applied, the underlined parts indicate where the rules are applied)

Summary of this Lecture

A word over an alphabet Σ is
a sequence of 0 or more symbols from Σ
A language over an alphabet Σ is
a set of words over Σ
A grammar is a set of rewriting rules that allow to produce all words of a language (and only those) by starting from a single start symbol
An automaton is a machine that accepts all words of a language (and only those)

Homework Submission

Deadline: April 21, 2022 (Thursday), 18:40

Format: A4 single page (using both sides is okay; NO cover page), easily readable handwriting (NO printouts), name (kanji and kana) and student number at the top right

Where to submit: Box in front of room O-529 (building O, 5th floor)

Homework Problem 1

For the language L = { qt, sq, s },
list the 10 shortest words of L^*.

Additional problem (solution voluntary):
List all words of L^* of length 4.

Homework Problem 2

Using the grammar from the slide "Example of Grammar and Derivation", find 3 words (different from each other and from ddccdd) produced by that grammar.

Give the full derivation for each word
(rule numbers and underlines not needed).

Guess and explain
what language this grammar defines.

Hint: If your guess is not simple,
maybe you have made a mistake in the derivations.

Additional problem (solution voluntary):
Prove or justify your guess.

Homework Problem 3

(no need to submit, but contact me by e-mail if you have any problems)

Install cygwin on your notebook computer (detailled instructions with images).

Make sure that you select/install gcc (gcc-core), flex, bison, diff (diffutils), make, and m4 .

If you have an earlier cygwin installation, make sure to check/update.

Homework Returns

No need to wait, you can come to my office in the afternoon or at a later date
Order is by points, decreasing
When your name is called, raise your hand very visibly, then come to the front
Only take your own homework, never any other
Homeworks without kana will be handed back at the end

Glossary

word: 語
derivation: 導出
classification: 分類
symbol: 記号
empty word: 空語
alphabet: アルファベット
(word/language) over Σ: Σ 上の (語・言語)
concatenation (operation): 連結 (演算)
associativity: 結合性 (結合率が成立つこと)
neutral element: 単位元
commutativity: 可換性
prefectural government (building): 県庁
keyword: 予約語
well-formed formula: 整論理式
Kleene closure: クリーン閉包
rule: 規則
type of language: 言語族
Chomsky hierarchy: チョムスキー階層
phrase structure language: 句構造言語
context-sensitive language: 文脈依存言語
context-free language: 文脈自由言語
regular language: 正規言語
Turing machine: チューリング機械
linear-bounded automaton: 線形束縛オートマトン
push-down automaton: プッシュダウンオートマトン
finite state automaton: 有限オートマトン
external memory: 外部メモリ
nonterminal symbol: 非終端記号
upper case (letter): 大文字
lower case (letter): 小文字
terminal symbol: 終端記号
rewriting rule/production rule: 書き換え規則・生成規則
initial/start symbol: 初期記号・開始記号
derivation: 導出
quadruple: 四字組
left-hand side: 左辺
right-hand side: 右辺
subsequence: 部分列

Importance, Definition, and Classification of Formal Languages

2rd lecture, April 15, 2022

Language Theory and Compiler

Martin J. Dürst

Today's Schedule

Seating

Covid Precautions

Example Answers for Homework

Course Contents

Importance of Formal Language Theory

Today's Schedule

Basic Concept: Word

Definition of Word

Concatenation Operation on Words

Properties of Concatenation

Today's Schedule

Definition of Language

More Examples of Languages

Even More Examples of Languages

Operations on Languages

Terms used for Natural Languages and Formal Languages

Main Problems in Formal Language Theory

Today's Schedule

Languages, Automata, and Grammars

Table of Formal Language Types

Types of Automata

Example of a Grammar for a Formal Language

Definition of Grammar

Rewriting Rule

How to Apply Rewriting Rules

Derivation

Example of Grammar and Derivation

Summary of this Lecture

Homework Submission

Homework Problem 1

Homework Problem 2

Homework Problem 3

Homework Returns

Glossary

Terms used for Natural Languages
and Formal Languages