(形式言語の重要性、定義、分類)

http://www.sw.it.aoyama.ac.jp/2022/Compiler/lecture2.html

© 2005-22 Martin J. Dürst 青山学院大学

- Last week's homework
- Definitions for formal language theory
- Definitions, operations, and properties for words
- Definitions, operations, and properties for languages

- Automata, grammars, and derivation

- One person per table
- Odd rows (from front, 1st, 3rd, 5th,...): Sit left
- Even rows (from front, 2nd, 4th,...): Sit right
- Do not leave tables at the front empty

1 ☑口口 ☑口口 ☑口口 ☑口口 ☑口口 ☑口口 2 口口☑ 口口☑ 口口☑ 口口☑ 口口☑ 口口☑ 3 ☑口口 ☑口口 ☑口口 ☑口口 ☑口口 ☑口口 4 口口☑ 口口☑ 口口☑ 口口☑ 口口☑ 口口☑ 5 ☑口口 ☑口口 ☑口口 ☑口口 ☑口口 ☑口口 6 口口☑ 口口☑ 口口☑ 口口☑ 口口☑ 口口☑

- Every morning, measure your body temperature
- If you have increased temperature (above 37.5°), contact the health center
- Observe social distance
- Always wear a mask (correctly!)
- Regularly wash/disinfect your hands thoroughly
- Eat/drink quietly, alone
- Get your third vaccination

Problem: For the one-line C program fragment below, based on the examples given in this lecture, write down:

- the result of lexical analysis
- the result of parsing
- the output of the compiler (in assembly language; comments are not
needed; use
`SUB`

for substraction, and`DIV`

for division)

grade = english - absent * 5 + math / 3;

- Output of lexical analysis:
- id("grade"), equal, id("english"), hyphen, id("absent"), asterisk, int(5), plus, id("math"), slash, int(3), semicolon
- Output of parsing:

- Compiler output (other solutions possible):

LOAD R1, english LOAD R2, absent CONST R3, 5 MUL R2, R2, R3 SUB R1, R1, R2 LOAD R2, math CONST R3, 3 DIV R2, R2, R3 ADD R1, R1, R2 STORE grade, R1

Theory | Compilers | Other applications | |
---|---|---|---|

Front end | language theory, automata | lexical analysis, parsing | regular expressions, text/data formats |

Back end | optimization, code generation |

- Developed for natural languages
- Model for data formats and programming languages
- Model for computation and recognition

- A
*word*is composed of*symbols*following some rules - A word is a sequence of symbols
- Example 1: Words such as
`a`,`abc`,`aaabbb`, and`abcba`can be created using symbols`a`,`b`, and`c` - Example 2: ❄☀☔ and ☀☀☀ are words that can be created with the symbols ❄, ☀, and ☔
- The
*empty word*(`ε`) is also a word

- A
*word*or a*language*are defined using a finite set of symbols (or letters)`Σ` `Σ`is called*alphabet*(example:`Σ`= {`a`,`b`,`c`})- A word over
`Σ`is a sequence of 0 or more symbols from`Σ` - The number of symbols in a word is called the
*length*of a word - The length of a word
`w`is written |`w`| - Example: |
`abcaba`| = 6; |❄☀☔| = 3; |`ε`| = 0 - Symbols are also words, of length 1

(∀`s`∈`Σ`: |`s`|=1; examples: |`b`|=1, |☀|=1)

- A new word can be created by putting two words one after another
- This is called
*concatenation*on words - The concatenation operation is represented without an explicit symbol

(similar to multiplication in high school) - Example: The concatenation of words
`w`and`v`

is written`w``v` - Application example 1:
`w`=`abc`,`v`=`cba`

⇒`w``v`=`abccba` - Application example 2:
`t`= ❄☀,`z`= ☀☔

⇒`z``t`= ☀☔❄☀

- The concatenation of a word (or symbol) with itself is written using an
exponent:

`w`^{2}=`w``w`=`abcabc`,`a`^{5}=`aaaaa`,`a`^{1}=`a`,`w`^{0}=`ε`,...

- Associativity: For any words
`w`,`v`, and`u`:

(`w``v`)`u`=`w`(`v``u`) - Neutral element:
`ε`(`ε``w`=`w`=`w``ε`) - Commutativity does not hold:
`w``v`≠`v``w`

(example:`abccba`≠`cbaabc`) - The length of a concatenation

is the sum of the lengths of its operands:

|`w``v`| = |`w`| + |`v`|

A *language* over `Σ` is a *set of
words* over `Σ`

Examples for lanuages over `Σ` ={`a`, `b`,
`c`}:

- Empty set: {}
- Set containing only the empty word: {
`ε`} `Σ`(set of words of length 1 over`Σ`): {`a`,`b`,`c`}- Set of all possible words of length 3 (over
`Σ`)

(size of set: 27 ) - Set of all possible words of length
`n`(over`Σ`)

(size of set: |`Σ`|^{n}) - Set of all possible words over
`Σ` - Set of all possible words over
`Σ`where the number of`a`s is odd and the number of`c`s is even - Set of (all) words (over
`Σ`) starting with`a`

`Σ`= {a,..., z}, set of keywords

of the programming language C

({`auto`

,`break`

,`case`

,`char`

,`const`

,...,`do`

,`double`

,...})`Σ`= {0,..., 9, a,..., f, A,..., F, x,...},

set of integer literals of C

({`0`

,`1`

,`2`

,`054`

,`86400`

,`0x7F`

, ...})`Σ`= {characters from ASCII or Unicode},

(set of) grammatically correct C programs`Σ`= {a,..., z, (, ), ¬, ∧, ∨}, set of all

well-formed formulæ of predicate logic

`Σ`= {Latin letters,...}, set of all English words`Σ`= {Latin letters,...}, set of all French words`Σ`= {Latin letters,..., space,...}, set of all correct English sentences`Σ`= {Kanji, Kana,...}, set of all Japanese words`Σ`= {Kanji, Kana,...}, set of all grammatically correct Japanese sentences`Σ`= {❄, ☀, ☔}, set of words representing the weather at each of the prefectural government locations for each of the days of next week (size of each word: 7; number of words: 47)

Operations on languages are combinations of operations on sets and operations on words.

- Set union on languages
- Set intersection on languages
- Set difference on langugages
- Concatenation operation for languages:

For languages`A`and`B`, their concatenation`A``B`is the set {`w``v`|`w`∈`A`,`v`∈`B`}

Example:`A`= {`ab`,`abc``},``B`= {`a`,`ca`},

`AB`= {`aba`,`abca`,`abcca``} (|``AB`| ≦ |`A`| · |`B`|)

As for words, we write`L`^{2}for`L``L`,`L`^{1}for`L`,

`L`^{0}for {`ε`}, ... - Kleene closure: Concatenating the same language 0 or more times

written`L`^{*};`L`^{*}=`L`^{0}∪`L`^{1}∪`L`^{2}∪`L`^{3}∪... = ⋃^{∞}_{i=0}`L`^{i}

Example:`L`= {`a`,`b`}

⇒`L`^{*}= {`ε`,`a`,`b`,`aa`,`ab`,`ba`,`bb`,`aaa`, ...}

and Formal Languages

- How to
`define`languages in a way that is

simple and easy to understand? - How to
`produce`words

from definitions of languages? - How to
`decide`whether some sequence of symbols is a word in some language? - How to
`implement`such decisions easily,

and execute them quickly? - How to
`associate`syntax with sematics?

- An
*automaton*is a model for a machine that*accepts*/*recognizes*/distinguishes words in a given language - A
*grammar*is a set of rules to create (the words of) a language - There are many different types of automata and grammars
- These different types have different ranges of languages that can be accepted/generated
- Language theory distinguishes mainly four types of language families
- For each type of language, there is a corresponding type of automaton and a corresponding type of grammar

(Chomsky hierarchy)

言語 | grammar | Type | Lanugage type | automaton |

句構造言語 | phrase structure grammar (psg) | 0 | phrase structure language | Turing machine |

文脈依存言語 | context-sensitive grammar (csg) | 1 | context-sensitive language | linear-bounded automaton |

文脈自由言語 | context-free grammar (cfg) | 2 | context-free language | push-down automaton |

正規言語 | regular grammar (rg) | 3 | regular language | finite state automaton |

- The Turing machine is a model for computation in general
- Context-free languages are used for parsing
- Regular languages are used for lexical analysis
- There is an ordered subset relationship between these four types

(Type 3 ⊂ Type 2 ⊂ Type 1 ⊂ Type 0)

Automata types are distinguished by the restrictions on their "external memory":

0. The external memory is a tape of unlimited length: Turing machine

1. The external memory is a tape of limited length: linear-bounded automaton

2. The external memory is a stack (only the top can be accessed): push-down automaton

3. There is no external memory: finite state automaton

`S`,`B`:*nonterminal symbols*(upper case)- a, o, y:
*terminal symbols*(lower case) *rewriting rules*:`S`→`a``S``o``S`→`B``B`→`y``a``S`:*start symbol*(initial symbol)

Example of *derivation* of a word from the grammar:

`S` → `a` `S` `o` →
`a` `a` `S` `o`
`o` → `a` `a` `B` `o` `o` → ` a` `a` `y` `a`
`o` `o`

`S` ⇒ ` a` `a`
`y` `a` `o` `o`, other derivations:
` a y a o`, ` aaayaooo`,
...

(single steps in a derivation are written with →, the overall result with ⇒)

The four components defining a grammar:

- A finite set of nonterminal symbols
`N`(usually upper case) - A finite set of terminal symbols
`Σ`(usually lower case,`N`∩`Σ`＝ {}) - A finite set of rewriting rules
`P`(also called production rules) - A start symbol
`S`(`S`∈`N`, the symbol on the left side of the first rewriting rule if not explicitly specified)

A grammar is a quadruple (`N`, `Σ`,
`P`, `S`)

(also: production rule)

- Each rewriting rule is written as
`α`→`β` `α`is called left-hand side,

`β`is called right-hand side`α`is a sequence of (nonterminal/terminal) symbols, with at least one nonterminal symbol`β`is a sequence of 0 or more

(nonterminal/terminal) symbols- Examples:
`aD`→`aDDb`,`EF`→`abc`,`F`→`Fb`,

`D`→`ε` - Counterexamples:
`bc`→`Dc`,`ε`→`b`

- In the current sequence of (non)terminals,

find a subsequence that matches

the left-hand side of a rewriting rule - Replace this subsequence

with the right-hand side of the rewriting rule - If more than one rewriting rule can be applied, select one

(different choices may produce different words)

- Process of creating words from a grammar
- Start from the start symbol
- Repeatedly apply a rewriting rule
- When the sequence contains only terminal symbols, the derivation is
complete

→ The result is a word of the language

defined by this grammar - If there are still some nonterminal symbols,

but rewriting is impossible, the derivation fails

Grammar:

`S`→`dcd``S`→`dHRd``R`→`GHRd``R`→`GHd``HG`→`GH``dG`→`dd``Hd`→`cd``Hc`→`cc`

Example of derivation:

`S` →_{2} `dHRd`→_{4} `dHGHdd`→_{5} `dGHHdd`→_{7} `dGHcdd`→_{8} `dGccdd`→_{6} `ddccdd`

(numbers indicate the rewriting rule that is applied, the underlined parts indicate where the rules are applied)

- A word over an alphabet
`Σ`is

a sequence of 0 or more symbols from`Σ` - A
*language*over an alphabet`Σ`is

a*set of words*over`Σ` - A
`grammar`is a set of`rewriting rules`that allow to produce all words of a language (and only those) by starting from a single`start symbol` - An
`automaton`is a machine that`accepts`all words of a language (and only those)

Deadline: April 21, 2022 (Thursday), 18:40

Format: A4 single page (using both sides is okay; NO cover page), easily readable handwriting (NO printouts), name (kanji and kana) and student number at the top right

Where to submit: Box in front of room O-529 (building O, 5th floor)

For the language `L` = { `qt`, `sq`, `s`
},

list the 10 shortest words of `L`^{*}.

Additional problem (solution voluntary):

List all words of `L`^{*} of length 4.

Using the grammar from the slide "Example of Grammar and Derivation", find 3
words (different from each other and from `ddccdd`) produced by that
grammar.

Give the full derivation for each word

(rule numbers and underlines not needed).

Guess and explain

what language this grammar defines.

Hint: If your guess is not simple,

maybe you have made a mistake in the derivations.

Additional problem (solution voluntary):

Prove or justify your guess.

(no need to submit, but contact me by e-mail if you have any problems)

Install cygwin on your notebook computer (detailled instructions with images).

Make sure that you select/install **gcc (gcc-core)**,
**flex, ****bison, diff (diffutils), make**, and
**m4** .

If you have an earlier cygwin installation, make sure to check/update.

- No need to wait, you can come to my office in the afternoon or at a later date
- Order is by points, decreasing
- When your name is called, raise your hand very visibly, then come to the front
- Only take your own homework, never any other
- Homeworks without kana will be handed back at the end

- word
- 語
- derivation
- 導出
- classification
- 分類
- symbol
- 記号
- empty word
- 空語
- alphabet
- アルファベット
- (word/language) over
`Σ` `Σ`上の (語・言語)- concatenation (operation)
- 連結 (演算)
- associativity
- 結合性 (結合率が成立つこと)
- neutral element
- 単位元
- commutativity
- 可換性
- prefectural government (building)
- 県庁
- keyword
- 予約語
- well-formed formula
- 整論理式
- Kleene closure
- クリーン閉包
- rule
- 規則
- type of language
- 言語族
- Chomsky hierarchy
- チョムスキー階層
- phrase structure language
- 句構造言語
- context-sensitive language
- 文脈依存言語
- context-free language
- 文脈自由言語
- regular language
- 正規言語
- Turing machine
- チューリング機械
- linear-bounded automaton
- 線形束縛オートマトン
- push-down automaton
- プッシュダウンオートマトン
- finite state automaton
- 有限オートマトン
- external memory
- 外部メモリ
- nonterminal symbol
- 非終端記号
- upper case (letter)
- 大文字
- lower case (letter)
- 小文字
- terminal symbol
- 終端記号
- rewriting rule/production rule
- 書き換え規則・生成規則
- initial/start symbol
- 初期記号・開始記号
- derivation
- 導出
- quadruple
- 四字組
- left-hand side
- 左辺
- right-hand side
- 右辺
- subsequence
- 部分列