# Importance, Definition, and Classification of Formal Languages

(形式言語の重要性、定義、分類)

## Language Theory and Compiler

### Martin J. Dürst

© 2006-17 Martin J. Dürst 青山学院大学

# Today's Schedule

• Last week's homework
• Definitions for formal language theory
• Definitions, operations, and properties for words
• Definitions, operations, and properties for languages
• Automata, grammars, and derivation

# Course Contens

Theory Compilers Other applications
Front end language theory, automata lexical analysis, parsing regular expressions, text/data formats
Back end
optimization, code generation

# Importance of Formal Language Theory

• Model for data formats and programming languages
• Model for computation and recognition

# Basic Terms

Terms for formal languages:

• A word is composed of symbols following some rules
• A word is a string/sequence of symbols
• Example: Words such as a, abc, aaabbb, and abcba can be created using symbols a, b, and c
• The empty word (ε) is also a word

# Definition of Word

• A word or a language are defined using a finite set of symbols (or letters) Σ
• Σ is called alphabet (example: Σ = {a, b, c})
• A word over Σ is a sequence of 0 or more symbols from Σ
• The number of symbols in a word is called the length of a word
• The length of a word w is written |w|
• Example: |abcaba| = 6; |ε| = 0
• Symbols are also words, of length 1
(∀sΣ: |s|=1; example: |b|=1)

# Concatenation Operation for Words

• A new word can be created by lining up two words after each other
• This is called concatenation operation for (on) words
• The concatenation operation is represented without an explicit symbol
(similar to multiplication in high school)
Example: The concatenation of words w and v is written wv
• Application example: w = abc, v = cbawv = abccba
• The concatenation of a word (or symbol) with itself is written using an exponent: w2 = ww = abcabc, a5 = aaaaa,...

# Properties of Concatenation

• Associativity: For any words w, v, and u: (wv)u = w(vu)
• The neutral element is ε: wε = εw = w
• Commutativity does not hold: wvvw (example: abccbacbaabc)
• The length of a concatenation is the sum of the lengths of its operands: |wv| = |w| + |v|

# Definition of Language

A language over Σ is a set of words over Σ

Examples for lanuages over Σ ={a,b,c}:

• Empty set: {}
• Set containing only the empty word: {ε}
• Σ (set of words of length 1 over Σ): {a,b,c}
• Set of all possible words of length 3 (over Σ) (size of set: 27)
• Set of all possible words of length n (over Σ) (size of set: |Σ|n)
• Set of all possible words over Σ
• Set of (all) words (over Σ) starting with a
• Set of words representing the weather at each of the prefectural government locations for each of the days of next week, if a stands for sunny, b stands for cloudy, and c stands for rainy/snowy (size of each word: 7; number of words: 47)

# More Examples of Languages

• Σ = {a,..., z}, set of keywords of the programming language C
• Σ = {0,..., 9, a,..., f, A,..., F, x,...}, set of integer literals of C
• Σ = {characters from ASCII or Unicode}, (set of) grammatically correct C programs
• Σ = {a,..., z, (, ), ¬, ∧, ∨}, set of all well-formed formulæ of predicate logic
• Σ = {Latin letters,...}, set of all English words
• Σ = {Latin letters,...}, set of all French words
• Σ = {Kanji, Kana,...}, set of all Japanese words
• Σ = {Kanji, Kana,...}, set of all grammatically correct Japanese sentences

# Operations on Languages

Operations on languages are combinations of operations on sets and operations on words.

1. Set union of languages
2. Set intersection of languages
3. Set difference of langugages
4. Concatenation operation for languages: For languages A and B, their concatenation AB is the set { wv | wA, vB }

As for words, we write L2 for LL,...

5. Kleene closure: Concatenating the same language 0 or more times

written L*; L* = ⋃i=0 Li

Example: L = {a, b} => L* = {ε, a, b, aa, ab, ba, bb, aaa, ...}

# Main Problems in Formal Language Theory

• How can languages be defined in a way that is simple and easy to understand?
• How can words be produced from definitions of languages?
• How to decide whether some sequence of symbols is a word in some language?
• How can such decisions be implemented easily and executed quickly?
• How can we associate syntax with sematics?

# Languages and Automata and Grammars

• An automaton is a model for a machine that accepts/recognizes/distinguishes words in a given language
• A grammar is a set of rules to create (the words of) a language
• There are many different types of automata and grammars
• These different types have different ranges of languages that can be accepted/generated
• Language theory distinguishes mainly four types of languages
• There is an ordered subset relationship between these four types
• For each type of language, there is a corresponding type of automaton and a corresponding type of grammar

# Table of Formal Language Types

(Chomsky hierarchy)

 文法 grammar Type Lanugage type automaton 句構造文法 phrase structure grammar (psg) 0 phrase structure language Turing machine 文脈依存文法 context-sensitive grammar (csg) 1 context-sensitive language linear-bounded automaton 文脈自由文法 context-free grammar (cfg) 2 context-free language push-down automaton 正規文法 regular grammar (rg) 3 regular language finite state automaton
• The Turing machine is a model for computation in general
• Context-free languages are used for parsing
• Regular languages are used for lexical analysis

# Types of Automata

Automata types are distinguised by the restrictions on their "external memory":

0. The external memory is a tape of unlimited length: Turing machine

1. The external memory is a tape of limited length: linear-bounded automaton

2. The external memory is a stack where only the top can be accessed: push-down automaton

3. There is no external memory: finite state automaton

# Example of a Grammar for a Formal Language

• S, A: nonterminal symbols (upper case)
• a, o, y: terminal symbols (lower case)
• rewriting rules:

Sa S o

SA

Ay a

• S: start symbol (initial symbol)

Example of derivation of a word from the grammar:

Sa S oa a S o oa a A o oa a y a o o

Sa a y a o o

(single steps in a derivation are written with →, the overall result with ⇒)

# Definition of Grammar

• A finite set of nonterminal symbols N (usually upper case)
• A finite set of terminal symbols Σ (usually lower case, NΣ ＝ {})
• A finite set of rewriting rules P (also called production rules)
• A start symbol S (SN, the symbol on the left side of the first rewriting rule if not explicitly specified)

A grammar is defined as a quadruple (N, Σ, P, S)

# Rewriting Rule

(also: production rule)

• Each rewriting rule is written as αβ
• α is called left-hand side, β is called right-hand side
• α is a sequence (nonterminal/terminal) symbols, with at least one nonterminal symbol
• β is a sequence of 0 or more (nonterminal/terminal) symbols
• Counterexamples: bcDc, εb

# Derivation

(derivation)

• Process of creating words from a grammar
• Starting from the start symbol
• In each derivation step, one rewriting rule is applied as follows:
• In the current sequence of (non)terminal symbols
• Find a subsequence that is equal to the left-hand side of a rewriting rule
• Replace the subsequence with the right-hand side of the rewriting rule
• If more than one rewriting rules can be applied, choose one
(different choices may produce different words)
• When the sequence contains only terminal symbols, the derivation is complete
→ The result is a word of the language defined by this grammar
• If there are still some nonterminal symbols, but there is no matching (left hand side of a) production rule, the derivation fails

# Example of Grammar and Derivation

Grammar:

1. Saba
3. TCDTa
4. TCDa
5. DCCD
6. aCaa
7. Daba
8. Dbbb

Example of derivation:

(numbers indicate the rewriting rule that is applied, the underlined parts indicate where the rules are applied; not necessary (e.g. for homework))

# Types of Grammars

Grammar types are distinguished by restrictions on rewriting rules:

0. No restrictions: Phrase structure grammar, (Chomsky) type 0 grammar

1. αAβαγβ, where α and β are sequences of 0 or more (non)terminals, and γ is a sequence of 1 or more (non)terminals:
Context-sensitive grammar), (Chomsky) type 1 grammar

2. Aγ, where γ is a sequence of 1 or more (non)terminals:
Context-free grammar, (Chomsky) type 2 grammar

3. AaB or Aa (alternative: ABa or Aa):
Regular grammar, (Chomsky) type 3 grammar

(for all types, Sε is also allowed)

# Homework

Deadline: April 20, 2017 (Thursday), 19:00

Where to submit: Box in front of room O-529 (building O, 5th floor)

Format: A4 single page (using both sides is okay; NO cover page), easily readable handwriting (NO printouts), name (kanji and kana) and student number at the top right

1. For the language L = { a, cb, ac }, list up the 10 shortest words of L*
Additional problem (solution voluntary): List all words of L* of length 4
2. Using the grammar from the slide "Example of Grammar and Derivation", find 3 words (different from each other and from aabbaa). Give the full derivation for each word (rule numbers and underlines not needed). Guess and explain what language this grammar defines (Hint: If your guess is not simple, maybe you have made a mistake in the derivations).
3. (no need to submit, but bring your notebook PC with you to the next lecture if you have any problems)
Install cygwin on your notebook computer (detailled instructions with images). Make sure that you select/install gcc, flex, bison, diff, make, and m4. If you have an earlier cygwin installation, make sure to check/update.

# Glossary

word
derivation

classification

symbol

empty word

alphabet
アルファベット
(word/language) over Σ
Σ 上の (語・言語)
concatenation (operation)

associativity

neutral element

commutativity

prefectural government (building)

keyword

well-formed formula

Kleene closure
クリーン閉包
rule

type of language

Chomsky hierarchy
チョムスキー階層
phrase structure language

context-sensitive language

context-free language

regular language

Turing machine
チューリング機械
linear-bounded automaton

push-down automaton
プッシュダウンオートマトン
finite state automaton

external memory

nonterminal symbol

upper case (letter)

lower case (letter)

terminal symbol

rewriting rule/production rule

initial/start symbol

derivation