Algorithms for String Matching

(文字列照合のアルゴリズム)

Data Structures and Algorithms

11th lecture, December 19, 2019

http://www.sw.it.aoyama.ac.jp/2019/DA/lecture11.html

Martin J. Dürst

AGU

© 2009-19 Martin J. Dürst 青山学院大学

Today's Schedule

 

Summary of Last Lecture

 

Leftovers of Last Lecture

 

Scheduling of Makeup Lecture

 

Term Final Exam

Date and time: January 30, 09:30-10:55

Coverage:
Complete contents of lecture and handouts
No need to be able to write Ruby code, but need to be able to understand it, and to write your own pseudocode
Type of problems:
Similar to problems in Discrete Mathematics I or Computer Practice I
Past exams: 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018
How to view example solutions:
Press [表示 (S)] button or [S] key. To revert, press [非表示 (P)] button or press [P] key.
Sometimes, more than one key press is needed to start switching.
Some images and example solutions are missing.
Important points:
Read problems carefully (distinguish between calculation, proof, explanation,...)
Be able to explain concepts in your own words
Combine and apply knowledge from different lectures
Write clearly
Answers can be in Japanese or English

 

Overview of String Matching

Goal: To find a short pattern p in a long text t

Background: String matching algorithms in various programming languages

Task: Find "pattern" in the text below

...paternpattnetrnternapatternatnetnpttepneretanetpat...

 

Simplistic Implementation

 

Overview of the Rabin-Karp Algorithm

 

Selecting the Hash Function

 

Speeding up the Hash Function

 

Example of Rabin-Karp

Pattern: 081205

Text: 28498608120598743297

b = 10, d = 9

 (for manual calculation, casting out nines is helpful)

Example of implementing Rabin-Karp algorithm in Excell: BRabin-Karp.xls

Pseudocode of Rabin-Karp algorithm: Bstringmatch.rb

 

Overview of Knuth-Morris-Pratt Algorithm

 

Details of Knuth-Morris-Pratt Algorithm

 

Time Complexity of Knuth-Morris-Pratt Algorithm

 

Precomputation for Knuth-Morris-Pratt Algorithm

For all positions x in the pattern,
assuming that p[x] does not match
but that p[0] ... p[x-1] (length x) already match,
the length of the longest matching prefix and suffix of the already matching part
indicates the position in the pattern to match next

Pseudocode for Knuth-Morris-Pratt algorithm: Bstringmatch.rb

 

Precomputation Example

Precomputation for pattern aoyaoa (? is any non-matching character):

Position not matched part next
0 text ?
pattern a o y a o a
next a o y a o a -1
1 text a ?
pattern a o y a o a
next a o y a o a 0
2 text a o ?
pattern a o y a o a
next a o y a o a 0
3 text a o y ?
pattern a o y a o a
next? a o y a o a
next a o y a o a -1
4 text a o y a ?
pattern a o y a o a
next a o y a o a 0
5 text a o y a o ?
pattern a o y a o a
next a o y a o a 2

Green indicates matches, red indicates non-matches, blue indicates next comparision (maybe with red ?).

 

Overview of Boyer-Moore Algorithm

 

Idea Details

Two guidelines for shifting the pattern:

  1. Pattern-internal comparison
    (back-to-front version of Knuth-Morris-Pratt algorithm)
  2. The rightmost position in the pattern of the non-matching character in the text
    text e f g h i j k a m n o
    pattern a o g a k u
    next pattern position a o g a k u

Select the larger shift

 

Time Complexity of Boyer-Moore Algorithm

 

String Matching Context

 

String Matching and Character Encoding

 

Character Encodings and Byte Patterns

 

Outlook

 

Summary

Homework:

 

Glossary

string matching
文字列照合
pattern
パターン
text
文書
substring
部分文字列
simplistic
素朴な、単純すぎる
nucleotide
ヌクレオチド
amino acid
アミノ酸
precomputation
事前計算
casting out nines
九去法
prefix
接頭部分列
suffix
末尾部分列
character encoding
文字コード
lexical analysis
字句解析
finite automaton (pl. automata)
有限オートマトン
regular expression
正規表現