Algorithms for String Matching

(文字列照合のアルゴリズム)

Data Structures and Algorithms

11th lecture, December 8, 2022

https://www.sw.it.aoyama.ac.jp/2022/DA/lecture11.html

Martin J. Dürst

AGU

© 2009-22 Martin J. Dürst 青山学院大学

 

Today's Schedule

 

Leftovers from Last Lecture

 

Summary of Last Lecture

 

Remaining Schedule

 

Term Final Exam

Coverage:
Complete contents of lecture and handouts (incl. programs)
No need to write Ruby code, but need to understand it, and to write your own pseudocode
Type of problems:
Similar to problems in Discrete Mathematics I
Past exams: 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, (2020 online) 2021
How to view example solutions:
Press [表示 (S)] button or [S] key. To revert, press [非表示 (P)] button or press [P] key.
Sometimes, more than one key press is needed to start switching.
Some images and example solutions are missing.
Important points:
Read problems carefully (distinguish between calculation, proof, explanation,...)
Be able to explain concepts in your own words
Combine and apply knowledge from different lectures
Write clearly
Answers can be in Japanese or English

 

Overview of String Matching

Goal: To find a short pattern p in a long text t

Task: Find "pattern" in the text below

...paternpattnetrnternapatternatnetnpttepneretanetpat...  

Background: String matching algorithms in various programming languages

 

Simplistic Implementation

 

Overview of the Rabin-Karp Algorithm

 

Selecting the Hash Function

 

Speeding up the Hash Function

 

Example of Rabin-Karp

Pattern: 081205

Text: 28498608120598743297

b = 10, d = 9

 (for manual calculation, casting out nines is helpful)

Example of implementing Rabin-Karp algorithm in Excel: BRabin-Karp.xls

Pseudocode of Rabin-Karp algorithm: Bstringmatch.rb

 

Overview of Knuth-Morris-Pratt Algorithm

 

Details of Knuth-Morris-Pratt Algorithm

 

Time Complexity of Knuth-Morris-Pratt Algorithm

 

Precomputation for Knuth-Morris-Pratt Algorithm

For all positions x in the pattern,
assuming that p[x] does not match
but that p[0] ... p[x-1] (length x) already match,
the length of the longest matching prefix and suffix of the already matching part
indicates the position in the pattern to match next

Pseudocode for Knuth-Morris-Pratt algorithm: Bstringmatch.rb

 

Precomputation Example

Precomputation for pattern aoyaoa (? is any non-matching character):

Green indicates matches, red indicates non-matches, blue indicates next comparision (maybe with red ?).

Position not matched part next
0 text ?
pattern a o y a o a
next a o y a o a -1
1 text a ?
pattern a o y a o a
next a o y a o a 0
2 text a o ?
pattern a o y a o a
next a o y a o a 0
3 text a o y ?
pattern a o y a o a
next? a o y a o a
next a o y a o a -1
4 text a o y a ?
pattern a o y a o a
next a o y a o a 0
5 text a o y a o ?
pattern a o y a o a
next a o y a o a 2

 

Overview of Boyer-Moore Algorithm

 

Idea Details

Two guidelines for shifting the pattern:

  1. Pattern-internal comparison
    (back-to-front version of Knuth-Morris-Pratt algorithm)
  2. The rightmost position in the pattern of the non-matching character in the text

    text e f g h i j k a m n o
    pattern a o g a k u
    next pattern position a o g a k u

Select the larger shift

 

Time Complexity of Boyer-Moore Algorithm

 

String Matching Context

 

String Matching and Character Encoding

 

Character Encodings and Byte Patterns

 

Other String Matching Problems

 

Outlook

 

Summary

 

Homework

 

Glossary

string matching
文字列照合
pattern
パターン
text
文書
substring
部分文字列
simplistic
素朴な、単純すぎる
nucleotide
ヌクレオチド
amino acid
アミノ酸
precomputation
事前計算
casting out nines
九去法
prefix
接頭部分列
suffix
末尾部分列
character encoding
文字コード
index
索引
lexical analysis
字句解析
finite automaton (pl. automata)
有限オートマトン
regular expression
正規表現