Algorithms for String Matching


Data Structures and Algorithms

11th lecture, December 19, 2019

Martin J. Dürst


© 2009-19 Martin J. Dürst 青山学院大学

Today's Schedule


Summary of Last Lecture


Leftovers of Last Lecture


Scheduling of Makeup Lecture


Term Final Exam

Date and time: January 30, 09:30-10:55

Complete contents of lecture and handouts
No need to be able to write Ruby code, but need to be able to understand it, and to write your own pseudocode
Type of problems:
Similar to problems in Discrete Mathematics I or Computer Practice I
Past exams: 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018
How to view example solutions:
Press [表示 (S)] button or [S] key. To revert, press [非表示 (P)] button or press [P] key.
Sometimes, more than one key press is needed to start switching.
Some images and example solutions are missing.
Important points:
Read problems carefully (distinguish between calculation, proof, explanation,...)
Be able to explain concepts in your own words
Combine and apply knowledge from different lectures
Write clearly
Answers can be in Japanese or English


Overview of String Matching

Goal: To find a short pattern p in a long text t

Background: String matching algorithms in various programming languages

Task: Find "pattern" in the text below



Simplistic Implementation


Overview of the Rabin-Karp Algorithm


Selecting the Hash Function


Speeding up the Hash Function


Example of Rabin-Karp

Pattern: 081205

Text: 28498608120598743297

b = 10, d = 9

 (for manual calculation, casting out nines is helpful)

Example of implementing Rabin-Karp algorithm in Excell: BRabin-Karp.xls

Pseudocode of Rabin-Karp algorithm: Bstringmatch.rb


Overview of Knuth-Morris-Pratt Algorithm


Details of Knuth-Morris-Pratt Algorithm


Time Complexity of Knuth-Morris-Pratt Algorithm


Precomputation for Knuth-Morris-Pratt Algorithm

For all positions x in the pattern,
assuming that p[x] does not match
but that p[0] ... p[x-1] (length x) already match,
the length of the longest matching prefix and suffix of the already matching part
indicates the position in the pattern to match next

Pseudocode for Knuth-Morris-Pratt algorithm: Bstringmatch.rb


Precomputation Example

Precomputation for pattern aoyaoa (? is any non-matching character):

Position not matched part next
0 text ?
pattern a o y a o a
next a o y a o a -1
1 text a ?
pattern a o y a o a
next a o y a o a 0
2 text a o ?
pattern a o y a o a
next a o y a o a 0
3 text a o y ?
pattern a o y a o a
next? a o y a o a
next a o y a o a -1
4 text a o y a ?
pattern a o y a o a
next a o y a o a 0
5 text a o y a o ?
pattern a o y a o a
next a o y a o a 2

Green indicates matches, red indicates non-matches, blue indicates next comparision (maybe with red ?).


Overview of Boyer-Moore Algorithm


Idea Details

Two guidelines for shifting the pattern:

  1. Pattern-internal comparison
    (back-to-front version of Knuth-Morris-Pratt algorithm)
  2. The rightmost position in the pattern of the non-matching character in the text
    text e f g h i j k a m n o
    pattern a o g a k u
    next pattern position a o g a k u

Select the larger shift


Time Complexity of Boyer-Moore Algorithm


String Matching Context


String Matching and Character Encoding


Character Encodings and Byte Patterns








string matching
amino acid
casting out nines
character encoding
lexical analysis
finite automaton (pl. automata)
regular expression