(文字列照合のアルゴリズム)

http://www.sw.it.aoyama.ac.jp/2019/DA/lecture11.html

© 2009-19 Martin J. Dürst 青山学院大学

- Summary and leftovers of last lecture
- Overview of string matching
- Simplistic implementation
- Rabin-Karp algorithm
- Knuth-Morris-Pratt algorithm
- Boyer-Moore algorithm
- String matching and character encoding
- Summary

- A
*hash table*implements a dictionary ADT using a*hash function* - The hash function converts the keys into a random-like distribution in a repeatable way
- Hash tables allow search, insertion, and deletion all in time
`O`(1) - The main methods for conflict resolution are
*chaining*and*open addressing* - In Ruby and many other programming languages, hash tables are very convenient data strucutures
- The implementation of hashing in Ruby uses open addressing starting with Ruby 2.4 (before: chaining)

- Rescheduling the lecture originally on December 12, 2019
- Choices:
- Weekday or Saturday
- 1st to 5th period or 6th period
- This year or next year

- Please note: Contents of makeup class is part of final exam!

Date and time: January 30, 09:30-10:55

- Coverage:
- Complete contents of lecture and handouts
- No need to be able to write Ruby code, but need to be able to
understand it, and to
**write your own pseudocode** - Type of problems:
- Similar to problems in Discrete Mathematics I or Computer Practice I
- Past exams: 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018
- How to view example solutions:
- Press [表示
(S)] button or [S] key. To revert, press [非表示 (P)]
button or press [P] key.

Sometimes, more than one key press is needed to start switching.

Some images and example solutions are missing. - Important points:
- Read problems carefully (distinguish between calculation, proof, explanation,...)
- Be able to explain concepts in
**your own**words

**Combine**and**apply**knowledge from different lectures- Write
**clearly** - Answers can be in Japanese or English

Goal: To find a short *pattern* `p` in a long *text*
`t`

- Text
`t`(haystack): String of length`n`

- Pattern
`p`(needle): String of length`m`

- Search pattern inside text (existence/location/number)
- The location is usually called
*shift* - The substring of shift
`s`and length`m`in text`t`is written`t`_{s} - The character in
`t`at position`s`is written`t`[`s`]

Background: String matching algorithms in various programming languages

Task: Find "pattern" in the text below

...paternpattnetrnternapatternatnetnpttepneretanetpat...

- Compare the pattern with each substring of length
`m`of the text - There are
`n`-`m`+1 =`O`(`n`) substrings - Comparing the pattern with a substrings takes time
`O`(`m`)text e f a n y a k a m n pattern a o y a m a next pattern position a o y a m a - A simplistic implementation will
take time
`O`(`n`) ·`O`(`m`) =`O`(`n``m`) - If the comparison is stopped early:
- In the general case, time will be close to
`O`(`n`) - In the worst case, time will be
`O`(`n``m`)

Actual example:`t`= aaa....aaaab,`p`= aa..ab

- In the general case, time will be close to

- Define a hash function
`hf`() - Calculate the hash value
`hf`(`p`) of pattern`p` - Calculate the hash value of each substring of length
`m`of text`t` - Compare with
`hf`(`p`) - If the hash values are equal, check the actual substring against the pattern
- A simplistic implementation is
`O`(`n``m`) - This can be improved by selecting an appropriate hash function

- Use a hash function so that
`hf`(`t`_{s+1}) can be easily calculated from`hf`(`t`_{s}) - The hash of the pattern is
`hf`(`p`) = (`p`[0]·`b`^{m-1}+`p`[1]·`b`^{m-2}+...+`p`[`m`-2]·`b`^{1}+`p`[`m`-1]_{}·`b`^{0}) mod`d`

(`b`is the size of the alphabet,`d`is an arbitrarily selected divisor) - The hash of the candidate substrings is:

`hf`(`t`_{s}) = (`t`[`s`]·`b`^{m-1}+`t`[`s`+1]·`b`^{m-2}+ ... +`t`[`s`+`m`-2]·`b`^{1}+`t`[`s`+`m`-1]·`b`^{0}) mod`d`

`hf`(`t`_{s+1}) = (`t`[`s`+1]·`b`^{m-1}+`t`[`s`+2]·`b`^{m-2}+ ... +`t`[`s`+`m`-1]·`b`^{1}+`t`[`s`+`m`]·`b`^{0}) mod`d`

- Using the properties of the modulo function (Discrete Mathematics I, Modular
Arithmetic)

`hf`(`t`_{s+1}) = ((`hf`(`t`_{s}) -`t`[`s`]·`b`^{m-1}) ·`b`+`t`[`s`+`m`]) mod`d`

= ((`hf`(`t`_{s}) -`t`[`s`]·(`b`^{m-1}mod`d`)) ·`b`+`t`[`s`+`m`]) mod`d` `hf`(`t`_{s+1}) can be calculated from`hf`(`t`_{s}) in time`O`(1)- Therefore, the overall time is
`O`(`n`)

Pattern: 081205

Text: 28498608120598743297

`b` = 10, `d` = 9

(for manual calculation, casting out nines is helpful)

Example of implementing Rabin-Karp algorithm in Excell: BRabin-Karp.xls

Pseudocode of Rabin-Karp algorithm: Bstringmatch.rb

- A simplistic implementation compares the same text character many times
- Basic idea:

Use knowledge from previous comparisons - Precompute the pattern shifts by pattern-internal comparisons

- Compare the current text location with the current pattern location
- As long as there is a match, continue comparison without changing
`s`

(i.e. compare the next character in the text with the next character in the pattern) - If the end of the pattern is reached, there is a successful match
- If there is no match:
- At the start of the pattern, we move the pattern by one (increase
`s`by one)

(i.e. compare the next character in the text with the first character in the pattern)

text e f g h i j k l m n pattern a o y a m a next pattern position a o y a m a - In other positions, move the position in the pattern to the left
according to the precomputed value

(i.e. keep the current position in the text, move the pattern to the right)text e f a o y a k l m n o p pattern a o y a m a next pattern position a o y a m a

- At the start of the pattern, we move the pattern by one (increase

- As a result of a comparison, either of two actions is taken:
- Shift the pattern to the right by one or more characters (maximum
`n`-`m`times) - Shift the comparison position in the text to the right by one
character (maximum
`n`-`m`times)

- Shift the pattern to the right by one or more characters (maximum
- The total number of operations is about 2
`n`, so the time complexity is`O`(`n`) - Except for the precomputation, the time complexity does not depend on
`m` - Advantage: The characters in the text are accessed strictly in order (left to right)

For all positions `x` in the pattern,

assuming that `p`[`x`] does not match

but that `p`[0] ... `p`[`x`-1] (length `x`)
already match,

the length of the longest matching prefix and suffix of the already matching
part

indicates the position in the pattern to match next

- In addition, if at the position in the pattern to be checked next, the same character is used, then an additional comparison can be omitted
- Moving the position in the text by one and the pattern to the start again is expressed as -1

Pseudocode for Knuth-Morris-Pratt algorithm: Bstringmatch.rb

Precomputation for pattern `aoyaoa`

(? is any non-matching
character):

Position not matched | part | next | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | text | ? | ||||||||||

pattern | a | o | y | a | o | a | ||||||

next | a | o | y | a | o | a | -1 | |||||

1 | text | a | ? | |||||||||

pattern | a | o | y | a | o | a | ||||||

next | a | o | y | a | o | a | 0 | |||||

2 | text | a | o | ? | ||||||||

pattern | a | o | y | a | o | a | ||||||

next | a | o | y | a | o | a | 0 | |||||

3 | text | a | o | y | ? | |||||||

pattern | a | o | y | a | o | a | ||||||

next? | a | o | y | a | o | a | ||||||

next | a | o | y | a | o | a | -1 | |||||

4 | text | a | o | y | a | ? | ||||||

pattern | a | o | y | a | o | a | ||||||

next | a | o | y | a | o | a | 0 | |||||

5 | text | a | o | y | a | o | ? | |||||

pattern | a | o | y | a | o | a | ||||||

next | a | o | y | a | o | a | 2 |

Green indicates matches, red indicates non-matches, blue indicates next comparision (maybe with red ?).

- Start comparing at the end of the pattern
- Consider the actual text character when comparing

(i.e. not only match/no match, but match/no match with 'a'/no match with 'b'/...) - Increase the distance by which the pattern can be shifted
- Example: If the last character of the pattern doesn't match the text, and
the candidate character in the text does not appear in the pattern, then
the pattern can be shifted by
`m`at once

text e f g h i j k l m n o p q r pattern a o y a m a next pattern position a o y a m a

Two guidelines for shifting the pattern:

- Pattern-internal comparison

(back-to-front version of Knuth-Morris-Pratt algorithm) - The rightmost position in the pattern of the non-matching character in
the text

text e f g h i j k a m n o pattern a o g a k u next pattern position a o g a k u

Select the larger shift

- In the worst case, same as Knuth-Morris-Pratt algorithm:
`O`(`n`)

- If
`m`is relatively small compared to`b`, in most cases:`O`(`n`/`m`)

- Relative size of
`n`and`m`(in general,`m`≪`n`) - Number of searches, number of patterns

(if there are many searches with the same text or pattern, then that text or pattern can be preprocessed to make the search more efficient) - Number of characters (size of alphabet,
`b`)- Sequence of bits:
`b`= 2 - Genetics:
`b`= 4 (nucleotides) or`b`= 21 (amino acids) - Western documents:
`b`≅ 26~256 - East Asian documents:
`b`≅ several thousand

- Sequence of bits:

- In some character encodings, there is a large number of characters
- Implementation becomes simpler if working on bytes instead of characters
- For some character encodings, working byte-by-byte is impossible
- Possible: UTF-8
- Impossible:
`iso-2022-jp`

(JIS),`Shift_JIS`

(SJIS),`EUC-JP`

(EUC)

- UTF-8:
**0**xxxxxxx;xxxxx

110**10**xxxxxx;

**1110**xxxx**10**xxxxxx**10**xxxxxx;

**11110**xxx**10**xxxxxx**10**xxxxxx**10**xxxxxx - EUC-JP: 0xxxxxxx; 1xxxxxxx 1xxxxxxx
- Shift_JIS: 0xxxxxxx; 1xxxxxxx xxxxxxxx
- iso-2022-jp: 0xxxxxxx; 0xxxxxxx 0xxxxxxx

- The lexical analysis of a program is advanced string matching
- Not only fixed character strings as patterns
- Using finite automata
- Patterns are specified using regular expressions
- These are topics of the course "Language Theory and Compilers" (3rd year spring term)
- Regular expressions can be used for text processing in Ruby as well as many other programming languages
- Ruby string matching is reported to have bad worst-case performance, improve as part of graduation research?

- A simplistic implementation of string matching is
`O`(`n``m`) in the worst case - The Rabin-Karp algorithm is
`O`(`n`), using a hash function that can be extended to 2D matching - The Knuth-Morris-Pratt algorithm is
`O`(`n`), and views the text strictly in input order - The Boyer-Moore algorithm is
`O`(`n`/`m`) in most cases

Homework:

- Review the Rabin-Karp algorithm using BRabinKarp.xls and Bstringmatch.rb
- Review the Knuth-Morris-Pratt algorithm using Bstringmatch.rb (in particular
`show_kmp_preparation`

) - Review matrix multiplication (linear algebra)

- string matching
- 文字列照合
- pattern
- パターン
- text
- 文書
- substring
- 部分文字列
- simplistic
- 素朴な、単純すぎる
- nucleotide
- ヌクレオチド
- amino acid
- アミノ酸
- precomputation
- 事前計算
- casting out nines
- 九去法
- prefix
- 接頭部分列
- suffix
- 末尾部分列
- character encoding
- 文字コード
- lexical analysis
- 字句解析
- finite automaton (pl. automata)
- 有限オートマトン
- regular expression
- 正規表現