Regular Expressions
Inside Out

http://www.sw.it.aoyama.ac.jp/2017/pub/Ruby_Kaigi/

Ruby Kaigi 2017 , Hiroshima, Japan, September 19, 2017

Martin J. DÜRST (マーティンと呼んでください)

duerst@it.aoyama.ac.jp, Aoyama Gakuin University

Abstract

Regular expressions are a very important part of the toolkit of every Ruby programmer. This talk will help improve your understanding of regular expressions, including how to use them from Ruby, and how they are implemented.

Examples will include things Ruby can do but other programming languages can't, huge regular expressions, substitutions that change as we go, and performance improvements for future Ruby versions.

For Best Viewing

These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux; use F11 to switch to projection mode). The slides have been optimized for a screen size of 1920x1080pixels, but can easily be viewed on other screens, too. Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.

Text/Translation

[これは話す予定の原稿です。残念ながら、そのままぴったり話す保証はできません。
スライドの最後に用語集がありますので参考にしてください。]

[This is [the English translation of] the manuscript. Of course, I can't guarantee that I'll read it word-for-word. At the end of the slides, there is a glossary.]

Introduction

Topics

Original Title: Regular Expressions from the Outside and from the Inside

Updated Title: Regular Expressions Inside Out

Originally, long list of topics

Reduced more and more, three topics left

この発表のテーマは正規表現です。
題名を途中でちょっと単純化したので、前の題名での発表を期待している方にはお詫びを申し上げます。
昨日このような T-shirt を着ました。[IETF 79 (2009) の T-shirt で、赤の背景に鯉が書いてあるもの] その関係かどうかわかりませんが、昨日広島カープが優勝しました。
発表の提案を出した時には色々なネタがありましたが、どんどん厳選して、三つに絞りました。

The topic of this talk is regular expressions. I simplified the title a bit, so I want to appologize to listeners who expected the original title. Yesterday, I was wearing this T-shirt. [T-shirt of IETF 79 (2009) with a carp drawn on a redbackground.] I'm not sure there's a connection, but anyway, the Hiroshima Carps baseball team won the championship yesterday. When I submitted the talk proposal, I had a lot of topics, but I had to reduced this to three topics.

Outline

Introduction
Regular expression pitfalls:
The smallest regular expression
Effective use of regular expressions:
Efficient pure Ruby Unicode normalization
Regular expression implementation internals:
Improving character property check implemetation
Conclusions

三つのネタはご覧の通りです。

This outline shows the three topics.

Conventions Used

Code is mostly green, monospaceputs 'Hello Ruby!'

Variable parts are orange
puts "some string"

Results are indicated with "⇒"
1 + 1 ⇒ 2

Matching of regular expressions is indicated with ≟, ≅, or ≇
'abc' ≅ /a/; 'def' ≇ /a/

Matching substrings are indicated with orage background

'abcba' ≅ /b/

この発表も「設定より規約」を採用します。
ご覧のような規約を使っています。
(コードは緑の等幅フォントです。
変更可能な部分は橙色です。
必要な場合、文字コードを青の添え字で明記します。
実行結果は矢印の後に示します。
正規表現のマッチはご覧の記号で表します。)

This presentation is also using "convention over configuration". This slide shows you the conventions we'll use throughout the talk. (Code is in green monospace font. Variable parts are in orange. Where necessary, the character encoding of a string is indicated with a blue subsccript. Results are show after a double arrow. We'll use two symbols to show when regular expressions match or don't match.)

Audience Self-Intro

Who has used regular expressions?
Who has had problems with using regular expressions?
Who has looked at the implementation of regular expressions?

皆さんにちょっと自己紹介していただきたいです。
正規表現を使った方、手を挙げてください。
正規表現で手こずったことがある方、手を挙げてください。
正規表現の実装を除きのぞき程度でも見たことがある方、手を挙げてください。

Let's give everybody a chance to introduce yourselves. Who has used regular expressions? Who has had problems with using regular expressions? Who has looked at the implementation of regular expressions?

Speaker Self-Intro

From Switzerland, living and working in Japan
Ruby Committer
W3C Internationalization Interest Group Chair, IETF WG member,...
Professor at Aoyama Gakuin University (青山学院大学)
Teaching: Algorithms, Compilers, C programming, Math for CS students, Functional Programming,..., in Japanese and English
Research topics: Software Internationalization,
World Wide Web Architecture, Programming Languagues

自己紹介です。時間の都合、飛ばします。

This is some information about myself. I don't have much time, so let's move on.

Contributions to Ruby

Mainly in the following areas:

Encoding conversion (String#encode, Ruby 1.9)
Unicode normalization (String#unicode-normalize, Ruby 2.2)
Non-ASCII case conversion (String#upcase,..., Ruby 2.4)
Unicode version updates

Ruby コミッタとして主にご覧のあたりを担当しています。

As a Ruby committer, I'm mostly responsible for the areas listed on this slide.

Ruby Versions and Unicode Versions

Year (`y`)	Ruby version (`V`_Ruby)	Supported Unicode version (`V`_Unicode)
	published around Christmas	published in Summer
2014	2.2	7.0.0
2015	2.3	8.0.0
2016	2.4	9.0.0
2017	2.5	10.0.0 (as of September 6)

NOTE about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.

RbConfig::CONFIG["UNICODE_VERSION"] ⇒ '10.0.0'

V_Unicode = y - 2007; V_Ruby = (y - 1992) · 0.1

V_Ruby = 1.5 + V_Unicode · 0.1 V_Unicode = V_Ruby · 10 - 15

Don't extrapolate too far!

ご覧のように、Ruby のバージョンと
対応している Unicode のバージョンには
2014 年以来、規則性が見られますが、
これがどこまで続くのかは保証できません。

As you can see, starting in 2014, there is some regularity between the Ruby version and the Unicode version it supports. But there is no guarantee that this relationship stays the same going forward.

Ruby Regular Expression Engine

Onigmo
Onigumo (おにぐも)
鬼雲? (devil's cloud)
鬼蜘蛛? (devil's spider)
- Dangerous Japanese spider
- Name used in various Manga

現在 Ruby が使っている正規表現のエンジンは「おにぐも」です。
意味についていろいろ憶測ができます。

The regular expression engine used by Ruby is called "Onigmo'. The name may mean various things; I can only speculate.

Outline

Introduction
Regular expression pitfalls: ⇐ We are here now
The smallest regular expression
Effective use of regular expressions:
Efficient pure Ruby Unicode normalization
Regular expression implementation internals:
Improving character property check implemetation
Conclusions

次のネタに移ります。目次では現在ご覧のところになります。

Let's move to the next section of this talk. You can see where we are in the table of contents.

Regular Expression Pitfalls

次に、正規表現のよくある落とし穴の一つを見てみたいと思います。

Next, let's look at a common pitfall with regular expressions.

Example Regular Expression

/stuff_between_slashes/

簡単に言うと、正規表現をスラッシュの間のものですよね。

In very easy terms, a regular expression is stuff between slashes.

The Smallest Regular Expression

何かよく理解しようとしているとき、
先ずその中でも極端なものを理解してみるのはお勧めします。
例えば一番小さいもの。
それでは、一番小さい正規表現はどのようなものでしょうか。
後にもっと大きいものが出ますので、しばらく我慢してください。

When trying to understand something, make sure you understand extreme cases, such as the smallest one. So let's look at the smallest regular expression. Don't worry, we'll look at some much bigger ones later.

The Smallest Regular Expression

//

これは一番小さい正規表現です。

This is the smalles regular expression.

`//`

スラッシュ二つだけです。何も入ってない。

Only two slashes. Nothing inside.

`//`

It's empty!

空っぽです。何もありません。

It's empty. There's nothing.

`//`

What does it match?

何にマッチするでしょう。

What does it match?

`//`

It will match nothing, yes?

何もないので、何もマッチできないよね。

There's nothing, so it won't match anything, yes?

`//`

It will match nothing, yes?

If you agree, raise your LEFT hand.

If you disagree, raise your RIGHT hand.

皆さん、これで良かったら左手を上げてください。
違ううと思ったら、右手を挙げてください。

If you think that's correct, please raise your left hand. If you think that's wrong, please raise your right hand.

`//`

It will match nothing, yes?
Sorry, wrong!

何もマッチしないというのは間違っています。

It's wrong that this doesn't match anything.

`//`

It will not match any character?

でも文字はハッチできないよね?

But it won't possibly match any characters, will it?

`//`

It will not match any character?
Better!

少しずつ分かってきたかもしれません。

It looks like we are getting closer.

`''` ≟ `//`

Maybe the empty string?

文字をマッチしなかったら、もしかしてから文字列をマッチしますか。

If it doesn't match any characters, maybe it will match the emty string?

`' '` ≅ `//`

Yes, the empty string!

あたり。空の正規表現は空の文字列とマッチします。

Yes indeed, the empty regular expression matches the empty string.

`'ab'` = `''+'a'+''+'b'+''`

Many, many empty strings!

でもこれだ明けではありません。
文字列にはからの文字列が一杯隠れています

But that's not all. A plain string may hide many empty strings.

`//` matches
all these strings!

// = ' '+'a'+' '+'b'+' '

空の正規表現はこの隠れている空の文字列を全てマッチします。

An empty regular expression matches all these hidden empty strings.

So `//` does not match any characters

空の正規表現は文字とは一切マッチしませんが、

An empty regular expression indeed doesn't match any character at all.

But `//` matches `''` between characters!

' a b ' ≅ //

文字の間 (そして最初と最後) にマッチします。

But it matches between characters (and at the start and at the end).

Applications for `//`?

空の文字列にマッチするのは
おそらく応用がないと皆さんがお考えでしょう。
本当ににそうでしょうか。

You probably thing there are no applications for matching empty strings. But is that really true?

`'abcde'`

非常につまらない文字列 abcde を見てみ見ましょう。

Let's take a very boring string, abcde.

Say it's Really Hot

非常に暑い日と仮定しましょう。

Let's say we have a very hot day.

Let's Cool it Down

ちょっと涼しくしましょう。

Let's try to feel a bit cooler.

Cool it Down with Ruby

'abcde'.gsub //, '☃'

Ruby でちょっと涼しくしましょう。

Let's try to feel a bit cooler with the help of Ruby.

Cool it Down with Ruby

'abcde'.gsub //, '☃'⇒ '☃a☃b☃c☃d☃e☃'

これで涼しくなったでしょうよね。

I hope you feel cooler already.

Heat it up with Ruby

'abcde'.gsub //, '♨'⇒ '♨a♨b♨c♨d♨e♨'

逆もできます。日本の温泉っていいよね。

We can also feel hotter. I can really recommend Japanese hot springs.

Hidden Empty Regular Expressions

" xyz"≅/a+|b*/

じゃ、冗談の話を置いておいて、なぜ空の正規表現は大事なのでしょうか。
「普通」の正規表現に空の正規表現が含まれることがあります。
それを見抜けないと、意外なところで訳の分からないマッチが起こります。
よくある正規表現の落とし穴です。

Jokes apart, why are empty regular expressions important? Well, "everyday" regular expressions can contain empty regular expressions as subexpressions. If you are not aware of these, you may get unexpected matches. This is a frequent pitfall for regular expressions.

Outline

Introduction
Regular expression pitfalls:
The smallest regular expression
Effective use of regular expressions: ⇐ We are here now
Efficient pure Ruby Unicode normalization
Regular expression implementation internals:
Improving character property check implemetation
Conclusions

次のネタに移ります。目次では現在ご覧のところになります。

Let's move to the next section of this talk. You can see where we are in the table of contents.

Effective Use of Regular Expressions

正規表現の有効な使い方について一つ例を紹介したいと思います。

I’ll show you an example of effective use of regular expressions.

Look Outside

Consider more than just =~ (aka. match) and sub

slice
split
scan
gsub

正規表現そのものより、
正規表現の外も考えると正規表現を効果的に使える方法が見つかるかもしれません。
正規表現を使うメソッドは沢山あります。
何かのプログラムを考えているときに、
よく scan とか gsub とかのドキュメントを読んで、
何か効果的な使い方がないのかを見てみ見ます。
これからこのような例を一つ紹介したいと思います。

In order to use regular expressions effectively, it may be necessary to look not at regular expressions themselves, but outside. There are many methods using regular expressions. When I'm thinking about a problem, I'm often reading the documentation of methods such as scan and gsub in the hope that I find some efficent ay of solving the problem. I want to show you an example.

Efficient Pure Ruby Unicode Normalization

For more details, see also
Implementing Normalization in Pure Ruby - the Fast and Easy Way
(IUC 37, Santa Clara, CA, U.S.A., 22 October 2013)
(together with Ayumu Nojima (野島歩))

純粋な Rubyで効率的なユニコードの正規化の実装についてです。
数年前に研究室の野島歩と一緒に行った研究の成果です。

This is about efficient Unicode normalization in pure Ruby. This is the result of reseach done together with my student Ayumu Nojima a few years ago..

Unicode Normalization in Ruby

String#unicode_normalize :nfc|:nfd|:nfkc|:nfkd

String#unicode_normalized? :nfc|:nfd|:nfkc|:nfkd

現在の String クラスの unicode_normalize メゾッドなどの実装についてです。

It's about the current implementation of unicode_normalie etc. in the String class.

Four Forms

Original	NFD `:nfd`	NFC `:nfc`	NFKD `:nfkd`	NFKC `:nfkc`
R + ̄ + ̣; Ṛ + ̄	R + ̣ + ̄	Ṝ	R + ̣ + ̄	Ṝ
Å	A + ̊	Å	A + ̊	Å
晴	晴	晴	晴	晴
가; ᄀ + ᅡ	ᄀ + ᅡ	가	ᄀ + ᅡ	가
⁴₄④¾⑷⒋	⁴₄④¾⑷⒋	⁴₄④¾⑷⒋	4443⁄4(4)4.	4443⁄4(4)4.

ご覧のように、ユニコード正規化には四種類あります。

As you can see, there are four forms of Unicode Normalization.

What's Needed for Normalization

See Unicode Standard Annex #15: Unicode Normalization Forms

Convert code units (UTF-8/UTF-16) to scalar values (U+BA98)
Look up decompositions (Ṛ → R + ̣)
Look up combining classes ( ̣ → 220)
Search for composable character pairs (Ṛ + ̄ → Ṝ)
Convert back to code units

⇒ Lots of calculations and lookups

⇒ Slow in scripting language

仕様書を分析すると、色々な操作が必要が、正規表現はどこにも見当たりません。逆に Ruby で書くと遅くなる操作がいっぱいあります。

Reading the specification, one finds lots of operations, but no place where one could use a regular expression. On the other hand, these operatons won't run fast if they are written in Ruby.

Example: German to NFC

Letters in core German affected by normalization:

Ä, ä, Ö, ö, Ü, ü

Converting German to NFC:

string.gsub "A\u0308", "Ä"
      .gsub "a\u0308", "ä"
      ...

じゃー、実際の応用から考えましょう。
例えばドイツ語の場合、正規化に関連する文字は6個しかありません。
正規表現の使い道が見えたかもしれません。

But let's look at actual use. For example in German, there are only 6 characters affected by normalization. Maybe there's a way to use regular expressions here.

To explain the idea with a simplistic example, let's take a look at German. Core German (not including loanwords) only has six letters that are affected by normalization. Normalization to NFC could be done simply by replacing all occurrences of "A" followed by a combining diaeresis (U+0308) with "Ä", and so on for the other umlaut characters.

Of course, this approach is not directly usable. Even for German, the string would have to be scanned six times, which is way slower than just scanning it one time, and the solution would have to be hand-tailored to different languages and language combinations.

But we can make things more automatic using more of Ruby's primitives.

Example: Single Method Call

string.gsub /[AaOoUu]\u0308/ do |c|
           case c
           when "A\u0308"
               "Ä"
           when "a\u0308"

               "ä"
           ...
           end
       end

効率を上げるために、一つの gsub の呼び出しにまとめます。でも case はまだちょっと違和感があります。

To make this more efficient, we'll use only one call to gsub. But the 'case' expression still looks somewhat strange.

Example: Single Method Call

string.gsub /[AaOoUu]\u0308/ {
           "A\u0308" => "Ä",
           "a\u0308" => "ä"
           ...
       }

gsub の説明をよく見ると、Hash が使えることが思いつくかもしれません。

If we look at the documentation of 'gsub' carefully, we may see that we can use a hash.

One Little Problem Left

There are other languages than German
Some with many more relevant characters
Algorithm not limited to characters in actual languages
Number of combining characters around a base character not limited

残念ながら、まだドイツ語にしか対応できていません。
他の言語の文字もたくさんありますし、
一つの基底文字に付けられる結合文字の数の制限はありません。

We are done with German, but there are other languages, with many characters. Also, the number of combining characters around a base character is not limited.

A Better Regular Expression

Essentially /[BaseCharacters][CombiningCharacters]+/
In file lib/unicode_normalize/tables.rb
(caution: not part of standard library, but core)
Generated by template/unicode_norm_gen.tmpl
Regular expression part about 200 lines

実際に必要になる正規表現はでかくなりますし、自動生成が必要です。

The actual regular expression is quite big, and has to be generated automatically.

A Better Hash

Infinite-size Hash not an option
Could go back to block, but would be inefficient
So let's be bold:
What about an empty Hash?

Hash のところでは正規表現みたいに
組合せをコンパクトに表現できないです。
無限の Hash は無理なので、空の Hash はどうでしょうか。

Regular expressions allow us to express combinations, but that doesn't work for a Hash. So instead of an infinite-size Hash, why not be bold and use an empty Hash..

Building the Hash

Start empty
Define default block
When an entry is found:
- use it
When an entry is not found:
- create it (actual normalization goes here)
- add it
- use it
Hash works as a cache

幸いながら、Ruby の Hash にキーが見つからない場合に呼ばれるブロックが使えます。
これによって、Hash を徐々に成長させ、キャッシュとして使うことが可能です。

Fortunately, Ruby Hashes have default blocks that get called when a key is not found. So we can calculate the normalization and let the Hash grow and usi it as a cache

Code Skeleton

NFC_HASH = Hash.new do |h, run|   # create hash for NFC
    h[run] = nfc_run run          # add nfc for run to hash
end

class String
    def unicode_normalize         # add nfc to String class
        gsub NFC_REGEXP, NFC_HASH
    end
end

Actual code is at lib/unicode_normalize/normalize.rb,
with some stub code in string.c

結果として、ご覧のような構造の実装となります。

As a result, we get the following code sceleton.

Performance Experiments

Machine
- Panasonic Let's note notebook CF-J10
- Intel Core i7-2620M CPU 2.7GHz, 8MB main memory
Software
- Windows 7 Professional 64bit, Service Pack 1
- Ruby Installer Ruby 1.9.2p180 (2011-02-18) [i386-mingw32]
Data
- Randomly downloaded Wikipedia pages
- German (89KB), Japanese (223KB), Vietnamese (85KB), Korean (59KB)

ご覧の通りに、実験を行いました。

We conducted experiments under the conditions above.

Results

(German, more at GitHub, milliseconds)

	eprun	unicode_utils (1.4.0)	twitter_cldr (2.4.2)
NFD	12.48	75.19	3884
NFC	15.60	191.73	4072
NFKD	18.56	118.71	8627
NFKC	22.00	215.75	10577

Fastest implementation in pure scripting language!

結果として、仕様書通りの実装に比べて、大幅に早くなりました。

As a result, we are a lot faster than implementations that just followed the specification.

Why so Fast

Normalization of each run calculated only once
After warmup, everything runs in C
Regular expression serves as quick check

Gets faster if implementation (gsub, Regexp, Hash) get faster

早さの理由はいくつかあります。
とくに、プログラムは純粋な Ruby なのに、
実行中、ほとんどのとき C 関数内です。

There are many reasons why we are faster. In particular, while the code we wrote is pure Ruby, the execution mostly happens in C functions.

Ruby Only

gsub with Hash
since Ruby 1.9
Hash with updates
Not available in Perl, Python, JavaScript,...

この実装で使った機能は Ruby 独特なものなので、他の言語ではなかなかまねできません。

The functionality we used in this implementation is specific to Ruby. Therefore, it cannot easily be reproduced in other languages.

Personal Storyline

NFC-like proposal in draft-duerst-i18n-norm-00 in 1997
Definition of NFC/NFKC by Unicode Consortium in 1999
Implemented normalization in Perl and C
gsub with Hash introduced at a Ruby Committer meeting in 2007
Student research 2010-2013
Published as eprun gem in 2013
Added to Ruby in Ruby 2.2 (2014)

以上の説明は結構簡単に見えるかもしれません。
しかし、長い間ユニコードの正規化についてかかわったこと、
ちょうどまつもとさんと初めて出会った時で、
初めての Ruby コミッターの会合でgsub に Hash が追加されたことも影響したかもしれません。

The explanations up to here may sound quite easy. But this idea may not have been possible without my longtime involvement in Unicode normalization and the fact that just the first time I met Matz and participated in a Ruby committer meeting, Hash was added to gsub.

Outline

Introduction
Regular expression pitfalls:
The smallest regular expression
Effective use of regular expressions:
Efficient pure Ruby Unicode normalization
Regular expression implementation internals: ⇐ We are here now
Improving character property check implemetation
Conclusions

次のネタに移ります。目次では現在ご覧のところになります。

Let's move to the next section of this talk. You can see where we are in the table of contents.

Inside Regular Expressions

次に、正規表現の実装の内部について少しお話をしたいです。

Next, I want to talk a bit about the internals of regular expression implementation.

Improving Unicode Property Support

(together with Takumi Koyama (小山拓美))

ユニコード・プロパーティの実装の改善についてです。
去年度に研究室の小山拓美君と行った研究です。

This is about improving Unicode Property support. This is the result of reseach done together with my student Takumi Koyama last (academic) year.

Slide from Ruby Kaigi 2016

Unicode provides a wide range of character properties
Most available in Regexp
Does this string contain a Hiragana character?
'Юに코δ' =~ /\p{Hiragana}/
What script is 'Ю'?
sorry, impossible! 不可能!
Currently looking at this with a student, hopefully
- For Ruby ~2.5
- Use less memory
- Faster
- More properties
- More ways to use

これは去年の Ruby 会議の発表で見せたスライドです。
メモリ節約、加速、などなどずいぶん大胆な予想でした。

This is a slide I showed in my presentation at last year's Ruby Kaigi. Less memory, faster,...:I made some quite bold predictions.

Properties in Regular Expressions

Check for a character with a property with \p{}
- Examples (positive):
  \p{Digit} (match a digit)
  \p{Hiragana} (match a Hiragana)
- Examples (negative):
  \p{^Digit} or \P{Digit} (match everything else than a digit)
  \p{^Hiragana} or \P{Hiragana} (match everything else than a Hiragana)
Usable inside character class
- Example (or):
  [\p{Digit}\p{Hiragana}] (match a digit or a Hiragana)
- Example (and):
  [\p{Tamil}&&\p{Digit}] (match Tamil digits: ௦௧௨௩௪௫௬௭௮௯)

正規表現内、文字のプロパーティはバックスラッシュpで表します。
文字クラスの中でも使えるし、その場合は論理積も使えます。

In regular expressions, we can check for character properties with backslash-p. We can use this also inside character classes, where logical 'and' is also available.

Importance of Unicode Properties

Unicode:
- Code point assignments (equivalent to ISO 10646)
- Properties
- Algorithms (using properties)
Internationalization in Ruby:
- Encodings
- Regular Expressions ⇒ Properties
  Somebody: I have an internationalization problem.
  Matz: You can use regular expressions.

ASCII などの文字コードではプロパーティは少ないが、
ユニコードの場合はとても大切な役割を補っています。
Ruby では、松本さんがよく言うように、
国際化機能には多くの場合、
正規表現やプロパーティが使われる。

In legacy encodings such as ASCII, there are not many properties. But character properties are very central to Unicode. For Ruby, properties are also imprtant, because, as Matz often says, regular expressions are the main tool for internationalization issues.

Types of Properties

Boolean properties (二値プロパティ):
- The property is either true or false
- Examples: Uppercase, Lowercase,...
Enumerated properties (多値プロパティ):
- The property can have multiple values
- Examples: Bidi Class, East Asian Width

プロパーティは主に二値プロパーティと多値プロパーティに分類できます。

Properties can be classified into Boolean properties and enumerated, i.e. multi-valued, properties.

Currently Supported Properties

List of supported properties and values
Very good support for binary properties (55/60)
Not much support for multi-valued properties (7/43)
Why this difference?

(as of Unicode 9.0.0)

Onigmo を通じて、Ruby はで現在、
二値プロパーティのサポートが非常に良いが、
多値プロパーティのサポートは足りません。
後で分かるように、これは Onigmo の実装と関係しています。

Due to Onigmo, Ruby currently has very good support for Boolean properties. But the support for enumerated properties is not up to par. As we will see soon, the reason for this lies in the details of the Onigmo implemenation.

Property Data

Essentially, huge matrix (think spreadsheet)
- One character per row
  - Includes unassigned code points
  - 17 · 2¹⁶ = 1'114'112 rows
    (Excel maximum: 1'048'576 rows)
- One property per column (100++ columns)
Direct storage very inefficient
Compression needed

プロパーティのデータは膨大です。
110万の文字番号・行で、100以上のプロパーティ・列のスプレッドシートで考えてください。
そのまま保存すると効率が悪いので、何かしらの圧縮する必要がある。

The data needed for property support is huge. Immagine a spreadsheet with over one million code points/rows and over 100 properties/columns. It's very inefficient to store this data directly. Therefore, some kind of compression is needed.

Current Implementation: Data for Boolean Property

From enc/unicode/10.0.0/name2ctype.h

/* 'Radical': Binary Property */
static const OnigCodePoint CR_Radical[] = {
        3,                /* number of intervals */
        0x2e80, 0x2e99,   /* ⺀ ~ ⺙ */
        0x2e9b, 0x2ef3,   /* ⺛ ~ ⺀ */
        0x2f00, 0x2fd5,   /* ⼀ ~ ⺀ */
}; /* CR_Radical */

Values are boundaries of intervals
Inside the intervals, the property is true
Outside the intervals, the property is false

文字が漢字の部首であるかどうかを示す「Radical」のプロパーティの実装に必要なデータです。
プログラム言語は C です。
値は区間の境界線を表している。
区間内にプロパーティは「真」で、外は「偽」になります。

This is the data used for implementing the "Radical" property, which indicates whether a character is a Kanji radical or not. The values are the boundaries of intervals. Inside the boundaries, the property is true, outside, it is false.

Current Implementation: Inversion Lists

Values are boundaries of intervals

        0x2e80, 0x2e99,   /* ⺀ ~ ⺙ */

This means:
"Radical" starts at U+2e80 and goes up to U+2e99, and so on
In other words:
Invert from non-Radical to Radical at U+2e80
Invert from Radical to non-Radical at U+2e99
This is called an Inversion List (反転リスト)

データの値は区間の境界線。
境界線で値の有無が反転します。
そのため、このようなデータ構造を「反転リスト」と言います。

The data values are the boundaries of the intervals. At the boundaries, the value of the property is inverted. This kind of data structure is therefore called "Inversion List".

Current Implementation: Enumerated Properties

Inversion lists work well for binary properties
Enumerated properties cannot be implemented directly
Enumerated properties are decomposed:
One inversion list for each property value
Example: Script Property
- Example values: Latin, Greek, Arabic, Tamil, Hiragana, Katakana,...
- Total number of values: 138 in Unicode 10.0.0

反転リストは二値プロパーティに最適です。
しかし、直接多値プロパーティには使えません。
多値プロパーティの場合にはプロパーティ値ごとに反転リストを用意する必要があります。
例えば「スクリプト」(文字体系・文字種ともいう) の場合、138個の反転リストが必要になります。

Inversion lists work well for binary properties. But they cannot be used directly for enumerated properties. For enumerated properties, an inversion list has to be provided for each property value. As an example, the Script Property needs 138 inversion lists.

Additional Example: Data for Katakana

/* 'Katakana': Script */
static const OnigCodePoint CR_Katakana[] = {
        8,                /* number of intervals */
        0x30a1, 0x30fa,   /* ァ ~ ヺ */
        0x30fd, 0x30ff,   /* ヽ ~ ヿ */
        0x31f0, 0x31ff,   /* ㇰ ~ ㇿ */
        0x32d0, 0x32fe,   /* ㋐ ~ ㋾ */
        0x3300, 0x3357,   /* ㌀ ~ ㍗ */
        0xff66, 0xff6f,   /* ｦ ~ ｯ */
        0xff71, 0xff9d,   /* ｱ ~ ﾝ */
        0x1b000, 0x1b000, /* KATAKANA LETTER ARCHAIC E */
}; /* CR_Katakana */

ここで例として、スクリプトプロパーティのプロパーティ値がカタカナのための反転リストを示します。

Here's another example, the inversion list for the property value "Katakana" of the property "Script".

Current Implementation: Property Check

From regcomp.c, function onig_is_in_code_range:

  for (low = 0, high = n; low < high; ) {
    x = (low + high) >> 1;
    if (code > data[x * 2 + 1])
      low = x + 1;
    else
      high = x;
  }
  return ((low < n && code >= data[low * 2]) ? 1 : 0);

Binary search!

Execution time dependent on length of inversion list

ご覧のように、プロパーティのチェックには
反転リストの二分探索が使われます。
実行時間は反転リストの長さに依存しています。

As you can see from this code, properties are checked by using binary search on an inversion list. The execution time depends on the length of the inversion list.

Current Implementation: Data Size

enc/unicode/10.0.0/name2ctype.h is 38'088 lines, 767'874 bytes
535 inversion lists (Unicode 9.0.0)
30'018 intervals (Unicode 9.0.0)
Shortest inversion list: 1 interval

Longest inversion list: 791 intervals

/* 'Grapheme_Base': Derived Property */
static const OnigCodePoint CR_Grapheme_Base[] = {
        791,
        0x0020, 0x007e,

OnigCodePoint is 4 bytes
Total memory usage:
(535 + 30'018 · 2) · 4 = 252'284 bytes
Compiled into enc/unicode.o (688'263 bytes on Cygwin)

このスライドでは、現在の実装に必要なメモリの量を計算しています。
ユニコードバージョン 9 には535の反転リスト、3万ほどの区間があります。
結果として、全体に必要なメモリは250KBです。

This slide shows the calculations for the total memory needed by the current implementation. For Unicode Version 9, there are 535 inversion lists, about 30'000 intervals. This results in a total memory usage of about 250KB.

File Size

enc/unicode/10.0.0/name2ctype.h is 38'088 lines, 767'874 bytes

Trying to download from https://svn.ruby-lang.org/:

An Exception Has Occurred

Display of files larger than 512 KB disallowed by configuration

HTTP Response Status

403 Forbidden

Fortunately, checkout with svn (or github) works

ファイルの大きさはご覧のようにも把握できます。

One more way to show how large the file is.

New Approach: Equivalence Classes

Core observations:
- Look at rows (horizontal), not columns (vertical)
- There are many characters where all properties are the same
Group such characters into equivalence classes
Unicode 9.0.0: 3747 equivalence classes

See Feature #13240

さて、現在の実装は改善できるでしょうか。
次の考え方で挑戦してみました。
まず、でかいスプレッドシートで、列のではなく、行に注目します。
すべてのプロパーティが同じの行が多く見かけます。
このような行を同値類にまとめます。
ユニコードバージョン9には 3747 の同値類があります。

So can we improve on the current implementation? We'll try with the following approach. First, in our big spreadsheet, we look not at columns, but rows. We find that many rows are exactly identical. We group such rows/characters into equivalence classes. In Unicode Version 9, there are 3747 equivalence classes.

New Approach: Property Representation

Each property is represented with a number of bits
Examples:
Uppercase: Boolean property ⇒ 1 bit
Script: 137 property values (scripts) ⇒ 8 bits
Properties are grouped into words (32 bits)
In Unicode 9.0.0, 192 bits (24 bytes, 6 words) needed
Each property-value combination(e.g. Script=Katakana) is identified by
1. Word number
2. Bit mask for property
3. Bit pattern for property value

プロパーティごとに数ビットを使います。
二値プロパーティの場合には1ビットになります。
スクリプトの場合、プロパーティ値は137ありますので、8 ビットが必要となります。
それぞれのプロパーティのビット数を考慮して、ワードに詰めます。
ユニコード9では6ワードで足ります。

We represent each property with a number of bits. Boolean propertios need only one bit. The Script property, which has 137 varues, needs 8 bits.

New Approach: Property Check

  /* get Property Value data */
  propVal = propValInfo[ctype];
  /* get bit pattern for equivalence class */
  return (propertyData[equivClass]
                      /* select word */
                      [propVal.word]
          /* extract property value bit pattern */
          & propVal.mask
           /* check for matching property value pattern */
         ) == propVal.value ? 1 : 0;

Constant-time lookup!

プロパーティ値チェックの処理は少々複雑ですが、一定時間で可能です。

The processing needed to check for a property value is slightly complicated, but can be done in constant time.

Property Check Example

/* data for equivalence class of Hiragana 「こ」 */ 10000111001000000011000010100010

/* mask for Script property */ & 00000000111111110000000000000000

/* bit pattern for property value Hiragana */ = 00000000001000000000000000000000

/* result is same as bit pattern ⇒「こ」 is Hiragana */ == 00000000001000000000000000000000

実際の処理の一例です。平仮名の「こ」の字が平仮名かどうかのチェックです。

This is an example with actual data. We are checking whether the Hiragana character "ko" is a Hiragana or not.

New Approach: Data Size (first try)

Data for equivalence classes:
3747 · 24 = 89'928 bytes
Codepoint → equivalence class:
17 · 2¹⁶ · 2 = 2'228'224 bytes
Total:
2'318'152 bytes
Way too much!
Improvements needed!

この新しい実装で、文字番号ごとの同値類を記憶する必要がありますので、
そのままでは必要なメモリが多すぎます。もうちょっと改善が必要です。

In this new implementation, we need to know the equivalence class of each code point. That still needs too much memory. We need another improvement.

New Approach: Two-Step Equivalence Class Lookup

Group equivalence class information into slices of 256 characters
Find slices that are exactly the same
Examples:
- All Kanji
- All empty
Structure as a two-layer DAG (direct acyclic graph)

同値類の探索を二段階にすると、メモリの節約ができます。

By using a two-step lookup for equivalance classes, we can save a lot of memory.

New Approach: Data Size (second try)

Codepoint → slice:
17 · 2¹⁶/256 · 1 = 4'352 bytes
Slices:
210 · 256 · 2 = 107'520 bytes
Equivalence classes:
3747 · 24 = 89'928 bytes
Property Values:
1009 · 3 · 4 = 12'108 bytes
Total:
213'908 bytes
Looks good!

結果として、213KB ぐらいのメモリで済みます。

As a result, we happen to get by with 213KB of data.

Performance Experiments

Machine
- 3.33 GHz Intel Core i5
Software
- Windows 10 Enterprise 64 bit OS
- x86_64 Cygwin
- ruby 2.5.0dev (2016-12-27 trunk 57204)
Data
- 10 characters:
  a, Б, س, क, ∉, こ, 青, 가, , 
- Run property checks 10 million times
- Innermost method call only

実行時間の測定はご覧の通りの条件で行いました。

This is the setup for our performance experiments.

Processing Speed

Graph of processing speed and number of intervals.

ご覧のように、現在の実装では、反転リストの区間が多くなるにつれ、処理が遅くなります。
二分探索から予想できるように、大体区間の数の対数に比例しています。
新しい実装では、全てのプロパーティ値の場合、
古い実装の一番早い実行時間とほぼ同等になります。

As you can see, in the current implementation, processing gets slower if the number of intervals in the inversion list increases. Consistent with binary lookup, the increase is proportional to the logarithm of the number of intervals. In the new implementation, for every property value, we achieve about the same low time as for the property value with the lowest time in the current implemenation.

Property Discrimination

Regular expression's \p checks for one specific value
Example: Is this a Hiragana?
Sometimes, we want to know e.g. the script of a character
Does this string contain a Hiragana character?
'Юに코δ' =~ /\p{Hiragana}/
What script is 'Ю '?
sorry, impossible with inversion lists! 不可能!
Becomes possible with equivalence classes

See Feature#13241

正規表現ではプロパーティの特定な値は確認できますが、
ある文字のあるプロパーティはどの値なのか調べることはできません。
同値類を使った実装ではこのような機能の追加は簡単になります。

With regular expressions, you can check characters for specific property values. However, it's not possible to ask what property value for a property that a character has. With an implementation based on eqivalence classes, providing this functionality becomes easy.

Old versus New

		Old (current)	New (future?)
Properties Supported	Boolean Properties	55	56
	Enumerated Properties	7	20
	Property Values	554	1009
Memory Usage (bytes)		252'284	213'908
Lookup Speedup		depending on list length	up to 9 times faster
Discrimination Speed		very slow	very fast

現在の実装と新実装を比べると、ご覧のようになります。

This is a comparison of the old and the new implementation.

Slide from Ruby Kaigi 2016

Unicode provides a wide range of character properties
Most available in Regexp
Does this string contain a Hiragana character?
'Юに코δ' =~ /\p{Hiragana}/
What script is 'Ю'?
sorry, impossible! 不可能!
Currently looking at this with a student, hopefully
- For Ruby ~2.5 ×
- Use less memory ○
- Faster ○
- More properties ○
- More ways to use ○

Ruby Kaigi 2016 のスライドに戻りますと、予想の多くは達成できました。

Comming back to the slide from Ruby Kaigi 2016, we see that most of our predictions were realized.

Problems

Not yet full implementation
Need to coordiante with Onigmo
Combinations of property checks may get slower
- Inversion lists allow logical operations
  (E.g. inversion list ∧ inversion list → inversion list)
- For equivalence classes, logical operation at run time
May be solved by creating custom properties for frequently used combinations
(e.g. Kana: Hiragana ∨ Katakana)
Research ongoing

残念ながら、新しい実装に問題点もいくつかありますが、研究は現在も続けています。

The new implementation also has some problems, but we are continuing research.

Acknowledgments

Takumi Koyama (小山拓美) for his help with Unicode Properties
Ayumu Nojima (野島歩) for his help with Unicode Normalization
Many other students for help with research and implementations
Openclipart for the World icon
Yui Naruse (成瀬ゆい), Nobu Nakada (中田伸悦) and many other Ruby committers for help and support
Matz (まつもとゆきひろ) for Ruby, a programmer's best friend
Amaya, Opera 12.17, and coderay for slide production and display
Nozomi Kikuchi (菊池のぞみ) and SVuGy (sorry, unpublished version) for creating the graphs

本研究と発表、そして Ruby そのものにあったって、多くの方々に深く感謝します。

I want to deeply thank the many people who have contributed to this research and this preparation, as well as Ruby in general.

Summary

Empty regular expression:
Be careful about //
Outside of regular expressions:
Consider various methods/arguments for effective regular expression use
String#gsub with Hash ⇒ efficient pure Ruby imlementation of String#unicode_normalize
Inside of regular expressions:
\p character property checks may get faster in the future

まとめです。
空の正規表現は要注意です。
正規表現の外側も考慮すると有効な実装に津奈良ります。
文字のプロパーティのチェックがもしかして将来早くなります。

I'm summarizing. Be careful with empty regular expressions. Consider methods outside regular expresions for effective regular expression use. Hopefully, \p regular expression lookup will get faster.

Q & A

Send questions and comments to Martin Dürst
(mailto:duerst@it.aoyama.ac.jp)
or open a bug report or feature request

The latest version of this presentation is available at:

http://www.sw.it.aoyama.ac.jp/2017/pub/Ruby_Kaigi/

Glossary / 用語集

(incomplete)

regular expression: 正規表現
convention over configuration: 設定より規約
character encoding: 文字コード
(Unicode)(character) normalization: 正規化
interval: 区間
inversion list: 反転リスト
equivalence class: 同値類

Regular Expressions Inside Out

Ruby Kaigi 2017, Hiroshima, Japan, September 19, 2017

Martin J. DÜRST (マーティンと呼んでください)

Abstract

For Best Viewing

Text/Translation

Introduction

Topics

Outline

Conventions Used

Audience Self-Intro

Speaker Self-Intro

Contributions to Ruby

Ruby Versions and Unicode Versions

Ruby Regular Expression Engine

Outline

Regular Expression Pitfalls

Example Regular Expression

The Smallest Regular Expression

The Smallest Regular Expression

//

//

//

//

//

//

//

//

'' ≟ //

' ' ≅ //

'ab' = ''+'a'+''+'b'+''

// matches all these strings!

So // does not match any characters

But // matches '' between characters!

Applications for //?

'abcde'

Say it's Really Hot

Let's Cool it Down

Cool it Down with Ruby

Cool it Down with Ruby

Heat it up with Ruby

Hidden Empty Regular Expressions

Outline

Effective Use of Regular Expressions

Look Outside

Efficient Pure Ruby Unicode Normalization

Unicode Normalization in Ruby

Four Forms

What's Needed for Normalization

Example: German to NFC

Example: Single Method Call

Example: Single Method Call

One Little Problem Left

A Better Regular Expression

A Better Hash

Building the Hash

Code Skeleton

Performance Experiments

Results

Why so Fast

Ruby Only

Personal Storyline

Outline

Inside Regular Expressions

Improving Unicode Property Support

Slide from Ruby Kaigi 2016

Properties in Regular Expressions

Importance of Unicode Properties

Types of Properties

Currently Supported Properties

Property Data

Current Implementation: Data for Boolean Property

Current Implementation: Inversion Lists

Current Implementation: Enumerated Properties

Additional Example: Data for Katakana

Current Implementation: Property Check

Current Implementation: Data Size

File Size

An Exception Has Occurred

HTTP Response Status

Regular Expressions
Inside Out

Ruby Kaigi 2017 , Hiroshima, Japan, September 19, 2017

`//`

`//`

`//`

`//`

`//`

`//`

`//`

`//`

`''` ≟ `//`

`' '` ≅ `//`

`'ab'` = `''+'a'+''+'b'+''`

`//` matches
all these strings!

So `//` does not match any characters

But `//` matches `''` between characters!

Applications for `//`?

`'abcde'`