Regular Expressions
Inside Out

Ruby Kaigi 2017, Hiroshima, Japan, September 19, 2017

Martin J. DÜRST (マーティンと呼んでください), Aoyama Gakuin University

Ruby Programming Language

© 2017 Martin J. Dürst, Aoyama Gakuin University


Regular expressions are a very important part of the toolkit of every Ruby programmer. This talk will help improve your understanding of regular expressions, including how to use them from Ruby, and how they are implemented.

Examples will include things Ruby can do but other programming languages can't, huge regular expressions, substitutions that change as we go, and performance improvements for future Ruby versions.

For Best Viewing

These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux; use F11 to switch to projection mode). The slides have been optimized for a screen size of 1920x1080pixels, but can easily be viewed on other screens, too. Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.



[This is [the English translation of] the manuscript. Of course, I can't guarantee that I'll read it word-for-word. At the end of the slides, there is a glossary.]




Original Title: Regular Expressions from the Outside and from the Inside

Updated Title: Regular Expressions Inside Out

Originally, long list of topics

Reduced more and more, three topics left

昨日このような T-shirt を着ました。[IETF 79 (2009) の T-shirt で、赤の背景に鯉が書いてあるもの] その関係かどうかわかりませんが、昨日広島カープが優勝しました。

The topic of this talk is regular expressions. I simplified the title a bit, so I want to appologize to listeners who expected the original title. Yesterday, I was wearing this T-shirt. [T-shirt of IETF 79 (2009) with a carp drawn on a redbackground.] I'm not sure there's a connection, but anyway, the Hiroshima Carps baseball team won the championship yesterday. When I submitted the talk proposal, I had a lot of topics, but I had to reduced this to three topics.




This outline shows the three topics.


Conventions Used

Code is mostly green, monospace
puts 'Hello Ruby!'

Variable parts are orange
puts "some string"

Results are indicated with "⇒"
1 + 12

Matching of regular expressions is indicated with ≟, ≅, or ≇
'abc'/a/; 'def'/a/

Matching substrings are indicated with orage background



This presentation is also using "convention over configuration". This slide shows you the conventions we'll use throughout the talk. (Code is in green monospace font. Variable parts are in orange. Where necessary, the character encoding of a string is indicated with a blue subsccript. Results are show after a double arrow. We'll use two symbols to show when regular expressions match or don't match.)


Audience Self-Intro


Let's give everybody a chance to introduce yourselves. Who has used regular expressions? Who has had problems with using regular expressions? Who has looked at the implementation of regular expressions?


Speaker Self-Intro


This is some information about myself. I don't have much time, so let's move on.


Contributions to Ruby

Mainly in the following areas:

Ruby コミッタとして主にご覧のあたりを担当しています。

As a Ruby committer, I'm mostly responsible for the areas listed on this slide.


Ruby Versions and Unicode Versions

Year (y) Ruby version (VRuby) Supported Unicode version (VUnicode)
published around Christmas published in Summer
2014 2.2 7.0.0
2015 2.3 8.0.0
2016 2.4 9.0.0
2017 2.5 10.0.0 (as of September 6)

NOTE about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.


VUnicode = y - 2007; VRuby = (y - 1992) · 0.1

VRuby = 1.5 + VUnicode · 0.1      VUnicode = VRuby · 10 - 15

Don't extrapolate too far!

ご覧のように、Ruby のバージョンと
対応している Unicode のバージョンには
2014 年以来、規則性が見られますが、

As you can see, starting in 2014, there is some regularity between the Ruby version and the Unicode version it supports. But there is no guarantee that this relationship stays the same going forward.


Ruby Regular Expression Engine

現在 Ruby が使っている正規表現のエンジンは「おにぐも」です。

The regular expression engine used by Ruby is called "Onigmo'. The name may mean various things; I can only speculate.




Let's move to the next section of this talk. You can see where we are in the table of contents.


Regular Expression Pitfalls


Next, let's look at a common pitfall with regular expressions.


Example Regular Expression



In very easy terms, a regular expression is stuff between slashes.


The Smallest Regular Expression


When trying to understand something, make sure you understand extreme cases, such as the smallest one. So let's look at the smallest regular expression. Don't worry, we'll look at some much bigger ones later.


The Smallest Regular Expression



This is the smalles regular expression.




Only two slashes. Nothing inside.



It's empty!

空っぽです。 何もありません。

It's empty. There's nothing.



What does it match?


What does it match?


It will match nothing, yes?


There's nothing, so it won't match anything, yes?



It will match nothing, yes?

If you agree, raise your LEFT hand.

If you disagree, raise your RIGHT hand.


If you think that's correct, please raise your left hand. If you think that's wrong, please raise your right hand.



It will match nothing, yes?
Sorry, wrong!


It's wrong that this doesn't match anything.



It will not match any character?


But it won't possibly match any characters, will it?



It will not match any character?


It looks like we are getting closer.



Maybe the empty string?


If it doesn't match any characters, maybe it will match the emty string?



Yes, the empty string!


Yes indeed, the empty regular expression matches the empty string.


'ab' = ''+'a'+''+'b'+''

Many, many empty strings!


But that's not all. A plain string may hide many empty strings.


// matches
all these strings!

// = ''+'a'+''+'b'+''


An empty regular expression matches all these hidden empty strings.


So // does not match any characters


An empty regular expression indeed doesn't match any character at all.


But // matches '' between characters!


文字の間 (そして最初と最後) にマッチします。

But it matches between characters (and at the start and at the end).


Applications for //?


You probably thing there are no applications for matching empty strings. But is that really true?



非常につまらない文字列 abcde を見てみ見ましょう。

Let's take a very boring string, abcde.


Say it's Really Hot


Let's say we have a very hot day.


Let's Cool it Down


Let's try to feel a bit cooler.

Cool it Down with Ruby

'abcde'.gsub //, '☃​'

Ruby でちょっと涼しくしましょう。

Let's try to feel a bit cooler with the help of Ruby.


Cool it Down with Ruby

'abcde'.gsub //, '☃​'


I hope you feel cooler already.

Heat it up with Ruby

'abcde'.gsub //, '♨​'


We can also feel hotter. I can really recommend Japanese hot springs. 


Hidden Empty Regular Expressions

"xyz" /a+|b*/


Jokes apart, why are empty regular expressions important? Well, "everyday" regular expressions can contain empty regular expressions as subexpressions. If you are not aware of these, you may get unexpected matches. This is a frequent pitfall for regular expressions.




Let's move to the next section of this talk. You can see where we are in the table of contents.


Effective Use of Regular Expressions


I’ll show you an example of effective use of regular expressions.


Look Outside

Consider more than just =~ (aka. match) and sub

よく scan とか gsub とかのドキュメントを読んで、

In order to use regular expressions effectively, it may be necessary to look not at regular expressions themselves, but outside. There are many methods using regular expressions. When I'm thinking about a problem, I'm often reading the documentation of methods such as scan and gsub in the hope that I find some efficent ay of solving the problem. I want to show you an example.


Efficient Pure Ruby Unicode Normalization

For more details, see also
Implementing Normalization in Pure Ruby - the Fast and Easy Way
(IUC 37, Santa Clara, CA, U.S.A., 22 October 2013)
(together with Ayumu Nojima (野島 歩))

純粋な Rubyで効率的なユニコードの正規化の実装についてです。
数年前に研究室の野島 歩と一緒に行った研究の成果です。

This is about efficient Unicode normalization in pure Ruby. This is the result of reseach done together with my student Ayumu Nojima a few years ago..


Unicode Normalization in Ruby

String#unicode_normalize   :nfc|:nfd|:nfkc|:nfkd

String#unicode_normalized? :nfc|:nfd|:nfkc|:nfkd

現在の String クラスの unicode_normalize メゾッドなどの実装についてです。

It's about the current implementation of unicode_normalie etc. in the String class.


Four Forms

Original NFD :nfd NFC :nfc NFKD :nfkd NFKC :nfkc
R + ̄ + ̣; Ṛ + ̄ R + ̣ + ̄ R + ̣ + ̄
A + ̊ Å A + ̊ Å
가; ᄀ + ᅡ ᄀ + ᅡ ᄀ + ᅡ
⁴₄④¾⑷⒋ ⁴₄④¾⑷⒋ ⁴₄④¾⑷⒋ 4443⁄4(4)4. 4443⁄4(4)4.


As you can see, there are four forms of Unicode Normalization.


What's Needed for Normalization

See Unicode Standard Annex #15: Unicode Normalization Forms

⇒ Lots of calculations and lookups

⇒ Slow in scripting language

仕様書を分析すると、色々な操作が必要が、正規表現はどこにも見当たりません。逆に Ruby で書くと遅くなる操作がいっぱいあります。

Reading the specification, one finds lots of operations, but no place where one could use a regular expression. On the other hand, these operatons won't run fast if they are written in Ruby.


Example: German to NFC

Letters in core German affected by normalization:

Ä, ä, Ö, ö, Ü, ü

Converting German to NFC:

string.gsub "A\u0308", "Ä"
      .gsub "a\u0308", "ä"


But let's look at actual use. For example in German, there are only 6 characters affected by normalization. Maybe there's a way to use regular expressions here.

To explain the idea with a simplistic example, let's take a look at German. Core German (not including loanwords) only has six letters that are affected by normalization. Normalization to NFC could be done simply by replacing all occurrences of "A" followed by a combining diaeresis (U+0308) with "Ä", and so on for the other umlaut characters.

Of course, this approach is not directly usable. Even for German, the string would have to be scanned six times, which is way slower than just scanning it one time, and the solution would have to be hand-tailored to different languages and language combinations.

But we can make things more automatic using more of Ruby's primitives.


Example: Single Method Call

string.gsub /[AaOoUu]\u0308/ do |c|
           case c
           when "A\u0308"
           when "a\u0308"

効率を上げるために、一つの gsub の呼び出しにまとめます。でも case はまだちょっと違和感があります。

To make this more efficient, we'll use only one call to gsub. But the 'case' expression still looks somewhat strange.


Example: Single Method Call

string.gsub /[AaOoUu]\u0308/ {
           "A\u0308" => "Ä",
           "a\u0308" => "ä"

gsub の説明をよく見ると、Hash が使えることが思いつくかもしれません。

If we look at the documentation of 'gsub' carefully, we may see that we can use a hash.


One Little Problem Left


We are done with German, but there are other languages, with many characters. Also, the number of combining characters around a base character is not limited.


A Better Regular Expression


The actual regular expression is quite big, and has to be generated automatically.


A Better Hash

Hash のところでは正規表現みたいに
無限の Hash は無理なので、空の Hash はどうでしょうか。

Regular expressions allow us to express combinations, but that doesn't work for a Hash. So instead of an infinite-size Hash, why not be bold and use an empty Hash..


Building the Hash

幸いながら、Ruby の Hash にキーが見つからない場合に呼ばれるブロックが使えます。
これによって、Hash を徐々に成長させ、キャッシュとして使うことが可能です。

Fortunately, Ruby Hashes have default blocks that get called when a key is not found. So we can calculate the normalization and let the Hash grow and usi it as a cache


Code Skeleton

NFC_HASH = do |h, run|   # create hash for NFC
    h[run] = nfc_run run          # add nfc for run to hash

class String
    def unicode_normalize         # add nfc to String class
        gsub NFC_REGEXP, NFC_HASH

Actual code is at lib/unicode_normalize/normalize.rb,
with some stub code in string.c


As a result, we get the following code sceleton.


Performance Experiments


We conducted experiments under the conditions above.



(German, more at GitHub, milliseconds)

eprun unicode_utils (1.4.0) twitter_cldr (2.4.2)
NFD 12.48  75.19  3884
NFC 15.60 191.73  4072
NFKD 18.56 118.71  8627
NFKC 22.00 215.75 10577

Fastest implementation in pure scripting language!


As a result, we are a lot faster than implementations that just followed the specification.


Why so Fast

Gets faster if implementation (gsub, Regexp, Hash) get faster

とくに、プログラムは純粋な Ruby なのに、
実行中、ほとんどのとき C 関数内です。

There are many reasons why we are faster. In particular, while the code we wrote is pure Ruby, the execution mostly happens in C functions.


Ruby Only

この実装で使った機能は Ruby 独特なものなので、他の言語ではなかなかまねできません。

The functionality we used in this implementation is specific to Ruby. Therefore, it cannot easily be reproduced in other languages.


Personal Storyline

初めての Ruby コミッターの会合でgsub に Hash が追加されたことも影響したかもしれません。

The explanations up to here may sound quite easy. But this idea may not have been possible without my longtime involvement in Unicode normalization and the fact that just the first time I met Matz and participated in a Ruby committer meeting, Hash was added to gsub.




Let's move to the next section of this talk. You can see where we are in the table of contents.


Inside Regular Expressions


Next, I want to talk a bit about the internals of regular expression implementation.


Improving Unicode Property Support

(together with Takumi Koyama (小山 拓美))

去年度に研究室の小山 拓美君と行った研究です。

This is about improving Unicode Property support. This is the result of reseach done together with my student Takumi Koyama last (academic) year.


Slide from Ruby Kaigi 2016

これは去年の Ruby 会議の発表で見せたスライドです。

This is a slide I showed in my presentation at last year's Ruby Kaigi. Less memory, faster,...:I made some quite bold predictions.


Properties in Regular Expressions


In regular expressions, we can check for character properties with backslash-p. We can use this also inside character classes, where logical 'and' is also available.


Importance of Unicode Properties

ASCII などの文字コードではプロパーティは少ないが、
Ruby では、松本さんがよく言うように、

In legacy encodings such as ASCII, there are not many properties. But character properties are very central to Unicode. For Ruby, properties are also imprtant, because, as Matz often says, regular expressions are the main tool for internationalization issues.


Types of Properties


Properties can be classified into Boolean properties and enumerated, i.e. multi-valued, properties.


Currently Supported Properties

(as of Unicode 9.0.0)

Onigmo を通じて、Ruby はで現在、
後で分かるように、これは Onigmo の実装と関係しています。

Due to Onigmo, Ruby currently has very good support for Boolean properties. But the support for enumerated properties is not up to par. As we will see soon, the reason for this lies in the details of the Onigmo implemenation.


Property Data


The data needed for property support is huge. Immagine a spreadsheet with over one million code points/rows and over 100 properties/columns. It's very inefficient to store this data directly. Therefore, some kind of compression is needed.


Current Implementation: Data for Boolean Property

From enc/unicode/10.0.0/name2ctype.h

/* 'Radical': Binary Property */
static const OnigCodePoint CR_Radical[] = {
        3,                /* number of intervals */
        0x2e80, 0x2e99,   /* ⺀ ~ ⺙ */
        0x2e9b, 0x2ef3,   /* ⺛ ~ ⺀ */
        0x2f00, 0x2fd5,   /* ⼀ ~ ⺀ */
}; /* CR_Radical */

プログラム言語は C です。

This is the data used for implementing the "Radical" property, which indicates whether a character is a Kanji radical or not. The values are the boundaries of intervals. Inside the boundaries, the property is true, outside, it is false.


Current Implementation: Inversion Lists


The data values are the boundaries of the intervals. At the boundaries, the value of the property is inverted. This kind of data structure is therefore called "Inversion List".


Current Implementation: Enumerated Properties

例えば「スクリプト」(文字体系・文字種ともいう) の場合、138個の反転リストが必要になります。

Inversion lists work well for binary properties. But they cannot be used directly for enumerated properties. For enumerated properties, an inversion list has to be provided for each property value. As an example, the Script Property needs 138 inversion lists.


Additional Example: Data for Katakana

/* 'Katakana': Script */
static const OnigCodePoint CR_Katakana[] = {
        8,                /* number of intervals */
        0x30a1, 0x30fa,   /* ァ ~ ヺ */
        0x30fd, 0x30ff,   /* ヽ ~ ヿ */
        0x31f0, 0x31ff,   /* ㇰ ~ ㇿ */
        0x32d0, 0x32fe,   /* ㋐ ~ ㋾ */
        0x3300, 0x3357,   /* ㌀ ~ ㍗ */
        0xff66, 0xff6f,   /* ヲ ~ ッ */
        0xff71, 0xff9d,   /* ア ~ ン */
        0x1b000, 0x1b000, /* KATAKANA LETTER ARCHAIC E */
}; /* CR_Katakana */


Here's another example, the inversion list for the property value "Katakana" of the property "Script".


Current Implementation: Property Check

From regcomp.c, function onig_is_in_code_range:

  for (low = 0, high = n; low < high; ) {
    x = (low + high) >> 1;
    if (code > data[x * 2 + 1])
      low = x + 1;
      high = x;
  return ((low < n && code >= data[low * 2]) ? 1 : 0);

Binary search!

Execution time dependent on length of inversion list


As you can see from this code, properties are checked by using binary search on an inversion list. The execution time depends on the length of the inversion list.


Current Implementation: Data Size

ユニコードバージョン 9 には535の反転リスト、3万ほどの区間があります。

This slide shows the calculations for the total memory needed by the current implementation. For Unicode Version 9, there are 535 inversion lists, about 30'000 intervals. This results in a total memory usage of about 250KB.


File Size

enc/unicode/10.0.0/name2ctype.h is 38'088 lines, 767'874 bytes

Trying to download from

An Exception Has Occurred

Display of files larger than 512 KB disallowed by configuration

HTTP Response Status

403 Forbidden

Fortunately, checkout with svn (or github) works


One more way to show how large the file is.


New Approach: Equivalence Classes

See Feature #13240

ユニコードバージョン9には 3747 の同値類があります。

So can we improve on the current implementation? We'll try with the following approach. First, in our big spreadsheet, we look not at columns, but rows. We find that many rows are exactly identical. We group such rows/characters into equivalence classes. In Unicode Version 9, there are 3747 equivalence classes.


New Approach: Property Representation

スクリプトの場合、プロパーティ値は137ありますので、8 ビットが必要となります。

We represent each property with a number of bits. Boolean propertios need only one bit. The Script property, which has 137 varues, needs 8 bits.


New Approach: Property Check

  /* get Property Value data */
  propVal = propValInfo[ctype];
  /* get bit pattern for equivalence class */
  return (propertyData[equivClass]
                      /* select word */
          /* extract property value bit pattern */
& propVal.mask /* check for matching property value pattern */ ) == propVal.value ? 1 : 0;

Constant-time lookup!


The processing needed to check for a property value is slightly complicated, but can be done in constant time.


Property Check Example

    /* data for equivalence class of Hiragana 「こ」 */

    /* mask for Script property */
&   00000000111111110000000000000000

    /* bit pattern for property value Hiragana */
=   00000000001000000000000000000000

    /* result is same as bit pattern ⇒「こ」 is Hiragana */
==  00000000001000000000000000000000


This is an example with actual data. We are checking whether the Hiragana character "ko" is a Hiragana or not.


New Approach: Data Size (first try)


In this new implementation, we need to know the equivalence class of each code point. That still needs too much memory. We need another improvement.


New Approach: Two-Step Equivalence Class Lookup


By using a two-step lookup for equivalance classes, we can save a lot of memory.


New Approach: Data Size (second try)

結果として、213KB ぐらいのメモリで済みます。

As a result, we happen to get by with 213KB of data.


Performance Experiments


This is the setup for our performance experiments.


Processing Speed

Graph of processing speed and number of intervals.


As you can see, in the current implementation, processing gets slower if the number of intervals in the inversion list increases. Consistent with binary lookup, the increase is proportional to the logarithm of the number of intervals. In the new implementation, for every property value, we achieve about the same low time as for the property value with the lowest time in the current implemenation.


Property Discrimination

See Feature#13241


With regular expressions, you can check characters for specific property values. However, it's not possible to ask what property value for a property that a character has. With an implementation based on eqivalence classes, providing this functionality becomes easy.


Old versus New

Old (current) New (future?)
Properties Supported Boolean Properties 55 56
Enumerated Properties 7 20
Property Values 554 1009
Memory Usage (bytes) 252'284 213'908
Lookup Speedup depending on list length up to 9 times faster
Discrimination Speed very slow very fast


This is a comparison of the old and the new implementation.


Slide from Ruby Kaigi 2016

Ruby Kaigi 2016 のスライドに戻りますと、予想の多くは達成できました。

Comming back to the slide from Ruby Kaigi 2016, we see that most of our predictions were realized.




The new implementation also has some problems, but we are continuing research.



本研究と発表、そして Ruby そのものにあったって、多くの方々に深く感謝します。

I want to deeply thank the many people who have contributed to this research and this preparation, as well as Ruby in general.




I'm summarizing. Be careful with empty regular expressions. Consider methods outside regular expresions for effective regular expression use. Hopefully, \p regular expression lookup will get faster.


Q & A

Send questions and comments to Martin Dürst
or open a bug report or feature request

The latest version of this presentation is available at:


Glossary / 用語集


regular expression
convention over configuration
character encoding
(Unicode)(character) normalization
inversion list
equivalence class