Slicing and Dicing Unicode Properties

http://www.sw.it.aoyama.ac.jp/2018/pub/IUC42_properties/

Internationalization and Unicode Conference 42
Santa Clara, CA, U.S.A., September 11, 2018

Martin J. DÜRST

duerst@it.aoyama.ac.jp, Aoyama Gakuin University

Ruby Programming Language

© 2017-8 Martin J. Dürst, Aoyama Gakuin University

Abstract

Unicode defines a large number of character properties for character classification and algorithms. This talk compares the two main ways of implementing property support, and discusses their advantages and disadvantages based on actual implementation experience. This talk is suited for users dealing with Unicode properties as well as for implementers.

Examples of Unicode character properties range from general category (letter, number, symbol,...) to specialized properties for tasks such as bidirectional layout and normalization. Many programming languages, libraries, and regular expression engines provide access to these properties. We provide some insights and results based on experimental work on the Onigmo regular expression engine that is used by the programming language Ruby.

Leaving special properties such as character name aside, the two main ways of implementing property support are inversion lists and folded tries. Moving from the former to the later, we succeeded to increase the number of properties covered from 62 to 76, and the number of property values covered from 554 to 1009, all while reducing the memory necessary from 240kB to 214kB. In addition, elimination of binary search made raw property checks up to 9 times faster, and lookup of specific property values from characters up to 65 times faster.

Comparing the two methods from a conceptual viewpoint, it is interesting to observe that when arranging property values in a huge table with characters as rows and properties as columns, inversion lists look at this table one column a time, whereas folded tries look at it one row at a time. Folded tries take advantage of the fact that most property value combinations are shared by a large number of characters, whereas inversion lists take advantage of the fact that subsequent characters often share the same property value. This understanding can lead to further improvements.

The work described in this presentation was done jointly with Takumi Koyama in 2016/17.

For Best Viewing

These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux; use F11 to switch to projection mode). The slides have been optimized for a screen size of 1920x1080pixels, but can easily be viewed on other screens, too. Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.

 

Introduction

Motivation:
Slide from Ruby Kaigi 2016

 

Outline

 

Audience Self-Intro

 

Audience Self-Intro

 

Audience Self-Intro

 

Audience Self-Intro

 

Audience Self-Intro

 

Speaker Self-Intro

 

Ruby

Ruby - A Programmer's Best Friend

 

Basic Ruby

5.times { puts "Hello Ruby!" }

 

Conventions Used

Code is mostly green, monospace
puts 'Hello Ruby!'

Variable parts are orange
puts "some string"

Results are indicated with "⇒"
1 + 12

Matching of regular expressions is indicated with ≟, ≅, or ≇
'abc'/a/; 'def'/a/

Matching substrings are indicated with orage background

'abcba'/b/

This slide shows you the conventions we'll use throughout the talk. (Code is in green monospace font. Variable parts are in orange. Where necessary, the character encoding of a string is indicated with a blue subsccript. Results are show after a double arrow. We'll use two symbols to show when regular expressions match or don't match.)

 

Our Contributions to Ruby

Mainly in the following areas:

As a Ruby committer, I'm mostly responsible for the areas listed on this slide.

 

Ruby Versions and Unicode Versions

Year (y) Ruby version (VRuby) Supported Unicode version (VUnicode)
published around Christmas published in Summer
2014 2.2 7.0.0
2015 2.3 8.0.0
2016 2.4 9.0.0
2017 2.5 10.0.0
2018 (planned) 2.6 11.0.0

NOTE about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.

RbConfig::CONFIG["UNICODE_VERSION"]'10.0.0'

VUnicode = y - 2007; VRuby = (y - 1992) · 0.1

VRuby = 1.5 + VUnicode · 0.1      VUnicode = VRuby · 10 - 15

Don't extrapolate too far!

 

Georgian Mtavruli

 

Outline

 

Unicode is ...

 

Unicode is ...

 

Unicode is ...

 

Unicode is ...

 

Examples of Properties and Algorithms

For a list of properties, see Unicode Character Database, UAX #44, in particular Table 9

For some background on Unicode properties, see The Unicode Property Model, UTR #23 (caution: very dry reading)

 

Properties and Regular Expressions

See also Unicode Regular Expressions, UTS #18

 

Properties in Regular Expressions

 

How to Think about Properties

 

How to Organize Property Data

Properties
Category Bidi Name ...
Cha- 1
rac- 2
ters a
b
c
...

⇒ All properties represented in a big table

 

How Big is this Table?

 

How Big is this Table?

 

Any Better Ideas?

Properties of properties that we may be able to exploit:

 

Exploiting Property Values

Properties
P1 P2 P3 ...
Cha- 1 T T F
rac- 2 T T T
ters a F T F
b F T T
c F T F
T F T
...

 

Exploiting Sparsity

Properties
P1 P2 P3 ...
Cha- 1
rac- 2
ters a Z
b
c
X
...

 

Exploiting Sparsity: Example 1

 

Exploiting Sparsity: Example 2

 

Exploiting Dependence

Properties
P1 P2 P3 ...
Cha- 1 0 2
rac- 2 99 101
ters a 5 7
b 44 46
c 7 9
20 22
...

 

Exploiting Regularity

Properties
P1 P2 P3 ...
Cha- 1 1
rac- 2 2
ters 3 3
4 4
5 5
...

 

Exploiting Continuity

Properties
P1 P2 P3 ...
Cha- 1 x 7
rac- 2 x 7
ters a y 7
b y 7
c y 7
z 9
...

 

Exploiting Equivalence

Properties
P1 P2 P3 ...
Cha- 1 1 X Q
rac- 2 2 Y P
ters a 1 X Q
b 3 Z P
c 2 Y P
3 Z P
...

 

Any Better Ideas? 

How each exploit works:

 

Outline

 

Improving Unicode Property Support in Ruby

 

Importance of Unicode Properties in Ruby

 

Ruby Regular Expression Engine

 

Properties in Regular Expressions

In regular expressions, we can check for character properties with backslash-p. We can use this also inside character classes, where logical operations are also available.

 

Currently Supported Properties

(as of Unicode 9.0.0)

Due to Onigmo, Ruby currently has very good support for Boolean properties. But the support for enumerated properties is not up to par. As we will see soon, the reason for this lies in the details of the Onigmo implemenation.

 

Current Implementation: Data for Boolean Property

From enc/unicode/10.0.0/name2ctype.h

/* 'Radical': Binary Property */
static const OnigCodePoint CR_Radical[] = {
        3,                /* number of intervals */
        0x2e80, 0x2e99,   /* ⺀ ~ ⺙ */
        0x2e9b, 0x2ef3,   /* ⺛ ~ ⺀ */
        0x2f00, 0x2fd5,   /* ⼀ ~ ⺀ */
}; /* CR_Radical */

This is the data used for implementing the "Radical" property, which indicates whether a character is a Kanji radical or not. The values are the boundaries of intervals. Inside the boundaries, the property is true, outside, it is false.

 

Current Implementation: Inversion Lists

The data values are the boundaries of the intervals. At the boundaries, the value of the property is inverted. This kind of data structure is therefore called "Inversion List".

 

Current Implementation: Enumerated Properties


Inversion lists work well for binary properties. But they cannot be used directly for enumerated properties. For enumerated properties, an inversion list has to be provided for each property value. As an example, the Script Property needs 138 inversion lists.

 

Additional Example: Data for Katakana

/* 'Katakana': Script */
static const OnigCodePoint CR_Katakana[] = {
        8,                /* number of intervals */
        0x30a1, 0x30fa,   /* ァ ~ ヺ */
        0x30fd, 0x30ff,   /* ヽ ~ ヿ */
        0x31f0, 0x31ff,   /* ㇰ ~ ㇿ */
        0x32d0, 0x32fe,   /* ㋐ ~ ㋾ */
        0x3300, 0x3357,   /* ㌀ ~ ㍗ */
        0xff66, 0xff6f,   /* ヲ ~ ッ */
        0xff71, 0xff9d,   /* ア ~ ン */
        0x1b000, 0x1b000, /* KATAKANA LETTER ARCHAIC E */
}; /* CR_Katakana */

Here's another example, the inversion list for the property value "Katakana" of the property "Script".

 

Current Implementation: Property Check

From regcomp.c, function onig_is_in_code_range:

  for (low = 0, high = n; low < high; ) {
    x = (low + high) >> 1;
    if (code > data[x * 2 + 1])
      low = x + 1;
    else
      high = x;
  }
  return ((low < n && code >= data[low * 2]) ? 1 : 0);

Binary search!

Execution time dependent on length of inversion list

As you can see from this code, properties are checked by using binary search on an inversion list. The execution time depends on the length of the inversion list.

 

Current Implementation: Data Size

This slide shows the calculations for the total memory needed by the current implementation. For Unicode Version 9, there are 535 inversion lists, about 30'000 intervals. This results in a total memory usage of about 250KB.

 

File Size

enc/unicode/10.0.0/name2ctype.h is 38'088 lines, 767'874 bytes

Trying to download from https://svn.ruby-lang.org/:


An Exception Has Occurred

Display of files larger than 512 KB disallowed by configuration

HTTP Response Status

403 Forbidden

Fortunately, checkout with svn (or github) works

One more way to show how large the file is.

 

New Approach: Equivalence Classes

See Feature #13240

So can we improve on the current implementation? We'll try with the following approach. First, in our big spreadsheet, we look not at columns, but rows. We find that many rows are exactly identical. We group such rows/characters into equivalence classes. In Unicode Version 9, there are 3747 equivalence classes.

 

New Approach: Property Representation

We represent each property with a number of bits. Boolean propertios need only one bit. The Script property, which has 137 varues, needs 8 bits.

 

New Approach: Property Check

  /* get Property Value data */
  propVal = propValInfo[ctype];
  /* get bit pattern for equivalence class */
  return (propertyData[equivClass]
                      /* select word */
                      [propVal.word]
          /* extract property value bit pattern */
& propVal.mask /* check for matching property value pattern */ ) == propVal.value ? 1 : 0;

Constant-time lookup!

The processing needed to check for a property value is slightly complicated, but can be done in constant time.

 

Property Check Example

    /* data for equivalence class of Hiragana 「こ」 */
    10000111001000000011000010100010

    /* mask for Script property */
&   00000000111111110000000000000000

    /* bit pattern for property value Hiragana */
=   00000000001000000000000000000000

    /* result is same as bit pattern ⇒「こ」 is Hiragana */
==  00000000001000000000000000000000

This is an example with actual data. We are checking whether the Hiragana character "ko" is a Hiragana or not.

 

New Approach: Data Size (first try)

In this new implementation, we need to know the equivalence class of each code point. That still needs too much memory. We need another improvement.

 

New Approach: Two-Step Equivalence Class Lookup

By using a two-step lookup for equivalance classes, we can save a lot of memory.

 

New Approach: Data Size (second try)

As a result, we happen to get by with 213KB of data.

 

Performance Experiments

This is the setup for our performance experiments.

 

Processing Speedsparsity

Graph of processing speed and number of intervals.

As you can see, in the current implementation, processing gets slower if the number of intervals in the inversion list increases. Consistent with binary lookup, the increase is proportional to the logarithm of the number of intervals. In the new implementation, for every property value, we achieve about the same low time as for the property value with the lowest time in the current implemenation.

 

Property Discrimination

See Feature#13241

With regular expressions, you can check characters for specific property values. However, it's not possible to ask what property value for a property that a character has. With an implementation based on eqivalence classes, providing this functionality becomes easy.

 

Old versus New

Old (current) New
Properties Supported Boolean Properties 55 56
Enumerated Properties 7 20
Property Values 554 1009
Memory Usage (bytes) 252'284 213'908
Lookup Speedup depending on list length up to 9 times faster
Discrimination Speed very slow very fast

This is a comparison of the old and the new implementation.

 

Slide from Ruby Kaigi 2016

Comming back to the slide from Ruby Kaigi 2016, we see that most of our predictions were realized.

 

Future Work

The new implementation also has some problems, but we are continuing research.

 

Acknowledgments

I want to deeply thank the many people who have contributed to this research and this preparation.

 

Summary

 

Q & A

For more questions and comments, please contact me
(mailto:duerst@it.aoyama.ac.jp)



The latest version of this presentation is available at:

http://www.sw.it.aoyama.ac.jp/2018/pub/IUC42_properties/