Implementing Normalization
in Pure Ruby
- the Fast and Easy Way -

http://www.sw.it.aoyama.ac.jp/2013/pub/RubyNorm

IUC 37, Santa Clara, CA, U.S.A., 22 October 2013

Martin J. DÜRST

duerst@it.aoyama.ac.jp

Aoyama Gakuin University

Ruby Programming Language

© 2013 Martin J. Dürst, Aoyama Gakuin University

Abstract

Unicode normalization eliminates encoding variants for what is essentially the same character. We present a new efficient way to implement normalization in scripting languages such as Ruby. Our implementation takes advantage of two facts: First, that large percentages of many kinds of text do not need to be changed when normalizing, and second, that the number of different sequences that need to be normalized in a certain language or group of languages is always much lower than the number of theoretically possible sequences. We exploit each of these facts with built-in functionality of Ruby. Specially-crafted regular expressions capture sequences of characters in possible need for normalization, and a hash structure is used to look up their normalization. The hash structure starts empty and is updated by a default block of code whenever a new sequence of characters needing normalization is identified. As a result, no actual normalization code is executed once the hash structure incorporates the character sequences used in a particular language.

Performance tests with many different implementations and a wide range of languages (German, Vietnamese, Korean,...) with different normalization needs show that our method is one or more orders of magnitude faster than straightforward implementations in scripting languages, and in some cases close to the speed of implementations in compiled languages. Because the code is written in pure Ruby, it is very easily portable and configurable, and similars techniques may be useful for other internationalization operations such as case conversions.

About the Slides

The most up-to-date version of the slides with links is available at http://www.sw.it.aoyama.ac.jp/2013/pub/RubyNorm.

These slides are created in HMTL, for projection with Opera (≤12.16 Windows/Mac/Linux, use F11). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some of the characters and character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.

Main Points

Audience

Speaker Intro

Why Normalization

Fundamental for Unicode: Are two strings the same or not

Four Forms

Original NFD NFC NFKD NFKC
R + ̄ + ̣; Ṛ + ̄ R + ̣ + ̄ R + ̣ + ̄
A + ̊ Å A + ̊ Å
가; ᄀ + ᅡ ᄀ + ᅡ ᄀ + ᅡ
⁴₄④¾⑷⒋ ⁴₄④¾⑷⒋ ⁴₄④¾⑷⒋ 4443⁄4(4)4. 4443⁄4(4)4.

Algorithmic Definition

Works on characters one by one

What's needed for Normalization

⇒ Lots of calculations and lookups

⇒ Slow in scripting language

Ruby

Ruby - A Programmer's Best Friend

 

Why Pure Implementation

 

How to Speed up Scripts

Scripts work in two layers:

Powerful primitives

Scripting languages are inherently slow because they are interpreted. Very simply put, every line of the program is analysed again every time it is executed. The implementation of a scripting language usually consists of two parts. One is the interpretation engine for the scripting language commands. The other is the implementation of the powerful primitives.

The way to speed up programs in a scripting language is to use the powerful scripting primitives wherever possible. Let's look at an example.

A Script Example

Problem: Replace all occurrences of "&lt;" in a string by "<"

Low-level solution:

Solution (in a scripting language such as Ruby):

This shows an example that we will be able to use later.

The problem is unescaping for XML/HTML. Adding "&gt;", "&amp;", "&apos;", and "&quot;" into the mix, the problem is solved by the better students in my sophomore programming class in around 100 lines of C code. Such a solution can be transferred quite mechanically to Ruby, too, it would be very slow, apart from being very unwieldy.

The real solution in Ruby (or any similar scripting language) is trivial. It uses the gsub operation, which globally searches for a string in another string and replaces it with a third string.

 

Normalization: Analyzing the Problem

Now we come to the basic idea underlying our implementation. This idea is based on an analysis of the actual data patterns occurring in text normalization. The number of combinations of characters that may be involved in normalization is potentially unlimited. Each of these character combinations has to be dealt with, and it's impossible to predict which ones will appear in the data. Any kind of optimization or shortcut seems to be futile.

However, the situation is not that bleak. Texts usually consists of only a single natural language, maybe a few natural languages in some cases. And each natural language has only a limited number of characters or character combinations affected by normalization. Even in large text corpora, the number of character combinations affected by normalization is bound to be limited by the orthographic needs of the languages involved. On the other hand, the same normalization pattern may appear repeatedly.

This is the core idea for our simple and efficient implementation of normalization.

Example: German to NFC

Letters in core German affected by normalization:

Ä, ä, Ö, ö, Ü, ü

Converting German to NFC:

string.gsub! "A\u0308", "Ä"
string.gsub! "a\u0308", "ä"
...

To explain the idea with a simplistic example, let's take a look at German. Core German (not including loanwords) only has six letters that are affected by normalization. Normalization to NFC could be done simply by replacing all occurrences of "A" followed by a combining diaeresis (U+0308) with "Ä", and so on for the other umlaut characters.

Of course, this approach is not directly usable. Even for German, the string would have to be scanned six times, which is way slower than just scanning it one time, and the solution would have to be hand-tailored to different languages and language combinations.

But we can make things more automatic using more of Ruby's primitives.

 

Regular Expressions

Normalization Runs

Normalization Run Examples

Diacritics: R + ̄ + ̣; Ṛ + ̄ ; Singleton: ; CJK Compat: ; Hangul: ; Compat:

Replacement

 

Building the Hash

 

Code Skeleton

NFC_HASH = Hash.new do |h, run| # create hash for NFC
  h[run] = nfc_run run          # add nfc for run to hash
end

class String
  def nfc(string)               # add nfc to String class
    string.gsub NFC_REGEXP, NFC_HASH
  end
end

 

Preprocessing

 

Actual Normalization

Based on performance tests

 

Ruby Specifics

Security: DOS Attacks

 

Publication

Testing for Correctness

 

Performance Experiments

 

Results

(German, more at GitHub, milliseconds)

eprun unicode_utils (1.4.0) twitter_cldr (2.4.2)
NFD 12.48  75.19  3884
NFC 15.60 191.73  4072
NFKD 18.56 118.71  8627
NFKC 22.00 215.75 10577

 

Cross-Language

(Perl v5.14.2 built for cygwin-thread-multi-64int, in milliseconds)

eprun Perl (native) Perl (pure)
NFD 12.48 47.7 683.4
NFC 15.60 30.3 253.3
NFKD 18.56 48.8 742.4
NFKC 22.00 36.0 259.0

 

Why so Fast

Gets faster if implementation (gsub, Regexp, Hash) get faster

 

Demo

Depending on the time available, we will show a demo of testing, and/or some performance demos.

Conclusions

 

Q&A

Future Work

Acknowledgments

Ayumu NOJIMA for parts of implementation, performance measurement, and testing.

Cameron Dutro for interesting discussions and encouragement to publish the code.

The IME Pad for facilitating character input.

Amaya and Web technology for slide editing and display.