Design Considerations for
Internationalization in Ruby 2.2

IUC 38, Santa Clara, CA, U.S.A., 4 November 2014

Martin J. DÜRST

Aoyama Gakuin University

Ruby Programming Language

© 2014 Martin J. Dürst, Aoyama Gakuin University


Ruby is a purely object-oriented scripting language which is easy to learn for beginners and highly appreciated by experts for its productivity and depth. This presentation discusses how to add internationalization functionality such as Unicode normalization, case conversion, and number formatting to Ruby in a true Ruby way.

Since Ruby 1.9, Ruby has a pervasive if somewhat unique framework for character encoding, allowing different applications to choose different internationalization models. In practice, Ruby is most often used with UTF-8.

Support for internationalization facilities beyond character encoding is available in various external libraries, but not yet in the Ruby core. As a result, libraries and applications may use conflicting and confusing ways to invoke internationalization functionality. To use case conversion as an example, Ruby comes with built-in methods for upcasing and downcasing strings, but these only work on ASCII. An internationalization library may add separate functions for case conversion for the whole range of Unicode characters.

We study the interface of internationalization functions/methods in a wide range of programming languages and Ruby libraries. Based on this study, we propose to extend the current built-in Ruby methods with additional parameters to allow language-dependent, purpose-based, and explicitly specified functionality, in a true Ruby way.

This presentation is intended for users and potential users of the programming language Ruby, and people interested in internationalization of programming languages and libraries in general.

[These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux; use F11 to switch to projection mode). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some of the characters and character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.]






Conventions Used

(Ruby) code is green, monospace
puts "Hello Ruby!"

Variable parts are orange
puts "some string"

Encoding is indicated with a subscript

Results are indicated with "⇒": 1 + 12



Ruby - A Programmer's Best Friend


Ruby Implementations


Basic Ruby

5.times { puts "Hello Ruby!" }


Ruby Library Hierarchy




Current I18N Support

(since Ruby 1.9)

  1. Strings are sequences of characters (codepoints)
  2. Strings tagged with encoding
    "Юに코δ"UTF-8.encoding ⇒ UTF-8
  3. Encoding conversion
    "abcδ".encode 'SJIS'"abcδ"SJIS
  4. Regular expressions:
    "Юに코δ".match /δ/
  5. Source files are UTF-8 by default
    (since Ruby 2.1, US-ASCII before)

Point 2 is fundamentally different from other programming languages, which in general use just a single encoding internally (Java, JavaScript: UTF-16; Python: 8-bit/UTF-16/UTF-32; Perl: 8-bit/UTF-8; C/C++: mb/wc)

Important Encodings

In Ruby, all encodings are equal, but some encodings are more equal than others:

[Adapted from George Orwell's Animal Farm.]


Missing I18N Support


External Libraries

Most I18N support is currently available via various libraries:


Design Considerations


Case Studies


Case Study: Normalization

Unicode Normalization Forms NFC, NFD, NFKC, NFKD

Defined in UTS #15, yesterday's tutorial

Example: NFC of R + ̣ + is Ṝ


Normalization in Other Languages


Normalization in Ruby Libraries



We can do better!


I18N Functionality Where It Belongs


Monkey Patching

⇒ Libraries are reluctant to patch core classes

⇒ Adding functionality to language core


Memory Footprint

⇒ Dynamic loading


Dynamic Loading to the Rescue (0)

Starting point: Add method to class String

class String
  def unicode_normalize(form)
    # implementation goes here


Dynamic Loading to the Rescue (1)

Step 1: Separate invocation and implementation

class String
  def unicode_normalize(form)
    UnicodeNormalize.normalize(self, form)

Implementation in separate module String::UnicodeNormalize (including internal methods)


Dynamic Loading to the Rescue (2)

Step 2: Conditionally require implementation (only on first call)

class String #reopen class String to add new methods
  def unicode_normalize(form = :nfc)
    unless defined? UnicodeNormalize
      require 'unicode_normalize/normalize.rb'
    UnicodeNormalize.normalize(self, form)


Dynamic Loading to the Rescue (3)

Speed optimization: Dynamically redefine method without require

class String #reopen class String to add new methods
  def unicode_normalize(form = :nfc)
    unless defined? UnicodeNormalize
      require 'unicode_normalize/normalize.rb'
      :unicode_normalize, ->(form = :nfc) do
        UnicodeNormalize.normalize(self, form)
    UnicodeNormalize.normalize(self, form)

We have to use metaprogramming (send, define_method) to redefine the method because Ruby doesn't allow to directly define a method inside a method.


Three New Methods


Why the unicode_ prefix?


Symbols for Normalization Forms


Pure Ruby vs. Extension


Efficient Pure Ruby


Normalization in Various Languages

Ruby clearly is shorter. But it's not shortness for shortness' sake, it just feels better. Ruby is a programmer's best friend!


Case Study: Casing


Casing in Ruby Now


Widening Method Interface

In languages with strict typing (C++, Java,...), this can be achieved by method overloading.


Implementation Reuse


One Encoding or All


Default: To Unicode or not to Unicode


Case Study: Properties


Case Study: Iterators


Existing String Iterators

Strings can be cut up, and iterated over, in four ways:

Lines: "Юに코δ".lines.to_a

Characters: "Юに코δ".chars.to_a
["Ю", "に", "코", "δ"]

Codepoints: "Юに코δ".codepoints.to_a
[1070, 12395, 46020, 948]

Bytes: "Юに코δ".bytes.to_a
[208, 174, 227, 129, 171, 235, 143, 132, 206, 180]

As the result of chars shows, Ruby treats characters as Strings of length 1. They were integers up to Ruby 1.8. Using the same class for both strings and characters avoids the distinction between characters and strings of length 1. This matches Ruby's "big classes" policy. It also leaves the door open for 'characters' other than single codepoints.

each_line, each_char are older names for lines, chars, codepoints, and bytes. The methods lines, chars, codepoints, and bytes return Enumerators. Here we just use to_a to produce arrays. Enumerators can be used directly for iteration with each, or with separate iterators for mapping with map, selection with select/reject.

The need for several enumerators on a single object, resulting form the change of string/character representation between Ruby 1.8 and 1.9, was one of the main motivators for introducing enumerators into Ruby. This is an interesting example of how internationalization concerns can affect more 'fundamental' language features. The orthogonality resulting from separating what to enumerate over and how to iterate results in very expressive code.



Implementation Status

Christmas tree

A note about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.




Q & A

Send questions and comments to Martin Dürst,

The latest version of this presentation is available at:


Ayumu Nojima (野島 歩) and Kimihito Matsui (松井 仁人) for help with research and implementations. Yui Naruse (成瀬 ゆい), Nobu Nakada (中田 伸悦) and many other Ruby committers for help and support. Matz (まつもと ゆきひろ) for Ruby, a programmer's best friend.

The IME Pad for facilitating character input.

Image credits: Christmas tree: Frits Ahlefeldt-Laurvig