Design Considerations for
Internationalization in Ruby 2.2

http://www.sw.it.aoyama.ac.jp/2014/pub/RubyI18N/

IUC 38, Santa Clara, CA, U.S.A., 4 November 2014

Martin J. DÜRST

duerst@it.aoyama.ac.jp

Aoyama Gakuin University

Ruby Programming Language

© 2014 Martin J. Dürst, Aoyama Gakuin University

Abstract

Ruby is a purely object-oriented scripting language which is easy to learn for beginners and highly appreciated by experts for its productivity and depth. This presentation discusses how to add internationalization functionality such as Unicode normalization, case conversion, and number formatting to Ruby in a true Ruby way.

Since Ruby 1.9, Ruby has a pervasive if somewhat unique framework for character encoding, allowing different applications to choose different internationalization models. In practice, Ruby is most often used with UTF-8.

Support for internationalization facilities beyond character encoding is available in various external libraries, but not yet in the Ruby core. As a result, libraries and applications may use conflicting and confusing ways to invoke internationalization functionality. To use case conversion as an example, Ruby comes with built-in methods for upcasing and downcasing strings, but these only work on ASCII. An internationalization library may add separate functions for case conversion for the whole range of Unicode characters.

We study the interface of internationalization functions/methods in a wide range of programming languages and Ruby libraries. Based on this study, we propose to extend the current built-in Ruby methods with additional parameters to allow language-dependent, purpose-based, and explicitly specified functionality, in a true Ruby way.

This presentation is intended for users and potential users of the programming language Ruby, and people interested in internationalization of programming languages and libraries in general.

[These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux; use F11 to switch to projection mode). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some of the characters and character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.]

 

Introductions

 

Overview

 

Conventions Used

(Ruby) code is green, monospace
puts "Hello Ruby!"

Variable parts are orange
puts "some string"

Encoding is indicated with a subscript

Results are indicated with "⇒": 1 + 12

 

Ruby

Ruby - A Programmer's Best Friend

 

Ruby Implementations

 

Basic Ruby

5.times { puts "Hello Ruby!" }

 

Ruby Library Hierarchy

 

Object-Orientation

 

Current I18N Support

(since Ruby 1.9)

  1. Strings are sequences of characters (codepoints)
    "Юに코δ".length4
  2. Strings tagged with encoding
    "Юに코δ"UTF-8.encoding ⇒ UTF-8
  3. Encoding conversion
    "abcδ".encode 'SJIS'"abcδ"SJIS
  4. Regular expressions:
    "Юに코δ".match /δ/
  5. Source files are UTF-8 by default
    (since Ruby 2.1, US-ASCII before)

Point 2 is fundamentally different from other programming languages, which in general use just a single encoding internally (Java, JavaScript: UTF-16; Python: 8-bit/UTF-16/UTF-32; Perl: 8-bit/UTF-8; C/C++: mb/wc)

Important Encodings

In Ruby, all encodings are equal, but some encodings are more equal than others:

[Adapted from George Orwell's Animal Farm.]

 

Missing I18N Support

 

External Libraries

Most I18N support is currently available via various libraries:

 

Design Considerations

 

Case Studies

 

Case Study: Normalization

Unicode Normalization Forms NFC, NFD, NFKC, NFKD

Defined in UTS #15, yesterday's tutorial

Example: NFC of R + ̣ + is Ṝ

 

Normalization in Other Languages

 

Normalization in Ruby Libraries

 

Observations

We can do better!

 

I18N Functionality Where It Belongs

 

Monkey Patching

⇒ Libraries are reluctant to patch core classes

⇒ Adding functionality to language core

 

Memory Footprint

⇒ Dynamic loading

 

Dynamic Loading to the Rescue (0)

Starting point: Add method to class String

class String
  def unicode_normalize(form)
    # implementation goes here
  end
end

 

Dynamic Loading to the Rescue (1)

Step 1: Separate invocation and implementation

class String
  def unicode_normalize(form)
    UnicodeNormalize.normalize(self, form)
  end
end

Implementation in separate module String::UnicodeNormalize (including internal methods)

 

Dynamic Loading to the Rescue (2)

Step 2: Conditionally require implementation (only on first call)

class String #reopen class String to add new methods
  def unicode_normalize(form = :nfc)
    unless defined? UnicodeNormalize
      require 'unicode_normalize/normalize.rb'
    end
    UnicodeNormalize.normalize(self, form)
  end
end

 

Dynamic Loading to the Rescue (3)

Speed optimization: Dynamically redefine method without require

class String #reopen class String to add new methods
  def unicode_normalize(form = :nfc)
    unless defined? UnicodeNormalize
      require 'unicode_normalize/normalize.rb'
    end
    String.send(:define_method,
      :unicode_normalize, ->(form = :nfc) do
        UnicodeNormalize.normalize(self, form)
      end
    )
    UnicodeNormalize.normalize(self, form)
  end
end

We have to use metaprogramming (send, define_method) to redefine the method because Ruby doesn't allow to directly define a method inside a method.

 

Three New Methods

 

Why the unicode_ prefix?

 

Symbols for Normalization Forms

 

Pure Ruby vs. Extension

 

Efficient Pure Ruby

 

Normalization in Various Languages

Ruby clearly is shorter. But it's not shortness for shortness' sake, it just feels better. Ruby is a programmer's best friend!

 

Case Study: Casing

 

Casing in Ruby Now

 

Widening Method Interface

In languages with strict typing (C++, Java,...), this can be achieved by method overloading.

 

Implementation Reuse

 

One Encoding or All

 

Default: To Unicode or not to Unicode

 

Case Study: Properties

 

Case Study: Iterators

 

Existing String Iterators

Strings can be cut up, and iterated over, in four ways:

Lines: "Юに코δ".lines.to_a
["Юに코δ"]

Characters: "Юに코δ".chars.to_a
["Ю", "に", "코", "δ"]

Codepoints: "Юに코δ".codepoints.to_a
[1070, 12395, 46020, 948]

Bytes: "Юに코δ".bytes.to_a
[208, 174, 227, 129, 171, 235, 143, 132, 206, 180]

As the result of chars shows, Ruby treats characters as Strings of length 1. They were integers up to Ruby 1.8. Using the same class for both strings and characters avoids the distinction between characters and strings of length 1. This matches Ruby's "big classes" policy. It also leaves the door open for 'characters' other than single codepoints.

each_line, each_char are older names for lines, chars, codepoints, and bytes. The methods lines, chars, codepoints, and bytes return Enumerators. Here we just use to_a to produce arrays. Enumerators can be used directly for iteration with each, or with separate iterators for mapping with map, selection with select/reject.

The need for several enumerators on a single object, resulting form the change of string/character representation between Ruby 1.8 and 1.9, was one of the main motivators for introducing enumerators into Ruby. This is an interesting example of how internationalization concerns can affect more 'fundamental' language features. The orthogonality resulting from separating what to enumerate over and how to iterate results in very expressive code.

 

 

Implementation Status

Christmas tree

A note about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.

 

Conclusions

 

Q & A

Send questions and comments to Martin Dürst, duerst@it.aoyama.ac.jp

The latest version of this presentation is available at:

http://www.sw.it.aoyama.ac.jp/2014/pub/RubyI18N/

Acknowledgments

Ayumu Nojima (野島 歩) and Kimihito Matsui (松井 仁人) for help with research and implementations. Yui Naruse (成瀬 ゆい), Nobu Nakada (中田 伸悦) and many other Ruby committers for help and support. Matz (まつもと ゆきひろ) for Ruby, a programmer's best friend.

The IME Pad for facilitating character input.

Image credits: Christmas tree: Frits Ahlefeldt-Laurvig