Internationalization in Ruby 2.4

40th Internationalization and Unicode Conference

Santa Clara, California, U.S.A., November 3, 2016

Martin J. DÜRST

Aoyama Gakuin University

A Ruby Crystal, Symbol of the Ruby Programming Language

© 2016 Martin J. Dürst, Aoyama Gakuin University


Ruby is a purely object-oriented scripting language which is easy to learn for beginners and highly appreciated by experts for its productivity and depth. This presentation discusses the progress of adding internationalization functionality to Ruby for the version 2.4 release expected towards the end of 2016. One focus of the talk will be the currently ongoing implementation of locale-aware case conversion.

Since Ruby 1.9, Ruby has a pervasive if somewhat unique framework for character encoding, allowing different applications to choose different internationalization models. In practice, Ruby is most often and most conveniently used with UTF-8.

Support for internationalization facilities beyond character encoding has been available via various external libraries. As a result, applications may use conflicting and confusing ways to invoke internationalization functionality. To use case conversion as an example, up to version 2.3, Ruby comes with built-in methods for upcasing and downcasing strings, but these only work on ASCII. Our implementation extends this to the whole Unicode range for version 2.4, and efficiently reuses data already available for case-sensitive matching in regular expressions.

We study the interface of internationalization functions/methods in a wide range of programming languages and Ruby libraries. Based on this study, we propose to extend the current built-in Ruby methods, e.g. for case conversion, with additional parameters to allow language-dependent, purpose-based, and explicitly specified functionality, in a true Ruby way. Both the design as well as the implementation of the new functionality for Ruby 2.4 will be described.

This presentation is intended for users and potential users of the programming language Ruby, and people interested in internationalization of programming languages and libraries in general.

For Best Viewing

These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux). Use F11 to switch to projection mode and back. Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.







Ruby Basics



Ruby - A Programmer's Best Friend


Ruby Implementations

This tutorial is about MRI/C-Ruby, the reference implementation


Basic Ruby

3.times { puts 'Hello Ruby!' }

Hello Ruby!
Hello Ruby!
Hello Ruby!


Conventions Used in This Talk

Code is mostly green, monospace
puts 'Hello Ruby!'

Variable parts are orange
puts "some string"

Encoding is indicated with a subscript
'Юに코δ'UTF-8, 'ユニコード'SJIS

Results are indicated with "⇒"
1 + 12


Frequent Example Юに코δ


Up and Running


String Basics

Using the same class for both strings and characters avoids the distinction between characters and strings of length 1. This matches Ruby's "big classes" policy. It also leaves the door open for 'characters' other than single codepoints. Strings are not Arrays, but where it makes sense, operations work the same for both classes. This is called duck typing.


Encoding Basics

Just use Unicode, just use UTF-8


Ruby Likes UTF-8


Ruby Versions


Ruby Versions and Unicode Versions

Year (y) Ruby version (VRuby) Unicode version (VUnicode)
published around Christmas published in Summer
2014 2.2 7.0.0
2015 2.3 8.0.0
2016 2.4 9.0.0

A note about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.


VUnicode = y - 2007

VRuby = 1.5 + VUnicode · 0.1

VUnicode = VRuby · 10 - 15

Don't extrapolate too far!


New in Ruby 2.4:

Non-ASCII Case Conversion

Case Conversions Functions in Ruby


Case Conversion in Ruby 2.3

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
'RéSUMé ĭñŧėřŋãţijňőńæłĩżàťïōņ'


Case Conversions NOT in Ruby 2.3

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase

Case Conversion up to and including Ruby 2.3 is ASCII-only!


Case Conversions NOT in Ruby 2.3

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase

But in Ruby 2.4!


Case Conversion Around the World


Case Distinction History


Modern Case Usage

(details vary by language)

der Gefangene floh - the prisoner fled, but
der gefangene Floh - the captive flea


Isn't ASCII-only Case Conversion Enough?


But: Backwards Compatibility?


Backwards Compatibility Problems


Backwards Compatibility: :ascii Option

Use if you find a case where you really don't want to convert non-ASCII characters

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase :ascii
'RéSUMé ĭñŧėřŋãţijňőńæłĩżàťïōņ'


Implementation Choices

Use a library?

Integrate ICU?

Write new code?


Implementation Choices

Use a library?

Integrate ICU?

Write new code?


Where to Get the Data From?

Data and other specifications available from the Unicode Consortium:





Special Cases: Not 1-to-1


Special Case: Simple Case Mapping

Not implemented!


Special Case: Turkic


Special Case: Lithuanian


Special Case: Case Folding


Special Case: Titlecase


More Special Cases




12 Methods to Implement

String (functional) String (destructive) Symbol
upcase upcase! upcase
downcase downcase! downcase
capitalize capitalize! capitalize
swapcase swapcase! swapcase

Not dealt with: String#casecmp
Why: Includes sorting


Internally, a Single Function

Flags to indicate operation needed
(in file include/ruby/oniguruma.h):

#define ONIGENC_CASE_UPCASE     (1<<13) /* uppercase mapping */
#define ONIGENC_CASE_DOWNCASE   (1<<14) /* lowercase mapping */
#define ONIGENC_CASE_TITLECASE  (1<<15) /* titlecase mapping */

Usage to indicate operation type:

(upcasing needed)

(downcasing needed)

(changed to   ONIGENC_CASE_DOWNCASE after first character)

(both upcasing and downcasing needed)


Option Handling

Flags also used for options:

Corresponding flags:

#define ONIGENC_CASE_FOLD                (1<<19) /* has/needs case folding * /
#define ONIGENC_CASE_FOLD_TURKISH_AZERI  (1<<20) /* Turkic */
#define ONIGENC_CASE_FOLD_LITHUANIAN     (1<<21) /* Lithuanian */
#define ONIGENC_CASE_ASCII_ONLY          (1<<22) /* limited to ASCII */


String Expansion

Handles string expansion (e.g. "ffi".upcase"FFI")

Common to all casing operations


Handling Encodings: The Ruby Way

[1] 松本行弘, 縄手雅彦. スクリプト言語 Ruby の拡張可能な多言語テキスト処理の実装. 情報処理学会論文誌. 2005 Nov 15;46(11):2633-42. / Yukihiro Matsumoto and Masahiko Nawate: Multilingual Text Manipulation Method for Ruby Language. Journal of Information Processing (JIP); 2005 Nov 15; Vol. 46, No. 11, pp. 2633-42. (in Japanese)


Implementation Choice:
           UTF-8 only or Primitives


Implementation Choice:
           New or Reused Primitive

⇒ New primitive


The case_map Primitive


Implementations of case_map Primitive



The Primitive of Primitives:


More case_map Primitives

Students (sophomores/juniors/seniors) at Aoyama Gakuin University


So What about Shift_JIS and Friends?

For East Asian encodings
(Shift_JIS, EUC-JP, GB2312, EUC-KR, Big-5, EUC-TW,...)

data could be shared between //i and case mapping

but case folding for //i only works for ASCII

None of the main Japanese committers thought this was needed anymore

Talk to me if you need it


Reusing Case Folding Data


Folding Data: Before and After

in enc/unicode/9.0.0/casefold.h

/*  before  */
  {0x0041, {1, {0x0061}}},  /*  A → a  */
  {0x00df, {2, {0x0073, 0x0073}}},  /*  ß → ss  */
  {0x01c4, {1, {0x01c6}}},  /*  DŽ → dž  */
  {0x01c5, {1, {0x01c6}}},  /*  Dž → dž  */
  {0xab73, {1, {0x13a3}}},  /*  Ꭳ → ꭳ (Cherokee)  */

/*  after  */
  {0x0041, {1|F|D, {0x0061}}},  /*  A → a  */
  {0x00df, {2|F|ST|SU|I(1), {0x0073, 0x0073}}},  /*  ß → ss  */
  {0x01c4, {1|F|D|ST|I(8), {0x01c6}}},  /*  DŽ → dž  */
  {0x01c5, {1|F|D|IT|SU|I(9), {0x01c6}}},  /*  Dž → dž  */
  {0xab73, {1|F|U, {0x13a3}}},  /*  Ꭳ → ꭳ (Cherokee)  */


Folding Data: Flags

(squeezed into an int where only 2 bits were used)

see enc/unicode.c

/*  data is available here  */
/*  (flags are the same as for options)  */
/*  data is in special additional array  */
/*  index into special array
    (size: around 420 words only)  */
#define I(n) OnigSpecialIndexEncode(n)


Small Implementation Detail

(or my attempt at using the Takahashi method)


seems useful


seems useful


seems useful


Who would use swapcase?





Well, I did, when testing swapcase!


Why swapcase?


Why swapcase?

Python has it ?! (Matz)

Why swapcase?

Python has it ?! (Matz)

To revert accidental Caps Lock output ?! (on Unicode list)

implementing swapcase

must be easy
UPPER ⇒ upper
lower ⇒ LOWER

But what about titlecase?

Dz, Dž, Lj, Nj
ᾼ, ᾈ, ᾉ, ᾊ, ᾋ, ᾌ, ᾍ, ᾎ, ᾏ
ῌ, ᾘ, ᾙ, ᾚ, ᾛ, ᾜ, ᾝ, ᾞ, ᾟ
ῼ, ᾨ, ᾩ, ᾪ, ᾫ, ᾬ, ᾭ, ᾮ, ᾯ


Choice 1
⇓ leave as is

preferred by Unicode Consortium
(never ever need any new standardization)

preserves reversibility
(X.swapcase.swapcase == X)


Choice 2
⇓ upcase


Choice 3
⇓ downcase


Choice 4
⇓ swap

proposed by Nobuyoshi Nakada


swap ⇒"UNgla"

useless?, but 'correct'
additional effort for implementation
additional effort for testing


Commit Date
April 1st, 2016

Japan Time 20:58:33 ⇒ same date in most timezones
please draw your own conclusions



Test-Driven Development


Data-Driven Testing


413 tests, 2'212'391 assertions, 0 failures, 0 errors, 0 skips


Continuous Integration



Ideas, Problems, Questions

In No Particular Order


Character Properties


Locale-Aware Formatting

What I want:

loc = 'de-CH' (German as used in Switzerland)




Well, Just use a Library

Internationalization support in libraries:


Example: Unicode Normalization

Libraries avoid monkey patching

⇒ not Ruby-like (ライブラリを使うと Ruby らしくない)


Locales and Case Mappings

Possible solution (解決案):

loc = 'tr'
'Türkiye'.upcase loc


Encodings: Less is More?







More information about case conversion implementation internals:
(video at


Q & A

Send questions and comments to Martin Dürst
or open a bug report or feature request for Ruby

The latest version of this presentation is available at: