Internationalization in Ruby 2.4

http://www.sw.it.aoyama.ac.jp/2016/pub/IUC40-Ruby2.4/

40th Internationalization and Unicode Conference

Santa Clara, California, U.S.A., November 3, 2016

Martin J. DÜRST

duerst@it.aoyama.ac.jp

Aoyama Gakuin University

A Ruby Crystal, Symbol of the Ruby Programming Language

© 2016 Martin J. Dürst, Aoyama Gakuin University

Abstract

Ruby is a purely object-oriented scripting language which is easy to learn for beginners and highly appreciated by experts for its productivity and depth. This presentation discusses the progress of adding internationalization functionality to Ruby for the version 2.4 release expected towards the end of 2016. One focus of the talk will be the currently ongoing implementation of locale-aware case conversion.

Since Ruby 1.9, Ruby has a pervasive if somewhat unique framework for character encoding, allowing different applications to choose different internationalization models. In practice, Ruby is most often and most conveniently used with UTF-8.

Support for internationalization facilities beyond character encoding has been available via various external libraries. As a result, applications may use conflicting and confusing ways to invoke internationalization functionality. To use case conversion as an example, up to version 2.3, Ruby comes with built-in methods for upcasing and downcasing strings, but these only work on ASCII. Our implementation extends this to the whole Unicode range for version 2.4, and efficiently reuses data already available for case-sensitive matching in regular expressions.

We study the interface of internationalization functions/methods in a wide range of programming languages and Ruby libraries. Based on this study, we propose to extend the current built-in Ruby methods, e.g. for case conversion, with additional parameters to allow language-dependent, purpose-based, and explicitly specified functionality, in a true Ruby way. Both the design as well as the implementation of the new functionality for Ruby 2.4 will be described.

This presentation is intended for users and potential users of the programming language Ruby, and people interested in internationalization of programming languages and libraries in general.

For Best Viewing

These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux). Use F11 to switch to projection mode and back. Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.

 

Introduction

Introductions

 

Overview

 

Ruby Basics

 

Ruby

Ruby - A Programmer's Best Friend

 

Ruby Implementations

This tutorial is about MRI/C-Ruby, the reference implementation

 

Basic Ruby

3.times { puts 'Hello Ruby!' }

Hello Ruby!
Hello Ruby!
Hello Ruby!

 

Conventions Used in This Talk

Code is mostly green, monospace
puts 'Hello Ruby!'

Variable parts are orange
puts "some string"

Encoding is indicated with a subscript
'Юに코δ'UTF-8, 'ユニコード'SJIS

Results are indicated with "⇒"
1 + 12

 

Frequent Example Юに코δ

 

Up and Running

 

String Basics

Using the same class for both strings and characters avoids the distinction between characters and strings of length 1. This matches Ruby's "big classes" policy. It also leaves the door open for 'characters' other than single codepoints. Strings are not Arrays, but where it makes sense, operations work the same for both classes. This is called duck typing.

 

Encoding Basics

Just use Unicode, just use UTF-8

 

Ruby Likes UTF-8

 

Ruby Versions

 

Ruby Versions and Unicode Versions

Year (y) Ruby version (VRuby) Unicode version (VUnicode)
published around Christmas published in Summer
2014 2.2 7.0.0
2015 2.3 8.0.0
2016 2.4 9.0.0

A note about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.

RbConfig::CONFIG["UNICODE_VERSION"]'9.0.0'

VUnicode = y - 2007

VRuby = 1.5 + VUnicode · 0.1

VUnicode = VRuby · 10 - 15

Don't extrapolate too far!

 

New in Ruby 2.4:

Non-ASCII Case Conversion

Case Conversions Functions in Ruby

 

Case Conversion in Ruby 2.3

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
'RéSUMé ĭñŧėřŋãţijňőńæłĩżàťïōņ'

 

Case Conversions NOT in Ruby 2.3

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'

Case Conversion up to and including Ruby 2.3 is ASCII-only!

 

Case Conversions NOT in Ruby 2.3

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'

But in Ruby 2.4!

 

Case Conversion Around the World

 

Case Distinction History

 

Modern Case Usage

(details vary by language)

German:
der Gefangene floh - the prisoner fled, but
der gefangene Floh - the captive flea

 

Isn't ASCII-only Case Conversion Enough?

 

But: Backwards Compatibility?

 

Backwards Compatibility Problems

 

Backwards Compatibility: :ascii Option

Use if you find a case where you really don't want to convert non-ASCII characters

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase :ascii
'RéSUMé ĭñŧėřŋãţijňőńæłĩżàťïōņ'

 

Implementation Choices

Use a library?

Integrate ICU?

Write new code?

 

Implementation Choices

Use a library?

Integrate ICU?

Write new code?

 

Where to Get the Data From?

Data and other specifications available from the Unicode Consortium:

UnicodeData.txt

CaseFolding.txt

SpecialCasing.txt

 

Special Cases: Not 1-to-1

 

Special Case: Simple Case Mapping

Not implemented!

 

Special Case: Turkic

 

Special Case: Lithuanian

 

Special Case: Case Folding

 

Special Case: Titlecase

 

More Special Cases

 

Implementation

 

12 Methods to Implement

String (functional) String (destructive) Symbol
upcase upcase! upcase
downcase downcase! downcase
capitalize capitalize! capitalize
swapcase swapcase! swapcase

Not dealt with: String#casecmp
Why: Includes sorting

 

Internally, a Single Function

Flags to indicate operation needed
(in file include/ruby/oniguruma.h):

#define ONIGENC_CASE_UPCASE     (1<<13) /* uppercase mapping */
#define ONIGENC_CASE_DOWNCASE   (1<<14) /* lowercase mapping */
#define ONIGENC_CASE_TITLECASE  (1<<15) /* titlecase mapping */

Usage to indicate operation type:

upcase:      ONIGENC_CASE_UPCASE
(upcasing needed)

downcase:    ONIGENC_CASE_DOWNCASE
(downcasing needed)

capitalize:  ONIGENC_CASE_TITLECASE | ONIGENC_CASE_UPCASE
(changed to   ONIGENC_CASE_DOWNCASE after first character)

swapcase   ONIGENC_CASE_UPCASE | ONIGENC_CASE_DOWNCASE
(both upcasing and downcasing needed)

 

Option Handling

Flags also used for options:

Corresponding flags:

#define ONIGENC_CASE_FOLD                (1<<19) /* has/needs case folding * /
#define ONIGENC_CASE_FOLD_TURKISH_AZERI  (1<<20) /* Turkic */
#define ONIGENC_CASE_FOLD_LITHUANIAN     (1<<21) /* Lithuanian */
#define ONIGENC_CASE_ASCII_ONLY          (1<<22) /* limited to ASCII */

 

String Expansion

Handles string expansion (e.g. "ffi".upcase"FFI")

Common to all casing operations

  

Handling Encodings: The Ruby Way

[1] 松本行弘, 縄手雅彦. スクリプト言語 Ruby の拡張可能な多言語テキスト処理の実装. 情報処理学会論文誌. 2005 Nov 15;46(11):2633-42. / Yukihiro Matsumoto and Masahiko Nawate: Multilingual Text Manipulation Method for Ruby Language. Journal of Information Processing (JIP); 2005 Nov 15; Vol. 46, No. 11, pp. 2633-42. (in Japanese)

 

Implementation Choice:
           UTF-8 only or Primitives

 

Implementation Choice:
           New or Reused Primitive

⇒ New primitive

 

The case_map Primitive

 

Implementations of case_map Primitive

Examples:

 

The Primitive of Primitives:
           onigenc_unicode_case_map

 

More case_map Primitives

Students (sophomores/juniors/seniors) at Aoyama Gakuin University

 

So What about Shift_JIS and Friends?

For East Asian encodings
(Shift_JIS, EUC-JP, GB2312, EUC-KR, Big-5, EUC-TW,...)

data could be shared between //i and case mapping

but case folding for //i only works for ASCII

None of the main Japanese committers thought this was needed anymore

Talk to me if you need it

 

Reusing Case Folding Data

 

Folding Data: Before and After

in enc/unicode/9.0.0/casefold.h

/*  before  */
  {0x0041, {1, {0x0061}}},  /*  A → a  */
  {0x00df, {2, {0x0073, 0x0073}}},  /*  ß → ss  */
  {0x01c4, {1, {0x01c6}}},  /*  DŽ → dž  */
  {0x01c5, {1, {0x01c6}}},  /*  Dž → dž  */
  {0xab73, {1, {0x13a3}}},  /*  Ꭳ → ꭳ (Cherokee)  */

/*  after  */
  {0x0041, {1|F|D, {0x0061}}},  /*  A → a  */
  {0x00df, {2|F|ST|SU|I(1), {0x0073, 0x0073}}},  /*  ß → ss  */
  {0x01c4, {1|F|D|ST|I(8), {0x01c6}}},  /*  DŽ → dž  */
  {0x01c5, {1|F|D|IT|SU|I(9), {0x01c6}}},  /*  Dž → dž  */
  {0xab73, {1|F|U, {0x13a3}}},  /*  Ꭳ → ꭳ (Cherokee)  */

 

Folding Data: Flags

(squeezed into an int where only 2 bits were used)

see enc/unicode.c

/*  data is available here  */
/*  (flags are the same as for options)  */
#define U ONIGENC_CASE_UPCASE
#define D ONIGENC_CASE_DOWNCASE
#define F ONIGENC_CASE_FOLD
/*  data is in special additional array  */
#define ST ONIGENC_CASE_TITLECASE
#define SU ONIGENC_CASE_UP_SPECIAL
#define SL ONIGENC_CASE_DOWN_SPECIAL
#define IT ONIGENC_CASE_IS_TITLECASE
/*  index into special array
    (size: around 420 words only)  */
#define I(n) OnigSpecialIndexEncode(n)

 

Small Implementation Detail

(or my attempt at using the Takahashi method)

upcase

seems useful

downcase

seems useful

capitalize

seems useful

swapcase

Who would use swapcase?

 

Nobody?

 

Nobody?

Well, I did, when testing swapcase!

 

Why swapcase?

 

Why swapcase?

Python has it ?! (Matz)

Why swapcase?

Python has it ?! (Matz)

To revert accidental Caps Lock output ?! (on Unicode list)

implementing swapcase

must be easy
UPPER ⇒ upper
lower ⇒ LOWER

But what about titlecase?

Dz, Dž, Lj, Nj
ᾼ, ᾈ, ᾉ, ᾊ, ᾋ, ᾌ, ᾍ, ᾎ, ᾏ
ῌ, ᾘ, ᾙ, ᾚ, ᾛ, ᾜ, ᾝ, ᾞ, ᾟ
ῼ, ᾨ, ᾩ, ᾪ, ᾫ, ᾬ, ᾭ, ᾮ, ᾯ

 

Choice 1
"DžunGLA".swapcase
⇓ leave as is
"DžUNgla"

preferred by Unicode Consortium
(never ever need any new standardization)

preserves reversibility
(X.swapcase.swapcase == X)

 

Choice 2
"DžunGLA".swapcase
⇓ upcase
"DŽUNgla"

 

Choice 3
"DžunGLA".swapcase
⇓ downcase
"džUNgla"

 

Choice 4
"DžunGLA".swapcase
⇓ swap
"UNgla"

proposed by Nobuyoshi Nakada

 

Implemented
swap ⇒"UNgla"

useless?, but 'correct'
additional effort for implementation
additional effort for testing

 

Commit Date
April 1st, 2016

(エイプリルフールの日)
Japan Time 20:58:33 ⇒ same date in most timezones
please draw your own conclusions

 

Testing

Test-Driven Development

Files:
test/ruby/enc/test_case_options.rb
test/ruby/enc/test_case_mapping.rb

Data-Driven Testing

Files:
test/ruby/enc/test_case_comprehensive.rb

413 tests, 2'212'391 assertions, 0 failures, 0 errors, 0 skips

 

Continuous Integration

 

Future:

Ideas, Problems, Questions

In No Particular Order

 

Character Properties

 

Locale-Aware Formatting

What I want:

loc = Locale.new 'de-CH' (German as used in Switzerland)

1.2345678E5.to_s"123456.78"

1.2345678E5.to_s(loc)"123'456,78"

 

Well, Just use a Library

Internationalization support in libraries:

 

Example: Unicode Normalization

Libraries avoid monkey patching

⇒ not Ruby-like (ライブラリを使うと Ruby らしくない)

 

Locales and Case Mappings

Possible solution (解決案):

loc = Locale.new 'tr'
'Türkiye'.upcase loc
'TÜRKİYE'

 

Encodings: Less is More?

 

Acknowledgments

 

Conclusions

 

References

More information about case conversion implementation internals:
http://www.sw.it.aoyama.ac.jp/2016/pub/RubyKaigi/
(video at http://rubykaigi.org/2016/presentations/duerst.html)

 

Q & A

Send questions and comments to Martin Dürst
(mailto:duerst@it.aoyama.ac.jp)
or open a bug report or feature request for Ruby



The latest version of this presentation is available at:

http://www.sw.it.aoyama.ac.jp/2016/pub/IUC40-Ruby2.4/