Ups and Downs of Ruby Internationalization

http://www.sw.it.aoyama.ac.jp/2016/pub/RubyKaigi/

Ruby Kaigi 2016 , Kyoto, Japan, September 8, 2016

Martin J. DÜRST

(マーティンと呼んでください)

duerst@it.aoyama.ac.jp

Abstract

Currently many of Ruby's String methods, such as upcase and downcase, are limited to ASCII and ignore the rest of the world. This is finally going to change in Ruby 2.4, where this functionality will be extended to cover full Unicode. You will get to know what will change, how your programs may be affected, and how these changes are implemented behind the scenes. We will also look at the overall state of internationalization functionality in Ruby, and potential future directions.

For Best Viewing

These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux; use F11 to switch to projection mode). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.

Introduction

Outline

Introduction
Upcase and downcase
Special cases
Implementation
Future of Ruby Internationalization

Some Conventions Used

Code is mostly green, monospaceputs "Hello Ruby!"

Variable parts are orange
puts "some string"

Encoding is indicated with a _subscript
"Юに코δ"_UTF-8, "ユニコード"_SJIS

Results are indicated with "⇒"
1 + 1 ⇒ 2

Audience Self-Intro

Who uses letters other than A-Z?
Who uses encodings other than US-ASCII?
Who uses encodings other than UTF-8?

Speaker Self-Intro

From Switzerland, living in Japan
Ruby Committer
W3C Internationalization Interest Group Chair
IETF WG member,...
Professor at Aoyama Gakuin University (青山学院大学)
(teaching algorithms, Compilers, C programming, Math for CS students,..., in Japanese and English)
Research topics: Software internationalization, World Wide Web Architecture, Programming Languagues

Contributions to Ruby

Mainly in the following areas:

Encoding conversion (String#encode, Ruby 1.9)
Unicode normalization (String#unicode-normalize, Ruby 2.2)
Non-ASCII case conversion (String#upcase,..., Ruby 2.4)
Unicode version updates

Ruby Versions and Unicode Versions

Year (`y`)	Ruby version (`V`_Ruby)	Unicode version (`V`_Unicode)
	published around Christmas	published in Summer
2014	2.2	7.0.0
2015	2.3	8.0.0
2016	2.4	9.0.0 (as of yesterday)

A note about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.

RbConfig::CONFIG["UNICODE_VERSION"] ⇒ '9.0.0'

V_Unicode = y - 2007

V_Ruby = 1.5 + V_Unicode · 0.1

V_Unicode = V_Ruby · 10 - 15

Don't extrapolate too far!

Ups and Downs:

Upcase and Downcase

Case Conversions in Ruby 2.3

'Ruby Kaigi 2016'.upcase ⇒ 'RUBY KAIGI 2016'
'Ruby Kaigi 2016'.downcase ⇒ 'ruby kaigi 2016'
'Ruby Kaigi 2016'.capitalize ⇒ 'Ruby Kaigi 2016'
'Ruby Kaigi 2016'.swapcase ⇒ 'rUBY kAIGI 2016'

Case Conversions in Ruby 2.3

'Ruby Kaigi 2016'.upcase ⇒ 'RUBY KAIGI 2016'
'Ruby Kaigi 2016'.downcase ⇒ 'ruby kaigi 2016'
'Ruby Kaigi 2016'.capitalize ⇒ 'Ruby Kaigi 2016'
'Ruby Kaigi 2016'.swapcase ⇒ 'rUBY kAIGI 2016'

'Résumé ĭñŧėřŋãţĳňőńæłĩżàťïōņ'.upcase
⇒ 'RéSUMé ĭñŧėřŋãţĳňőńæłĩżàťïōņ'
'Юрий Соколов'.upcase (aka funny.falcon)
⇒ 'Юрий Соколов'

Case Conversions NOT in Ruby 2.3

'Résumé ĭñŧėřŋãţĳňőńæłĩżàťïōņ'.upcase
⇒ 'RÉSUMÉ ĬÑŦĖŘŊÃŢĲŇŐŃÆŁĨŻÀŤÏŌŅ'
'Юрий Соколов'.upcase
⇒ 'ЮРИЙ СОКОЛОВ'

Case Conversions NOT in Ruby 2.3

'Résumé ĭñŧėřŋãţĳňőńæłĩżàťïōņ'.upcase
⇒ 'RÉSUMÉ ĬÑŦĖŘŊÃŢĲŇŐŃÆŁĨŻÀŤÏŌŅ'
'Юрий Соколов'.upcase
⇒ 'ЮРИЙ СОКОЛОВ'

But in Ruby 2.4!

Case Conversion Around the World

Many more Latin letters than just A-Z
Other scripts:
- Cyrillic, Greek
- Coptic, Armenian [, Georgian]
- Cherokee, Deseret, Osage
- Old Hungarian, Warang Citi, Glagolitic, Adlam
More minority scripts may introduce case distinction
from surrounding majority scripts

Case Distinction History

Originally: Style difference, depending on medium
- Upper case for stone inscriptions (SPQR)
- Lower case for wax tablets,...?
Functional distinction since ~15th century

Modern Case Usage

(details vary by language)

ALL UPPER CASE
- EMPHASIS
- Acronyms, abbreviations (DRY, SQL)
First letter upper case
- Start of sentence
- Words in titles
- Proper nouns/adjectives (Kyoto, Japanese)
- Nouns
- Honorifics
Lower case: everything else

German:
der Gefangene floh 捕虜が逃げた the prisoner fled, but
der gefangene Floh 捕まった蚤 the captive flea

Isn't ASCII-only Case Conversion Enough?

Already in other languages (Python, Perl, Java, ...)
Already in Ruby (Regexp: //i)
Data is available from Unicode Consortium
It's a good idea in general

But: Backwards Compatibility?

Idea: Option for new functionality
'Résumé'.upcase ⇒ 'RéSUMé'
'Résumé'.upcase :unicode ⇒ 'RÉSUMÉ'
Matz felt option was not necessary
Lots of data is ASCII-only
For non-ASCII data, you hopefully used a gem
(which you can now eliminate)
Check early
grep your code base for upcase and friends
Test early (preview of 2.4 should come out during this RubyKaigi)

Backwards Compatibility: Problems to Look Out For

Explicit ASCII-only case conversion
E.g. DNS servers
(but you used Encoding::ASCII_8BIT there anyway, didn't you)
Exact matches after conversion
1. Allowed non-ASCII in userids (e.g. Соколов)
2. downcased with Ruby 2.3 to help users (Соколов in DB)
3. Used exact match
4. In Ruby 2.4, соколов will not match Соколов anymore
Localization: See Turkic, Lithuanian special cases

Backwards Compatibility: `:ascii` Option

Use if you find a case where you really don't want to convert non-ASCII characters
(どうしても ASCII のみに限定したいとき)

'Résumé ĭñŧėřŋãţĳňőńæłĩżàťïōņ'.upcase
⇒ 'RÉSUMÉ ĬÑŦĖŘŊÃŢĲŇŐŃÆŁĨŻÀŤÏŌŅ'

'Résumé ĭñŧėřŋãţĳňőńæłĩżàťïōņ'.upcase :ascii
⇒ 'RéSUMé ĭñŧėřŋãţĳňőńæłĩżàťïōņ'

'Юрий Соколов'.upcase
⇒ 'ЮРИЙ СОКОЛОВ'

'Юрий Соколов'.upcase :ascii
⇒ 'Юрий Соколов'

Where Do We Get the Data From?

Data and other specifications available from the Unicode Consortium:

UnicodeData.txt

CaseFolding.txt

SpecialCasing.txt

Special Cases: Not 1-to-1

Number of characters not preserved
'ß'.upcase ⇒ 'SS' (German sz/sharp s)
'ﬃ'.upcase ⇒ "FFI" (ﬃ ligature)
Not necessarily reversible
'ß'.upcase.downcase ⇒ 'ß' 'ss' 'σ'.upcase ⇒ 'Σ' (Greek sigma)
'ς'.upcase ⇒ 'Σ' (Greek final sigma)
'ς'.upcase.downcase ⇒ 'ς' 'σ'
Implemented!
'Σ'.downcase should be context-dependent
Not yet implemented!

Special Case: Simple Case Mapping

Defined by Unicode
Excludes mappings that change string length
Feels outdated

⇒Not implemented!

Special Case: Turkic

Usual:
'i'.upcase ⇒ 'I'
'I'.upcase ⇒ 'i'
Turkish, Azerbaijani, and related languages when written in Latin script
'i'.upcase ⇒ 'İ' (uppercase I with dot)
'İ'.downcase ⇒ 'i'
'ı'.upcase ⇒ 'I' (i without dot)
'I'.downcase ⇒ 'ı'
Implemented!
'Türkiye'.upcase :turkic ⇒ 'TÜRKİYE'

Special Case: Lithuanian

Usual:
'Í'.downcase ⇒'í' (accent replaces dot)
Lithuanian:
'Í'.downcase :lithuanian ⇒'í' (accent above visible dot)
Not yet implemented!

Special Case: Case Folding

Case mapping (大文字小文字変換、略: 大小変換):
- Change from one form to another
- upcase/downcase/capitalize/swapcase
Case folding (大小畳込み)
- Eliminate case-related differences
- For comparison, sorting
- In general same as downcase
- But: ß → ss, ﬃ → ffi, ς → σ
- Upcase for Cherokee
Implemented! with option on downcase
'ß'.downcase :fold ⇒ 'ss'
'ﬃ'.downcase :fold ⇒ 'ffi'
'ς'.downcase :fold ⇒ 'σ'

Special Case: Titlecase

Some characters have three case forms:
- Upper case: Ǆ (Croatian/Serbian)
- Lower case: ǆ
- Title case: ǅ
Important for capitalize
- "ǆungla".capitalize ⇒ "Ǆungla"
- "ǆungla".capitalize ⇒ "ǅungla"
Implemented!

Special Cases: There are More

Not implemented (yet?)

Implementation

This must be easy
Just a big tr('ABC...', 'abc...')
Yes, but very big (~1200 characters)
Watch out for special cases

Methods to Implement

`String` (functional)	`String` (destructive)	`Symbol`
`upcase`	`upcase!`	`upcase`
`downcase`	`downcase!`	`downcase`
`capitalize`	`capitalize!`	`capitalize`
`swapcase`	`swapcase!`	`swapcase`

Not dealt with: String#casecmpWhy: Includes sorting, very difficult for all of Unicode

`string.c`: `Init_String`

Tells Ruby what C functions to use for each method

rb_define_method(rb_cString, "upcase", rb_str_upcase, -1);
rb_define_method(rb_cString, "downcase", rb_str_downcase, -1);
rb_define_method(rb_cString, "capitalize", rb_str_capitalize, -1);
rb_define_method(rb_cString, "swapcase", rb_str_swapcase, -1);

rb_define_method(rb_cString, "upcase!", rb_str_upcase_bang, -1);
rb_define_method(rb_cString, "downcase!", rb_str_downcase_bang, -1);
rb_define_method(rb_cString, "capitalize!", rb_str_capitalize_bang, -1);
rb_define_method(rb_cString, "swapcase!", rb_str_swapcase_bang, -1);

rb_define_method(rb_cSymbol, "upcase", sym_upcase, -1);
rb_define_method(rb_cSymbol, "downcase", sym_downcase, -1);
rb_define_method(rb_cSymbol, "capitalize", sym_capitalize, -1);
rb_define_method(rb_cSymbol, "swapcase", sym_swapcase, -1);

`string.c`: `Init_String`

Tells Ruby what C functions to use for each method

rb_define_method(rb_cString, "upcase", rb_str_upcase, -1);
rb_define_method(rb_cString, "downcase", rb_str_downcase, -1);
rb_define_method(rb_cString, "capitalize", rb_str_capitalize, -1);
rb_define_method(rb_cString, "swapcase", rb_str_swapcase, -1);

rb_define_method(rb_cString, "upcase!", rb_str_upcase_bang, -1);
rb_define_method(rb_cString, "downcase!", rb_str_downcase_bang, -1);
rb_define_method(rb_cString, "capitalize!", rb_str_capitalize_bang, -1);
rb_define_method(rb_cString, "swapcase!", rb_str_swapcase_bang, -1);

rb_define_method(rb_cSymbol, "upcase", sym_upcase, -1);
rb_define_method(rb_cSymbol, "downcase", sym_downcase, -1);
rb_define_method(rb_cSymbol, "capitalize", sym_capitalize, -1);
rb_define_method(rb_cSymbol, "swapcase", sym_swapcase, -1);

`string.c`: `sym_upcase`

static VALUE
sym_upcase(int argc, VALUE *argv, VALUE sym)
{
    return rb_str_intern(
        rb_str_upcase(argc, argv, rb_sym2str(sym)));
}

Equivalent in Ruby:

class Symbol
  def upcase (*args)
    to_s.upcase(*args).to_sym
  end
end

`string.c`: `sym_upcase`

static VALUE
sym_upcase(int argc, VALUE *argv, VALUE sym)
{
    return rb_str_intern(
        rb_str_upcase(argc, argv, rb_sym2str(sym)));
}

Equivalent in Ruby:

class Symbol
  def upcase (*args)
    to_s.upcase(*args).to_sym
  end
end

`string.c`: `rb_string_upcase`

static VALUE
rb_str_upcase(int argc, VALUE *argv, VALUE str)
{
    str = rb_str_dup(str);
    rb_str_upcase_bang(argc, argv, str);
    return str;
}

Equivalent in Ruby:

class String
  def upcase (*args)
    dup.upcase!(*args)
  end
end

`string.c`: `rb_string_upcase`

static VALUE
rb_str_upcase(int argc, VALUE *argv, VALUE str)
{
    str = rb_str_dup(str);
    rb_str_upcase_bang(argc, argv, str);
    return str;
}

Equivalent in Ruby:

class String
  def upcase (*args)
    dup.upcase!(*args)
  end
end

`string.c`: `rb_string_upcase_bang`

Here the real work starts

static VALUE
rb_str_upcase_bang(int argc, VALUE *argv, VALUE str)
{
    ...
    /*  set flags for upcase  */
    OnigCaseFoldType flags = ONIGENC_CASE_UPCASE;  
    /*  check options  */
    flags = check_case_options(argc, argv, flags);
    ...
    /*  shortcuts for ASCII-only  */
    if (...) { ...  }
    else if (flags&ONIGENC_CASE_ASCII_ONLY)
        rb_str_ascii_casemap(str, &flags, enc);
    else  /*  actual hard work  */
        str_shared_replace(str, rb_str_casemap(str,&flags,enc));

    if (ONIGENC_CASE_MODIFIED&flags) return str;
    return Qnil;
}

`string.c`: `rb_string_upcase_bang`

Here the real work starts

static VALUE
rb_str_upcase_bang(int argc, VALUE *argv, VALUE str)
{
    ...
    /*  set flags for upcase  */
    OnigCaseFoldType flags = ONIGENC_CASE_UPCASE;  
    /*  check options  */
    flags = check_case_options(argc, argv, flags);
    ...
    /*  shortcuts for ASCII-only  */
    if (...) { ...  }
    else if (flags&ONIGENC_CASE_ASCII_ONLY)
        rb_str_ascii_casemap(str, &flags, enc);
    else  /*  actual hard work  */
        str_shared_replace(str, rb_str_casemap(str,&flags,enc));

    if (ONIGENC_CASE_MODIFIED&flags) return str;
    return Qnil;
}

`include/ruby/oniguruma.h`: `OnigCaseFoldType`

Flags used to indicate operation needed
(upcase/downcase/capitalize/swapcase):

#define ONIGENC_CASE_UPCASE     (1<<13) /* uppercase mapping */
#define ONIGENC_CASE_DOWNCASE   (1<<14) /* lowercase mapping */
#define ONIGENC_CASE_TITLECASE  (1<<15) /* titlecase mapping */

Usage to indicate operation type:

upcase: ONIGENC_CASE_UPCASE
(upcasing needed)

downcase: ONIGENC_CASE_DOWNCASE
(downcasing needed)

capitalize: ONIGENC_CASE_TITLECASE | ONIGENC_CASE_UPCASE
(changed to ONIGENC_CASE_DOWNCASE after first character)

swapcase: ONIGENC_CASE_UPCASE | ONIGENC_CASE_DOWNCASE
(both upcasing and downcasing needed)

`string.c`: `rb_string_upcase_bang`

Here the real work starts

static VALUE
rb_str_upcase_bang(int argc, VALUE *argv, VALUE str)
{
    ...
    /*  set flags for upcase  */
    OnigCaseFoldType flags = ONIGENC_CASE_UPCASE;  
    /*  check options  */
    flags = check_case_options(argc, argv, flags);
    ...
    /*  shortcuts for ASCII-only  */
    if (...) { ...  }
    else if (flags&ONIGENC_CASE_ASCII_ONLY)
        rb_str_ascii_casemap(str, &flags, enc);
    else  /*  actual hard work  */
        str_shared_replace(str, rb_str_casemap(str,&flags,enc));

    if (ONIGENC_CASE_MODIFIED&flags) return str;
    return Qnil;
}

`string.c`: `check_case_options`

Common to upcase/downcase/capitalize/swapcase

Checks case options, sets flags, or produces error messages

Possible options:

:fold (for case folding; only on downcase)
:turkic
:lithuanian (not yet implemented; usable with :turkic)
:ascii

Corresponding flags:

#define ONIGENC_CASE_FOLD                (1<<19) /* has/needs case folding * /
#define ONIGENC_CASE_FOLD_TURKISH_AZERI  (1<<20) /* Turkic */
#define ONIGENC_CASE_FOLD_LITHUANIAN     (1<<21) /* Lithuanian */
#define ONIGENC_CASE_ASCII_ONLY          (1<<22) /* limited to ASCII */

`string.c`: `rb_string_upcase_bang`

Here the real work starts

static VALUE
rb_str_upcase_bang(int argc, VALUE *argv, VALUE str)
{
    ...
    /*  set flags for upcase  */
    OnigCaseFoldType flags = ONIGENC_CASE_UPCASE;  
    /*  check options  */
    flags = check_case_options(argc, argv, flags);
    ...
    /*  shortcuts for ASCII-only  */
    if (...) { ...  }
    else if (flags&ONIGENC_CASE_ASCII_ONLY)
        rb_str_ascii_casemap(str, &flags, enc);
    else  /*  actual hard work  */
        str_shared_replace(str, rb_str_casemap(str,&flags,enc));

    if (ONIGENC_CASE_MODIFIED&flags) return str;
    return Qnil;
}

`string.c`: `rb_str_casemap`

Handles string expansion (e.g. "ﬃ".upcase ⇒ "FFI")

Common to all casing operations

Linked list of buffers (b₁→b₂→b₃→...)
Repeatedly calls encoding-specific primitive
to fill as much as possible of next buffer
For buffer b_x, allocates
bytes_to_still_be_converted · x + 20 bytes
Example:
We need a 3rd buffer, and need to convert 5 more bytes,
so we allocate length(b₃) = 5 · 3 + 20 = 35 bytes
Until no new buffer is needed

`string.c`: `rb_str_casemap`

static VALUE
rb_str_casemap(VALUE source, OnigCaseFoldType *flags, rb_encoding *enc)
{
    /*  general preparations  */
    while (source_current < source_end) {
        /*  calculate next buffer length  */
        size_t capa = (source_end - source_current)
               * ++buffer_count + 20;
        ...  /*  prepare and link next buffer  */
        buffer_length_or_invalid = enc->case_map(flags,
               &source_current, source_end,
               current_buffer->space,
               current_buffer->space+current_buffer->capa,
               enc);
        ...  /*  check for errors (invalid input string)  */
    }

    /*  prepare for copy to final location  */
    while (...)
        /*  copy current_buffer and move to next buffer  */
    /*  cleanup and return  */
}

`string.c`: `rb_str_casemap`

static VALUE
rb_str_casemap(VALUE source, OnigCaseFoldType *flags, rb_encoding *enc)
{
    /*  general preparations  */
    while (source_current < source_end) {
        /*  calculate next buffer length  */
        size_t capa = (source_end - source_current)
               * ++buffer_count + 20;
        ...  /*  prepare and link next buffer  */
        buffer_length_or_invalid = enc->case_map(flags,
               &source_current, source_end,
               current_buffer->space,
               current_buffer->space+current_buffer->capa,
               enc);
        ...  /*  check for errors (invalid input string)  */
    }

    /*  prepare for copy to final location  */
    while (...)
        /*  copy current_buffer and move to next buffer  */
    /*  cleanup and return  */
}

`enc->case_map` and Encoding Primitives

enc is the encoding of our string, a struct OnigEncodingTypeST
case_map is a function pointer, an encoding primitive [1]
Primitives work like methods (polymorphism), but are implemented in C
Primitives make Ruby work with many different character encodings
Examples:
- "Résumé"_UTF-8.upcase calls
  onigenc_unicode_case_map in enc/unicode.c
  as defined with OnigEncodingDefine in enc/utf_8.c
- "Résumé"_UTF-16LE.upcase calls
  onigenc_unicode_case_map in enc/unicode.c
  as defined with OnigEncodingDefine in enc/utf_16le.c
- "Résumé"_ISO-8859-1.upcase calls
  case_map in enc/iso_8859_1.c
  as defined with OnigEncodingDefine in the same file

[1] 松本行弘, 縄手雅彦. スクリプト言語 Ruby の拡張可能な多言語テキスト処理の実装. 情報処理学会論文誌. 2005 Nov 15;46(11):2633-42. / Yukihiro Matsumoto and Masahiko Nawate: Multilingual Text Manipulation Method for Ruby Language. Journal of Information Processing (JIP); 2005 Nov 15; Vol. 46, No. 11, pp. 2633-42. (in Japanese)

Implementation Choice: UTF-8 only or Primitive

Matz would have been fine with
- Full Unicode case conversion for UTF-8
- ASCII-only for all other encodings
Used primitives to obtain
- A more complete implementation
- Experience about pros/cons of using primitives

Implementation Choice: New or Reused Primitive

Each encoding uses 13 primitives and 5 data items
3 primitives are used for case folding with //i
- mbc_case_fold
- apply_all_case_fold
- get_case_fold_codes_by_str
No good way to reuse any of these

⇒ New primitive

The `case_map` Primitive

Input/output parameters:
- OnigCaseFoldType flags
- Start of source
Input parameters:
- End of source
- Start of destination
- End of destination
- Encoding (to call other primitives)
Output parameters:
- Byte count of conversion result
Most complex 'primitive', although not by much

A Simple `case_map` Primitive (`enc/iso_8859_1.c`)

static int case_map (...)
{
    /*  initializations  */

    while (*pp<end && to<to_end) {
        code = *(*pp)++;
        if (code==SHARP_s)
            /*  German ß special case  */
        else if (/* have upper case && want lower case */)
            code += 0x20, flags |= ONIGENC_CASE_MODIFIED;
        else if (/* lower without upper: ª, º, µ, ÿ */)  ;
        else if ((EncISO_8859_1_CtypeTable[code]&BIT_CTYPE_LOWER)
                 && (flags&ONIGENC_CASE_UPCASE))
            code -= 0x20, flags |= ONIGENC_CASE_MODIFIED;
        *to++ = code;
        if (flags&ONIGENC_CASE_TITLECASE)
            /* titlecase → lowercase for capitalize */
    }
    /*  cleanup and return  */
}

Good example when creating a new primitive!

More `case_map` Primitives and Who Wrote Them

Students (sophomores/juniors/seniors 二・三・四年生) at Aoyama Gakuin University

ISO-8859-2: Yushiro Ishii (石井優史朗)
ISO-8859-3: Kanon Shindo (新藤海音)
ISO-8859-4: Kotaro Yoshida (吉田孝太郎)
ISO-8859-5: Masaru Onodera (小野寺俊)
ISO-8859-7: Kosuke Kurihara (栗原光祐)
ISO-8859-9: Kazuki Iijima (飯島一貴)
ISO-8859-10: Toya Hosokawa (細川登陽)
ISO-8859-13: Takuya Miyamoto (宮本拓弥)
ISO-8859-14: Yutaro Tada (多田悠太朗)
ISO-8859-15: Maho Harada (原田真帆)
ISO-8859-16: Satoshi Kayama (香山智志)
Windows-1250, -1257: Sho Koike (小池翔)
Windows-1251: Shunsuke Sato (佐藤駿介)
Windows-1252: Serina Tai (田井芹奈)
Windows-1253: Takumi Koyama (小山拓美)

So What about Shift_JIS and Friends?

For East Asian encodings
(Shift_JIS, EUC-JP, GB2312, EUC-KR, Big-5, EUC-TW,...)

data could be shared between //i and case mapping

but case folding for //i only works for ASCII

None of the main Japanese committers thought this was needed anymore
(日本人からもう不要と言われた)

Talk to me if you need it

The Primitive of Primitives: `onigenc_unicode_case_map`

Works for UTF-8, UTF-16[BE|LE], UTF-32[BE|LE]
140 lines long 'monster function'
Same structure as simpler primitives:
- Big while loop, one source character a time
- Carefully updating ONIGENC_CASE_MODIFIED flag
- Deal with special cases 'by hand'
- Reuse existing data where possible
~30 if/else if/else
Lots of |/& with flag bits
2 gotos
gperf-created hash lookups:
onigenc_unicode_fold_lookuponigenc_unicode_unfold1_lookup

Reusing Case Folding Data

Onig[uruma|gmo] has data for case folding
Folding is very close to downcase
There is also unfolding (why?), which is close to upcase
That's almost all we need

Folding Data: Before and After

in enc/unicode/9.0.0/casefold.h

/*  before  */
  {0x0041, {1, {0x0061}}},  /*  A → a  */
  {0x00df, {2, {0x0073, 0x0073}}},  /*  ß → ss  */
  {0x01c4, {1, {0x01c6}}},  /*  Ǆ → ǆ  */
  {0x01c5, {1, {0x01c6}}},  /*  ǅ → ǆ  */
  {0xab73, {1, {0x13a3}}},  /*  Ꭳ → ꭳ (Cherokee)  */

/*  after  */
  {0x0041, {1|F|D, {0x0061}}},  /*  A → a  */
  {0x00df, {2|F|ST|SU|I(1), {0x0073, 0x0073}}},  /*  ß → ss  */
  {0x01c4, {1|F|D|ST|I(8), {0x01c6}}},  /*  Ǆ → ǆ  */
  {0x01c5, {1|F|D|IT|SU|I(9), {0x01c6}}},  /*  ǅ → ǆ  */
  {0xab73, {1|F|U, {0x13a3}}},  /*  Ꭳ → ꭳ (Cherokee)  */

Folding Data: Flags

(squeezed into an int where only 2 bits were used)

see enc/unicode.c

/*  data is available here  */
/*  (flags are the same as for options)  */
#define U ONIGENC_CASE_UPCASE
#define D ONIGENC_CASE_DOWNCASE
#define F ONIGENC_CASE_FOLD
/*  data is in special additional array  */
#define ST ONIGENC_CASE_TITLECASE
#define SU ONIGENC_CASE_UP_SPECIAL
#define SL ONIGENC_CASE_DOWN_SPECIAL
#define IT ONIGENC_CASE_IS_TITLECASE
/*  index into special array
    (size: around 420 words only)  */
#define I(n) OnigSpecialIndexEncode(n)

Small Implementation Detail

(or my attempt at using the Takahashi method)

`upcase`

seems useful

`downcase`

seems useful

`capitalize`

seems useful

`swapcase`

Who has used `swapcase`?

Nobody?

Well, I did, when testing swapcase!

Why `swapcase`?

Python has it ?! (Matz)

Why `swapcase`?

Python has it ?! (Matz)

To revert accidental Caps Lock output ?! (on Unicode list)

implementing `swapcase`

must be easy
UPPER ⇒ upper
lower ⇒ LOWER

But what about titlecase?

ǲ, ǅ, ǈ, ǋ
ᾼ, ᾈ, ᾉ, ᾊ, ᾋ, ᾌ, ᾍ, ᾎ, ᾏ
ῌ, ᾘ, ᾙ, ᾚ, ᾛ, ᾜ, ᾝ, ᾞ, ᾟ
ῼ, ᾨ, ᾩ, ᾪ, ᾫ, ᾬ, ᾭ, ᾮ, ᾯ

Choice 1
`"ǅunGLA".swapcase`
⇓ leave as is
`"ǅUNgla"`

preferred by Unicode Consortium
(never ever need any new standardization)

preserves reversibility
(X.swapcase.swapcase == X)

Choice 2
`"ǅunGLA".swapcase`
⇓ upcase
`"ǄUNgla"`

Choice 3
`"ǅunGLA".swapcase`
⇓ downcase
`"ǆUNgla"`

Choice 4
`"ǅunGLA".swapcase`
⇓ swap
`"dŽUNgla"`

proposed by Nobu (中田さんの提案)

Implemented
swap ⇒`"dŽUNgla"`

useless?, but 'correct'
additional effort for implementation
additional effort for testing

Commit Date
April 1st, 2016

(エイプリルフールの日)
Japan Time 20:58:33 ⇒ same date in most timezones
please draw your own conclusions

Testing

Test-Driven Development

Write small example test
Verify that it doesn't work
Implement
Enjoy that it works
Rinse and repeat

Files:
test/ruby/enc/test_case_options.rb
test/ruby/enc/test_case_mapping.rb

Data-Driven Testing

Test
- every character (except ranges in UnicodeData.txt)
- of every encoding
- for all option combinations
- for (almost) all methods
Data provided by Unicode
Identical to data used for implementation ?!

Files:
test/ruby/enc/test_case_comprehensive.rb

413 tests, 2212391 assertions, 0 failures, 0 errors, 0 skips

Continuous Integration

Commit early, commit often
- Advice (and scolding) from hardcore Ruby hackers
- Keep code reasonably clean, and motivation high
- More commits → higher chance to attend Ruby Kaigi for free
- But: Don't want to affect Ruby build or execution
Solution:
- Make use of new functionality dependent on special option
- Used :lithuanian (because last to be actually implemented)
- Test with option protection
- Remove option protection

Future:

Ideas, Problems, Questions

In No Particular Order

Character properties
Locale-aware formatting
What to do with encodings?

Character Properties

Unicode provides a wide range of character properties
Most available in Regexp
Does this string contain a Hiragana character?
'Юに코δ' =~ /\p{Hiragana}/
What script is 'Ю'?
sorry, impossible! 不可能!
Currently looking at this with a student, hopefully
- For Ruby ~2.5
- Use less memory
- Faster
- More properties
- More ways to use

Locale-Aware Formatting

What I want:

loc = Locale.new 'de-CH' (German as used in Switzerland)

1.2345678E5.to_s ⇒ "123456.78"

1.2345678E5.to_s(loc) ⇒ "123'456,78"

Well, Just use a Library

Internationalization support in libraries:

Pure Ruby:
C extensions:
- ICU as a gem: icu, ffi-icu

Example: Unicode Normalization

UnicodeUtils
```
UnicodeUtils.nfkc string
```

ActiveSupport::Multibyte

ActiveSupport::Multibyte::Chars.new(string).normalize :kc

TwitterCLDR

TwitterCldr::Normalization::NFKC.normalize string

Native (since Ruby 2.2)
string.unicode_normalize :nfkc

Libraries avoid monkey patching

⇒ not Ruby-like (ライブラリを使うと Ruby らしくない)

Locales and Case Mappings

Possible solution (解決案):

loc = Locale.new 'tr' 'Türkiye'.upcase loc ⇒ 'TÜRKİYE'

Encodings: Less is More?

We discovered flaky support for current encodings
(//i case folding: all encodings not at end of
test/ruby/enc/test_regex_casefold.rb)
The world is moving to Unicode
Matz wants to move to UTF-8, slowly but steadily
Do we let other encodings die slowly?
Or get rid of them in a single step (Ruby3.0?)

Acknowledgments

Kimihito Matsui (松井仁人) and many other students for help with research and implementations
Openclipart for the World icon
Yui Naruse (成瀬ゆい), Nobu Nakada (中田伸悦) and many other Ruby committers for help and support
Matz (まつもとゆきひろ) for Ruby, a programmer's best friend
Amaya, Opera 12.17, and coderay for slide production and display
The IME Pad for easy character input

Conclusions

Full Unicode case mapping (mostly) implemented
- Options for backward compatibility, special conventions, case folding
- Space efficient implementation by reusing Regexp data
- Available in Ruby trunk now, please test!
More internationalization work needed
Tell me what you want most

Q & A

Send questions and comments to Martin Dürst
(mailto:duerst@it.aoyama.ac.jp)
or open a bug report or feature request

The latest version of this presentation is available at:

http://www.sw.it.aoyama.ac.jp/2016/pub/RubyKaigi/

Ups and Downs of Ruby Internationalization

Ruby Kaigi 2016, Kyoto, Japan, September 8, 2016

Martin J. DÜRST

Abstract

For Best Viewing

Introduction

Outline

Some Conventions Used

Audience Self-Intro

Speaker Self-Intro

Contributions to Ruby

Ruby Versions and Unicode Versions

Ups and Downs: Upcase and Downcase

Case Conversions in Ruby 2.3

Case Conversions in Ruby 2.3

Case Conversions NOT in Ruby 2.3

Case Conversions NOT in Ruby 2.3

But in Ruby 2.4!

Case Conversion Around the World

Case Distinction History

Modern Case Usage

Isn't ASCII-only Case Conversion Enough?

But: Backwards Compatibility?

Backwards Compatibility: Problems to Look Out For

Backwards Compatibility: :ascii Option

Where Do We Get the Data From?

Special Cases: Not 1-to-1

Special Case: Simple Case Mapping

Special Case: Turkic

Special Case: Lithuanian

Special Case: Case Folding

Special Case: Titlecase

Special Cases: There are More

Implementation

Implementation

Methods to Implement

string.c: Init_String

string.c: Init_String

string.c: sym_upcase

string.c: sym_upcase

string.c: rb_string_upcase

string.c: rb_string_upcase

string.c: rb_string_upcase_bang

string.c: rb_string_upcase_bang

include/ruby/oniguruma.h: OnigCaseFoldType

string.c: rb_string_upcase_bang

string.c: check_case_options

string.c: rb_string_upcase_bang

string.c: rb_str_casemap

string.c: rb_str_casemap

string.c: rb_str_casemap

enc->case_map and Encoding Primitives

Implementation Choice: UTF-8 only or Primitive

Implementation Choice: New or Reused Primitive

The case_map Primitive

A Simple case_map Primitive (enc/iso_8859_1.c)

More case_map Primitives and Who Wrote Them

So What about Shift_JIS and Friends?

The Primitive of Primitives: onigenc_unicode_case_map

Reusing Case Folding Data

Folding Data: Before and After

Folding Data: Flags

Small Implementation Detail

upcase

seems useful

downcase

seems useful

capitalize

seems useful

swapcase

Who has used swapcase?

Nobody?

Nobody?

Why swapcase?

Why swapcase?

Why swapcase?

implementing swapcase

must be easy UPPER ⇒ upper lower ⇒ LOWER

But what about titlecase?

ǲ, ǅ, ǈ, ǋ ᾼ, ᾈ, ᾉ, ᾊ, ᾋ, ᾌ, ᾍ, ᾎ, ᾏ ῌ, ᾘ, ᾙ, ᾚ, ᾛ, ᾜ, ᾝ, ᾞ, ᾟ ῼ, ᾨ, ᾩ, ᾪ, ᾫ, ᾬ, ᾭ, ᾮ, ᾯ

Ruby Kaigi 2016 , Kyoto, Japan, September 8, 2016

Ups and Downs:

Upcase and Downcase

Backwards Compatibility: `:ascii` Option

`string.c`: `Init_String`

`string.c`: `Init_String`

`string.c`: `sym_upcase`

`string.c`: `sym_upcase`

`string.c`: `rb_string_upcase`

`string.c`: `rb_string_upcase`

`string.c`: `rb_string_upcase_bang`

`string.c`: `rb_string_upcase_bang`

`include/ruby/oniguruma.h`: `OnigCaseFoldType`

`string.c`: `rb_string_upcase_bang`

`string.c`: `check_case_options`

`string.c`: `rb_string_upcase_bang`

`string.c`: `rb_str_casemap`

`string.c`: `rb_str_casemap`

`string.c`: `rb_str_casemap`

`enc->case_map` and Encoding Primitives

The `case_map` Primitive

A Simple `case_map` Primitive (`enc/iso_8859_1.c`)

More `case_map` Primitives and Who Wrote Them

The Primitive of Primitives: `onigenc_unicode_case_map`

`upcase`

`downcase`

`capitalize`

`swapcase`

Who has used `swapcase`?

Why `swapcase`?

Why `swapcase`?

Why `swapcase`?

implementing `swapcase`

must be easy
UPPER ⇒ upper
lower ⇒ LOWER

ǲ, ǅ, ǈ, ǋ
ᾼ, ᾈ, ᾉ, ᾊ, ᾋ, ᾌ, ᾍ, ᾎ, ᾏ
ῌ, ᾘ, ᾙ, ᾚ, ᾛ, ᾜ, ᾝ, ᾞ, ᾟ
ῼ, ᾨ, ᾩ, ᾪ, ᾫ, ᾬ, ᾭ, ᾮ, ᾯ

Choice 1
`"ǅunGLA".swapcase`
⇓ leave as is
`"ǅUNgla"`

Choice 2
`"ǅunGLA".swapcase`
⇓ upcase
`"ǄUNgla"`

Choice 3
`"ǅunGLA".swapcase`
⇓ downcase
`"ǆUNgla"`

Choice 4
`"ǅunGLA".swapcase`
⇓ swap
`"dŽUNgla"`

Implemented
swap ⇒`"dŽUNgla"`

Commit Date
April 1st, 2016

Future:

Ideas, Problems, Questions