Ups and Downs of Ruby Internationalization

http://www.sw.it.aoyama.ac.jp/2016/pub/RubyKaigi/

Ruby Kaigi 2016, Kyoto, Japan, September 8, 2016

Martin J. DÜRST

(マーティンと呼んでください)

duerst@it.aoyama.ac.jp

Aoyama Gakuin University

Ruby Programming Language

© 2016 Martin J. Dürst, Aoyama Gakuin University

Abstract

Currently many of Ruby's String methods, such as upcase and downcase, are limited to ASCII and ignore the rest of the world. This is finally going to change in Ruby 2.4, where this functionality will be extended to cover full Unicode. You will get to know what will change, how your programs may be affected, and how these changes are implemented behind the scenes. We will also look at the overall state of internationalization functionality in Ruby, and potential future directions.

For Best Viewing

These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux; use F11 to switch to projection mode). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.

 

Introduction

Outline

 

Some Conventions Used

Code is mostly green, monospace
puts "Hello Ruby!"

Variable parts are orange
puts "some string"

Encoding is indicated with a subscript
"Юに코δ"UTF-8, "ユニコード"SJIS

Results are indicated with "⇒"
1 + 12

 

Audience Self-Intro

 

Speaker Self-Intro

 

Contributions to Ruby

Mainly in the following areas:

 

Ruby Versions and Unicode Versions

Year (y) Ruby version (VRuby) Unicode version (VUnicode)
published around Christmas published in Summer
2014 2.2 7.0.0
2015 2.3 8.0.0
2016 2.4 9.0.0 (as of yesterday)

A note about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.

RbConfig::CONFIG["UNICODE_VERSION"]'9.0.0'

VUnicode = y - 2007

VRuby = 1.5 + VUnicode · 0.1

VUnicode = VRuby · 10 - 15

Don't extrapolate too far!

 

Ups and Downs:

Upcase and Downcase

Case Conversions in Ruby 2.3

 

Case Conversions in Ruby 2.3

 

Case Conversions NOT in Ruby 2.3

 

Case Conversions NOT in Ruby 2.3

But in Ruby 2.4!

 

Case Conversion Around the World

 

Case Distinction History

 

Modern Case Usage

(details vary by language)

German:
der Gefangene floh 捕虜が逃げた the prisoner fled, but
der gefangene Floh 捕まった蚤 the captive flea

 

Isn't ASCII-only Case Conversion Enough?

 

But: Backwards Compatibility?

 

Backwards Compatibility: Problems to Look Out For

 

Backwards Compatibility: :ascii Option

Use if you find a case where you really don't want to convert non-ASCII characters
(どうしても ASCII のみに限定したいとき)

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase :ascii
'RéSUMé ĭñŧėřŋãţijňőńæłĩżàťïōņ'

'Юрий Соколов'.upcase
'ЮРИЙ СОКОЛОВ'

'Юрий Соколов'.upcase :ascii
'Юрий Соколов'

 

Where Do We Get the Data From?

Data and other specifications available from the Unicode Consortium:

UnicodeData.txt

CaseFolding.txt

SpecialCasing.txt

 

Special Cases: Not 1-to-1

 

Special Case: Simple Case Mapping

Not implemented!

 

Special Case: Turkic

 

Special Case: Lithuanian

 

Special Case: Case Folding

 

Special Case: Titlecase

 

Special Cases: There are More

Not implemented (yet?)

 

Implementation

 

Implementation

 

Methods to Implement

String (functional) String (destructive) Symbol
upcase upcase! upcase
downcase downcase! downcase
capitalize capitalize! capitalize
swapcase swapcase! swapcase

Not dealt with: String#casecmp
Why: Includes sorting, very difficult for all of Unicode

 

string.c: Init_String

Tells Ruby what C functions to use for each method

rb_define_method(rb_cString, "upcase", rb_str_upcase, -1);
rb_define_method(rb_cString, "downcase", rb_str_downcase, -1);
rb_define_method(rb_cString, "capitalize", rb_str_capitalize, -1);
rb_define_method(rb_cString, "swapcase", rb_str_swapcase, -1);

rb_define_method(rb_cString, "upcase!", rb_str_upcase_bang, -1);
rb_define_method(rb_cString, "downcase!", rb_str_downcase_bang, -1);
rb_define_method(rb_cString, "capitalize!", rb_str_capitalize_bang, -1);
rb_define_method(rb_cString, "swapcase!", rb_str_swapcase_bang, -1);

rb_define_method(rb_cSymbol, "upcase", sym_upcase, -1);
rb_define_method(rb_cSymbol, "downcase", sym_downcase, -1);
rb_define_method(rb_cSymbol, "capitalize", sym_capitalize, -1);
rb_define_method(rb_cSymbol, "swapcase", sym_swapcase, -1);

 

string.c: Init_String

Tells Ruby what C functions to use for each method

rb_define_method(rb_cString, "upcase", rb_str_upcase, -1);
rb_define_method(rb_cString, "downcase", rb_str_downcase, -1);
rb_define_method(rb_cString, "capitalize", rb_str_capitalize, -1);
rb_define_method(rb_cString, "swapcase", rb_str_swapcase, -1);

rb_define_method(rb_cString, "upcase!", rb_str_upcase_bang, -1);
rb_define_method(rb_cString, "downcase!", rb_str_downcase_bang, -1);
rb_define_method(rb_cString, "capitalize!", rb_str_capitalize_bang, -1);
rb_define_method(rb_cString, "swapcase!", rb_str_swapcase_bang, -1);

rb_define_method(rb_cSymbol, "upcase", sym_upcase, -1);
rb_define_method(rb_cSymbol, "downcase", sym_downcase, -1);
rb_define_method(rb_cSymbol, "capitalize", sym_capitalize, -1);
rb_define_method(rb_cSymbol, "swapcase", sym_swapcase, -1);

 

string.c: sym_upcase

static VALUE
sym_upcase(int argc, VALUE *argv, VALUE sym)
{
    return rb_str_intern(
rb_str_upcase(argc, argv, rb_sym2str(sym))); }

Equivalent in Ruby:

class Symbol
  def upcase (*args)
    to_s.upcase(*args).to_sym
  end
end

 

string.c: sym_upcase

static VALUE
sym_upcase(int argc, VALUE *argv, VALUE sym)
{
    return rb_str_intern(
rb_str_upcase(argc, argv, rb_sym2str(sym))); }

Equivalent in Ruby:

class Symbol
  def upcase (*args)
    to_s.upcase(*args).to_sym
  end
end

 

string.c: rb_string_upcase

static VALUE
rb_str_upcase(int argc, VALUE *argv, VALUE str)
{
    str = rb_str_dup(str);
    rb_str_upcase_bang(argc, argv, str);
    return str;
}

Equivalent in Ruby:

class String
  def upcase (*args)
    dup.upcase!(*args)
  end
end

 

string.c: rb_string_upcase

static VALUE
rb_str_upcase(int argc, VALUE *argv, VALUE str)
{
    str = rb_str_dup(str);
    rb_str_upcase_bang(argc, argv, str);
    return str;
}

Equivalent in Ruby:

class String
  def upcase (*args)
    dup.upcase!(*args)
  end
end

 

string.c: rb_string_upcase_bang

Here the real work starts

static VALUE
rb_str_upcase_bang(int argc, VALUE *argv, VALUE str)
{
    ...
    /*  set flags for upcase  */
    OnigCaseFoldType flags = ONIGENC_CASE_UPCASE;  
    /*  check options  */
    flags = check_case_options(argc, argv, flags);
    ...
    /*  shortcuts for ASCII-only  */
    if (...) { ...  }
    else if (flags&ONIGENC_CASE_ASCII_ONLY)
        rb_str_ascii_casemap(str, &flags, enc);
    else  /*  actual hard work  */
        str_shared_replace(str, rb_str_casemap(str,&flags,enc));

    if (ONIGENC_CASE_MODIFIED&flags) return str;
    return Qnil;
}

 

string.c: rb_string_upcase_bang

Here the real work starts

static VALUE
rb_str_upcase_bang(int argc, VALUE *argv, VALUE str)
{
    ...
    /*  set flags for upcase  */
    OnigCaseFoldType flags = ONIGENC_CASE_UPCASE;  
    /*  check options  */
    flags = check_case_options(argc, argv, flags);
    ...
    /*  shortcuts for ASCII-only  */
    if (...) { ...  }
    else if (flags&ONIGENC_CASE_ASCII_ONLY)
        rb_str_ascii_casemap(str, &flags, enc);
    else  /*  actual hard work  */
        str_shared_replace(str, rb_str_casemap(str,&flags,enc));

    if (ONIGENC_CASE_MODIFIED&flags) return str;
    return Qnil;
}

 

include/ruby/oniguruma.h: OnigCaseFoldType

Flags used to indicate operation needed
(upcase/downcase/capitalize/swapcase):

#define ONIGENC_CASE_UPCASE     (1<<13) /* uppercase mapping */
#define ONIGENC_CASE_DOWNCASE   (1<<14) /* lowercase mapping */
#define ONIGENC_CASE_TITLECASE  (1<<15) /* titlecase mapping */

Usage to indicate operation type:

upcase:      ONIGENC_CASE_UPCASE
(upcasing needed)

downcase:    ONIGENC_CASE_DOWNCASE
(downcasing needed)

capitalize:  ONIGENC_CASE_TITLECASE | ONIGENC_CASE_UPCASE
(changed to   ONIGENC_CASE_DOWNCASE after first character)

swapcase   ONIGENC_CASE_UPCASE | ONIGENC_CASE_DOWNCASE
(both upcasing and downcasing needed)

 

string.c: rb_string_upcase_bang

Here the real work starts

static VALUE
rb_str_upcase_bang(int argc, VALUE *argv, VALUE str)
{
    ...
    /*  set flags for upcase  */
    OnigCaseFoldType flags = ONIGENC_CASE_UPCASE;  
    /*  check options  */
    flags = check_case_options(argc, argv, flags);
    ...
    /*  shortcuts for ASCII-only  */
    if (...) { ...  }
    else if (flags&ONIGENC_CASE_ASCII_ONLY)
        rb_str_ascii_casemap(str, &flags, enc);
    else  /*  actual hard work  */
        str_shared_replace(str, rb_str_casemap(str,&flags,enc));

    if (ONIGENC_CASE_MODIFIED&flags) return str;
    return Qnil;
}

 

string.c: check_case_options

Common to upcase/downcase/capitalize/swapcase

Checks case options, sets flags, or produces error messages

Possible options:

Corresponding flags:

#define ONIGENC_CASE_FOLD                (1<<19) /* has/needs case folding * /
#define ONIGENC_CASE_FOLD_TURKISH_AZERI  (1<<20) /* Turkic */
#define ONIGENC_CASE_FOLD_LITHUANIAN     (1<<21) /* Lithuanian */
#define ONIGENC_CASE_ASCII_ONLY          (1<<22) /* limited to ASCII */

 

string.c: rb_string_upcase_bang

Here the real work starts

static VALUE
rb_str_upcase_bang(int argc, VALUE *argv, VALUE str)
{
    ...
    /*  set flags for upcase  */
    OnigCaseFoldType flags = ONIGENC_CASE_UPCASE;  
    /*  check options  */
    flags = check_case_options(argc, argv, flags);
    ...
    /*  shortcuts for ASCII-only  */
    if (...) { ...  }
    else if (flags&ONIGENC_CASE_ASCII_ONLY)
        rb_str_ascii_casemap(str, &flags, enc);
    else  /*  actual hard work  */
        str_shared_replace(str, rb_str_casemap(str,&flags,enc));

    if (ONIGENC_CASE_MODIFIED&flags) return str;
    return Qnil;
}

 

string.c: rb_str_casemap

Handles string expansion (e.g. "ffi".upcase"FFI")

Common to all casing operations

 

string.c: rb_str_casemap

static VALUE
rb_str_casemap(VALUE source, OnigCaseFoldType *flags, rb_encoding *enc)
{
    /*  general preparations  */
    while (source_current < source_end) {
        /*  calculate next buffer length  */
        size_t capa = (source_end - source_current)
               * ++buffer_count + 20;
        ...  /*  prepare and link next buffer  */
        buffer_length_or_invalid = enc->case_map(flags,
               &source_current, source_end,
               current_buffer->space,
               current_buffer->space+current_buffer->capa,
               enc);
        ...  /*  check for errors (invalid input string)  */
    }

    /*  prepare for copy to final location  */
    while (...)
        /*  copy current_buffer and move to next buffer  */
    /*  cleanup and return  */
}

 

string.c: rb_str_casemap

static VALUE
rb_str_casemap(VALUE source, OnigCaseFoldType *flags, rb_encoding *enc)
{
    /*  general preparations  */
    while (source_current < source_end) {
        /*  calculate next buffer length  */
        size_t capa = (source_end - source_current)
               * ++buffer_count + 20;
        ...  /*  prepare and link next buffer  */
        buffer_length_or_invalid = enc->case_map(flags,
               &source_current, source_end,
               current_buffer->space,
               current_buffer->space+current_buffer->capa,
               enc);
        ...  /*  check for errors (invalid input string)  */
    }

    /*  prepare for copy to final location  */
    while (...)
        /*  copy current_buffer and move to next buffer  */
    /*  cleanup and return  */
}

 

enc->case_map and Encoding Primitives

[1] 松本行弘, 縄手雅彦. スクリプト言語 Ruby の拡張可能な多言語テキスト処理の実装. 情報処理学会論文誌. 2005 Nov 15;46(11):2633-42. / Yukihiro Matsumoto and Masahiko Nawate: Multilingual Text Manipulation Method for Ruby Language. Journal of Information Processing (JIP); 2005 Nov 15; Vol. 46, No. 11, pp. 2633-42. (in Japanese)

 

Implementation Choice: UTF-8 only or Primitive

 

Implementation Choice: New or Reused Primitive

⇒ New primitive

 

The case_map Primitive

 

A Simple case_map Primitive (enc/iso_8859_1.c)

static int case_map (...)
{
    /*  initializations  */

    while (*pp<end && to<to_end) {
        code = *(*pp)++;
        if (code==SHARP_s)
            /*  German ß special case  */
        else if (/* have upper case && want lower case */)
            code += 0x20, flags |= ONIGENC_CASE_MODIFIED;
        else if (/* lower without upper: ª, º, µ, ÿ */)  ;
        else if ((EncISO_8859_1_CtypeTable[code]&BIT_CTYPE_LOWER)
                 && (flags&ONIGENC_CASE_UPCASE))
            code -= 0x20, flags |= ONIGENC_CASE_MODIFIED;
        *to++ = code;
        if (flags&ONIGENC_CASE_TITLECASE)
/* titlecase → lowercase for capitalize */ } /* cleanup and return */ }

Good example when creating a new primitive!

 

More case_map Primitives and Who Wrote Them

Students (sophomores/juniors/seniors 二・三・四年生) at Aoyama Gakuin University

 

So What about Shift_JIS and Friends?

For East Asian encodings
(Shift_JIS, EUC-JP, GB2312, EUC-KR, Big-5, EUC-TW,...)

data could be shared between //i and case mapping

but case folding for //i only works for ASCII

None of the main Japanese committers thought this was needed anymore
(日本人からもう不要と言われた)

Talk to me if you need it

 

The Primitive of Primitives: onigenc_unicode_case_map

 

Reusing Case Folding Data

 

Folding Data: Before and After

in enc/unicode/9.0.0/casefold.h

/*  before  */
  {0x0041, {1, {0x0061}}},  /*  A → a  */
  {0x00df, {2, {0x0073, 0x0073}}},  /*  ß → ss  */
  {0x01c4, {1, {0x01c6}}},  /*  DŽ → dž  */
  {0x01c5, {1, {0x01c6}}},  /*  Dž → dž  */
  {0xab73, {1, {0x13a3}}},  /*  Ꭳ → ꭳ (Cherokee)  */

/*  after  */
  {0x0041, {1|F|D, {0x0061}}},  /*  A → a  */
  {0x00df, {2|F|ST|SU|I(1), {0x0073, 0x0073}}},  /*  ß → ss  */
  {0x01c4, {1|F|D|ST|I(8), {0x01c6}}},  /*  DŽ → dž  */
  {0x01c5, {1|F|D|IT|SU|I(9), {0x01c6}}},  /*  Dž → dž  */
  {0xab73, {1|F|U, {0x13a3}}},  /*  Ꭳ → ꭳ (Cherokee)  */

 

Folding Data: Flags

(squeezed into an int where only 2 bits were used)

see enc/unicode.c

/*  data is available here  */
/*  (flags are the same as for options)  */
#define U ONIGENC_CASE_UPCASE
#define D ONIGENC_CASE_DOWNCASE
#define F ONIGENC_CASE_FOLD
/*  data is in special additional array  */
#define ST ONIGENC_CASE_TITLECASE
#define SU ONIGENC_CASE_UP_SPECIAL
#define SL ONIGENC_CASE_DOWN_SPECIAL
#define IT ONIGENC_CASE_IS_TITLECASE
/*  index into special array
    (size: around 420 words only)  */
#define I(n) OnigSpecialIndexEncode(n)

 

Small Implementation Detail

(or my attempt at using the Takahashi method)

upcase

seems useful

downcase

seems useful

capitalize

seems useful

swapcase

Who has used swapcase?

Nobody?

Nobody?

Well, I did, when testing swapcase!

Why swapcase?

Why swapcase?

Python has it ?! (Matz)

Why swapcase?

Python has it ?! (Matz)

To revert accidental Caps Lock output ?! (on Unicode list)

implementing swapcase

must be easy
UPPER ⇒ upper
lower ⇒ LOWER

But what about titlecase?

Dz, Dž, Lj, Nj
ᾼ, ᾈ, ᾉ, ᾊ, ᾋ, ᾌ, ᾍ, ᾎ, ᾏ
ῌ, ᾘ, ᾙ, ᾚ, ᾛ, ᾜ, ᾝ, ᾞ, ᾟ
ῼ, ᾨ, ᾩ, ᾪ, ᾫ, ᾬ, ᾭ, ᾮ, ᾯ

Choice 1
"DžunGLA".swapcase
⇓ leave as is
"DžUNgla"

preferred by Unicode Consortium
(never ever need any new standardization)

preserves reversibility
(X.swapcase.swapcase == X)

Choice 2
"DžunGLA".swapcase
⇓ upcase
"DŽUNgla"

Choice 3
"DžunGLA".swapcase
⇓ downcase
"džUNgla"

Choice 4
"DžunGLA".swapcase
⇓ swap
"UNgla"

proposed by Nobu (中田さんの提案)

Implemented
swap ⇒"UNgla"

useless?, but 'correct'
additional effort for implementation
additional effort for testing

Commit Date
April 1st, 2016

(エイプリルフールの日)
Japan Time 20:58:33 ⇒ same date in most timezones
please draw your own conclusions

Testing

Test-Driven Development

Files:
test/ruby/enc/test_case_options.rb
test/ruby/enc/test_case_mapping.rb

Data-Driven Testing

Files:
test/ruby/enc/test_case_comprehensive.rb

413 tests, 2212391 assertions, 0 failures, 0 errors, 0 skips

 

Continuous Integration

 

Future:

Ideas, Problems, Questions

In No Particular Order

 

Character Properties

 

Locale-Aware Formatting

What I want:

loc = Locale.new 'de-CH' (German as used in Switzerland)

1.2345678E5.to_s"123456.78"

1.2345678E5.to_s(loc)"123'456,78"

 

Well, Just use a Library

Internationalization support in libraries:

 

Example: Unicode Normalization

Libraries avoid monkey patching

⇒ not Ruby-like (ライブラリを使うと Ruby らしくない)

 

Locales and Case Mappings

Possible solution (解決案):

loc = Locale.new 'tr'
'Türkiye'.upcase loc
'TÜRKİYE'

 

Encodings: Less is More?

 

Acknowledgments

 

Conclusions

 

Q & A

Send questions and comments to Martin Dürst
(mailto:duerst@it.aoyama.ac.jp)
or open a bug report or feature request



The latest version of this presentation is available at:

http://www.sw.it.aoyama.ac.jp/2016/pub/RubyKaigi/