Internationalization in Ruby 1.9

IUC 33, San Jose, CA, U.S.A., October 2009

Martin J. DÜRST

duerst@it.aoyama.ac.jp

Aoyama Gakuin University

AGU

© 2009 Martin J. Dürst, Aoyama Gakuin University

Outline

Abstract

Ruby is a purely object-oriented scripting language which is easy to learn for beginners and highly appreciated by experts for its productivity and depth. Internationalization of Ruby made a big leap forwards when this January, Ruby 1.9.1, the first stable release of the Ruby 1.9 series, was released. While previous versions of Ruby mostly treated text data as byte sequences, strings in Ruby 1.9 are sequences of characters. Because Ruby tags each string with encoding information internally, different applications can choose different internationalization models.

The presentation will give a short overview of Ruby as a programming language, and introduce the new internationalization features in detail. We will be concentrating on how to use Ruby with Unicode, which in Ruby's case means UTF-8. We will also discuss internationalization support in Ruby on Rails, the pouplar Web application framework written in Ruby.

Assumptions

Talk assumes that you know

Slides avaliable at http://www.sw.it.aoyama.ac.jp/2009/pub/IUC33-ruby1.9/

Some parts of this talk are based on:

What is Ruby?

Ruby History

Ruby Highlights

The Past: Internationalization in Ruby 1.8

The problem:

Good news:

Now much better: Ruby 1.9

Internationalization in Ruby 1.9: The Architecture

Japanese mostly use Multilingualization (M17N) instead of Internationalization (I18N)

Original proposal: Yukihiro Matsumoto and Masahiko Nawate, Multilingual Text Manipulation Method for Ruby Language, IPSJ Journal, Vol. 46, No. 11, Nov. 2005 (in Japanese).

UCS: Universal Code Set

CSI: Code Set Independent

In Defense of the CSI Approach

Main Problems of CSI

Internationalization in Ruby 1.9: Functionality

String Basics

String Iteration

Transcoding: String.encode

Converting a string to another encoding:
s2 = string.encode(to_enc)
Explicit from-encoding:
s2 = s.encode(to_enc, from_enc)
Convert to default_internal (by default no errors):
s2 = s.encode
Options:
s.encode(..., invalid: :ignore)
(options for how to handle transcoding errors)
Destructive conversion:
s1.encode!(...)
 
Changing encoding without changing bytes:
s.force_encoding enc

Setting the Source File Encoding

IMPORTANT: Declare your source encoding

Using the Source File Encoding

Internationalization of Ruby Identifiers

Unicode Character Escapes

Input and Output

Default Encodings

Setting Default Encodings

Internationalization in Ruby 1.9: Behind the Scenes

What Defines an Encoding

Important Encodings

These three are built in, others are loaded dynamically

US-ASCII

ASCII-8BIT

UTF-8

Other Encodings Supported

Tell me if you need something else!

Why New Transcoding Library

Transcoding Library Architecture

Conversion Data Example

One table per byte
(Shift_JIS ⇒ UTF-8, leading byte)
Byte value Action
0x00
...
0x61 ('a') Copy input (0x61, 'a')
...
0xC3 ('テ') output "\xEF\xBE\x83"
...
0xE0 Go to table for 0xE0...
...

Problems with one Table per Byte

Two Tables per Byte

Two Tables Example

(Shift_JIS ⇒ UTF-8, leading byte)

offsets
Byte value offset
0x00 0
... 0
0x61 ('a') 0
...
0xC3 ('テ') 23
...
0xE0 42
...
infos
offset info
0 0
...
23 output
"\xEF\xBE\x83"
...
42 goto table for 0xE0...
...

Advantages of Two Bytes per Table

Next Steps for Transcoding

Unicode in Ruby on Rails

Internationalization in Ruby 1.9: Advice

Future Work

Acknowledgements

Conclusions

Questions & Answers

Collophon

or how these slides were produced,
and how they are best viewed