Tutorial

Internationalization and Localization
in Ruby and Ruby on Rails

http://www.sw.it.aoyama.ac.jp/2012/pub/RubyRails/

IUC 36, Santa Clara, CA, U.S.A., 22 October 2012

Martin J. DÜRST

duerst@it.aoyama.ac.jp

Aoyama Gakuin University

Ruby Programming Language

© 2012 Martin J. Dürst, Aoyama Gakuin University

Overview

Abstract

Ruby is a purely object-oriented scripting language designed to make programming fun and efficient. Ruby on Rails is the groundbreaking web application framework built using the programming language Ruby. This tutorial will help you understand the basics for internationalization and localization in Ruby and Ruby on Rails.

The tutorial will start with a discussion of how character encoding works in Ruby and how to make the best use of it both in throw-away scripts and in long-running applications. We will show how in Ruby, all character encodings are equal, but UTF-8 is more equal than others, and should be used with preference.

Ruby on Rails also preferably uses UTF-8, because this is the best choice for web applications. Ruby on Rails comes with its own internationalization and localization framework. As is typical for Ruby on Rails, this framework is very simple but easily extensible. We will show discuss both the basics framework as will as several helpful extensions, e.g. for handling timeliness or for translating user interface texts.

The tutorial assumes that participants have some experience with programming and Web applications. Experience with Ruby and/or Ruby on Rails is a plus, but is not a precondition for attending.

[Text appearing in gray are comments not showing up in presentation mode. The best way to view the slides as they were presented is with Opera, pressing F11.]

 

Introductions

 

Goals

 

Conventions Used

(Ruby) code is green, monospace
puts "Hello Ruby!"

Variable parts are orange
puts "some string"

Encoding are indicated with a subscript

Results are indicated with "⇒": 1 + 12

 

Frequent Example Юに코δ

 

Ruby

Ruby - A Programmer's Best Friend

 

Basic Ruby

5.times { puts "Hello Ruby!" }

 

Ruby Versions

 

Ruby 1.8 and before

 

Ruby 1.9 and later

A note about Ruby versions and Unicode versions: The Ruby core team is very conservative in introducing new Unicode versions as bug fixes. New versions therefore only get added on a minor version upgrade (e.g. 1.9.1, 1.9.2, 1.9.3,...).

We still can get 10 as a result with:
"Юに코δ".bytesize10

 

Ruby Implementations

This tutorial is about MRI/C-Ruby, which is the unofficial standard

 

Ruby Internationalization Goals

 

String Representation

Strings are instances of the String class

Each string is:

  1. A sequence of characters
  2. Together with a character encoding
    (e.g.: US-ASCII, UTF-8, GB18030, ISO-8859-1, Shift_JIS,...)
    "Юに코δ".encoding.to_s"UTF-8"

Fundamentally different from other programming languages.

Java, JavaScript: UTF-16; Python: 8-bit/UTF-16/UTF-32; Perl: 8-bit/UTF-8; C/C++: mb/wc

 

Objects with Encoding

 

Setting an Encoding

(Very simplified, more details later)

 

Terminology: Transcoding

 

Terminology: Other

 

Encoding Clash

"Юに코δ"UTF-8 + "Юに코δ"UTF-8
"Юに코δЮに코δ"UTF-8

"Юに코δ"UTF-8 + "Юに코δ"UTF-16
Encoding::CompatibilityError

Another example:
"Dürst"ISO-8859-1 == "Dürst"ISO-8859-2false

Trying to combine strings with different encodings, as here with concatenation (+), leads to an exception. There are some exceptions (sic!) to this rule that we will look at later. The reasoning for the error here is that transcoding should not happen without the programmer being aware of it.

Trying to compare two character-by-character identical strings in different encodings will produce false, even if these strings are, as in the above example, also byte-for-byte identical. Again, the reason for the result is that encoding mismatches should be detected early. In addition, a simple byte-for-byte comparison could produce false positives.

 

String Pieces

Strings can be cut up, and iterated over, in four ways:

Lines: "Юに코δ".lines.to_a
["Юに코δ"]

Characters: "Юに코δ".chars.to_a
["Ю", "に", "코", "δ"]

Codepoints: "Юに코δ".codepoints.to_a
[1070, 12395, 46020, 948]

Bytes: ..bytes.to_a
[208, 174, 227, 129, 171, 235, 143, 132, 206, 180]

As the result of chars shows, Ruby treats characters as Strings of length 1. They were integers up to Ruby 1.8. Using the same class for both strings and characters avoids the distinction between characters and strings of length 1. This matches Ruby's "big classes" policy. It also leaves the door open for 'characters' other than single codepoints.

each_line, each_char are older names for lines, chars, codepoints, and bytes. The methods lines, chars, codepoints, and bytes return Enumerators. Here we just use to_a to produce arrays. Enumerators can be used directly for iteration with each, or with separate iterators for mapping with map, selection with select/reject.

The need for several enumerators on a single object, resulting form the change of string/character representation between Ruby 1.8 and 1.9, was one of the main motivators for introducing enumerators into Ruby. This is an interesting example of how internationalization concerns can affect more 'fundamental' language features. The orthogonality resulting from separating what to enumerate over and how to iterate results in very expressive code.

 

Operations on Strings

Fully internationalized:

Not internationalized:

 

Regular Expressions

Oniguruma: Very potent regular expression engine (caution: fork between Ruby Oniguruma and main branch)

Literals: /.../options

Encoding options: n: ASCII, e: EUC-JP, s: SJIS,
u: UTF-8

For UTF-8, implements character classes/scripts

However, not fully UTS#18 (Unicode Regular Expressions) compatible

 

Regular Expression Details

UTS #18 recommends: \s, \p{space}, \p{Whitespace} → Unicode whitespace

In Ruby: \s → ASCII whitespace; \p{space}, \p{Whitespace} → Unicode whitespace

Examples:

"abc def" =~ /\s/3(i.e. found)
"abc\u00A0def" =~ /\s/nil
"abc\u00A0def" =~ /\p{space}/3(i.e. found)
"abc\u00A0def" =~ /\p{Whitespace}/3(i.e. found)

Keeping \s to mean ASCII whitespace only was done for backwards compatibility. This can be explained as follows: If somebody wrote a script doing some processing where they wanted to match ASCII whitespace characters, they used \s. If Ruby would change \s to suddenly match more characters than before, the meaning of that program would change. Maybe it would change just in the right way. But there's also a good chance that it will change in ways not intended by the programmer. (See also https://bugs.ruby-lang.org/issues/7154.)

 

Encodings

 

Encodings as Objects

String to encoding:
e = Encoding.find "WinDOwS-1252"

Encoding to string: e.to_s"Windows-1252"

Methods most often accept both Strings and Encodings to indicate encodings

Constants for Encodings: Encoding::Windows_1252, Encoding::WINDOWS_1252

(We see that encoding names are case-insensitive, and are canonicalized. The internal canonicalization is specific to each encoding, and can start with a lower-case letter. Encodings are also exposed as constants in the Encoding module. There is always a constant with all letters in uppercase. There is also a constant that is the same or close to the internal canonical name, but starts with an uppercase letter. For the constants, hyphens are converted to underscores.)

 

Information about Encodings

List all encodings:

Encoding.list.join("\n")

Count the encodings supported:

Encoding.list.length

Ruby currently supports close to 100 encodings

 

Properties of Encodings

 

Encoding Aliases

Aliases are only separate names, not separate encodings

To list all encoding names:

Encoding.name_list.join("\n")

To list all (active) aliases:

Encoding.aliases.each { |a, e| puts a + " => " + e }

 

ASCII-compatible Encodings

Most, but not all encodings are ASCII-compatible:

Encoding::UTF-8.ascii_compatible?
true

Encoding::UTF-16.ascii_compatible?
false

ASCII-compatible means ASCII characters stay ASCII bytes:

Encoding::Shift_JIS.ascii_compatible?
true

Encoding::ISO_2022_JP.ascii_compatible?
false

 

Non-ASCII-compatible Encodings

List of non-ASCII-compatible encodings:

Encoding.list.reject(&:ascii_compatible?)

UTF-16BE, UTF-16LE, UTF-16

UTF-32BE, UTF-32LE, UTF-32

UTF-7

ISO-2022-JP, ISO-2022-JP-2, ISO-2022-JP-KDDI

CP50220, CP50221

 

Almost ASCII-compatible Encodings

 

Dummy Encodings

List of dummy encodings:

Encoding.list.select(&:dummy?)

UTF-16, UTF-32, UTF-7

ISO-2022-JP, ISO-2022-JP-2, ISO-2022-JP-KDDI

CP50220, CP50221

Dummy encodings are encodings which are treated as opaque byte sequences. But they are labeled, and may be transcoded to some other encodings.

 

Exceptional Encodings

 

Important Encodings

In Ruby, all encodings are equal, but some encodings are more equal than others:

[Adapted from George Orwell's Animal Farm.]

 

US-ASCII Encoding

 

ASCII-Only Data

ASCII-only data is frequent in programs.

Ruby treats this specially.

This also makes it easy for programmers who don't think about encoding when working only with US-ASCII. Internally, Ruby caches whether a string is ASCII-only or not, to increase performance.

"Data"SJIS.ascii_only?true

"データ"SJIS.ascii_only?false

 

No Clashes

ASCII-only data does not cause Encoding::CompatibilityError,
even if encoding isn't US-ASCII

"Юに코δ"UTF-8 + "Data"SJIS"Юに코δData"UTF-8

"Data"SJIS + "Юに코δ"UTF-8"DataЮに코δ"UTF-8

Works similar for other string operations.

Testing for Clashes

Test with Encoding.compatible? string1 string2

 

ASCII-8BIT

Alias: BINARY

Use for binary data

Use when high-bit bytes' semantics are unknown (but 7-bit bytes are ASCII)

 

UTF-8

 

System Encodings

 

Source Encoding

 

Non-ASCII Identifiers

π = 3.14...

Possible, but not recommended (possible exceptions: basic education, special terminology)

Problems:

 

Locale Encoding

 

Filesystem Encoding

 

External encoding

 

Internal encoding

 

Encoding Integrity Check

"abc\xFE"UTF-8.valid_encoding?false
"abc\xC0\x80"UTF-8.valid_encoding?false
"abc\xC2\x80"UTF-8.valid_encoding?true
"abc\xCE\xA2"UTF-8.valid_encoding?true(there is no uppercase final Sigma (yet!?))

Checks for code structure, not unassigned codepoints

 

How to Avoid Invalid Data

 

Transcoding Overview

 

Transcoding String to String

Unfortunately, force_encoding is destructive, no non-destructive equivalent.

 

Explicit to and from Encodings

Example: Force UTF-8 "double-encoding" (not recommended)

"Юに코δ"UTF-8.force_encoding('iso-8859-1').encode('UTF-8')

Equivalent:

"Юに코δ"UTF-8..encode('UTF-8', 'iso-8859-1')

Note order of arguments: to from

(I always disliked the order of arguments in iconv, but it was unavoidable here, because usually, just the 'to' encoding is needed, so that has to come first.)

 

Transcoding Failures

 

Fallback Options

Additional options for newline conversion and XML escaping

 

Transcoding on Input and Output

Methods that open files/streams take an encoding: option:

"external": External encoding

"external:internal": External and internal encoding

If 'internal' is undefined (here or otherwise), then external is used for labeling input.

If 'internal' is defined, then conversion takes place.

Caution: Has to be set explicitly for stdin, stdout,...

 

Operations on IOs

 

Command-line Options

 

Adding an Encoding or Transcoding

 

Encoding Models Overview

 

Encoding Models for Applications

An encoding model describes which encoding(s) can be used in what part of an application (e.g. externally, internally). It defines the conditions and restrictions with respect to string processing that the application (programmer) has to maintain. When creating a Ruby application, it is important to choose the appropriate encoding model.

 

Any Single Encoding

 

One Single Encoding

For applications that use more character semantics outside the ASCII range, or that keep data for a long time, the ideal solution is to use only a single encoding. This should be UTF-8 because that covers all of Unicode and works best with Ruby.

 

One Single Internal Encoding

If data (e.g. files) in other encodings also have to be handled, it will in most cases be best to adopt the model of most other programming languages: Use Unicode (i.e. UTF-8 for Ruby) inside, and convert on input/output.

 

Mixed Encodings

Ruby would also allow the creation of an application using many different encodings internally at the same time. However, this is not what the encoding model of Ruby was created for, and it should be avoided if at all possible.

 

Encoding Models for Libraries

Important: Don't touch system encodings (except maybe for a framework)

 

Libraries for Unicode Support

Ruby leaves open some important i18n support:

These can be done with:

 

UnicodeUtils

By Stefan Lang, in pure Ruby

Example:

require "unicode_utils/upcase"

UnicodeUtils.upcase("weiß") => "WEISS"

UnicodeUtils.upcase("i", :tr) => "İ"

 

ActiveSupport

 

Ruby on Rails

 

Rails and Encodings

 

Rails I18n Framework - Message Translation

 

Text Replacement: Templates

In templates, replace

<h1>Hello Rails!</h1>

with

<h1><%= t 'welcome.rails' %></h1>

welcome.rails is a structured key used for looking up the translated string

 

Text Replacement: Backends

 

Possible Future Work

 

Q & A

Send questions and comments to Martin Dürst, duerst@it.aoyama.ac.jp

A new version of this tutorial is available at:

http://www.sw.it.aoyama.ac.jp/2015/pub/RubyRails