31st Internationalization and Unicode Conference, October 2007, San Jose, CA, USA

Internationalization of the Ruby Scripting Language

Martin J. Dürst
Department of Integrated Information Technology,
College of Science and Engineering, Aoyama Gakuin University
Sagamihara, Kanagawa, Japan

Keywords: Ruby (Scripting Language), Internationalization, Transcoding Policy, Unicode


Ruby is a purely object-oriented scripting language that is rapidly growing in popularity due to its high productivity. Because it was invented in Japan, some basic internationalization features are available, but there is still a lot of work to do. This paper gives a short overview of the most important features of Ruby, and introduces the available internationalization-related features, concentrating on how to use Ruby with Unicode, which in Ruby's case means UTF-8. An analysis and outlook of planned directions for further internationalization work is also given.

An up-to-date version of this paper as well as slides used for the talk connected to this paper are available at http://www.sw.it.aoyama.ac.jp/2007/pub/IUC31-ruby.

1 Introduction

A programming language should allow to write programs easily and conveniently, at a high abstraction level, with results that are compact and readable. In the view of the author of this paper, as well as many others, the scripting language Ruby has managed to move a significant step closer to this goal. Programming is fun again, more fun than ever.

Programming languages, and in particular scripting languages, deal with text, and in this day and age of the World Wide Web, text is no longer just 7-bit characters. Internationalization and Unicode support are crucial for everyday programming tasks.

This paper discusses various aspects of Ruby internationalization. We start with a short overview of Ruby itself. Then we discuss the core internationalization features of Ruby 1.8, the current stable version, in Section 3, including shortcomings. In Section 4, we look at the CSI (Character Set Independent) model underlying the internationalization architecture for future versions of Ruby, contrast it with the UCS (Universal Character Set) model that is often used in combination with Unicode, and discuss the shortcomings of the CSI approach. In Section 5 we describe the progress towards implementing the new internationalization architecture in Ruby 1.9. In Section 6, we propose transcoding policies as a measure to implement the UCS model on top of the CSI architecture and to improve the usability of the CSI model.

2 The Ruby Scripting Language

This section gives a short overview of the Ruby scripting language, with particular attention to the features relevant for internationalization. Ruby's core functionality is close to that of Perl [Perl], and it borrows many syntactical idioms from Perl, but avoids many of Perl's shortcomings. Ruby also uses concepts from programming languages such as Smalltalk and Lisp. The language is continuously being improved, with work on internationalization one of the main items in the current cycle.

2.1 History and Importance

Ruby is the creation of Yukihiro Matsumoto, who started working on it in 1993. The idea was to create a purely object-oriented scripting language that was easy to use. In the view of this author and many others, this was highly successful.

Ruby started to become more widely known in 1999 in Japan, and around 2000, it was 'discovered' for the rest of the world by Dave Thomas, who wrote the standard Pickaxe book, now in its second edition [RubyPrag]. If you are already familiar with a programming language and with the basic concepts of object-oriented programming, then this is the first Ruby book that you should buy.

Ruby became even more well-known in 2004 when David Heinemeier Hansson published the first version of Ruby on Rails [Rails], a high-productivity Web application framework that is based on the principle of convention over configuration and takes extensive advantage of Ruby.

Ruby comes with an extensive range of built-in classes, standard libraries, a framework for installing extensions called RubyGems, a debugger, a profiler, an interactive shell, documentation tools, a testing framework, and many other productivity tools. Ruby has been ported on a large number of platforms starting with Linux/Unix, and including Microsoft Windows (native and Cygwin) as well as many others.

In the meantime, there are already several implementations of the Ruby interpreter. Besides the main implementation in C, there is JRuby, implemented in Java, Rubinius, based on the Smalltalk-80 VM architecture, and IronRuby, implemented on the Microsoft Common Language Runtime. Many well-known programmers in the agile software community use Ruby, and there is even a book that predicts that Ruby will replace Java [JavaRuby].

2.2 Syntax

One of the features that makes Ruby so attractive is its concise syntax. This is achieved on two levels. On a micro-level, parentheses around method arguments and semicolons at the end of a line can be left out. This is achieved at the cost of higher parser complexity, something which does not bother the Ruby programmer. It often produces syntax that is more reminiscent of configuration settings than an actual programming language. This kind of style is also called Internal Domain Specific Language [DSL], Internal referring to the fact that this DSL is just a natural part of the host language Ruby.

At a higher level, concise syntax it is achieved by using convenience methods on the root class Object and a default object that forms the context at the start of a script. In Java, printing to standard output has to be written

System.out.println("Hello World!");

after creating an application class to execute this command. In Ruby, this is simply written

print "Hello World!"

without the need for any further code. This allows for Perl-like one-liners while keeping the object-oriented concepts clean.

For control structures, Ruby mostly uses familiar syntax, although constructs such as if/elsif/else and while use an explicit end rather than brackets or implicit delimitation. A control structure particular to Ruby, and widely used, is the block. A block is a function-like piece of code that can be passed to a method and then can be called repeatedly by this method. A very simple example is the times method on integers:

5.times { print "Hello World!" }

This repeats the statement in the block (between { and }) for five times. In object-oriented terms, the object 5 is requested to execute the block for 'itself' number of times. Blocks are widely used for iteration, for higher-order functions/methods, and for providing optional behavior.

As the 5.times example above showed, being fully object oriented means that even integers and strings are objects, and that methods can directly be applied to constants. So it is possible to write "Hello".upcase (returning the string HELLO).

Another syntactical feature frequently used is string interpolation. Using the delimiters #{ and } within a string allows to add replacement strings to the original string. This feature is not limited to variable values (as e.g. in Perl), but can also be used with method results. As an example, the following short program

who = "Unicode Conference"
print "Hello #{who}!"

will print Hello Unicode Conference!

2.3 Type System and Extensibility

Ruby has a very flexible type system. All classes are derived from the Object root class. Each object is an instance of some class, and the class defines what methods the object can execute.

In contrast to objects, variables do not have types, they are just references to objects. As a consequence, data structures such as Arrays can contain mixtures of objects. Here is an example of an Array constant that mixes integers, floating point numbers, and strings:

[1, 2, 3.14159, "four", "five and a quarter"]

Because variables are not typed, they also do not need to be declared. Variable scoping is lexical, but a variable lives as long as it may be used, e.g. in a block. In the computer science literature, this is called a closure.

What makes Ruby even more flexible is the fact that methods are called strictly by name, independent of the class hierarchy. This is called duck typing, because any object, of whatever class, is considered a duck as long as it walks and quacks like a duck. This is different from languages such as C++ and Java, where a method has to be declared in a common superclass in order for this method to be used with objects of different types.

While the class hierarchy can only be extended, not changed, any class (and with it its subclasses) can easily be extended by adding new methods. Also, existing methods can be redefined. This is often combined with renaming the existing method and calling it from the new definition to avoid losing the existing functionality. This kind of extensibility is crucial, because it reduces the 'reinventing the wheel' phenomenon: It is not necessary to reimplement or subclass a class just to add some features. This kind of extensibility also serves as a great catalyst and testbed for new methods that may eventually become built-in methods.

Modules can be used to add a collection of methods to a class, in a scaled-down version of multiple inheritance. They also serve as a namespacing mechanism.

Ruby also makes it easy to write glue code to a library written in C, providing additional extensibility and a performance boost when needed.

2.4 Metaprogramming

Metaprogramming means that a program can program itself, i.e. change itself. The basic working mode of the Ruby interpreter is that class and method definitions are just program code that creates classes and methods instead of integers, Strings, Arrays and the like.

Metaprogramming is used frequently in Ruby to increase extensibility, to write more compact and consistent programs, and to provide DSL functionality.

3 Current Ruby Internationalization

This section describes the state of Ruby internationalization for Ruby 1.8, the currently stable and prevalent version. The latest subversion is 1.8.6, but there are no significant changes regarding internationalization within version 1.8.

3.1 Core Features

In Ruby 1.8, String objects are treated just as sequences of bytes. Accessing a String with array-style indexing returns the corresponding byte as an integer, which can quickly lead to data corruption. Some support for treating strings as sequences of characters is available with regular expressions. There are four different character encoding modes for regular expressions:

  1. A generic single-byte mode, used for raw byte sequences and US-ASCII, and also usable for other single-byte encodings.
  2. An EUC mode, usually used for EUC-JP, but also usable for character encodings such as EUC-KR.
  3. A Shift_JIS mode.
  4. A UTF-8 mode.

These modes can be set on regular expressions by using the option letters n (none), e (EUC), s (SJIS), and u (UTF-8). As an example, the regular expression /.../n matches three bytes, while /.../u matches three UTF-8 characters. According to Matsumoto, UTF-8 was added around 1997, which is quite early for Japanese software.

The UTF-8 mode indicates that for Ruby, Unicode means UTF-8. Other encoding forms such as UTF-16 are not really under consideration. For people used to working with UTF-16, this may be unfortunate, but having all Unicode support be centered around a single encoding definitely eliminates many problems, and leads to a concentration of forces. For a scripting language, the US-ASCII compatibility of UTF-8 [PropUTF8] is definitely a big asset.

The above modes can also be set globally with the -K option or by setting the $KCODE global variable. The 'K' is derived from Kanji. However, these features have to be used with care. Due to their global nature, they may easily affect libraries where another setting is desired.

The -K option and the $KCODE global variable have a few other effects. They influence what can be used as a variable name. With $KCODE = 'utf-8', for example, any sequence of (UTF-8-encoded) non-ASCII characters can be used as a variable name. A similar feature is available in XML, and is very helpful for education [XML2003] and for dealing with complex local terminologies [Murata]. Also, the above settings may affect whether strings are being escaped or not on output.

Another feature helpful when working with UTF-8 are the pack method for Array and the unpack method for String. They allow to convert from an array of integers, representing Unicode scalar values, to a string encoded in UTF-8. As an example, [0x9752, 0x5C71].pack('U*') produces the string '青山'. This can also be used for detecting UTF-8, for details, see [RubyWay, Section 4.2.3].

3.2 Standard Library

For more support, the programmer has to refer to libraries. The standard jcode library uses the built-in regular expression support to add and change some methods of the String class to make them work with character semantics. As an example, string.length returns the string length in bytes, whereas string.jlength, added by the jcode library, returns the length in characters. string.chop! is one of the modified methods, and removes the last character, rather than the last byte, if the jcode library is loaded. However, not all methods that one would expect are changed.

Some code conversion facilities are also available as standard libraries. nkf (Network Kanji Filter) is a library well-known in Japan, and is available directly from Ruby. A higher-level interface, called Kconv, based on NKF, is also available, and adds some conversion methods directly to the String class. However, these libraries only cover Japanese encodings (including parts of UTF-8), and by default also perform MIME decoding, which should be dealt with separately.

Another standard conversion library available in Ruby is iconv. This is a simple interface to the POSIX iconv library. However, this is only supported as far as an iconv library is installed natively, and can therefore not be relied upon.

Because iconv in general covers a much wider range of character encodings, but nkf/Kconv are preferred by Japanese programmers for how they handle some of the conversion details, we have explored the possibility of a higher-level wrapper over these two libraries, with added programmer convenience [Shima2007]. An alpha-quality version of this work has been published as [sconv].

3.3 Other Libraries

The list of other libraries that provide parts of the functionality necessary for internationalization is long. Hal Fulton [RubyWay, Chapter 4] provides a good start, including short explanations of the basic issues, which makes it recommended reading for people without experience in internationalization.

A library for doing Unicode normalization by Masato Yoshida is available at http://www.yoshidam.net/unicode-0.1.tar.gz, with a description at http://www.yoshidam.net/unicode.txt. However, this is based on Ruby 1.4 and Unicode as of 1999, and so is seriously outdated.

Another Unicode library, based on Unicode version 4.1.0, by Rob Leslie, is available at ftp://ftp.mars.org/pub/ruby/Unicode.tar.bz2. It redefines various operators for normalization and other operations.

The library currently most up-to-date for basic Unicode support is most probably ActiveSupport::Multibyte, part of Ruby on Rails 1.2. See http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Chars.html for more documentation. It supports Unicode semantics for strings encoded in UTF-8. Instead of just writing string.length, which gives the length in bytes, string.chars.length will give the string length in terms of characters. The chars method creates a proxy for each string, and delegates the methods to a handler that knows about the encoding. http://wiki.rubyonrails.org/rails/pages/HowToUseUnicodeStrings gives a short introduction for how to use Unicode in Ruby on Rails, which is also useful for Ruby in general. Another useful resource for using Unicode in Ruby on Rails is http://ruby.org.ee/wiki/Unicode_in_Ruby/Rails.

Another approach is ICU4R, providing ICU bindings for Ruby [ICU4R]. This provides a lot of functionality, but uses its own classes, e.g. UString instead of String. This means that advanced functionality is available, but at a high overhead for the programmer.

3.4 Language Tag Support

The author has implemented two small libraries, described in this and the next subsection, both available as RubyGems. The langtag RubyGem [langtag] provides parsing, wellformedness checking, and read/write accessors by component for IETF language tags [BCP47]. Below is a little example program showing some usage examples. The results are indicated with => in comments (starting with #).

require 'langtag'
tag = Langtag.new('de-Latn-ch')
tag.script                      # => 'Latn'
tag.wellformed?                 # => true
tag.region = 'at'
print tag                       # => de-Latn-at

3.5 Character Escapes

The charesc RubyGem [charesc] provides character escapes for Ruby strings. Up to version 1.8, Ruby only provides an escape syntax for bytes, \xHH for the hexadecimal case. Using metaprogramming and string interpolation, this limitation can be overcome. In Ruby, the method const_missing is called whenever a constant (an identifier starting with an upper-case letter) is not found, usually just to produce an error message. Redefining this method, we pretend that there are constants of the form UABCD, with the corresponding Unicode character U+ABCD as a value. To simplify syntax, we also allowed repeated sequences of hexadecimal digits, separated by the character u, to represent strings of Unicode characters. Again, we show some usage examples.

require 'charesc'
print "#{U677Eu672Cu884Cu5F18}"       # => 松本行弘
print U9752u5C71u5B66u9662u5927u5B66  # => 青山学院大学
print "Martin J. D#{U00FC}rst"        # => Martin J. Dürst

The third line above shows that the constants do not have to be interpolated, they can be used directly where appropriate. The escapes also work if $KCODE is set to SHIFT_JIS or EUC_JP, but only for characters available in these encodings. For $KCODE set to none/US-ASCII, UTF-8 is produced.

4 The Ruby Internationalization Masterplan

This section is heavily based on a 2005 paper [M17N] by Matsumoto, as well as direct discussions with him. To those who can read it, the above paper in some parts reads like one of the Unicode critiques that were so popular in Japan in the mid 1990ies. A Unicode-centered model for character encoding is rejected in favor of a character set independent model, relegating Unicode to one coded character set among many. The next subsections explain these two models in detail.

4.1 The UCS Model

Character encoding and processing on the Web as well as in many applications and programming languages all follow one and the same basic model, most clearly documented in Section 4.3 of [Charmod]. While this document talks indirectly, about specifications, the overall idea is very simple: All character processing occurs in (some encoding form of) Unicode; external data is converted to/from this encoding form as necessary, and other character encodings are secondary, simply being used to transmit (subsets of) Unicode. In more abstract terms, this model is called the Universal Character Set model, but in practical terms, Unicode is the only viable character set for this purpose.

Programming languages using the UCS model include Java and C#, which use UTF-16 internally, Perl, which mostly uses UTF-8, and Python, which can be compiled for both UTF-16 and UCS-4.

The many advantages of the UCS model are its conceptual simplicity and the possibility to provide rich functionality based on well-defined character semantics. Also, using a single encoding means that implementations can be highly optimized.

4.2 The CSI Model

The Character Set Independent model tries to treat various different character encodings as equals. This model was popular with big software companies in the 1980ies and early 1990ies, before an UCS model using Unicode became possible. At that time, the choice of encoding usually happened at compile time. In the [M17N], the encoding is a property of a String, and may be different for each string. Further implementation details are discussed in Section 4.3.

Matsumoto gives the following reasons for choosing the CSI model:

  1. Size of code tables for character encoding conversion, and code conversion speed,
  2. Difficulties to deal with encoding variants,
  3. Limitations with respect to characters not encoded in Unicode.

Among these, the memory use for code conversion tables is less and less of an issue, even for devices such as mobile phones, where the newest versions already support Unicode. Conversion speed is also not an issue, because the necessary table lookups are very fast compared to the overhead of a scripting language such as Ruby. The limitations with respect to characters not (yet) encoded in Unicode can be addressed by mapping such characters to one of the private use areas of Unicode; this is easier than to implement a new encoding.

What remains, and what is most often mentioned in direct conversation, are the difficulties with encoding variants. As an example, the encoding usually labeled as Shift_JIS, used on the Windows PC and the MacIntosh for over 20 years, has accumulated a lot of variants. These are due to additions and changes in the underlying coded character set, due to the addition of vendor-specific and user-specific characters in the reserved area. Also, various vendors and libraries differ slightly in how they map the core characters of Shift_JIS to Unicode and back. For more details on Shift_JIS variants, please see [XMLja].

A particularly thorny issue for Japanese encodings (and similar for Korean encodings) is the character encoded as byte 0x5C. It is used both as a backslash (\) and as a Yen symbol (¥). In its syntactic function as a backslash, e.g. in programming languages, it may look as it pleases, but may not change code position. In its function as a currency symbol, it has to look right or economic confusion may result.

While Shift_JIS is a particularly serious example, such issues are not at all limited to Shift_JIS or to Japanese encodings [ICUcompare]. Often, just a few characters out of a few hundred or even a few thousand are affected, and frequently, the characters affected are rare, or the differences are visually minor. Also, many of the issues do not appear in simple round-trip scenarios, i.e. if the same conversion library is used for converting in both directions.

Often, conversion to Unicode would be the right thing, but doing it correctly would require too much work, maybe including manual intervention. Keeping the data in a legacy encoding does not mean that the legacy encoding is better suited for representing the data. But it allows to gloss over ambiguities that otherwise would have to be resolved explicitly. In particular for a scripting language, where a lot of programs are written for one-time or casual use, the ability to not have to transcode is felt to be an important feature.

Being able to stay vague when needed is a feature often attributed to the Japanese language. Compared to Japanese, using English often means that things that are implicit or unknown have to be made explicit. To some extent, there may be some connection between the use of vagueness in the Japanese language and the desire for glossing over ambiguities in character encodings.

4.3 Implementing the CSI Model

[M17N] describes an implementation of the CSI model. Each string carries its encoding as a property. Each encoding is represented by a C-level data structure containing mostly function pointers. Each function pointer implements a particular primitive for the specific encoding. The data structure also contains a few flags and other information about the encoding in question, such as the length of the shortest and longest byte sequence representing a character. No recompilation is needed except when a new encoding is added, and even in that case, it may be done by dynamic linking, avoiding recompilation of the interpreter itself.

Strings with different encodings can co-exist in a single execution. In many cases, at least two encodings will coexist, a binary encoding (which can also be understood as no encoding) and a single platform-specific encoding. If strings or regular expressions with different encodings are used in the same operation, an exception is generated. Exceptions are not very useful in this case. It is for example impossible to use them to implement a general policy of "convert if you see an encoding conflict". The intention of using exceptions is to make programmers understand and fix their code.

The number of functions that have to be implemented for each encoding is kept very low in order to make implementing new encodings easy. The exact number has been fluctuating over the past few years, but always stayed between 15 and 20. The functions are very low-level, such as "starting from byte p, assuming that this is the start of a byte sequence representing a character, tell me how long this byte sequence is". With these basic functions, it is possible to implement all basic string functionality.

4.4 Critique of the CSI Model

The CSI model is not without problems. This subsection discusses these problems as perceived by the author. A first problem is performance, on multiple levels. Encoding-dependent function dispatch incurs a relatively small performance penalty, because it can be done once per string operation, where the overhead of using an interpreted language is already significant.

More serious performance problems can be expected due to the use of an extremely small set of primitives. As an example, finding the length of a string is currently implemented by repeatedly calling the function to find the next character. Adding a new primitive that calculates string length directly would increase the number of primitives, but increase performance for finding the length of a string in characters. While it is difficult to judge exactly when and where such performance problems will surface, the Law of Leaky Abstractions [JoelSW] virtually guarantees that they will.

The best way to deal with this conflict is to separate between core primitives and extended primitives. The core primitives are those primitives that have to be implemented for each encoding. They can be kept extremely low in number. The extended primitives are operations that are used frequently, can be expressed in terms of the core primitives, but may be performed significantly faster if implemented directly. For each extended primitive, a default implementation in terms of core primitives is made available.

When implementing a new encoding, as soon as the core primitives are implemented, the encoding can be used, because for the extended primitives, the default implementations are used. If the encoding is used more frequently, and if performance problems show up, more and more extended primitives can be implemented for the specific encoding. Overall, this results in a triangle-shaped oblique approach as used for the fast implementation of the Graphic User Interface Application Framework on several windowing platforms [Wei1993].

Another problem with the CSI model is that it leads to a lowest comon denominator approach to string functionality, whereas the UCS model tends towards a greatest common divisor approach. In the CSI model, only a minimum of primitives that can be applied to all encodings is implemented. This leaves the functionality virtually at the same level as the old-style interfaces designed for the ASCII-only days. On the other hand, the UCS model encourages covering more and more functionality.

4.4 Importance and Needs of Unicode

Supporting Unicode means more that just to make it possible write UCS-oriented programs on a CSI infrastructure by using UTF-8 instead of Shift_JIS or EUC-JP, for example. Unicode is not just a union of existing character encoding repertoires convenient for round-tripping. It is also not just one coded character set among many. Unicode is at the forefront of encoding rare scripts, it constitutes the lowest level of Web architecture, it provides a wide range of character-related properties, conventions, and standards. Thorough support for Unicode, not only as an encoding, but including advanced functionality, is expected from any programming language and application platform.

Tim Bray [World30] has helped a lot to make Unicode well known in the Ruby community. Since the implementation of UTF-8 around 1997, Unicode is on equal footing to other encodings such as Shift_JIS and EUC-JP. But this is not enough. To paraphrase George Orwell, all encodings are equal, but Unicode (meaning UTF-8 in the case of Ruby) has to become more equal than others.

5 Internationalization in Ruby 1.9

This section looks at the ongoing implementation for Internationalization in Ruby 1.9 and beyond. If everything goes well, some basic pieces may be finished by Christmas 2007, but this is not sure yet. This work is ongoing as we speak, so any details should be checked before use.

5.1 Basic Approach

The principle approach in Ruby 1.9 is that following the lines of the CSI model above, strings and regular expressions carry a character encoding as a property. This encoding is available as a String with the encoding method. The encoding is taken into account for basic string operations such as length or indexing, to make sure that these operations work on characters rather than bytes. As a small but important detail, indexing operations that extract a single character no longer return an integer, but simply a very short string.

For working on bytes, a binary encoding is provided. Currently, as in Ruby 1.8, the binary encoding is also used for US-ASCII. Work is ongoing to distinguish simple binary data from US-ASCII data, at least by checking the range of byte values. This is used to relax the restriction on combining strings with different encodings. A string with an ASCII-compatible encoding that only contains US-ASCII data is considered to be combinable with another string with an US-ASCII compatible encoding. US-ASCII compatible here means that US-ASCII codepoints are expressed as US-ASCII byte values and that the encoding has no switching states.

5.2 Encoding Indicator

The $KCODE global variable is being replaced by a comment convention at the start of a file. The start of the file here means the first line, or the second line if the first line contains a path to the ruby binary with options (the so-called shebang, e.g. #!/usr/local/bin/ruby). As an example, the line

# -*- coding: utf-8 -*-

indicates that the file is encoded in UTF-8. This convention only affects the actual file, not other files. String constants and regular expressions in the file are given this encoding, and the encoding is checked for correctness for identifiers.

5.3 Character Escapes

In Section 3.5, we presented our charesc RubyGem [charesc] that uses metaprogramming to make Unicode-based character escapes with reasonably short syntax available in Ruby 1.8. For Ruby 1.9, the implementation of character escapes in the Ruby core syntax is planned. The syntax allows two styles, \u followed by four hexadecimal digits, and \u followed by one to six hexadecimal digits enclosed in brackets. Here are some of the examples from Section 3.5, rewritten for the new syntax:

print "\u677E\u672C\u884C\u5F18"       # => 松本行弘
print "Martin J. D\u00FCrst"           # => Martin J. Dürst
print "Martin J. D\u{FC}rst"           # => Martin J. Dürst

It is as of yet undecided what these escapes will produce in encodings other than UTF-8. However, it seems clear that no syntax for other encodings (e.g. something like \s for Shift_JIS or \e for EUC-JP) is needed, for many reasons. There is no equivalent of an Unicode scalar value for these encodings. Byte-wise notation is already available, i.e. \x90\xc2 is too close to a potential \s90c2 to warrant implementation. Also, such a scheme would not scale to other encodings.

5.4 Regular Expressions

Ruby 1.9 uses a new (for Ruby) regular expression engine, called Oniguruma [Oniguruma]. Oniguruma is a very powerful regular expression engine, including the capability to use recursion in regular expressions. Oniguruma is relevant to Ruby internationalization in several different ways. First, Oniguruma adapted the CSI approach proposed in [M17N], and this implementation is now used by Ruby, too. The code moved from the Ruby core to Oniguruma, and with the adoption of Oniguruma by Ruby, this code is now in some sense re-imported.

Second, Oniguruma is important because similar to the old regular expression implementation in Ruby 1.8, it is well internationalized when compared to the rest of Ruby as currently available. In particular, Oniguruma supports a wide range of character properties including Unicode general categories and scripts. As an example, the regular expression /\p{Sm}\p{Hiragana}/ would match a mathematical symbol followed by a Hiragana character. For details, please see [OniSyntax, Section 3].

The problem with the current implementation of character properties is that except for Unicode-based encodings such as UTF-8, very few properties are supported. As an example, EUC-JP and Shift_JIS only support \p{Hiragana} and \p{Katakana} even tough these encodings also contain Han ideographs as well as Greek and Cyrillic characters. Also, properties not supported in an encoding lead to parse errors, rather than just not matching anything.

5.5 Ongoing Work

While a start has been made for the new Ruby internationalization framework in Ruby 1.9, a very large number of questions remains to be answered [M17Nqestions]. Many of these questions come from the fact that a CSI architecture allowing multiple internal encodings at the same time poses questions regarding encoding at every corner.

6 Transcoding Policies

This section looks at what the author thinks may become a crucial piece in making the CSI model of Ruby 1.9 work easily and smoothly for all kinds of programming needs. This in particular includes the UCS model important for Internet and Web-related computing and familiar to many programmers coming from other languages. The material presented in this section is currently just a proposal. Whether and how it will be implemented will still have to be explored and discussed.

It is not too difficult to implement the UCS model on top of a CSI architecture. The two main pieces needed are a universal encoding such as UTF-8 and transcoding functionality. But even then, a basic difference remains: Transcoding is performed automatically and aggressively in the UCS model, whereas in a CSI model, it is mostly avoided or delayed. The question is how this difference can be expressed and implemented. Here, we propose to do this with transcoding policies.

In short, a transcoding policy specifies under what circumstances what kinds of transcoding should or should not take place. In general, such a policy will only apply for cases where the program does not specify the exact details. Some basic policies, e.g. a basic CSI policy and a basic UCS policy, may be made available as command options. Finer control should be made possible by exposing the policy as a Ruby object.

6.1 Usage of Conversion Policies

In this subsection, we provide a short overview of where transcoding policies may be useful. Data enters and exits an application at various locations. Each such location is a candidate for transcoding, and therefore for the use of a policy.

In many applications, the bulk of data is read from and written to files. In modern operating systems, console input and output are abstracted as standard input and output files, but these can also be replaced by real files by using redirection. So we obtain real console input and output, redirected console input and output, real and redirected error output, and general file input and output as possible distinct locations where a policy may be applied. In many cases, the same policy will apply to several of these locations, so some kind of defaulting mechanism may be needed.

A next location to consider is the encoding of directory paths and filenames in the file system. In this case, the encoding to use may be derived from some system information or from locale settings in an environment variable.

Different C-level extensions and libraries may use different character encodings, and a policy or several policies may be used to bridge the gap between Ruby internals and these extensions. Different implementations of Ruby use different character encodings natively. For example, Java uses UTF-16 exclusively for strings, so for JRuby, a special policy may be needed.

When combining strings with different encodings, a pure CSI application will throw an exception, whereas a UCS application will want to convert the strings to an encoding that encompasses both encodings. This again can be selected by a policy. A special, more relaxed policy should be in force when reporting errors, to avoid secondary errors and endless loops.

6.2 Conversion Patterns

In the previous subsection, we have looked at the various locations where a character encoding policy may be applied. In this subsection, we discuss the various options that a transcoding policy may provide.

For some cases, a fixed source or target encoding may be provided. As an example, the encoding of filenames on a modern Microsoft Windows system is fixed to UTF-16. For a CSI system, internal transcodings would simply be prohibited. An alternative would be to allow conversion to a wider encoding (e.g. from US-ASCII to UTF-8), or to allow conversion if no non-convertible codepoints are found in the actual data.

Both for explicit conversion as well as for automatic conversions, it may be helpful to be able to indicate what fallbacks should be used in case a character or character sequence cannot be transcoded. Alternatives may include to simply drop the data, to replace it with a formal replacement character or with an appropriate visual equivalent (e.g. '?'), or to use some kind of escaping convention, e.g. XML numeric character references or Ruby character escaping syntax.

Transcoding policies may also be used to deal with illegal code sequences (frequently detected when transcoding) and with normalization. For normalization, one approach is to create 'guaranteed normalized' versions of encodings, and to use the policy logic to invoke normalization checks when combining two strings in such an encoding. Another approach is to use the policy logic to normalize at certain well-defined interfaces.

7 Conclusions and Future Work

Ruby is a scripting language that is gaining popularity rapidly due to its high productivity. We described Ruby internationalization for the current version (Ruby 1.8) including some work of the author. We described and analyzed the CSI model which forms the base for the internationalization of the next version (1.9) of Ruby. We also proposed transcoding policies as a general conceptual framework for unifying the CSI model and the UCS model. A lot of work remains to be done to fully implement internationalization as planned for Ruby 1.9, including character conversions and Unicode functionality.


My warmest thanks go to Yukihiro Matsumoto (Matz) and many others for Ruby and some great discussions, to Takuya Shimada for his work on sconv, and to Kazunari Ito and many others for providing a great research environment at Aoyama Gakuin University.


Addison Phillips and Mark Davis, Tags for Identifying Languages, RFC 4646, September 2006, and Matching of Language Tags, RFC 4647, September 2006, collectively BCP 47, available at http://www.ietf.org/rfc/bcp/bcp47.
Martin J. Dürst, Character Escapes for Ruby, RubyGem available at http://rubyforge.org/projects/charesc/.
Martin J. Dürst, François Yergeau, Richard Ishida, Misha Wolf, and Tex Texin, Character Model for the World Wide Web 1.0: Fundamentals,W3C Recommendation 15 February 2005, available at http://www.w3.org/TR/charmod/.
Bert Bos, Håkon Wium Lie, Chris Lilley, and Ian Jacobs, Cascading Style Sheets, level 2 - CSS2 Specification, W3C Recommendation 12-May-1998, available at http://www.w3.org/TR/REC-CSS2.
Martin Fowler, Domain Specific Language, available at http://www.martinfowler.com/bliki/DomainSpecificLanguage.html.
Martin J. Dürst, Fun with Regular Expressions: An Implementation of the Unicode Bidi Algorithm, 26th Internationalization & Unicode Conference, September 2004, San Jose, CA, U.S.A., presentation only, available at http://www.w3.org/2004/Talks/IUC26bidi.
ICU Project, Detailed Character Set Comparison, available at http://www.icu-project.org/charts/charset/roundtripIndex.html.
Nikolai Lugovoi, ICU4R - ICU Unicode bindings for Ruby, available at http://icu4r.rubyforge.org/.
Bruce Tate, From Java To Ruby - Things Every Manager Should Know, Pragmatic Bookshelf, June 2006.
Joel Spolsky, Joel on Software, Apress, 2004, also available at http://www.joelonsoftware.com/articles/LeakyAbstractions.html.
Martin J. Dürst, Support for IETF Language Tags (BCP 47), RubyGem available at http://rubyforge.org/projects/langtag/.
Yukihiro Matsumoto and Masahiko Nawate, Multilingual Text Manipulation Method for Ruby Language, IPSJ Journal, Vol. 46, No. 11, Nov. 2005 (in Japanese).
Various, m17n Questions, Web page (in Japanese) available at http://pub.cozmixng.org/~the-rwiki/rw-cgi.rb?cmd=view;name=m17nQuestions.
Makoto Murata, One project, four schema languages; medley or melee?, International World Wide Web Conference Developers' Day Keynote, Chiba, Japan, May 2005, available at http://www2005.org/keynotes/makoto.pdf.
Tim Bray,Dave Hollander, Andrew Layman, and Richard Tobin, Namespaces in XML 1.1, W3C Recommendation 4 February 2004, available at http://www.w3.org/TR/xml-names11.
Kiyomi Kosako, Oniguruma, Web page available at http://www.geocities.jp/kosako3/oniguruma/.
Kiyomi Kosako, Oniguruma Regular Expressions Version 5.6.0, Web page available at http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt.
Larry Wall, Tom Christiansen and Jon Orwant, Programming Perl (3rd Edition), O'Reilly, 2000.
Martin J. Dürst, The Properties and Promizes of UTF-8, Proceedings of the 11th International Unicode Conference, San Jose, CA, USA, September 1997, available at http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf.
François Yergeau, Gavin Nicol, Glenn Adams, and Martin Dürst, Internationalization of the Hypertext Markup Language, RFC 2070 (historical, superseeded by [HTML4]), January 1997, available at http://www.ietf.org/rfc/rfc2070.txt.
Martin Dürst and Michel Suignard, Internationalized Resource Identifiers (IRIs), RFC 3987, IETF Proposed Standard January 2005, available at http://www.ietf.org/rfc/rfc3987.txt.
Dave Thomas, David Heinemeier Hansson et al., Agile Web Development with Rails, Second Edition,The Pragmatic Programmers, 2006.
Dave Thomas, with Chad Fowler and Andy Hunt, Programming Ruby - The Pragmatic Progammers' Guide (Second Edition), The Pragmatic Bookshelf, 2005.
James Edward Gray III, Yukihiro Matsumoto and Koichi Sasada, The Ruby VM: Episode IV, Online Interview at http://blog.grayproductions.net/articles/the_ruby_vm_episode_iv.
Hal Fulton, The Ruby Way, Second Edition, Addison-Wesley, 2006.
Takuya Shimada and Martin J. Dürst, Sconv: A convenience layer for character encoding conversion, RubyGem available at http://rubyforge.org/projects/sconf/.
Takuya Shimada, Kazunari Ito, and Martin J. Dürst, A Study Concerning Multilingual Processing in Ruby, Proceedings of the 69th Annual Meeting of the Information Processing Society of Japan (IPSJ), Tokyo, March 2007 (in Japanese).
André Weinand, Objektorientierte Architektur für grafische Benutzungsoberflächen, Springer, Berlin, 1993 (in German).
Tim Bray, The World in 30 Minutes, RubyKaigi 2007, available at http://www.tbray.org/talks/rubykaigi2007.pdf.
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and François Yergeau, Extensible Markup Language (XML) 1.0 (Fourth Edition), W3C Recommendation August 2006 (First edition February 1998), available at http://www.w3.org/TR/REC-xml.
Martin J. Dürst, Internationalization of XML – Past, Present, Future, XML 2003, Philadelphia, USA, November 2003, available at http://www.idealliance.org/papers/dx_xml03/papers/03-06-02/03-06-02.html.
Makoto Murata et al., XML Japanese Profile (Second Edition), W3C Member Submission 24 March 2005, available at http://www.w3.org/TR/japanese-xml/.