Ruby M17N

Ruby Kaigi'08, Tsukuba, Japan, 2008/6/21

成瀬 ゆい / Martin J. Dürst

Ruby Logo

© 2008 成瀬 ゆい / Martin J. Dürst


Slides, Notes

Character Encoding Conversion

[see Collophon for how to view best]

String#encode Usage

Hint: See test/ruby/test_transcode.rb

Changing the encoding of a string:
s2 = string.encode(to_enc)
Explicit from-encoding (avoiding .force_encoding):
s2 = s1.encode(to_enc, from_enc)
s.encode(..., invalid: :ignore)
(more options to come)
Destructive conversion:
(never returns nil)

Why New Library

Naming and Files

Naming caution: String#encode, rest: trans

Layer Structure

str_encode(_bang): String method implementations
str_transcode: Parameter analysis/conversion, memory handling,...
transcode_dispatch: Conversion selection
transcode_loop: Byte-by-byte loop

(top-down, intranscode.c)

UTF-8 Conversion Hub

Transcoding Core

Conversion Data Example

One table per byte
(Shift_JIS ⇒ UTF-8, first byte)
Byte value Action
0x61 ('a') Copy input (0x61, 'a')
0xC3 ('テ') output "\xEF\xBE\x83"
0xE0 Go to table for 0xE0...

Problems with one Table per Byte

Two Tables per Byte

Two Tables Example

(Shift_JIS ⇒ UTF-8, first byte)

Byte value offset
0x00 0
... 0
0x61 ('a') 0
0xC3 ('テ') 23
0xE0 42
offset info
0 0
23 output
42 goto table for 0xE0...

Advantages of Two Bytes per Table

Currently Supported Encodings

Tell us what you need!

Next Steps

(order mostly insignificant)


Slides, Notes, Questions and Answers


or how these slides were produced,
and how they are best viewed