Ruby M17N

Ruby Kaigi'08, Tsukuba, Japan, 2008/6/21

成瀬 ゆい / Martin J. Dürst

http://www.sw.it.aoyama.ac.jp/2008/pub/RubyKaigiM17N.html

Ruby Logo

© 2008 成瀬 ゆい / Martin J. Dürst

成瀬さんの部分

Slides, Notes

Character Encoding Conversion

[see Collophon for how to view best]

String#encode Usage

Hint: See test/ruby/test_transcode.rb

Changing the encoding of a string:
s2 = string.encode(to_enc)
Explicit from-encoding (avoiding .force_encoding):
s2 = s1.encode(to_enc, from_enc)
Options:
s.encode(..., invalid: :ignore)
(more options to come)
Destructive conversion:
s1.encode!(...)
(never returns nil)

Why New Library

Naming and Files

Naming caution: String#encode, rest: trans

Layer Structure

str_encode(_bang): String method implementations
str_transcode: Parameter analysis/conversion, memory handling,...
transcode_dispatch: Conversion selection
transcode_loop: Byte-by-byte loop

(top-down, intranscode.c)

UTF-8 Conversion Hub

Transcoding Core

Conversion Data Example

One table per byte
(Shift_JIS ⇒ UTF-8, first byte)
Byte value Action
0x00
...
0x61 ('a') Copy input (0x61, 'a')
...
0xC3 ('テ') output "\xEF\xBE\x83"
...
0xE0 Go to table for 0xE0...
...

Problems with one Table per Byte

Two Tables per Byte

Two Tables Example

(Shift_JIS ⇒ UTF-8, first byte)

offsets
Byte value offset
0x00 0
... 0
0x61 ('a') 0
...
0xC3 ('テ') 23
...
0xE0 42
...
infos
offset info
0 0
...
23 output
"\xEF\xBE\x83"
...
42 goto table for 0xE0...
...

Advantages of Two Bytes per Table

Currently Supported Encodings

Tell us what you need!

Next Steps

(order mostly insignificant)

成瀬さんのまとめ

Slides, Notes, Questions and Answers

Collophon

or how these slides were produced,
and how they are best viewed