IRIs Beyond the Napkin: A Survey of
Internationalized Resource Identifier
Issues and Implementation

IUC 34, Santa Clara, CA, U.S.A., 20 October 2010

Martin J. DÜRST and Addison PHILLIPS

duerst@it.aoyama.ac.jp / addison@lab126.com

Aoyama Gakuin University / Lab126

IETF

© 2010 Martin J. Dürst, Aoyama Gakuin University and Addison Phillips, Lab126

Overview

Abstract

If the Latin Alphabet is not your (or your customer's) main script, there are many good reasons for including non-Latin characters in a Web address (URL/URI). This presentation will tell you why, when, and how you can and should do this, and provide the necessary background to make things work for servers and clients.

Non-ASCII characters have been used in Web addresses for more than a decade. Such Web addresses have been called Internationalized Resource Identifiers (IRIs), and since 2005 have been specified in RFC 3987. Early this year, the IETF chartered a Working Group to update the RFC 3987.

The presentation will first explain the basic rules for working with IRIs, in particular the conversion to URIs via UTF-8 and percent-encoding. To provide a deeper understanding, we will then concentrate on the major issues that the IRI Working Group is working on addressing:

[Text appearing in gray are comments not showing up in presentation mode. The best way to view the slides as they were presented is with Opera, pressing F11.]

Speakers' Introduction

Addison:

Martin:

Slides avaliable at: http://www.sw.it.aoyama.ac.jp/
2010/pub/IUC34-iri-napkin

What's an IRI

In full: Internationalized Resource Identifier

Version of URI/URL where non-ASCII characters are allowed

URIs are often also called URLs (Uniform/Universal Resource Locators), although strictly speaking, URLs are a subset of URIs.

Examples: http://räksmörgås.josefsson.org,
http://納豆.w3.mag.keio.ac.jp, http://بوابة.تونس/
The first internationalized top-level domain names were finally introduced by ICANN starting mid 2010 after extensive agonizing, deliberation and testing.

http://ja.wikipedia.org/wiki/青山学院大学,
http://www.sw.it.aoyama.ac.jp/Dürst/,
http://www.w3.org/People/Dürst/
These just work in all modern browsers (which doesn't include IE6), but may not always be displayed as such.

Why Internationalized?

Would you like to type ωωω.γουγλ.κομ ?

Identifiers in a well-known script are easier to:

How IRIs Work

The extended character repertoire is essentially the only difference between URIs and IRIs, and conversion is easy using UTF-8 and percent-encoding. However, as in many other areas of Unicode and internationalization, the details can be surprisingly tricky.

IETF IRI WG

IETF: Internet Engineering Task Force

WG: Working Group

BOF November 2009

Chartered early 2010

Was supposed to be done by June 2010 (charter)

Serious work started recently

Documents being Updated

RFC 3987 - Internationalized Resource Identifiers (IRIs)

Latest Draft: draft-ietf-iri-3987bis-02

RFC 4395 - Guidelines and Registration Procedures for New URI/IRI Schemes

Latest Draft: draft-ietf-iri-4395bis-irireg-00

There is a chance that we will create additional documents when we split up some work (list of WG documents).

Registration Guidelines

Clarified or new:

Main Issues for IRIs

Decomposition of an IRI

The term resource identifier is used generically, to denote URIs and IRIs.

scheme://userinfo@host:port/path?query#fragment

The most important parts:

JavaScript also has access to authority (userinfo/host/port together)

To Punycode or not to Punycode

 

Query Part

Details need to be worked out.

Bidirectionality

Conventions for Bidi Display

Normalization

Browser Quirks and Other Legacy

Scheme Names

Warning: This is not part of current work!

How You Can Contribute

Conclusions

Q & A

Further Material

Lots of links everywhere throughout the talk, please use them!

Some older material (more background information):