31st Internationalization and Unicode Conference, October 2007, San Jose, CA, USA

IRIs and IDNs:
Testing, Implementation, and Specification Evolvement

Martin J. Dürst
Department of Integrated Information Technology,
College of Science and Engineering, Aoyama Gakuin University
Sagamihara, Kanagawa, Japan
mailto:duerst@it.aoyama.ac.jp
http://www.sw.it.aoyama.ac.jp

Keywords: International Resource Identifiers, IRI, Internationalized Domain Names, IDN

Abstract

Internationalized Resource Identifiers (IRIs) are the internationalized version of Web addresses. The IRI specification has been available since 2005, and the specifications for Internationalized Domain Names (IDNs) since 2003. Implementations of IRIs and IDNs in the major browsers are well advanced, but implementations for toolkits and APIs currently still leave quite a bit to be desired. This paper gives a general introduction to IRIs and IDNs, stressing the role of Unicode and UTF-8. It reports on an implementation effort by the author and his group for IRIs and IDNs in the widely used Web toolkit Curl, and on progress with automatic testing and automatic generation of tests for IRIs and IDNs.

An up-to-date version of this paper as well as slides used for the talk connected to this paper are available at http://www.sw.it.aoyama.ac.jp/2007/pub/IUC31-iri.

1 Introduction

Internationalization of the Internet and the World Wide Web in a first wave addressed the internationalization of content. Internationalization of content means making sure that any email or Web page can be written with the characters that are customary for the language(s) of the user. Milestones for the current infrastructure include MIME for email and HTML Internationalization [RFC2070] for the World Wide Web. The overall policy for the IETF, the body setting Internet Standards, is given by [RFC2277]. For the W3C (World Wide Web Consortium), a similar policy was being observed, too, but it was written down much later [Charmod]. Being able to use the characters and language of your choice in electronic mail and in Web pages is definitely the most important part of internationalization. However, there is more. A second wave of internationalization is following, dealing with identifiers.

In a time where the Internet was mostly used by academics and technical people, the need for the internationalization of Internet identifiers was not felt that strongly. Also, there was a widespread concern that it would in effect be counterproductive, because it would no longer allow 'everybody' to access any content whatsoever. The next section discusses the motivation for the internationalization of Internet identifiers in more detail.

1.1 Motivation for Identifier Internationalization

It took quite some time for the general Internet and World Wide Web community to accept the idea of identifiers using characters other than US-ASCII. The restriction to the basic Latin characters A-Z and a-z, the digits 0-9, and a few widely used special characters were seen as a guarantee that an identifier would always be usable, by everybody. For domain names, the effort to create an internationalized version only took hold seriously at the IETF after China started to use non-ASCII domain names. For Web addresses, the long development time of the IRI specification [RFC3987] is to a significant part due to the time it took to get the general idea of internationalized identifiers accepted widely enough.

If indeed US-ASCII is all that is needed for identifiers, why internationalized them. To understand this, it is best to think about a that is similar to what people not very familiar with the Latin alphabet experience. One way to do this is to imagine that domain names and Web addresses use Greek characters rather than Latin ones. Most people are somewhat familiar with Greek characters, so one could claim this should not create any difficulties. Of course, this is far from being true. So what advantages are there to use the most familiar script for these identifiers?

Examining the various uses of domain names and Web addresses, one finds that ideally, such identifiers are:

All these functions are very clearly easier for readers familiar with English and the Latin script if the identifiers are using the Latin script, and are based on English words. Using Greek letters would be a real pain, independent of whether English words are transcribed or whether the Greek language is used. Turning things around, Greeks and others using non-Latin scripts must feel similar about having to use US-ASCII identifiers.

But if Greeks use Greek identifiers, Chinese use Chinese identifiers, and so on, will there not be a danger that the Internet and the World Wide Web is being fragmented along script boundaries? Well, as far as language is concerned, a Chinese Web page is already mostly unreadable to somebody who does not read Chinese. Giving it a Chinese address to make it easier for Chinese readers to access it, while making access more difficult for people who do not read and write Chinese makes a lot of sense.

This leaves out people not familiar with a certain script. While making it more difficult for them is appropriate, because they will not use these addresses very often, access should not be made impossible. There are two solutions. The first solution is to take advantage of the fact that it is possible to use multiple identifiers for the same domain or Web page. A Web page written in both Chinese and English can thus easily be made available under both a Latin and a Chinese identifier. The second solution is to use some internal encoding based on US-ASCII only. This is described in Section 2.3.

1.2 Importance of Unicode

Unicode [TUS5] is central to any kind of work on internationalized Internet or Web identifiers and addresses. While such identifiers have to be transported in all kinds of encodings (including paper and similar media where the term encoding is not really appropriate), a common frame of reference is needed. Unicode provides this frame of reference by providing an unique number for the characters of the world. Unicode also provides encodings, such as UTF-8, for use when an unique encoding is needed.

Unicode was also important, in a more fundamental way, for getting the idea of internationalized identifiers accepted. Without Unicode, well internationalized software would have been much more difficult to write, and the internationalization on the content level would have taken more time. This would have raised more doubts about the possibility of internationalizing identifiers.

1.3 Overview

The remainder of this paper is organized as follows. Section 2 provides an overview of the basic workings of IRIs, the internationalized version of URIs, and of IDNs, the internationalized version of domain names. Section 3 discusses IRI testing, and presents a framework for generating tests. Section 4 presents the current state of implementation in browsers and tools, including an implementation of IRIs in the URI tool cURL. Issues under consideration for the update of the IRI and IDN specifications are discussed in Section 5, and the internationalization of email addresses in Section 6.

2 How IRIs and IDNs Work

This section gives an overview of how IRIs (Internationalized Resource Identifiers) and IDNs (Internationalized Domain Names) work. Please see also [MulAddr] for further examples.

IRIs are internationalized versions of URIs (also called URLs; some minor differences in terminology between URIs and URLs can safely be ignored here). IDNs are internationalized versions of domain names. Here we start with an extremely short summary of domain names and URIs.

2.1 URIs and Domain Names

A full domain name is used to identify a host on the Internet. domain names consist of labels separated by dots; each label consists of letters, digits, and hyphens. Examples of domain names are www.example.com, www.sw.it.aoyama.ac.jp, and www.w3.org. Domain names form a hierarchy of domains, with the top level domain on the right side, in our examples com (commercial/companies), jp (Japan), or org (organizations). The domain name system is responsible for converting domain names, understandable to humans, into numeric Internet addresses used for directing data traffic. Without a functioning and stable domain name system, almost all Internet applications would break down very quickly.

URI stands for Uniform (or Universal) Resource Identifier. A resource, on purpose, is a rather vague concept, in many cases just an electronic document, but potentially anything that may need to be identified, including real-world objects and humans. Uniform stands for the fact that URIs follow certain very general syntax rules to simplify processing. Universal denotes the fact that URIs are designed to be extensible, to potentially denote anything. The general syntax for URIs is:

scheme:hierarchical-part?query#fragment-identifier

An example of an actual URI that contains all four parts is:

http://www.w3.org/2005/11/Translations/Query?titleMatch=HTML&lang=fr#xhtml1-2

Often the query and fragment identifier parts, and their preceeding separators (? and #), are absent. Schemes are used to identify how to identify and access the resource; they also restrict the syntax further. Hierarchical parts in many schemes contain domain names; in the above example, the domain name is www.w3.org. For many schemes, the syntax follows a generic model, with the domain name (and possibly some other information) after a double slash (//), and a directory/filename-like hierarchy indicated by further slashes. Query parts are used to query information resources such as databases. Fragment identifiers serve to indicate subresources, e.g. certain positions or parts of a document.

2.2 Syntax Extension

The first step to understanding how IRIs and IDNs work is to notice that they have to be usable not only in electronic form, e.g. in various protocols and formats, but also in non-electronic form, e.g. on paper, on the side of a bus, or over the phone. This is no different for URIs and domain names, but does not need to be stressed so much in that case. IRIs and IDNs thus first have to be defined as a sequence of characters, independent of encoding. Only after that, it can be decided how to encode them in various protocols and formats.

Defining IRIs and IDNs in terms of characters, in a first approximation, is very easy. The restriction on US-ASCII characters is removed, allowing all Unicode characters. For general users, this is often all they need to know. However, for implementers, and when writing specifications, more details are needed. Unicode includes an extremely large number of characters of very diverse origins and nature, some of them not visible, others visually extremely close to each other, and so on, as well as unassigned codepoints. IDNs and IRIs take different approaches here.

The IDN specification [RFC3490] is rather strict in what Unicode characters it allows. In particular, it does not allow the registration of domain names with characters not yet assigned in Unicode 3.2, the newest version at the time the IDN specification was created. It also excludes a fair number of characters that where deemed inappropriate in domain names. The IRI specification is much less restrictive. This is due to the fact that IRIs are composed of many different pieces, derived from different identifiers with different restrictions. For most components except the query part, private use characters are excluded because they do have no meaning in the global context of the Internet, but apart from that, there are not many restrictions.

A special issue for IRI and IDN syntax is the treatment of syntactically relevant characters. For domain names, this is only the dot ('.') separating domain labels. For URIs, this includes colon (':'), slash ('/'), question mark ('?'), hash mark ('#'), and others depending on URI scheme and part. One question is whether non-ASCII variants of these characters should be accepted as equivalents or not. In general, the answer is NO, to simplifies processing. However, IDN requires that several East Asian full stop variants are recognized as equivalents of the dot [RFC3490, Section 3.1]. Another question is whether additional characters, e.g. the cent sign ('¢') should be made syntax-significant. The answer here too is NO.

There is also the possibility of keeping parts of an identifier ASCII-only. For IDNs, this has not been done on the specification level, but for political reasons, there are unfortunately still no non-ASCII top-level domains. However, we hope that this will change soon. For IRIs, the scheme part has stayed restricted to US-ASCII, because the number of schemes is expected to stay low, and schemes need to be registered in a central registry (http://www.iana.org/assignments/uri-schemes.html).

A specific aspect of character identity in Unicode in normalization. Unicode allows many characters with diacritics to be encoded either precomposed (e.g. U+00FC for 'ü') or decomposed (e.g. U+0075 'u' followed by U+0308 COMBINING DIAERESIS). Unless it is clear what components of the infrastructure have to take care of this multiplicity of representations, IRIs (and IDNs) will not be resolved correctly. Again, the IDN specification is very clear here. It requires the use of NFKC [UTR15] by clients before the actual resolution of a domain name. For IRIs, this is again less clear-cut. The specification requires the use of NFC under certain circumstances, and suggests the use of NFC or NFKC in others, but cannot go further because it may otherwise interfere with the normalization requirements of individual components.

2.3 Encoding of IRIs and IDNs

Once the syntax of IRIs and IDNs is defined in abstract terms, we have to decide how to actually transport them electronically. Here, we can distinguish two cases. The first case is that they are transported as part of a document such as a Web page or an electronic mail message. The second case is that they are used in a resolution protocol.

For both IRIs and IDNs, the solution to the first case is the same, and is simple: They are included into these documents in exactly the same way as any other characters would be. To take the example of a Web page, any characters that can be encoded in the encoding of the Web page can be encoded directly. For other characters, HTML provides numeric character references. As an example, for an IRI containing Japanese characters and Korean Hangul, and being encoded in a Japanese encoding such as Shift_JIS or EUC-JP, the Japanese characters would be encoded natively, but numeric character references (e.g. 가) would be used for Hangul. Some formats, such as text/plain emails, do not provide an equivalent of numeric character references. We will come back to this case below.

For resolution, it is obviously not very appropriate to send identifiers in all kinds of encodings and with various escaping conventions. The encodings and conventions would have to be identified in the resolution protocol, and the server would have to be able to deal with all these encodings and conventions. A more uniform solution, putting the burden of conversion on the client, is therefore necessary.

For IDNs, the encoding chosen for representing labels of internationalized domain names in the actual DNS protocol is called punycode [RFC3492]. Punycode is a very elaborate algorithm that tries to convert a sequence of Unicode characters into a somewhat longer sequence of US-ASCII characters in an efficient way. Punycode labels are distinguished from usual US-ASCII labels by prefixing them with xn--. As an example, the fictive IDN www.résumé.jp, converted to punycode, results in www.xn--rsum-bpad.org. Punycode has been designed especially for domain names, where a label cannot contain more than 63 US-ASCII characters. Good compression (or low expansion) for names with characters from the same script was therefore the primary requirement.

For IRIs, there are a large number of protocols that are potentially involved, but the classical case is HTTP. HTTP uses URIs, or parts of URIs, to identify the resource requested from a server. For IRIs, the encoding used in many resolution scenarios is therefore a conversion to an URI. This conversion is carefully defined in the IRI specification [RFC3987, Section 3]. URIs already come with a mechanisms to encode arbitrary byte values, using a percent character ('%') followed by two hexadecimal digits. To convert from an IRI to a URI, the Unicode characters of the IRI are encoded as UTF-8, and the resulting bytes encoded as just described. As an example, the IRI component résumé.html will result in the corresponding URI component r%C3%A9sum%C3%A9.html. By fixing the encoding used in this conversion to UTF-8, US-ASCII characters remain as they are, and all Unicode characters can be converted. UTF-8 is also the encoding of choice for new protocols that have to transmit IRIs but do not suffer the limitations of HTTP. However, it should be made clear that the use of UTF-8 for this conversion does not mean that IRIs always have to be represented in UTF-8; if the context uses another encoding, then that encoding is used.

Both punycode as well as the conversion of IRIs to URIs result in a pure US-ASCII form of the respective identifiers. The main purpose of this form is for use in legacy protocols (e.g. the DNS and HTTP), out of view of the eyes of human users. However, these forms may also be used as fallbacks for users that are not familiar with a certain script, for cases when certain Unicode characters cannot be included in a document, and for cases when an exact check is needed, e.g. for security reasons. In general, however, users very much dislike these representations, and strongly prefer to see 'the real thing', i.e. the actual Unicode characters that they can easily understand.

3 Testing

3.1 Testing Requirements

Testing IRIs and IDNs is important because it helps improving implementations and moving specifications forward on the standards track. Testing of IRIs comes with many different aspects. The Character Model for the World Wide Web 1.0: Resource Identifiers [CharmodResid] lists the following axes:

  1. use of IRIs in several document formats;
  2. use of IRIs in several locations in the same document format;
  3. use of non-ASCII characters in different parts of an IRI (e.g. domain name part, path part);
  4. use of IRIs in documents with various widely used character encodings and with characters from various scripts;
  5. use of document-specific escapes in IRIs;
  6. use of IRIs with various URI schemes;
  7. setup of various servers for IRIs;
  8. the translation of IRIs into URIs.

Over the years, IRI tests for various purposes have been created and made available at various locations. An overview is given at http://www.w3.org/International/iri-edit/testing.html; if some tests are not listed there, please inform the author.

3.2 A Testing Framework

In [Yone2006], we have presented a framework for creating such tests. The framework concentrates on IRI resolution, as opposed to other parts of the IRI lifecycle such as IRI generation or IRI transmission. IRI transmission simply means transmission of Unicode characters; this is not an IRI-specific issue, and in general works well. IRI generation in many cases is naturally restricted by IRI resolution; in the long term, IRIs that cannot be resolved will simply not be generated.

3.2.1 Methodology

The methodological part of the framework starts with identifying a technology, for example (X)HTML [HTML4, XHTML1] or CSS [CSS2]. For that technology, the locations in the protocol or format that may contain IRIs are then identified. For HTML, this includes the href attribute of the <a> element, the src attribute of the <img> element, or the profile attribute of the <head> element.

Next, for each location, we evaluate whether an URI or IRI in that location will actually be dereferenced. Dereferencing the href attribute of the <a> element is what happens when following a link, and the src attribute of the <img> element gets dereferenced whenever an image is included in a Web page, so tests for these locations will be designed in the next step. On the other hand, the profile attribute of the <head> element is not dereferenced, at least not by the average browser, so we refrain from creating any tests.

For each location identified, we then create a simple and easy to use test, consisting of the files needed to check whether an URI or IRI in the location in question can be correctly resolved. The tests usually consist of one main file containing information about the test and instructions about how to judge the test result, and one or more auxiliary files, one of them being the file that has to be dereferenced for the test to be successful.

3.2.2 Testing Dimensions

Using scripting and templating techniques, the framework then allows to create a multidimensional test matrix. One dimension is the document format involved, covering point a) in Section 3.1. The different locations in a document form the second dimension, covering point b) above. These two dimensions are 'hand-generated' as explained above. A list of test strings representing terms or words in various languages and with various character encodings forms a third dimension, covering point d) above. These test strings have originally been collected for testing the Apache fileiri module [mod_fileiri]. A fourth dimension is unrelated to the list in Section 3.1, and addresses frequent failure cases.

3.2.3 Frequent Failure Cases

Most older browsers do not actually refuse to dereference syntactically incorrect URIs (including those containing non-ASCII characters, i.e. IRIs). Instead, they take the binary representation of the URI/IRI, in whatever encoding the Web page happens to be, and convert that to an URI using percent-escaping. To make it easy to distinguish this frequent failure case from a simple network failure, we make an alternate file available at the wrong but frequently requested location. The alternate file clearly differs from the correctly dereferenced file, e.g. by containing the words 'wrong' or 'failure' rather than 'okay' or 'success' and by using colors such as red rather than green.

The introduction of these alternate files to signal frequent failure cases is very helpful for human observers, but creates a problem for automatic tests, where a simple 404 Not Found is much clearer. The fourth dimension of our tests therefore consists of a group of tests for human inspection, which include the alternate files, and a group of identical tests for automatic testing which excludes the alternate files. Both of these groups use legacy encodings (i.e. encodings not based on Unicode), and are therefore labeled LegacyGroup1 (for humans) and LegacyGroup2 (for machines). A third group uses the same test strings, but encoded in UTF-8. In this case, the frequent failure case produces the correct dereferencing request, and so there is only one group.

These tests have been published at http://www.sw.it.aoyama.ac.jp/2005/iritest/, but this location has not been widely publicized.

3.3 Testing Framework Update

The testing framework described in Section 3.2 was written using the scripting language Perl [Perl]. For templating, we used string interpolation triggered by on-the-fly code evaluation. This often made it difficult to understand what actual value of what variable would be interpolated. Also, we did not use any of the object-oriented features of Perl because of the high overhead, both conceptually as well as in terms of lines of code.

Recently, we have rewritten this framework in the scripting language Ruby [RubyPrag]. For templating, we took advantage of Ruby's straightforward object-orientation and its more powerful string interpolation. Not only can variables be interpolated into a string, but it is also possible to interpolate the results of method calls. We combine this with inheritance and singleton objects to derive specific tests by extension from generic test patterns. Object-orientation is also used to abstract other dimensions of the testing framework. As a result, our framework is now considerably more compact, easier to understand, and more flexible and extensible.

To the three test groups mentioned in Section 3.2, which we now call modalities, we have added three more. The first is a basic US-ASCII only modality for cross-checking. The other two modalities test decimal and hexadecimal numeric character references (e.g. &#1234; and &#xABCD;). The two last modalities cover point e) in Section 3.1. Other modalities for other document-specific escapes could also be created.

We are currently working on the automatic generation of tests for ftp, in order to cover point f) in Section 3.1. We also plan to add some exploratory tests to decide what features of the current IRI specification may have to be removed when moving the specification ahead on the IETF standards track (see also Section 5). One specific example are tests related to normalization, e.g. for Vietnamese. We plan to publish the newest version of the test suite at http://www.sw.it.aoyama.ac.jp/2005/iritest/ in time for the conference.

4 Implementation

4.1 Implementation Status in Browsers

IRIs and IDNs are widely implemented in modern browsers, but sometimes some settings or plugins are necessary. Currently still the most widely used browser, Internet Explorer 6 requires a setting for using IRIs (Tools → Internet Options → Advanced → Browsing → Always send URLs as UTF-8). The setting is on by default in most language versions. In an earlier version of Internet Explorer, the setting was reportedly off for East Asian language versions. A plugin is required for IDNs. Internet Explorer 7 supports both IRIs and IDNs from the start.

In newer versions of Firefox, IDNs are supported, but IRIs require a setting. Enter about:config into the location/address bar, then search for network.standard-url.encode-utf8 and make sure it is set to true.

A setting for IRIs is also available in Opera, and is on by default (Tools → Preferences → Advanced → Network → Encode international Web addresses with UTF-8). IDNs are also supported in recent versions.

The author didn't find a setting for Safari for Windows, but this is not a problem, because both IRIs as well as IDNs seem to be supported well.

Overall, the implementation status for browsers is good, and is getting better, but this is not necessarily the same for tools and libraries.

4.2 Implementing IRIs in Curl

Curl [cURL] is an extremely powerful open-source command line tool for downloading URIs as well as for other, related operations. A large number of schemes is supported, and libcurl, the library part of Curl, is used in bindings to over 30 programming languages.

Currently, the publicly available version of cURL does not yet support IRIs or IDNs. We have recently produced a proof-of-concept implementation of IRIs and IDNs in Curl [Koba2007]. The implementation mainly consists of two parts, detection of the character encoding used and conversion as described above in Section 2.3. Of these two parts, the character encoding detection is the trickier one.

It is not feasible to automatically detect character encodings for short strings as they appear in IRIs and IDNs. The two remaining alternatives are therefore explicit specification via a command line option, and obtaining information from the operating system. For the command line option, the main problem was that most option letters were already in use, leaving us with -Z as the only alternative. Obtaining information from the operating system was implemented for Microsoft Windows and for Unix/Linux. For Windows, a remaining problem is that Windows applications use the so-called ANSI Code Page, whereas MS-DOS applications use the so-called OEM Code Page. Because these two are the same for Japanese systems, we were unable to fully test our approach.

5 Specification Evolvement

The specifications for both IRIs as well as IDNs are being maintained by the IETF. In this section, we discuss the know issues that need to be resolved for the next round of updates of these specifications. Both IRI and IDN specifications are currently at the stage of Proposed Standard. The next stage in the IETF standardization process is Draft Standard, followed by (full) Standard. In contrast to other standards organizations, IETF specifications are implemented when they reach Proposed Standard. The later stages are mainly used for minor updates and corrections, and for removing features that have not actually been implemented and that therefore do not seem needed.

5.1 IRI Specification Evolvement

A new version of the IRI specification is available as Internet Draft draft-duerst-iri-bis-xx.txt [IRIbis]. At the time of writing, xx is 00, but by the time of the conference, it should be 01. Internet Drafts are used for

An issues list for this update is kept at http://www.w3.org/International/iri-edit/#Issues. Issues with numbers lower than 100 have been resolved before publishing [RFC3987], and are therefore no longer relevant. The archive of the mailing list discussing the current update is at http://lists.w3.org/Archives/Public/public-iri/; this location also provides subscription instructions. We invite the reader to subscribe to this mailing list and provide comments on the current work so that we can make sure the IRI specification addresses everybody's needs.

In the following, we discuss those issues that we think will require most time for resolution, or will otherwise result in potentially significant changes.

5.1.1 Normalization

The current IRI specification requires the use of NFC when converting an IRI in a non-Unicode (legacy) encoding to an URI. There have been various comments on this point. One issue is that conversion from a legacy encoding to Unicode may be done on a document as a whole, in which case the information on the original encoding may be lost. Another issue is that most Web applications include some code for character encoding conversion, but much fewer applications include code for normalization.

For many scenarios, normalization does not cause any problems due to the fact that NFC was carefully designed to be close to the most frequent current practice. However, for some languages, in particular Vietnamese, and for some platforms, in particular the MacIntosh, it can create some problems. Vietnamese is written with the Latin script, but uses an large number of diacritics and combinations of diacritics. Unicode contains all the necessary precomposed characters, but these do not fit into a single 8-bit code page. Therefore 8-bit code pages such as windows-1258 are not in NFC, and transcoding tools often just convert codepoints without normalization.

The MacIntosh creates issues with respect to normalization because it traditionally uses more decomposition than other systems. As an example, filenames are stored decomposed in the file system. This is not a problem when using a MacIntosh as a server, because the file system also accepts precomposed names, similar to the Windows file system which accepts different capitalizations of names. However, it may become a problem for a client on the MacIntosh. If IRIs are sent in decomposed form, then servers on other operating systems may not be able to resolve them. A statement such as "A good HTTP/WebDAV server should accept any form of Unicode, of course." (see http://lists.apple.com/archives/macnetworkprog/2005/Jan/msg00005.html) is simply wishful thinking, and far from reality.

5.1.2 Legacy Extended IRIs

The IRI specification [RFC3987] was developed over a time period of more than five years. During this time, the exact definition of an IRI changed slightly. Some other specifications were keen on using IRIs, but had to use their own, abbreviated, definition as long as the IRI specification was not available as an RFC. As a consequence, in particular [XML] and some related specifications currently contain definitions that allow spaces and some other characters not allowed in IRIs.

[RFC3987] contains a paragraph permitting implementations to deal with such cases (search for "Systems accepting IRIs MAY also deal with" in Section 3.1). However, it is difficult to replace an ad-hoc definition in a specification with a reference to this text. The XML Core Working Group therefore has proposed to create a definition for Legacy Extended IRIs, with appropriate syntax and usage warnings. We are currently working on integrating this proposal into [IRIbis]. This will make sure that IRIs are converging, rather than diverging. The key to a successful resolution of this issue is to make sure that the needs of potential other specifications currently using a home-brewed definition of IRIs are covered, and that it is clear that the usage of the characters that are allowed in addition to those in IRIs proper is discouraged.

5.1.3 IRIs and Forms

In the case of HTTP GET requests, data input into HTML forms in transmitted in the query part of an URI or IRI. The HTML4 specification contains the oldest description of a conversion from Web addresses containing non-ASCII characters to pure ASCII-only URIs along the lines now in the IRI specification [HTML4, Appendix B.2.1]. However, the HTML4 specification also contains a provision for transmitting non-ASCII form data: The character encoding used is that of the Web page containing the form, unless the <form> element contains an accept-charset attribute with a different value [HTML4, Section 17.3]. This convention is well established and crucial for non-ASCII form data.

The IRI spec currently does not explicitly discuss in what cases UTF-8 has to be used, and in what cases the form-specific convention can be used. In general, this has not been a problem, but it should be clarified to make sure implementations are in sync.

5.1.4 IRIs in Context

In some contexts, such as plain-text email messages, URIs are automatically detected so that they can easily be followed. In general, this can be extended to IRIs without problems, but some cases, such as IRIs in scripts without spaces between words, may need more clarifications [IRIreco].

5.1.5 Security issues

This paper has not discussed security issues in great detail, because we concentrated on IRIs, while security issues are more acute for IDNs. Except for the domain name part of an IRI, IRIs are created under a single authority that will not want to mislead users of its IRIs. However, the security issues discussed in the IRI spec have to be carefully reevaluated based on the experience gained since 2005.

5.2 IDN Specifications

The Proposed Standard version of the IDN specification [RFC3490] and the related specifications are two years older than the corresponding IRI specifications. Work is also underway to update them. The main issue is the fact that only Unicode characters that have been defined in Unicode Version 3.2 are allowed in IDNs. This excludes recently added scripts and characters and future additions. A more flexible model is needed to avoid repeated upgrades. The interested reader is invited to consult [IDNAissues] and [IDNAbis]

6 Related Work: Email Address Internationalization

With the work on IRIs and IDNs progressing, the Internet identifiers most widely used by the end user but not yet internationalized are electronic mail (email) addresses. This has not gone unnoticed. After some preliminary discussions, the IETF formed the EAI Working Group in March 2006. EAI stands for Email Address Internationalization.

The approach taken by this working group is fundamentally different from the approaches taken for IRIs and IDNs. Neither was an existing escaping syntax used (similar to percent-encoding for IRIs), nor was a new, specialized encoding created (similar to punycode for IDNs). Instead, it was decided to finally abolish the long-standing, and in many ways no longer true, assumption that email data paths, in particular for email headers, were limited to 7 bit. The overall approach is described in [RFC4952].

Breaking with the 7-bit limitation allowed to start with a clean slate. Using UTF-8 was the obvious choice for the encoding for email addresses as well as other textual data in headers, as well as for email addresses in the SMTP protocol itself. This is a great step ahead from the current state, where non-ASCII data, e.g. in a Subject: header field, has to be encoded using [RFC2047], resulting in cryptic character sequences only decodable by email-specific software. As an example, =?ISO-8859-1?Q?Patrik_F=E4ltstr=F6m?= is used to encode Patrik Fältström. If there are no basic Latin characters, then the encoding looks even more cryptic.

Because this work challenges some heretofore rather fundamental assumptions, the first round of specifications will have experimental status. They may be approved by the IESG by the end of this year or early next year. The pointers to the newest versions can always be found on the Charter page of the EAI Working Group at http://www.ietf.org/html.charters/eai-charter.html.

7 Conclusions and Future Work

We have given an overview of the definition and workings of IRIs (Internationalized Resource Identifiers) and IDNs (Internationalized Domain Names), concentrating on testing, implementation in a widely used tool, and current issues for specification updates.

A lot work is still needed to obtain the full benefit of internationalized identifiers. While IRIs and IDNs work with modern browsers, many Web tools and Web sites still have difficulties processing them. On the political level, non-ASCII top level domains should be introduced as soon as possible, to allow IDNs to be fully non-ASCII.

Acknowledgements

My warmest thanks go to Kazuhiro Yonekawa and Takeo Kobayashi for their help on IRI testing and on implementing IRIs in Curl, and to Kazunari Ito and many others for providing a great research environment at Aoyama Gakuin University.

References

[Charmod]
Martin J. Dürst, François Yergeau, Richard Ishida, Misha Wolf, and Tex Texin, Character Model for the World Wide Web 1.0: Fundamentals,W3C Recommendation 15 February 2005, available at http://www.w3.org/TR/charmod/.
[CharmodResid]
Martin J. Dürst, François Yergeau, Richard Ishida, Misha Wolf, and Tex Texin, Character Model for the World Wide Web 1.0: Resource Identifiers, W3C Candidate Recommendation 22 November 2004, available at http://www.w3.org/TR/charmod-resid/.
[CSS2]
Bert Bos, Håkon Wium Lie, Chris Lilley, and Ian Jacobs, Cascading Style Sheets, level 2 - CSS2 Specification, W3C Recommendation 12-May-1998, available at http://www.w3.org/TR/REC-CSS2.
[cURL]
Daniel Stenberg, cURL groks URLs, available at http://curl.haxx.se/.
[HTML4]
Dave Raggett, Arnaud Le Hors, and Ian Jacobs, HTML 4.01 Specification, W3C Recommendation 24 December 1999, available at http://www.w3.org/TR/html4.
[IDNAbis]
P. Fältström, The Unicode Codepoints and IDN, Internet Draft, May 2007, work in progress, available at http://www.ietf.org/internet-drafts/draft-faltstrom-idnabis-tables-02.txt.
[IDNAissues]
J. Klensin, Proposed Issues and Changes for IDNA - An Overview, Internet Draft, July 2007, work in progress, available at http://www.ietf.org/internet-drafts/draft-klensin-idnabis-issues-02.txt.
[IRIreco]
Yoshiro Yoneya, IRI Recognition in Applications, Internet Draft draft-yoneya-iri-recognition-00.txt, February 2007 (expired).
[IRIbis]
Martin Dürst and Michel Suignard, Internationalized Resource Identifiers (IRIs), Internet-Draft, July 2007, work in progress, available at http://www.ietf.org/internet-drafts/draft-duerst-iri-bis-00.txt (to find the current version, replace 00 with 01, 02, and so on).
[Koba2007]
Takeo Kobayashi, Kazunari Ito, and Martin J. Dürst, Internationalization of the Data Transfer Tool curl - Processing IRIs, Proceedings of the 69th Annual Meeting of the Information Processing Society of Japan (IPSJ), Tokyo, March 2007 (in Japanese).
[mod_fileiri]
Martin Dürst, mod_fileiri: new Apache module under development, available at http://www.w3.org/2003/06/mod_fileiri/.
[MulAddr]
Richard Ishida, An Introduction to Multilingual Web Addresses, available at http://www.w3.org/International/articles/idn-and-iri/.
[Perl]
Larry Wall, Tom Christiansen and Jon Orwant, Programming Perl (3rd Edition), O'Reilly, 2000.
[RFC2047]
K. Moore, MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text, RFC 2047, November 1996, available at http://www.ietf.org/rfc/rfc2047.
[RFC2070]
François Yergeau, Gavin Nicol, Glenn Adams, and Martin Dürst, Internationalization of the Hypertext Markup Language, RFC 2070 (historical, superseded by [HTML4]), January 1997, available at http://www.ietf.org/rfc/rfc2070.
[RFC2277]
H. Alvestrand, IETF Policy on Character Sets and Languages, RFC 2277, Best Current Practice, January 1998, available at http://www.ietf.org/rfc/rfc2277.
[RFC3490]
P. Fältström, P. Hoffman, and A. Costello, Internationalizing Domain Names in Applications (IDNA), Proposed Internet Standard, RFC 3490, available at http://www.ietf.org/rfc/rfc3490.
[RFC3492]
A. Costello, Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA), Proposed Internet Standard, RFC 3492, March 2003, available at http://www.ietf.org/rfc/rfc3492.
[RFC3987]
Martin Dürst and Michel Suignard, Internationalized Resource Identifiers (IRIs), IETF Proposed Standard, RFC 3987, January 2005, available at http://www.ietf.org/rfc/rfc3987.
[RFC4952]
J. Klensin and Y. Ko, Overview and Framework for Internationalized Email, RFC 4952, July 2007, available at http://www.ietf.org/rfc/rfc4952.
[RubyPrag]
Dave Thomas, with Chad Fowler and Andy Hunt, Programming Ruby - The Pragmatic Progammers' Guide (Second Edition), The Pragmatic Bookshelf, 2005.
[TUS5]
The Unicode Consortium, The Unicode Standard 5.0, Addison-Wesley, 2006, available also at http://www.unicode.org/versions/Unicode5.0.0/.
[UTR15]
Mark Davis and Martin Dürst, Unicode Normalization Forms, Unicode Standard Annex #15, last updated October 2006, available at http://www.unicode.org/reports/tr15/.
[XHTML1]
XHTML™ 1.0 The Extensible HyperText Markup Language (Second Edition) - Reformulation of HTML 4 in XML 1.0, W3C Recommendation 26 January 2000, revised 1 August 2002, available at http://www.w3.org/TR/xhtml1.
[XML]
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and François Yergeau, Extensible Markup Language (XML) 1.0 (Fourth Edition), W3C Recommendation August 2006 (First edition February 1998), available at http://www.w3.org/TR/REC-xml.
[Yone2006]
Kazuhiro Yonekawa, Kazunari Ito, and Martin J. Dürst, Providing and environment for testing IRIs in HTML and CSS, Proceedings of the 68th Annual Meeting of the Information Processing Society of Japan (IPSJ), Tokyo, March 2006 (in Japanese).