28th Internationalization and Unicode Conference, September 2005, Orlando, FL, USA

Exploring Better Source Editing for Bidirectional XHTML and XML

Martin J. Dürst, Shiro Horie, and Yusaku Wada
Department of Integrated Information Technology,
College of Science and Engineering, Aoyama Gakuin University
Sagamihara, Kanagawa, Japan

Keywords: Bidirectional text, Arabic, Hebrew, XML, XHTML, HTML, source editor, Unicode Bidirectional Algorithm


(X)HTML and XML own a significant amount of their popularity and rapid adoption to the fact that they can be viewed and edited easily in source form with a plain-text editor. However, while this is true for all the scripts and languages written left-to-right, for scripts and languages written right-to-left, such as Arabic and Hebrew, there are very serious obstacles for source editing. The root of the problem is that syntax-significant characters, such as angle brackets and quotes, are weak or neutral, which may lead to very confusing display situations.

This paper looks at ways to more easily edit the source of bidirectional XHTML and XML documents by exploring various simulated changes to the Unicode bidirectional algorithm. Examples include the change of syntax-significant characters to strong LTR (or RTL) type, and the embedding of overall elements and/or element content. These simulated changes are explored with respect to several types of documents, in particular documents with LTR or RTL element and attribute names and with different kinds of element and attribute content.

We invite you to test the results described in this paper via the Web interface (at http://www.sw.it.aoyama.ac.jp/cgi-bin/bidi-source-test). An up-to-date version of this paper as well as slides used for the talk connected to this paper are available (at http://www.sw.it.aoyama.ac.jp/2005/pub/IUC28-bidi).

1 Introduction

1.1 The Problem

Bidirectional text denotes text that is written in part left-to-right (LTR) and in part right-to-left (RTL, e.g. for Arabic and Hebrew). Such a mixture can be laid out on a line in various different ways. The Unicode Bidirectional Algorithm [UAX#9] (hereafter simply called bidi algorithm) defines how to actually do this in a reproducible and therefore interoperable way.

The bidi algorithm is carefully designed to produce the behavior generally expected by the user without intervention. Its design is aimed at the general bulk of natural-language running text, from personal notes and Web pages to novels and poems. It also allows controlled intervention when necessary. However, the bidi algorithm is not designed for structured text such as specific data formats or programming languages.

In fact, when the bidi algorithm is applied to a structured format such as XML [XML] or (X)HTML [HTML4, XHTL1], some very strange and confusing artefact can appear. Running text has relatively little punctuation, and this punctuation follows the direction of the surrounding text. Structured formats, on the other hand, use punctuation to indicate the basic structure of the format. If this punctuation follows the direction of the surrounding text, the structure of the format can easily become obscured.

Here is a simple example piece of XHTML to show what can happen. First, a purely left-to-right version:

<p>Hello, <img scr='foo.jpg' alt="world icon"> World!</p>

Now let's replace the text in the alt attribute and after the image with some (meaningless) Arabic or Hebrew:

<p>Hello, <img scr='foo.jpg' alt="ابةتث جحخد"> ذرزسش!</p>

As can be seen (marked red), the double quote character and the angle bracket at the end of the img element have been exchanged, and the angle bracket was turned around because it is one of the characters that get mirrored by the bidi algorithm. Not only that, but also the text intended as the alt attribute and the text before the exclamation mark have been exchanged.

Even with this single, simple example, it it clear that editing marked-up documents with bidirectional text and data in this form would be very difficult. Rather than being able to concentrate on the content and the markup structure, users would be constantly confused by the unpredictable and inappropriate behavior of the bidi algorithm.

This problem has been known for quite some time, but, as far as we are aware, has not been addressed or solved. It is only recently, with the event of otherwise Unicode-enabled XML editors, that this problem has really caught the necessary attention. For some recent draft proposal, see [bidi-editing].

1.2 Goals

The overall goal of our research is the easy and natural reading and editing of XML and (X)HTML in source form. For element and attribute syntax, this can be broken down in more detail as follows. Syntactically significant characters (such as double quote and angle bracket above) should not be displaced in confusing ways. Quotes around attributes should delimit attribute values on both sides. The equal sign should separate attribute name and attribute value. White space should separate element name and attribute name/value pairs. The element name should appear at the start of a tag. Element content should appear between start tag and end tag. Elements should appear nested the way they are actually nested.

Similar considerations apply to other syntactic constructs of XML syntax, such as entity references and numeric character references, CDATA marked sections, processing instructions, comments, and so on, as well as the components of the DTD. (For a short introduction to XML syntax, see Section 2.1.)

In the above paragraphs, we have used words such as easy, natural, or start. This does not in any way indicate a direction. This is on purpose. It is very much possible that when displaying a formatted text written entirely in Arabic or Hebrew, every single syntactic element is displayed right-to-left. The above example, and using UPPER CASE for right-to-left characters, could very well be displayed as follows:


This points to another important goal: Not to change the actual source text, but only the way it is displayed. Some changes to the source would create syntax errors. Others could change the meaning of the document, in particular the display of the final text (which we call target here). It may be possible to introduce very specific, well-identified changes to the source that could be filtered out before parsing (a simple example would be the convention to use HTML bidi markup for indicating directionality of the target, and Unicode bidi control characters for indicating the directionality of the source). However, a separate filtering step complicates the simple edit-parse cycle, and virtually makes re-editing of processed data impossible because of the difficulty of re-introducing source-specific bidi indicators.

Not to change the actual source text therefore leads to another, less well-defined goal: As much as possible, make the source display automatic, i.e. avoid the need for human invention and settings. This seems possible for documents that are virtually only right-to-left (the same way this is currently possible for completely left-to-right documents). But we think that for documents with mixed text, different individuals may have somewhat different preferences. This is one reason why we are collecting feedback via the Web interface (at http://www.sw.it.aoyama.ac.jp/cgi-bin/bidi-source-test). We invite you to try it out.

1.3 Mixtures of LTR and RTL

Besides purely LTR documents and almost purely RTL documents (true RTL-only documents are very rare because numbers are always written LTR even in otherwise RTL texts), there is a wide range of mixtures of different directionality, such as the following.

Ideally, display of XML source should adapt to these different mixtures in a way that feels natural to a wide range of users.

1.4 The Importance of Source

Some might argue that most HTML is edited in WYSIWYG (what you see is what you get) editors, rather than in source form. Various editing tools also offer intermediate views between source an WYSIWYG. Examples are the tagged view of XMetaL [XMetaL] (showing tags as little flags in the displayed text) or nvu [nvu] (showing tags as boxes, on the side for block-level elements), and other kinds of structured editors.

However, there are several arguments that show that good source editing is important. A large percentage of authors are highly visually oriented. However, a small but significant part of authors to some extent or even completely prefer to work directly with the document source. Many of them are highly knowledgeable and sophisticated power users that drive the adoption and development of new technology.

Even for users working mostly in a visually-oriented mode, occasionally viewing, and editing, source helps to understand the concepts underlying the format they use. It can also greatly simplify some editing operations such as the reorganization of nested structures. It also serves as a reminder that any visual display is only one of many different ways to display the document or data being worked on. In addition, source viewing and editing is very important for education.

All the above arguments are related to the fact that the data that is actually exchanged, or being processed by all kinds of programs, is the source, and nothing but the source. The availability of Web pages in readable textual HTML source was also one of the major factors for the rapid growth of the World Wide Web in the mid 1990es, and its continuing popularity and growth since then.

The success of HTML was repeated some years later with XML, which again grew extremely fast and became extremely popular. It also avoided the main problems of the HTML source code availability, namely the tendency to blindly copy and paste source code, leading to buggier and buggier code. Many other formats have also benefited from the availability of a textual or almost textual format, among them Internet (electronic) mail and protocols such as HTTP. In other cases, the availability of an alternate textual format has improved interoperability of applications. Examples include "comma separated values" (CSV) for spreadsheet-like data, or the RDF format for Microsoft Word documents.

In one particular way, even despite the current problems, viewing source can be helpful for editing bidi text. This is due to the fact that with certain exceptions, source formats allow the addition of line breaks wherever white space (spaces, tabs, and so on) are allowed. Because bidi reordering is always restricted to a single line, putting text of different directionality on different lines can clarify the text and make it easier to edit while not affecting the (markup) structure. For more about dealing with the different paragraph/line structure of source and target, please see Section 4.6.

Addressing the source display problem for bidirectional text will help to realize the potential of XML and similar formats for education, easy data interchange, and rapid development for all scripts and languages. It should also lead to greater understanding of how to deal with semistructured, structured, and WYSIWYG views of the same document or data.

1.5 Notation

U+hhhh (with hhhh being between four and six hexadecimal digits) is used to denote Unicode codepoints. For examples in Arabic and Hebrew, we are using artificial text, mostly letters in alphabetic order. Sometimes, we are using UPPER CASE to denote RTL characters.

1.6 Overview

The rest of this paper is organized as follows. First, we provides an overview of some base technology, in particular XML and the Unicode Bidirectional Algorithm (Section 2) and bidirectional features in HTML and CSS (Section 3). Section 4 is devoted to the components of our solution, and Section 5 to the simulation of our solution using HTML. Section 6 discussed conclusions and future work.

2 Background

This section gives background information on various topics related to bidirectional source editing of HTML and XML. Readers already familiar with these topics may skip the respective sections. We first introduce the basic components of XML syntax using a simple example. We then give an overview of the bidi algorithm.

2.1 XML Example

The following is a very small example of XML (Extensible Markup Language [XML]) that gives a good impression of how XML source code looks, and shows XML's salient features:

  <author position='associate professor'>Martin J. Dürst</author>
  <author position='senior student'>堀江 史郎</author>
  <author position='senior student'>和田 雄策</author>

The pieces between < and >, such as <authors> or </author>, are called tags, which mark up the data or document. A tag starting with </ is an end tag, a tag starting with < only is a start tag. Start tags and end tags come in pairs, except in the case of empty tags, where the slash is put at the end of the tag (e.g. <authors/>, maybe to indicate that there are no authors). The first thing in a start tag (or empty tag), and the only thing in an end tag, is the element name. The element name may be followed by attribute-name=attribute-value pairs. In XML, attribute values always have to be quoted, either with single or double quotes. Element content comes between the start tag and the end tag. It can consist of elements only (with white space for easier reading) or of text only, or it can be mixed (both elements and text). Elements always have to be correctly nested, i.e. <i><b></b></i> is allowed, but <i><b></i></b> is not.

XML allows quite a few other constructs, such as comments (started by <!-- and ended by -->), processing instructions (started by <? and ended by ?>), CDATA sections (started by <![CDATA[ and ended by ]]>) and a DTD (Document Type Definition), a complex construct defining the allowable element structure and attributes in a particular type of document. However, these constructs are less frequent than the XML workhorses, elements and attributes, and similar in syntactic structure so that the results of our work can easily be applied to them.

XML documents can use arbitrary element and attribute names (not limited to the US-ASCII repertoire, i.e. also including Arabic and Hebrew). This makes XML very suited to create custom formats for all kinds of textual documents and data. With good element and attribute names, an XML document is in many ways self-documenting. An XML document that conforms to the basic syntactic rules of XML is called well-formed (a sequence of characters that is not well-formed, on the other hand, can't be an XML document, even if it looks similar to an XML document). An XML document that conforms to the allowable element structure and attributes given in the DTD included in or referenced from that XML document is said to be valid. Not all XML parsers perform validation, but all of them are required to perform well-formedness checking.

For bidirectional display of XML source, well-formedness is important because it is the basic syntactic constructs that have to be displayed in an understandable way to the user. But please note that during source editing, there are often situations where for some time, a document is not well-formed, e.g. because the user is adding another element, and source editing doesn't allow to add both the start tag and the end tag at the same time. So one requirement of a solution to bidirectional display of XML source is a certain degree of robustness to temporary non-wellformedness.

HTML, up to [HTML4], predates XML; it was an application (a certain fixed set of elements and attributes, with defined ways to combine them) of SGML, the predecessor of XML. XHTML (1.0) [XHTML1] is a reformulation of HTML 4 in XML, introducing some minor syntax changes (backwards-compatible if following the relevant guidelines) to achieve XML well-formedness.

2.2 The Unicode Bidirectional Algorithm

Not all scripts are written in the same direction. The Latin script (used for English and many other languages) is written in horizontal lines left to right (LTR). The same applies for most other scripts of the world. In some cases, such as Chinese and Japanese, horizontal LTR lines are not the only possibility, but they are widely used, in particular in technical contexts and on computers. On the other hand, some widely used scripts, such as Arabic and Hebrew, are written in horizontal lines right to left (RTL). Combining LTR and RTL scripts in the same text, in particular in the same line, leads to various ambiguities that have to be resolved carefully for reproducible and interoperable results.

The Unicode Bidirectional Algorithm [UAX#9], here simply called bidi algorithm, defines how LTR and RTL scripts are combined. We provide a short explanation of the most important concepts. It is crucial to understand that the text is always stored in logical order, i.e. (roughly) the way it is spoken or typed. Storing text in visual order, as it was done at the advent of digital text processing in particular for Hebrew, can simplify some simple display cases. However, storing text in visual order makes non-visual rendering (text-to-speech conversion), more complex rendering operations such as paragraph reflow, and high-level text processing virtually impossible.

Each character has a directionality, given as a property in the Unicode Database. This roughly can be strong, either L (left, for letters of scripts commonly written LTR) or R (right, for letters of scripts commonly written RTL), and weak or neutral for punctuation and similar characters that are used with many scripts. Both European-Arabic digits (123...) and Arabic-Indic digits (١٢٣...) are also L, which explains that the bidi algorithm is needed even for Arabic-only or Hebrew-only texts. Digits are also the main culprits for the complication of the bidi algorithm, which tries to make sure that things such as decimal fractions and monetary amounts are displayed naturally.

The implicit part of the bidi algorithm, i.e. without any additional instructions, makes sure that (sequences of) words of L characters are displayed LTR, (sequences of) words of R characters are displayed RTL, and punctuation follows the surrounding characters that have their own direction. The implicit part of the algorithm needs one additional parameter: Whether to display a sequence of LTR and RTL segments starting at the left or at the right. This is called the base directionality. A base directionality of RTL is best for occasional LTR text fragments in an overall RTL text, and vice versa. The bidi algorithm takes the first strong letter to determine the base directionality but allows this to be overridden (see [UAX#9, HL1]). The main means at the disposition of an author to influence the implicit part of the bidi algorithm are the LEFT-TO-RIGHT MARK (LRM, U+200E) and the RIGHT-TO-LEFT MARK (RLM, U+200F). These (formatting) characters have a strong directionality and therefore can affect nearby weak or neutral characters, but otherwise have no width and are purely transparent for other operations.

The explicit part of the bidi algorithm deals with two more complex situations: First, multiple embeddings (e.g. English in Arabic or Hebrew in English), and second, cases where even strong character directionality has to be overridden (e.g. for part numbers). Each embedding or override is introduced with a special formatting character, of which there are overall four, starting with LEFT-TO-RIGHT EBMEDDING (LRE, U+202A), and continuing, respectively, with RLE (U+202B), LRO (U+202D), and RLO (U+202E). These are ended with a POP DIRECTIONAL FORMATTING (PDF, U+202C), which ends the most recent still active override or embedding. This functionality is limited to a paragraph.

3 Bidirectionality in HTML and CSS

This section very shortly describes the solutions taken in HTML (see [HTML4, Section 8.2], also applying to XHTML) and CSS (see [CSS2, Section 9.10]) to the problem of bidi text. The HTML solution goes back to [RFC2070]. An understanding of these solutions is helpful for three reasons: First, HTML is a prime example of the kind of formats for which we are trying to develop better bidirectional source editing. Second, it is helpful to understand how we have used HTML to explore and prototype different ways of displaying bidirectional source. Third, in particular CSS may be useful to indicate user display preferences to a source editor.

3.1 HTML: Named Character Entities

HTML provides the two named (character) entities &lrm; (for LRM) and &rlm; (for RLM) for easy visual identification and for use in encodings that do not include these characters. These characters can also appear directly in the source code. These named entities can be viewed as pure syntactic convenience, because they are replaced with the actual characters by an XML parser.

3.2 HTML: Base Directionality, Embeddings, and Overrides

The basic directionality of an element can be indicated by the dir attribute, with values dir='rtl' and dir='ltr' (the latter being the default). This attribute is inherited on the block level, i.e. paragraph and higher, to set the base directionality for a whole document or part of a document with a single setting on a high-level element. It serves both to indicate the base directionality of a paragraph (also called the paragraph embedding) as well as the embedding of inline (smaller than paragraph) elements.

For overrides, HTML has the bdo (bidirectional override) element. The dir attribute is used to indicate the direction of the override.

3.3 HTML: Markup, not Formatting Characters

Why does HTML use markup, not formatting characters, for base directionality, embeddings, and overrides? First, this makes these settings visible and usable even in encodings that do not include the corresponding characters. Second, the structure of HTML can be used to inherit the base directionality, which simplifies settings. Simply setting the base directionality of a Web page is often all that is needed, besides a few marks (named character entities). Third, embeddings and overrides virtually always correspond to document structures (examples, citations, part numbers,...) that are marked up anyway. HTML very strongly discourages the use of formatting characters for embeddings and overrides because a mixture of markup and formatting characters can lead to very badly defined situations if the tags and characters are not properly nested. For additional background, see also [bidi-markup].

3.4 CSS

If HTML defines markup for bidi rendering, why should CSS provide its own functionality? There are several reasons for this: First, to allow to define the rendering semantics of the HTML bidi markup (see [CSS2, sample HTML stylesheet]). Second, to define the rendering semantics for other XML markup. Third, to tweak HTML rendering in special situations (such as when rendering Yiddish written in Hebrew with Latin letters and therefore with a different directionality). For further explanations, see also [bidi-css].

CSS properties may also be used to describe settings or preferences for (certain types of) source documents. To define the directionality of start-tags and end-tags, additional pseudo-elements such as :start-tag and :end-tag may be necessary.

CSS defines two properties to indicate bidirectional rendering: direction, with values of ltr or rtl (or inherit) to define the direction of some text, and unicode-bidi, with values normal, embed, or bidi-override (or inherit). A value of normal means no additional embedding level. A value of embed means an actual embedding; bidi-override an actual override. The reason for having two separate properties is that this makes it easier to define rendering behavior for cases such as HTML.

4 Solution Components

In this section, we describe the components of our solution for easy source editing of bidirectional text. When looking at source code including bidirectional text, there are several different levels of reordering problems that need to be addressed. The first level is the character level. Other levels are element content, tags including attribute structure, overall elements, specific data types, and a 'semantic level'.

It is important to note that we do not change, nor intend to change, the bidi algorithm, but we are taking advantage of its provisions for applying a higher-level protocol. Indeed, the explanation in the relevant section of the bidi algorithm [UAX#9, Higher-Level Protocols] explicitly mention XML source editing as an example. On the other hand, in our discussion and in the implementation of our simulation (see Section 5) we sometimes prefer to talk about changing properties, or using bidirectional marks (LRM or RLM), rather than using embeddings, even though the bidi algorithm favors embeddings. This is done for simplicity and because the former can always be replaced by the latter.

4.1 Character Level

The largest number of confusions caused by displaying bidi source code directly using the bidi algorithm as designed for running text are due to syntactically significant characters being displaced because they are of some weak or neutral type. We therefore change the properties of the syntax-significant characters to be strong (LTR for a start, but later also RTL). What is a syntax significant character depends both on context and to some extent is a matter of choice.

As an example, & is syntax-significant (introducing an entity or a numeric character reference) in content and attribute values, and forbidden in many other contexts, but is not syntax-significant in comments, CDATA sections, and processing instructions. # is syntax-significant only after a &, and ; only the first time it appears thereafter. < is syntax-significant in element content, but not in attribute values. The list goes on. A colon (:) as part of an element name may be seen as a syntax-significant separator for XML Namespaces [Namespaces1.1], or may be seen just as part of the element name. There are also syntax-significant white space characters that need to be considered, in particular in start tags.

4.2 Tag Level

A single tag, from the starting < to the ending >, can be rendered in different directions. While it is important that different attribute-value pairs are separated by white space, the attributes names and attribute values are separated by =, and the attribute values are quoted on both sides (none of it guaranteed!), and this can be achieved with the character level, if there is a high percentage of RTL characters, in particular in element and attribute names, it may be more natural to reorder a whole tag to read RTL rather than LTR.

How to decide tag reading direction still needs some testing and experimentation. Approaches include to always use LTR, to use the directionality of (the first strong character of) the element name, or to use overall or per-element-name settings.

4.3 Element Level

Both the directionality of the element content as well as the directionality of the overall element including start and end tag need to be considered. For consistent editing, it is desirable that the element content always comes between start tag and end tag. However, there are situations where this is not automatically the case. Here a simple example. The following source:

<p>hello <span style='color: blue'>world אבגדה</span> ןנסעף</p>

is correctly rendered as follows:

hello world אבגדה ןנסעף

As can be seen, the span colored blue is separated into two pieces. It is totally unclear where the start and end tags should be placed, but on the other hand, it is preferable to somehow indicate such a situation, because it is most probably a consequence of inadequate markup (either an embedding should be used on the <span>, or the two pieces should be marked up separately).

4.4 Specific Datatypes

Some specific datatypes may require particular bidirectional treatment. A typical example are Web addresses. [RFC3987] defines in Section 4 that Internationalized Resource Identifiers (IRIs) are to be rendered as if in an LTR context. For (X)HTML, the elements and attributes that are Web addresses and have to be treated as IRIs if they contain non-ASCII characters are well known, and an editor can apply the necessary LTR context.

For arbitrary XML formats, the information about which elements and attributes to treat as IRIs or otherwise in a special way has to be provided externally. For element content, it is easy to do this with CSS, but for attribute values, some additional CSS selector syntax would be needed.

4.5 Semantic Level

As discussed above, (X)HTML includes various means to influence the bidi behavior of the target using markup. Other formats may use similar features. For lack of a better term, we call this the semantic level. A good solution for bidirectional source editing should not only make the existing markup easily readable, but should also take into account the semantic level to help represent the textual content in a way that closely reflects the target behavior.

As an example, in a simple rendering, an &rlm; in the source code just rendered as a series of LTR characters, without any effect on its surroundings. Taking the example from [bidi-inline], the original text without &rlm; reads like this:

The title is "مفتاح معايير الويب!" in Arabic.

Here, the exclamation mark is at the wrong end of the Arabic text; the intended display looks as follows:

The title is "مفتاح معايير الويب!‏" in Arabic.

This can be achieved by inserting an &rlm; after the exclamation mark. However, a straightforward display of the source will look as below, not showing to the author that the problem has been solved.

The title is "مفتاح معايير الويب!&rlm;" in Arabic.

Only if the &rlm; is actually made to behave like a RIGHT-TO-LEFT MARK, displaying as below, will it be easy to understand for the author.

The title is "مفتاح معايير الويب!‎&rlm;‎" in Arabic.

Similar considerations apply for base directionality, embeddings, and overrides. Before we can discuss them in detail, we have to look at the issue of line separation (see next section).

4.6 Line Separation

For running text, line layout is taken into consideration only at the end of the bidi algorithm. [Dür2004, slide 9] gives an overview of the steps of the bidi algorithm and the text units they apply to. However, this is different in the case of source code. On the one hand, source code does not know the concept of paragraphs; for the purpose of the bidi algorithm, each line of the source code is treated as a paragraph. Source code lines are also usually rather short, and in many editors are not wrapped even if they are wider than the available display width.

So a simple solution to the rendering of bidi source code is to look only at single lines. However, this is not sufficient. Markup constructs and paragraphs of the target are usually spread over more than one line of source. One line of source can also contain pieces of several paragraphs of the target, although this is not good practice.

Separating bidi source into many short lines with as much as possible a single directionality in each line is sometimes a good way to overcome various kinds of confusing artifacts, because vertically, text is always arranged in logical order. To take this into account, our simulation also offers a mode where only an empty line in the input is actually interpreted as a linebreak. Adding linebreaks (interpreted as spaces in the final display) however has to be done with care. Sometimes spaces inadvertently disappear in the final display [bidi-space].

For the author, one of the problems of the line separation in source is that it is more difficult to understand the overall bidi structure of a paragraph. Even with editing tools for running text, this is a problem to some extent because the author works on a fixed width, but during editing, frequent reflow provides some visual experience of the bidi structure. On the other hand, in the case of source editing, explicit embeddings and overrides are visible. The author has to work more mentally than by hand-on experience to get the bidi structure right, but this applies to source editing in general.

Parsing the source including semantic level information across lines and then mapping this information to a bidi structure including embeddings and overrides for a single line is the most difficult part of our solution. For the moment, we concentrate on a correct and easy-to-understand implementation, but for use in an actual editor, we will have to work on speed.

5 Simulation

To what extent a goal such as easy and natural reading and editing has been reached is difficult to judge without serious experimentation and testing. In order to get feedback from a wider audience with actual day-to-day experience with bidi scripts, we have created a simulation using XHTML.

This simulation is available via the Web interface (at http://www.sw.it.aoyama.ac.jp/cgi-bin/bidi-source-test). Interested readers are strongly encouraged to test the simulation both with (X)HTML and XML source that they already have as well as with new examples, and to provide feedback. In particular, we are looking for real-life examples with Arabic or Hebrew element and attribute names and examples with complex bidi markup, e.g. several nested embeddings or overrides. Hand-produced source code is preferred to automatically generated code (which is often needlessly complex). However, we understand that there is a chicken-and-egg problem, i.e. it is difficult to manually produce bidi source code with the current editors.

5.1 Using XHTML for Simulation

There are many reasons for using XHTML for the simulation. XHTML can be displayed anywhere, its bidi behavior is widely understood and widely implemented and it can be used for remote (over the Web) testing and feedback. XHTML's bidi behavior is easy influenced with markup, which means that debugging is easy (all the benefits of source availability apply). There are additional synergy benefits because intuitive display of (X)HTML source is also one of the goals of our project.

The use of (X)HTML for the simulation also occasionally created some confusion (and may also confuse the reader). To avoid the confusion, we adopt distinctive terms. We call the source that is being simulated the visible source, and the source of the XHTML used for the simulation the hidden source. The following table shows a very simple example:

Table 1: Terms for source editing and simulation
Term Example
target hello
visible source <em>hello</em>
hidden source &lt;em>hello&lt;/em>

The example only shows the change from formatted text to marked up text when moving from target to visible source, and the escaping necessary for the additional hidden source level. Our simulation essentially takes the visible source as input, and produces the hidden source. This hidden source is then displayed to look like the visible source. The above example does not include any bidi-related changes.

5.2 Implementing Bidi-related Changes

Implementing bidi-related simulation changes requires careful parsing of the source across lines, escaping of syntax-significant characters, surrounding of syntax significant characters with directionality marks to make them strong, and use of embeddings or overrides to conserve the overall markup structure as well as the information on the semantic level.

As an example, we take the use of an &rlm; in Section 4.5, where the visible source should look as follows:

The title is "مفتاح معايير الويب!‎&rlm;‎" in Arabic.

A <span> element with dir='rtl' has to be used to make the &rlm; text behave like itself, i.e. as a RTL component. The resulting hidden source looks as follows:

The title is "مفتاح معايير الويب!<span dir="rtl">‎&amp;rlm;‎</span>" in Arabic.

For full generality, it is also necessary to surround the actual &rlm; text with &lrm;s, i.e. to change &amp;rlm; to &lrm;&amp;rlm;&lrm; to make sure that the weak characters & and ; are not displaced.

5.3 Configuration

We have contemplated several levels of configuration, and implemented some of them in our simulation. There is a switch for the base directionality of the overall source code. It affects the directionality of the syntax-significant characters as well as the base directionality of the source code lines. In an actual implementation, it could be a per-system option, a per-user option, or a per-document option.

We are also working on allowing to use the directionality of the first strong character (of the element name for tags, and so on) to determine directions, and plan to look into using CSS (if necessary with some extensions) for configuration.

6 Conclusions and Future Work

We have explored better ways to display and edit bidirectional (X)HTML and XML sources. We have shown how it is possible to use the Unicode Bidirectional Algorithm with some higher-level settings to achieve a much more natural and easily readable source display. We have also described the use of XHTML for simulating our display improvements and making them testable over the Web.

Interested parties are highly encouraged to use our simulation (at http://www.sw.it.aoyama.ac.jp/cgi-bin/bidi-source-test). After evaluating our solution based on the feedback from the simulation, we plan to implement our solution in some actual editor(s). We also hope that our results can be used for other views of XML and (X)HTML, such as structured and tagged views.

We are also thinking about working on other source code formats. Source code of programming languages in some cases allows non-ASCII text only in very limited contexts (character and string constants, for example). In that case, adaption may turn out to be a straightforward simplification. In other cases, where non-ASCII characters can be used as identifiers or even as keywords, the complexity may be higher than for XML because there may be more different syntactic constructs.


Lots of thanks go to Richard Ishida (outreach material on the W3C Web site and discussions), Yannis Haralambous (pointing out the problem years ago), the many contributors to the Unicode Bidirectional Algorithm, the first author's former colleagues at the W3C, and all the people who provide a great research environment at Aoyama Gakuin University.


Richard Ishida, FAQ: CSS vs. markup for bidi support, article available at http://www.w3.org/International/questions/qa-bidi-css-markup.
Richard Ishida, Editing requirements for XML markup and RTL scripts, article (rough draft) available at http://people.w3.org/rishida/articles/bidi-editing.html.
Richard Ishida, What you need to know about the bidi algorithm and inline markup, article available at http://www.w3.org/International/articles/inline-bidi-markup.
Richard Ishida, FAQ: Bidi formatting codes vs. markup in (X)HTML, article available at http://www.w3.org/International/questions/qa-bidi-controls.
Richard Ishida, FAQ: Bidi space loss, article available at http://www.w3.org/International/questions/qa-bidi-space.
Bert Bos, Håkon Wium Lie, Chris Lilley, and Ian Jacobs, Cascading Style Sheets, level 2 - CSS2 Specification, W3C Recommendation 12-May-1998, available at http://www.w3.org/TR/REC-CSS2.
Martin J. Dürst, Fun with Regular Expressions: An Implementation of the Unicode Bidi Algorithm, 26th Internationalization & Unicode Conference, September 2004, San Jose, CA, U.S.A., presentation only, available at http://www.w3.org/2004/Talks/IUC26bidi.
Dave Raggett, Arnaud Le Hors, and Ian Jacobs, HTML 4.01 Specification, W3C Recommendation 24 December 1999, available at http://www.w3.org/TR/html4.
Tim Bray,Dave Hollander, Andrew Layman, and Richard Tobin, Namespaces in XML 1.1, W3C Recommendation 4 February 2004, available at http://www.w3.org/TR/xml-names11.
nvu, see http://www.nvu.com/.
François Yergeau, Gavin Nicol, Glenn Adams, and Martin Dürst, Internationalization of the Hypertext Markup Language, RFC 2070 (historical, superseded by [HTML4]), January 1997, available at http://www.ietf.org/rfc/rfc2070.txt.
Martin Dürst and Michel Suignard, Internationalized Resource Identifiers (IRIs), RFC 3987, IETF Proposed Standard January 2005, available at http://www.ietf.org/rfc/rfc3987.txt.
Mark Davis, The Bidirectional Algorithm, Unicode Standard Annex #9, 1999-08-17 - 2005-03-25 (repeatedly updated), available at http://www.unicode.org/reports/tr9/.
XHTML™ 1.0 The Extensible HyperText Markup Language (Second Edition) - Reformulation of HTML 4 in XML 1.0, W3C Recommendation 26 January 2000, revised 1 August 2002, available at http://www.w3.org/TR/xhtml1.
XMetal, see http://www.xmetal.com.
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and François Yergeau, Extensible Markup Language (XML) 1.0 (Third Edition), W3C Recommendation February 2004 (First edition February 1998), available at http://www.w3.org/TR/REC-xml.