Tutorial
Equivalence, Mapping, and Normalization

http://www.sw.it.aoyama.ac.jp/2014/pub/NormEquivTut

IUC 38, Santa Clara, CA, U.S.A., 3 November 2014

Martin J. DÜRST

duerst@it.aoyama.ac.jp

Aoyama Gakuin University

   

© 2013-4 Martin J. Dürst, Aoyama Gakuin University

About the Slides

The most up-to-date version of the slides, as well as additional materials, is available at http://www.sw.it.aoyama.ac.jp/2014/pub/NormEquivTut.

These slides are created in HMTL, for projection with Opera (≤12.16 Windows/Mac/Linux, use F11). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some of the characters and character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.

Abstract

The wealth of characters in Unicode means that there are many ways in which characters or strings can be equivalent, similar, or otherwise related. In this tutorial, you will learn about all these relationships, in order to be able to use this knowledge for dealing with Unicode data or programs handling Unicode data.

Character relationships and similarities in Unicode range from linguistic and semantic similarities at the 'top' to equivalent representations in different character encodings and Unicode encoding forms at the bottom, with numerical and case equivalences, compatibility and canonical equivalences, and graphic similarities in the middle. The wealth of equivalences and relationships is due to the rich history of human writing as well as to the realities of character encoding policies and decisions.

Each of these relationships are ignorable in some processing contexts, but may be crucial in others. Processing contexts may range from use as identifiers (e.g. user ids and passwords) to searching and sorting. For most of the equivalences, data is available in the Unicode Standard and its associated data files, but the use of this data or the functions provided by various libraries requires understanding the background of the equivalences. When testing for equivalence of two strings, the general strategy is to normalize both strings to a form that eliminates accidental (in the given context) differences, and then compare the strings on a binary level.

The tutorial will not only look at officially defined equivalences, but will also discuss variants that may be necessary in practice to cover specialized needs. We will also discuss the relationships between various classes of equivalences, necessary to avoid pitfalls when combining them, and the stability of the equivalences over time and under various operations such as string concatenation.

The tutorial assumes that participants have a basic understanding about the scope and breadth of Unicode, possibly from attending tutorials earlier in the day.

Audience

 

Main Points

 

Speaker Normalization

Let's normalize the speaker, or in other words see how the speaker was involved in normalization!

 

Tell me the Difference

What's the difference between the following characters:

৪ and 8 and ∞
Bengali 4, 8, and infinity

க௧
Tamil ka and Tamil 1

and
same codepoint, different language preference/font

樂, 樂 and 樂
different Korean pronounciation, different codepoints

 

A Big List

Linguistic and Semantic Similarities color, colour, Farbe, 色 not discussed
Transliterations/Transcriptions Putin, Путин, 푸틴, プーチン, 普京, پوتین, بوتين
Accidental Graphical Similarities T, ⊤, ┳, ᅮ this talk's topic
Script Similarities K, Κ, К; ヘ、へ
Cross-script 'Equivalences' な, ナ
Numeric 'Equivalence' 7, ۷७৭௭౭๗໗༧፯៧᠗⁷₇⑦⑺⒎7⓻❼➆➐ (59 in total, not including a few Han ideographs)
Case 'Equivalence' g, G; dž, Dž, DŽ; Σ, ς, σ
Compatibility Equivalence ⁴, ₄, ④, ⒋, 4
Canonical Equivalence Å, Å; Ṝ, Ṛ +  ̄, R + ̣ + ̄
Unicode Encoding Forms UTF-8: E9 9D 92 E5 B1 B1, UTF-16: 9752 5C71 not discussed
(Legacy) Encodings ISO-8859-X, Shift_JIS,...

Similarities and equivalences range from binary (at the bottom) to semantic (at the top). In this tutorial, we assume an uniform Unicode Encoding Form (e.g. UTF-8) and don't deal with linguistic and semantic similarities, but discuss all the layers in-between them.

 

Why all these Similarities and Variants?

 

Why Does it Matter?

(actual case!)

 

Applications

 

Equivalences/Mappings Defined by Unicode

It's Compatibility Equivalence officially, but we will use Kompatibility Equivalence (as in NFKC/NFKD) hereafter for better distinction with Canonical Equivalence.

 

Definitions by Others

 

Something Familiar: Casing

A ↔ a

...

Z ↔ z

Is it actually that easy?

 

Special Casings

 

Case Equivalence/Mapping

 

Data for Casing

 

Equivalence vs. Mapping

Equivalence is often defined using a mapping. As an example, an equivalence FOOeq may be defined with a mapping FOOmap. Two strings A and B are FOOequivalent if (and only if) the FOOmapping of A and the FOOmapping of B are equal. Put the other way round, every mapping defines an equivalence.

 

Security Conditions:

 

Idempotence

A function is idempotent if you get the same result irrespective of applying the function once or more.

f(f(x)) = f(x) ⇔ f is idemponent

Caution: The combination of two independently idempotent functions may not be an idempotent function

f(f(x)) = f(x), g(g(x)) = g(x) ↛ f(g(f(g(x)) = f(g(x)), i.e. f(g(x)) may not be idempotent

Examples:

 

Fixpoints

 

Numerical Equivalence

 

Security Confusables

Multilevel complexity because not only characters, but also scripts are involved. Not necessarily using equivalence relations, because it may be possible for a to be confusable to b, and b to c, but not a to c.

 

Sorting

Sorting is related to equivalence and mappings, but is a talk of its own (going on now in track 1).

 

Canonical Equivalence

From Unicode Standard Annex (UAX) #15, Section 1.1:

Canonical equivalence is a fundamental equivalency between characters or sequences of characters that represent the same abstract character, and when correctly displayed should always have the same visual appearance and behavior.

(emphasis added by presenter)

 

Why Canonical Equivalence

 

Combining Characters

 

Canonical Combining Class

 

Canonical Ordering

 

Precomposed Characters

 

Hangul Syllables

 

Normalization Form D (NFD)

(D stands for Decomposed)

NFD is the result of applying Canonical Decomposition:

  1. Apply Decomposition Mappings
    1. Code charts, using ≡, or UnicodeData file (6th field, when no <...>)
    2. Hangul syllable decomposition (algorithmic)

    until you find no more decompositions

  2. Canonical Reordering

 

(Dis)Advantages of NFD

Advantages of NFD:

Disadvantages of NFD:

Usage Layers

 

Different Mindsets

 

Normalization Form C (NFC)

(C stands for Composed)

Definition:

  1. Canonical Decomposition (see NFD)
  2. Canonical Recomposition

Advantages of NFC:

Disadvantages of NFC:

The phrase "in some ways actually too close" refers to the fact that NFC is so close to actual usage on the Web that there isn't too much motivation to actively normalize content. This is not exactly the purpose of a normalization form, as it means that there is still some content out there that is not normalized.

 

Canonical Recomposition

 

Composition Exclusions

 

Stability

 

Normalization Corrigenda

(4 out of 9 for all of Unicode)

 

Kompatibility Equivalence

From Unicode Standard Annex (UAX) #15, Section 1.1:

Compatibility equivalence is a weaker equivalence between characters or sequences of characters that represent the same abstract character, but may have a different visual appearance or behavior.

(emphasis added by presenter)

 

All Things Kompatibility

<tag> indicates and classifies kompatibility in UnicodeData.txt

 

Kompatibility Decomposition

Normalization Form KD (NFKD)

Definition:

  1. Kompatibility Decomposition
  2. Canonical Reordering

Normalization Form KC (NFKC)

Definition:

  1. Kompatibility Decomposition
  2. Canonical Reordering
  3. Canonical Recomposition

Take Care with Normalization Forms

Similar concerns may apply to other kinds of mappings

Because of the wide range of phenomena encoded by Unicode, there is always the chance that different phenomena interact in strange and unpredictable ways. Before assuming no interaction, carefully check. If you don't find any interactions, don't assume that will be necessarily so in all future versions.

 

Limits of Normalization

 

Case Study: From IDNA2003 to Precis

 

Case Study: Windows File System vs. IDNs

 

Strategy

 

Questions and Answers

Questions may be unnormalized, but answers will be normalized!

 

Acknowledgments

Mark Davis for many years of collaboration and (and some disagreements) on normalization, and for proposing a wider approach to the topic of normalization.

The IME Pad for facilitating character input.