Equivalence, Mapping, and Normalization


IUC 37, Santa Clara, CA, U.S.A., 20 October 2013

Martin J. DÜRST


Aoyama Gakuin University


© 2013 Martin J. Dürst, Aoyama Gakuin University

About the Slides

The most up-to-date version of the slides, as well as additional materials, is available at http://www.sw.it.aoyama.ac.jp/2013/pub/NormEquivTut.

These slides are created in HMTL, for projection with Opera (≤12.16 Windows/Mac/Linux, use F11). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some of the characters and character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.


Main Points

Speaker Normalization

Let's normalize the speaker, or in other words see how the speaker was involved in normalization!


Tell me the Difference

What's the difference between the following characters:

৪ and 8 and ∞
Bengali 4, 8, and infinity

Tamil ka and 1

meaning 'bone', same codepoint, different language preference/font

樂, 樂 and 樂
different Korean pronounciation, different codepoints


A Big List

Linguistic and Semantic Similarities color, colour, Farbe, 色
Accidental Graphical Similarities T, ⊤, ┳, ᅮ
Script Similarities K, Κ, К; ヘ、へ
Cross-script 'Equivalences' な, ナ
Numeric 'Equivalence' 7, ۷७৭௭౭๗໗༧፯៧᠗⁷₇⑦⑺⒎7⓻❼➆➐ (59 in total, not including a few Han ideographs)
Case 'Equivalence' g, G; dž, Dž, DŽ; Σςσ
Compatibility Equivalence ⁴, ₄, ④, ⒋, 4
Canonical Equivalence Å, Å; Ṝ, Ṛ +  ̄, R + ̣ + ̄
Unicode Encoding Forms UTF-8: E9 9D 92 E5 B1 B1, UTF-16: 9752 5C71
(Legacy) Encodings ISO-8859-X, Shift_JIS,...

Similarities and equivalences range from binary (at the bottom) to semantic (at the top). In this tutorial, we assume an uniform Unicode Encoding Form (e.g. UTF-8) and don't deal with linguistic and semantic similarities, but discuss all the layers in-between them.


Why all these Similarities and Variants?



Equivalences/Mappings Defined by Unicode

It's Compatibility Equivalence officially, but we will use Kompatibility Equivalence (as in NFKC/NFKD) hereafter for better distinction with Canonical Equivalence.


Equivalence vs. Mapping

Equivalence is often defined using a mapping. As an example, an equivalence FOOeq may be defined with a mapping FOOmap. Two strings A and B are FOOequivalent if (and only if) the FOOmapping of A and the FOOmapping of B are equal. Put the other way round, every mapping defines an equivalence.



Sorting is related to equivalence and mappings, but would be a talk of its own.




Case Equivalence/Mapping


Other Equivalences/Mappings


Canonical Equivalence

From Unicode Standard Annex (UAX) #15, Section 1.1:

Canonical equivalence is a fundamental equivalency between characters or sequences of characters that represent the same abstract character, and when correctly displayed should always have the same visual appearance and behavior.

(emphasis added by presenter)

Why Canonical Equivalence

Combining Characters


Canonical Combining Class


Canonical Ordering


Precomposed Characters

Hangul Syllables


Normalization Form D (NFD)

(D stands for Decomposed)

Apply Canonical Decomposition, consisting of:

Advantages of NFD:

Disadvantages of NFD:


Usage Layers

Different Mindsets


Normalization Form C (NFC)

(C stands for Composed)

After Canonical Decomposition (see NFD):

Advantages of NFC:

Disadvantages of NFC:

The phrase "in some ways actually too close" refers to the fact that NFC is so close to actual usage on the Web that there isn't too much motivation to actively normalize content. This is not exactly the purpose of a normalization form, as it means that there is still some content out there that is not normalized.


Canonical Recomposition


Composition Exclusions




Normalization Corrigenda

(4 out of 9 for all of Unicode)


NFC/NFD Variants

Kompatibility Equivalence

From Unicode Standard Annex (UAX) #15, Section 1.1:

Compatibility equivalence is a weaker equivalence between characters or sequences of characters that represent the same abstract character, but may have a different visual appearance or behavior.

(emphasis added by presenter)

All Things Kompatibility

<tag> indicates and classifies kompatibility in the Unicode Data file


Kompatibility Decomposition

Normalization Form KD (NFKD)

Normalization Form KC (NFKC)

Take Care with Normalization Forms

Similar concerns may apply to other kinds of mappings

Because of the wide range of phenomena encoded by Unicode, there is always the chance that different phenomena interact in strange and unpredictable ways. Before assuming no interaction, carefully check. If you don't find any interactions, don't assume that will be necessarily so in all future versions.


Case Study: From IDNA2003 to Precis


Case Study: Windows File System vs. IDNs

Case Study: Normalization Breakin

(actual case!)


More about Normalization Implementation

Talk tomorrow (Tuesday, 22 October), Track 3, Session 4 (14:30 – 15:20):

Implementing Normalization in Pure Ruby - the Fast and Easy Way

Questions and Answers

Questions may be unnormalized, but answers will be normalized!


Mark Davis for many years of collaboration and (and some disagreements) on normalization, and for proposing a wider approach to the topic of normalization.

The IME Pad for facilitating character input.

Amaya and Web technology for slide editing and display.