Tutorial
Equivalence, Mapping, and Normalization

http://www.sw.it.aoyama.ac.jp/2013/pub/NormEquivTut

IUC 37, Santa Clara, CA, U.S.A., 20 October 2013

Martin J. DÜRST

duerst@it.aoyama.ac.jp

Aoyama Gakuin University

   

© 2013 Martin J. Dürst, Aoyama Gakuin University

About the Slides

The most up-to-date version of the slides, as well as additional materials, is available at http://www.sw.it.aoyama.ac.jp/2013/pub/NormEquivTut.

These slides are created in HMTL, for projection with Opera (≤12.16 Windows/Mac/Linux, use F11). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some of the characters and character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.

Audience

Main Points

Speaker Normalization

Let's normalize the speaker, or in other words see how the speaker was involved in normalization!

 

Tell me the Difference

What's the difference between the following characters:

৪ and 8 and ∞
Bengali 4, 8, and infinity

க௧
Tamil ka and 1

and
meaning 'bone', same codepoint, different language preference/font

樂, 樂 and 樂
different Korean pronounciation, different codepoints

 

A Big List

Linguistic and Semantic Similarities color, colour, Farbe, 色
Accidental Graphical Similarities T, ⊤, ┳, ᅮ
Script Similarities K, Κ, К; ヘ、へ
Cross-script 'Equivalences' な, ナ
Numeric 'Equivalence' 7, ۷७৭௭౭๗໗༧፯៧᠗⁷₇⑦⑺⒎7⓻❼➆➐ (59 in total, not including a few Han ideographs)
Case 'Equivalence' g, G; dž, Dž, DŽ; Σςσ
Compatibility Equivalence ⁴, ₄, ④, ⒋, 4
Canonical Equivalence Å, Å; Ṝ, Ṛ +  ̄, R + ̣ + ̄
Unicode Encoding Forms UTF-8: E9 9D 92 E5 B1 B1, UTF-16: 9752 5C71
(Legacy) Encodings ISO-8859-X, Shift_JIS,...

Similarities and equivalences range from binary (at the bottom) to semantic (at the top). In this tutorial, we assume an uniform Unicode Encoding Form (e.g. UTF-8) and don't deal with linguistic and semantic similarities, but discuss all the layers in-between them.

 

Why all these Similarities and Variants?

 

Applications

Equivalences/Mappings Defined by Unicode

It's Compatibility Equivalence officially, but we will use Kompatibility Equivalence (as in NFKC/NFKD) hereafter for better distinction with Canonical Equivalence.

 

Equivalence vs. Mapping

Equivalence is often defined using a mapping. As an example, an equivalence FOOeq may be defined with a mapping FOOmap. Two strings A and B are FOOequivalent if (and only if) the FOOmapping of A and the FOOmapping of B are equal. Put the other way round, every mapping defines an equivalence.

 

Sorting

Sorting is related to equivalence and mappings, but would be a talk of its own.

 

Casing

 

Case Equivalence/Mapping

 

Other Equivalences/Mappings

 

Canonical Equivalence

From Unicode Standard Annex (UAX) #15, Section 1.1:

Canonical equivalence is a fundamental equivalency between characters or sequences of characters that represent the same abstract character, and when correctly displayed should always have the same visual appearance and behavior.

(emphasis added by presenter)

Why Canonical Equivalence

Combining Characters

 

Canonical Combining Class

 

Canonical Ordering

 

Precomposed Characters

Hangul Syllables

 

Normalization Form D (NFD)

(D stands for Decomposed)

Apply Canonical Decomposition, consisting of:

Advantages of NFD:

Disadvantages of NFD:

 

Usage Layers

Different Mindsets

 

Normalization Form C (NFC)

(C stands for Composed)

After Canonical Decomposition (see NFD):

Advantages of NFC:

Disadvantages of NFC:

The phrase "in some ways actually too close" refers to the fact that NFC is so close to actual usage on the Web that there isn't too much motivation to actively normalize content. This is not exactly the purpose of a normalization form, as it means that there is still some content out there that is not normalized.

 

Canonical Recomposition

 

Composition Exclusions

 

Stability

 

Normalization Corrigenda

(4 out of 9 for all of Unicode)

 

NFC/NFD Variants

Kompatibility Equivalence

From Unicode Standard Annex (UAX) #15, Section 1.1:

Compatibility equivalence is a weaker equivalence between characters or sequences of characters that represent the same abstract character, but may have a different visual appearance or behavior.

(emphasis added by presenter)

All Things Kompatibility

<tag> indicates and classifies kompatibility in the Unicode Data file

 

Kompatibility Decomposition

Normalization Form KD (NFKD)

Normalization Form KC (NFKC)

Take Care with Normalization Forms

Similar concerns may apply to other kinds of mappings

Because of the wide range of phenomena encoded by Unicode, there is always the chance that different phenomena interact in strange and unpredictable ways. Before assuming no interaction, carefully check. If you don't find any interactions, don't assume that will be necessarily so in all future versions.

 

Case Study: From IDNA2003 to Precis

 

Case Study: Windows File System vs. IDNs

Case Study: Normalization Breakin

(actual case!)

Strategy

More about Normalization Implementation

Talk tomorrow (Tuesday, 22 October), Track 3, Session 4 (14:30 – 15:20):

Implementing Normalization in Pure Ruby - the Fast and Easy Way

Questions and Answers

Questions may be unnormalized, but answers will be normalized!

Acknowledgments

Mark Davis for many years of collaboration and (and some disagreements) on normalization, and for proposing a wider approach to the topic of normalization.

The IME Pad for facilitating character input.

Amaya and Web technology for slide editing and display.