Tutorial
Equivalence, Mapping, and Normalization

https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

IUC44, Virtual Conference, 14 October 2020

Martin J. DÜRST

duerst@it.aoyama.ac.jp

Aoyama Gakuin University

   

© 2013-20 Martin J. Dürst, Aoyama Gakuin University

 

About the Slides

The most up-to-date version of the slides is available at https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

These slides are created in HMTL, for projection with a very old version (≤12.18) of Opera (please contact me if you cannot find a copy). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some of the characters and character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed. To check the identity of a character, use e.g. Richard Ishida's Unicode Converter and the Unicode code charts.

Abstract

The multitude of characters available in Unicode means that there are many ways in which characters or strings can be equivalent, similar, or otherwise related. In this tutorial, you will learn about all these relationships, in order to be able to better work with Unicode data and programs handling Unicode data. The tutorial assumes that participants have a basic understanding of the scope and breadth of Unicode, possibly from attending tutorials earlier in the day.

Character relationships and similarities in Unicode range from linguistic and semantic similarities at one end to the same character being represented in different character encodings or Unicode encoding forms at the other end. In the middle, numerical and case equivalences, compatibility and canonical equivalences, graphic similarities, and many others can be found. This sometimes bewildering wealth of characters, equivalences, and relationships is due to the rich history of human writing as well as to the realities of character encoding policies and decisions.

The tutorial will give some guidance to help users navigate equivalences and differences for their use cases and applications. Each of these many equivalences or relationships can or should be ignored in some processing contexts, but may be crucial in others. Contexts may range from use as identifiers (e.g. user ids and passwords, with security consequences) to searching and sorting. For most of the equivalences, data is available in the Unicode Standard and its associated data files, or is provided by other standards such as IDNA and PRECIS. But the use of this data and the functions provided by various libraries requires understanding of the background of the equivalences.

When testing for equivalence of two strings, the general strategy is to map or normalize both strings to a form that eliminates accidental (in the given context) differences, and then compare the strings on a binary level. The tutorial will not only look at officially defined equivalences, but will also discuss variants that may be necessary in practice to cover specialized needs. We will also discuss the relationships between various classes of equivalences, necessary to avoid pitfalls when combining them, and the stability of the equivalences over time and under various operations such as string concatenation.

 

Outline

 

Introduction

Tutorial: Equivalence, Mapping, and Normalization

https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

IUC44, Virtual Conference, 14 October 2020

Martin J. DÜRST

Audience

 

Main Points

 

Speaker Normalization

Let's normalize the speaker, or in other words see how the speaker was involved in normalization!

 

Tell me the Difference

What's the difference between the following characters:

৪ and 8 and ∞
Bengali 4, 8, and infinity   

க௧
Tamil ka and Tamil 1   

and
same codepoint, different language preference/glyph structure   

樂, 樂 and 樂
different Korean pronounciation, different codepoints, should look the same   

 

A Big List

Linguistic and Semantic Similarities color, colour, Farbe, 色 not discussed
Transliterations/Transcriptions Putin, Путин, 푸틴, プーチン, 普京, پوتین, بوتين
Accidental Graphical Similarities T, ⊤, ┳, ᅮ this talk's topic
Script Similarities K, Κ, К; ヘ、へ
Cross-script 'Equivalences' な, ナ
Numeric 'Equivalence' 7, ۷७৭௭౭๗໗༧፯៧᠗⁷₇⑦⑺⒎7⓻❼➆➐ (59 in total, not including Han ideographs)
Case 'Equivalence' g, G; dž, Dž, DŽ; Σ, ς, σ
Compatibility Equivalence ⁴, ₄, ④, ⒋, 4
Canonical Equivalence Å, Å; Ṝ, Ṛ +  ̄, R + ̣ + ̄
Font Differences A, A, A, A, A not discussed
Unicode Encoding Forms UTF-8: E9 9D 92 E5 B1 B1, UTF-16: 9752 5C71
(Legacy) Encodings ISO-8859-X, Shift_JIS,...

Similarities and equivalences range from binary (at the bottom) to semantic (at the top). In this tutorial, we assume a uniform Unicode Encoding Form (e.g. UTF-8) and do not deal with linguistic and semantic similarities, but discuss all the in-between layers.

 

Why all these Similarities and Variants?

 

Why Does it Matter?

(actual case!)

See also: https://github.com/reinderien/mimic

 

Applications

 

Equivalences/Mappings Defined by Unicode

It's Compatibility Equivalence officially, but we will use Kompatibility Equivalence (as in NFKC/NFKD) hereafter for better distinction with Canonical Equivalence.

 

Case

Tutorial: Equivalence, Mapping, and Normalization

https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

IUC44, Virtual Conference, 14 October 2020

Martin J. DÜRST

Something Familiar: Casing

A ↔ a

...

Z ↔ z

Is it actually that easy?

 

Special Casings

 

Case Equivalence/Mapping

 

Data for Casing

 

A Bit of Theory

Tutorial: Equivalence, Mapping, and Normalization

https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

IUC44, Virtual Conference, 14 October 2020

Martin J. DÜRST

Equivalence vs. Mapping

Equivalence is often defined using a mapping. As an example, an equivalence FOOeq may be defined with a mapping FOOmap. Two strings A and B are FOOequivalent if (and only if) the FOOmapping of A and the FOOmapping of B are equal. Put the other way round, every mapping defines an equivalence.

 

Security Conditions for Mappings

 

Idempotence

A function f is idempotent if you get the same result irrespective of applying the function once or more.

f(f(x)) = f(x) ⇔ f is idemponent

Caution: The combination of two independently idempotent functions may not be an idempotent function

f(f(x)) = f(x), g(g(x)) = g(x) ↛ f(g(f(g(x)) = f(g(x)), i.e. f(g(x)) may not be idempotent

Examples:

 

Fixpoint

 

Various Equivalences

Tutorial: Equivalence, Mapping, and Normalization

https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

IUC44, Virtual Conference, 14 October 2020

Martin J. DÜRST

 

Numerical Equivalence

 

Security Confusables

Multilevel complexity because not only characters, but also scripts are involved. Not necessarily using equivalence relations, because it may be possible for a to be confusable to b, and b to c, but not a to c.

 

Sorting

Sorting is related to equivalence and mappings, but is a talk of its own.

 

Canonical Equivalence

Tutorial: Equivalence, Mapping, and Normalization

https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

IUC44, Virtual Conference, 14 October 2020

Martin J. DÜRST

 

Canonical Equivalence (≡)

From Unicode Standard Annex (UAX) #15, Section 1.1:

Canonical equivalence is a fundamental equivalency between characters or sequences of characters that represent the same abstract character, and when correctly displayed should always have the same visual appearance and behavior.

(emphasis added by presenter)

 

Why Canonical Equivalence

 

Combining Characters

⇒Canonical Ordering

 

Canonical Combining Class

 

Canonical Ordering

 

Canonical Ordering Example

Example (ş̤̂̏, ccc in (), characters being exchanged)

original s  ̂ (230)  ̧ (202)  ̏ (230)  ̤ (220)
after first step s  ̧ (202)  ̂ (230)  ̏ (230)  ̤ (220)
after second step s  ̧ (202)  ̂ (230)  ̤ (220)  ̏ (230)
final s  ̧ (202)  ̤ (220)  ̂ (230)  ̏ (230)

4 diacritics, 24 permutations, of which 12 and 12 are equivalent

 

Precomposed Characters

 

Hangul Syllables

 

Normalization Form D (NFD)

(D stands for Decomposed)

NFD is the result of Canonical Decomposition:

  1. Apply Decomposition Mappings
    1. Code charts, using ≡, or UnicodeData file (6th field, when no <...>)
      (apply repeatedly, because all decompositions are singletons or binary)
    2. Hangul syllable decomposition (algorithmic)

    until you find no more decompositions

  2. Apply Canonical Reordering

 

Evaluation of NFD

Advantages:

Disadvantages:

Usage Layers

 

Different Mindsets

⇒Normalization Form C

 

Normalization Form C (NFC)

(C stands for Composed)

Definition:

  1. Canonical Decomposition (see NFD,
    includes decomposition mappings and canonical reordering)
  2. Canonical Recomposition
    ('reverse' of decomposition mappings;
    see also below)

 

Evaluation of NFC

Advantages:

Disadvantages:

The phrase "in some ways actually too close" refers to the fact that NFC is close enough to actual usage on the Web that there isn't too much motivation or pressure to actively normalize content. This is not exactly the purpose of a normalization form, as it means that there is still some content out there that is not normalized.

 

Limits of NFC and Canonical Equivalence

 

Canonical Recomposition

 

Composition Exclusions

 

Stability of NFD and NFC

 

Normalization Corrigenda

(4 out of 9 corrigenda for all of Unicode)

 

Kompatibility Equivalence

Tutorial: Equivalence, Mapping, and Normalization

https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

IUC44, Virtual Conference, 14 October 2020

Martin J. DÜRST

 

Kompatibility Equivalence (≈)

From Unicode Standard Annex (UAX) #15, Section 1.1:

Compatibility equivalence is a weaker equivalence between characters or sequences of characters that represent the same abstract character, but may have a different visual appearance or behavior.

(emphasis added by presenter)

 

All Things Kompatibility

<tag> indicates and classifies kompatibility in UnicodeData.txt (6th field)

 

Kompatibility Decomposition

 

Normalization Form KD (NFKD)

Definition:

  1. Kompatibility Decomposition
  2. Canonical Reordering

 

Normalization Form KC (NFKC)

Definition:

  1. Kompatibility Decomposition
  2. Canonical Reordering
  3. Canonical Recomposition

 

Take Care with Normalization Forms

Similar concerns may apply to other kinds of mappings

Because of the wide range of phenomena encoded by Unicode, there is always the chance that different phenomena interact in strange and unpredictable ways. Before assuming no interaction, carefully check. If you don't find any interactions, don't assume that will be necessarily so in all future versions.

 

Case Studies

Tutorial: Equivalence, Mapping, and Normalization

https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

IUC44, Virtual Conference, 14 October 2020

Martin J. DÜRST

 

Case Study: IETF from IDNA2003 to Precis

 

IETF Ideal

⇒ Strong preference for one-stop shopping

 

IDNA Development

 

IDNA 2003

(RFC 3490)

 

IDNA 2008

(RFC 5890-4)

 

IDNA 2008 Version Adjustments

 

Unicode IDNA Compatibility Processing

(UTS #46)

 

Precis

 

Precis Components

Use fixpoint because idempotency unclear

 

Case Study 2: File Systems and IDNs/IRIs

 

Strategy: When and How to Apply

 

Strategy: What to Apply

 

More Advice on Normalization

String Matching for the Web (W3C Working Group Note Feb. 2019)

Update (9th October 2020) under review, please send comments

 

Questions and Answers

This is your time for questions!

Questions may be unnormalized, but answers will be normalized!

You can also send questions by email to duerst@it.aoyama.ac.jp

 

Acknowledgments

Mark Davis for many years of collaboration and (and some disagreements) on normalization, and for proposing a wider approach to the topic of normalization.

Too many people from Unicode, W3C, and the IETF to list them all.

The IME Pad for facilitating character input.

Amaya and Opera 12.18 for slide production and display.