Tutorial
Equivalence, Mapping, and Normalization

https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

IUC44, Virtual Conference, 14 October 2020

Martin J. DÜRST

duerst@it.aoyama.ac.jp

About the Slides

The most up-to-date version of the slides is available at https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

These slides are created in HMTL, for projection with a very old version (≤12.18) of Opera (please contact me if you cannot find a copy). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some of the characters and character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed. To check the identity of a character, use e.g. Richard Ishida's Unicode Converter and the Unicode code charts.

Abstract

The multitude of characters available in Unicode means that there are many ways in which characters or strings can be equivalent, similar, or otherwise related. In this tutorial, you will learn about all these relationships, in order to be able to better work with Unicode data and programs handling Unicode data. The tutorial assumes that participants have a basic understanding of the scope and breadth of Unicode, possibly from attending tutorials earlier in the day.

Character relationships and similarities in Unicode range from linguistic and semantic similarities at one end to the same character being represented in different character encodings or Unicode encoding forms at the other end. In the middle, numerical and case equivalences, compatibility and canonical equivalences, graphic similarities, and many others can be found. This sometimes bewildering wealth of characters, equivalences, and relationships is due to the rich history of human writing as well as to the realities of character encoding policies and decisions.

The tutorial will give some guidance to help users navigate equivalences and differences for their use cases and applications. Each of these many equivalences or relationships can or should be ignored in some processing contexts, but may be crucial in others. Contexts may range from use as identifiers (e.g. user ids and passwords, with security consequences) to searching and sorting. For most of the equivalences, data is available in the Unicode Standard and its associated data files, or is provided by other standards such as IDNA and PRECIS. But the use of this data and the functions provided by various libraries requires understanding of the background of the equivalences.

When testing for equivalence of two strings, the general strategy is to map or normalize both strings to a form that eliminates accidental (in the given context) differences, and then compare the strings on a binary level. The tutorial will not only look at officially defined equivalences, but will also discuss variants that may be necessary in practice to cover specialized needs. We will also discuss the relationships between various classes of equivalences, necessary to avoid pitfalls when combining them, and the stability of the equivalences over time and under various operations such as string concatenation.

Outline

Introduction
Example: CasE
A Bit of Theory
Various Equivalences
Canonical Equivalence (NFD and NFC)
Kompatibility Equivalence (NFKD and NFKC)
Case Studies
Questions

Introduction

Tutorial: Equivalence, Mapping, and Normalization

https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

IUC44, Virtual Conference, 14 October 2020

Martin J. DÜRST

Audience

Basic knowledge about Unicode
(e.g. you attended some tutorials today)
Responsible for Unicode data
or programs dealing with Unicode data

Main Points

Lots of characters: Lots of similarities
Why should you care?
Equivalences and mappings:
- Casing
- Numbers
- Security confusables
- Sorting
- Canonical/Kompatibility Equivalence
- Normal forms
Technical details
History and politics
Strategic advice

Speaker Normalization

Interested in Unicode since ca. 1992
First proposal for Internationalization of Domain Names (draft-duerst-dns-i18n-00, 1996)
Instigator of NFC (see draft-duerst-i18n-norm, 1997)
Co-creator of UAX #15, Unicode Normalization Forms
Advocate of normalization in the IETF and W3C
Three normalization implementations (charlint, checking for XML1.1, eprun (now part of Ruby))
Implemented full Unicode case mapping/folding in Ruby
W3C Internationalization Interest Group Chair since 1998

Let's normalize the speaker, or in other words see how the speaker was involved in normalization!

Tell me the Difference

What's the difference between the following characters:

৪ and 8 and ∞
Bengali 4, 8, and infinity

க௧
Tamil ka and Tamil 1

骨 and 骨
same codepoint, different language preference/glyph structure

樂, 樂 and 樂
different Korean pronounciation, different codepoints, should look the same

A Big List

Linguistic and Semantic Similarities	color, colour, Farbe, 色	not discussed
Transliterations/Transcriptions	Putin, Путин, 푸틴, プーチン, 普京, پوتین, بوتين	not discussed
Accidental Graphical Similarities	T, ⊤, ┳, ᅮ	this talk's topic
Script Similarities	K, Κ, К; ヘ、へ
Cross-script 'Equivalences'	な, ナ
Numeric 'Equivalence'	7, ۷७৭௭౭๗໗༧፯៧᠗⁷₇⑦⑺⒎７⓻❼➆➐ (59 in total, not including Han ideographs)
Case 'Equivalence'	g, G; ǆ, ǅ, Ǆ; Σ, ς, σ
Compatibility Equivalence	⁴, ₄, ④, ⒋, ４
Canonical Equivalence	Å, Å; Ṝ, Ṛ + ̄, R + ̣ + ̄
Font Differences	A, A, A, A, A	not discussed
Unicode Encoding Forms	UTF-8: E9 9D 92 E5 B1 B1, UTF-16: 9752 5C71
(Legacy) Encodings	ISO-8859-X, Shift_JIS,...

Similarities and equivalences range from binary (at the bottom) to semantic (at the top). In this tutorial, we assume a uniform Unicode Encoding Form (e.g. UTF-8) and do not deal with linguistic and semantic similarities, but discuss all the in-between layers.

Why all these Similarities and Variants?

Historical cultural evolution
- Scripts and characters borrowed
- Changed by writing tools and customs
- From stylistic to orthographic distinctions
- Independent quest for simple graphics
Encoding realities
- Encoding structure choices
- Round-tripping requirements
- Encoding compromises
- Encoding accidents

Why Does it Matter?

(actual case!)

Possible to hijack user account
How:
- Create account with compatibility equivalent to the targeted one (e.g. ᴮᴵᴳᴮᴵᴿᴰ)
- Ask for a password reset
- Log in and take over
Why:
- Username normalization incomplete
- ᴮᴵᴳᴮᴵᴿᴰ → BIGBIRD
- BIGBIRD → bigbird
Correct: ᴮᴵᴳᴮᴵᴿᴰ → bigbird

Applications

Identification:
- Domain Names/Filenames/Usernames/Passwords
- Identifiers in formats (JSON/XML/...) and programming languages
- Security implications!
Searching
Sorting

Equivalences/Mappings Defined by Unicode

Canonical Equivalence (NFC/NFD)
Kompatibility Equivalence (NFKC/NFKD)
Case equivalence: Case folding/mapping
Sorting: Sorting keys
Numeric Values
Accidental graphical similarities: confusability skeletons
Unicode Identifier and Pattern Syntax (UTS #31)

It's Compatibility Equivalence officially, but we will use Kompatibility Equivalence (as in NFKC/NFKD) hereafter for better distinction with Canonical Equivalence.

Case

Tutorial: Equivalence, Mapping, and Normalization

https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

IUC44, Virtual Conference, 14 October 2020

Martin J. DÜRST

Something Familiar: Casing

A ↔ a

...

Z ↔ z

Is it actually that easy?

Special Casings

Case is only relevant for Latin, Greek, Cyrillic, Armenian, Deseret, old Hungarian, Cherokee (?), and ancient Georgian
Unicode knows three cases: lowercase, Titlecase, and UPPERCASE
Case may be language/locale-dependent
(e.g. English: i↔I; Turkish: i↔İ, ı↔I)
Case may depend on typographic tradition
(accents on French uppercase letters,...)
Case may not be character-to-character
(e.g. ß↔SS)
Case may be context-dependent
e.g. σ↔Σ, but at the end of a word: ς↔Σ
Very vaguely related: Japanese カタカナ↔ひらがな

Case Equivalence/Mapping

Case Folding:
- Aggressively maps case-related strings together
- Target is lower case (T/Τ/Т but t/τ/т, see Wikipedia)
- Suited e.g. for search
Default Case Mapping:
- Not as aggressive as case folding
- Goes both ways
- Use if no context information available
- Use after context-specific mappings
Language-dependent Case Mappings
Simple Case Mapping:
- Character-to-character only
- No change of string length

Data for Casing

Columns 13 (upper), 14 (lower), and 15 (title) of UnicodeData.txt:

01C4;Ǆ;Lu;...;    ;01C6;01C5
01C5;ǅ;Lt;...;01C4;01C6;01C5
01C6;ǆ;Ll;...;01C4;    ;01C5

Not only Lu/Lt/Ll characters have case mappings
SpecialCasing.txt:
Contains all special cases

A Bit of Theory

Tutorial: Equivalence, Mapping, and Normalization

https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

IUC44, Virtual Conference, 14 October 2020

Martin J. DÜRST

Equivalence vs. Mapping

Equivalence is defined between pairs of strings
Two strings HELLO and hello are equivalent (or not)
Mapping is defined from one string to another
String Good Bye maps to string good bye
Equivalence is often defined using mapping
Example: For the FOO equivalence, with equivalence FOOeq and mapping FOOmap,
FOOmap(A) = FOOmap(B) ⇔ FOOeq(A, B)

Equivalence is often defined using a mapping. As an example, an equivalence FOOeq may be defined with a mapping FOOmap. Two strings A and B are FOOequivalent if (and only if) the FOOmapping of A and the FOOmapping of B are equal. Put the other way round, every mapping defines an equivalence.

Security Conditions for Mappings

Stable mapping
- Disallowed characters may become allowed
- Additional mappings may be added if they don't collapse existing data
Store mapped form (but may want to keep display form, too)
Make sure mapping is idempotent, or check for a fixpoint

Idempotence

A function f is idempotent if you get the same result irrespective of applying the function once or more.

f(f(x)) = f(x) ⇔ f is idemponent

Caution: The combination of two independently idempotent functions may not be an idempotent function

f(f(x)) = f(x), g(g(x)) = g(x) ↛ f(g(f(g(x)) = f(g(x)), i.e. f(g(x)) may not be idempotent

Examples:

Canonical decomposition and Kompatibility decomposition
Normalization and case mapping

Fixpoint

Replacement for idempotence
Apply a function until the result does not change anymore

Pseudocode:

def fixpoint(function, start)
  while true
    next = function(start)
    return next if next==start
    start = next
  end
end

Caution: Fixpoints may not exist
(not usually a problem for I18N)

Various Equivalences

Tutorial: Equivalence, Mapping, and Normalization

https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

IUC44, Virtual Conference, 14 October 2020

Martin J. DÜRST

Numerical Equivalence

Columns 6 (decimal), 7 (digit), and 8 (fraction) of UnicodeData.txt
(ASCII) hexadecimal digits in PropList.txt
Other ways to represent numbers:
- Greek, Hebrew: Letters with specific values
- Numbers as spoken out
- Han ideographs

Security Confusables

Described in UTS #39: Unicode Security Mechanisms
Data at http://www.unicode.org/Public/security/latest/
Detecting:
- Visible lookalikes
- Script confusables
- Number confusables
- ...
Purpose is spoofing detection
Apply whenever user is in control of creating identifier
Better be safe than sorry

Multilevel complexity because not only characters, but also scripts are involved. Not necessarily using equivalence relations, because it may be possible for a to be confusable to b, and b to c, but not a to c.

Sorting

Sorting is not only about equivalence (=), but about ordering (≤)
Mechanism: Mapping to sort keys
String A sorts before B if sort_key(A) ≤ sort_key(B)
Sorting is language/locale-dependent (viewer, not data!)
Sorting uses levels, such as:
- Base character (A<B)
- Diacritics (u<ü)
- Case (g<G or G<g)
- Others (space, articles,...)

Sorting is related to equivalence and mappings, but is a talk of its own.

Canonical Equivalence

Tutorial: Equivalence, Mapping, and Normalization

https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

IUC44, Virtual Conference, 14 October 2020

Martin J. DÜRST

Canonical Equivalence (≡)

From Unicode Standard Annex (UAX) #15, Section 1.1:

Canonical equivalence is a fundamental equivalency between characters or sequences of characters that represent the same abstract character, and when correctly displayed should always have the same visual appearance and behavior.

(emphasis added by presenter)

Why Canonical Equivalence

Order of combining characters
R + ̣ + ̄ vs. R + ̄ + ̣
Precomposed vs. Decomposed
Ṝ vs. Ṛ + ̄ vs. R + ̣ + ̄
Singletons
Å vs. Å, 樂 vs. 樂 vs. 樂
Hangul syllables vs. Jamos
가 vs. ᄀ + ᅡ

Combining Characters

Many scripts use base characters and diacritics
Diacritics and similar characters are called Combining Characters
Often, all combinations may appear (e.g. Arabic, Hebrew)
Often, combinations are limited per language,
but open-ended for the script (Latin)
Separate encoding seems advantageous
(nothing else available in Unicode 1.0)
Order meaningful if e.g. all diacritics above
Order arbitrary if e.g. above and below

⇒Canonical Ordering

Canonical Combining Class

A character property, integer between 0 and 254
All base characters have ccc=0
Only combining marks (but not all of them) have ccc≠0
Ccc is the same for combining characters that
- Go on the same side as the base character
  (e.g. below: 220, above: 230)
- Therefore, need to maintain order
UnicodeData file (4th field, 0 when no entry)
(see also DerivedCombiningClass.txt)

Canonical Ordering

For any two consecutive characters ab
reorder as ba if ccc(a) > ccc(b) > 0
until you find no more such reorderings
Very similar to bubblesort (but not the same!)
Local operation: most characters have ccc=0, and don't move

Canonical Ordering Example

Example (ş̤̂̏, ccc in (), characters being exchanged)

original	s	̂ (230)	̧ (202)	̏ (230)	̤ (220)
after first step	s	̧ (202)	̂ (230)	̏ (230)	̤ (220)
after second step	s	̧ (202)	̂ (230)	̤ (220)	̏ (230)
final	s	̧ (202)	̤ (220)	̂ (230)	̏ (230)

4 diacritics, 24 permutations, of which 12 and 12 are equivalent

Precomposed Characters

Legacy encodings have characters including diacritics
(Example: ISO-8859-1: ÁÂÃÅÅÆÇÉÊÊËÍÎÎÏ...ñòóôõ÷øùûûýýÿ)
Limited transcoding technology
Limited display technology
European national pride and voting power
Unicode - ISO 10646 merger
Precomposed characters in Unicode 1.1
Equivalences between precomposed characters and combining sequences defined
⇒Normalization Form D

Hangul Syllables

Korean is written in square syllables (Hangul, 한글)
Syllables consist of
- Consonant(s) + vowel(s) (가)
- Consonant(s) + vowel(s) + consonant(s) (민)
Individual pieces (Conjoining Jamo) are also encoded:
- Leading consonant(s) (ᄍ)
- Vowel(s) (ᅱ)
- Trailing consonant(s) (ᆹ)
For all L, V, and T in modern use, every LV and LVT syllable is encoded (over 10'000)
Syllables with historical L (ᅑ), V (ᆑ), and T (ᇷ) must use Jamo

Normalization Form D (NFD)

(D stands for Decomposed)

NFD is the result of Canonical Decomposition:

Apply Decomposition Mappings
1. Code charts, using ≡, or UnicodeData file (6th field, when no <...>)
  (apply repeatedly, because all decompositions are singletons or binary)
2. Hangul syllable decomposition (algorithmic)
until you find no more decompositions
Apply Canonical Reordering

Evaluation of NFD

Advantages:

Flat and straightforward
Fast operations

Disadvantages:

Most data is not in NFD
Difficult to force everybody to use it

Usage Layers

topmost: User application: Meaningful characters
higher: DNS/HTTP/SMTP/FTP: Bytes or characters
lower: TCP/IP: Byte streams
bottom: Electric current, light, electromagnetic waves

Different Mindsets

Multilingual Editor:
- Character boundaries, complex display
- Preferred internal normalization
- Normalization not a major burden
Application Protocols (e.g. IETF):
- Textual content just transported as bytes+charset info
- Identifiers (mostly) ASCII, comparison bytewise or ASCII-case insensitive
Internationalized Identifiers:
- E.g. XML element names (W3C)
- Byte or codepoint comparison
- Normalization close to actual practice desired

⇒Normalization Form C

Normalization Form C (NFC)

(C stands for Composed)

Definition:

Canonical Decomposition (see NFD,
includes decomposition mappings and canonical reordering)
Canonical Recomposition
('reverse' of decomposition mappings;
see also below)

Evaluation of NFC

Advantages:

Very close to usage on Web
(design goal; in some ways actually too close)
More compact than NFD

Disadvantages:

More complex than NFD

The phrase "in some ways actually too close" refers to the fact that NFC is close enough to actual usage on the Web that there isn't too much motivation or pressure to actively normalize content. This is not exactly the purpose of a normalization form, as it means that there is still some content out there that is not normalized.

Limits of NFC and Canonical Equivalence

Designed mainly for Latin
Koreans don't like it for historical Korean
MacIntosh file system NFD (roughly: Hangul NFC, rest NFD)
Canonical equivalence of compatibility Han Ideographs
(use variant sequences)
Reordering may be too strict or too loose
Reordering may not match typing order or rendering order
Stability guarantees may be interpreted purely formally
Arabic, Thai,.....: not ideal
(new work for rendering Arabic: Unicode Arabic Mark Ordering Algorithm)

Canonical Recomposition

Works repeatedly pairwise
Starts with base character, ends before next base character
When combination exists, combine and continue with next combining character
Skip additional characters of same combining class after combining failure
Data from code charts, using ≡, or UnicodeData.txt (field 6, when no <...>)
Exclude Composition Exclusions
Hangul syllable decomposition (algorithmic)

Composition Exclusions

Singletons (nothing to recombine)
Script specific:
(precomposed rarely used)
- Indic (क़ख़ग़ज़ड़ढ़फ़य़ড়ঢ়য়ਖ਼ਗ਼ਜ਼ੜਫ਼ଡ଼ଢ଼)
- Tibetan
- Hebrew (שׁשׂשּׁשּׂאַאָאּבּגּדּהּוּ...)
Non-starter decompositions
(cases with oddball ccc)
Post Composition Exclusions (⫝̸, some musical symbols)
Data from CompositionExclusions.txt

Stability of NFD and NFC

Normalized text should stay so in future Unicode versions
Okay for NFD
Needs some work for NFC
Problem is new precomposed characters
old: q + ̂, new: q̂ (precomposed)
New (post composition) precomposed characters:
- Need to be excluded from composition (NFC)
  (unless both parts are new)
- Discourages encoding in the first place
Careful stability guarantees
Effective stop for encoding precomposed characters
- Strengthens compromise between Unicode and ISO
- Replaced by NamedSequences.txt
Danger to interpret stability guarantee strictly formally
(anything can be added if it's not defined to be equivalent)

Normalization Corrigenda

(4 out of 9 corrigenda for all of Unicode)

Yod with Hiriq Normalization
(forgotten post composition exclusion)
U+F951 Normalization
(wrong data entry: 陋 ≈電 rather than 陋≈陋)
Five CJK Canonical Mapping Errors
(some data entries simply wrong)
Normalization Idempotency
(difference between intent/sample implementation and description)

Kompatibility Equivalence

Tutorial: Equivalence, Mapping, and Normalization

https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

IUC44, Virtual Conference, 14 October 2020

Martin J. DÜRST

Kompatibility Equivalence (≈)

From Unicode Standard Annex (UAX) #15, Section 1.1:

Compatibility equivalence is a weaker equivalence between characters or sequences of characters that represent the same abstract character, but may have a different visual appearance or behavior.

(emphasis added by presenter)

May not conserve semantics: 2³=8, but 2³ ≈ 23
Is not necessarily very consistent (e.g. ④≈4, but ➃≁4)
Use different characters when semantically different (e.g. Mathematical notation)
Use markup/styling for stylistic differences (e.g. emphasis)

All Things Kompatibility

<tag> indicates and classifies kompatibility in UnicodeData.txt (6th field)

<noBreak>: Difference in breaking properties only
<super>: Superscripts
<sub>: Subscripts
<fraction>: Fractions
<circle>: Circled letters
<square>: Square blocks
<font>: HEBREW LETTER WIDE; MATHEMATICAL/ARABIC MATHEMATICAL
<wide>: Fullwidth (double width) variants
<narrow>: Halfwidth variants
<small>: Small variants
<vertical>: Vertical variants
<isolated>, <final>, <initial>, <medial>: Arabic contextual variants and ligatures
<compat>: General compatibility (e.g. spacing/nonspacing), ligatures, double characters (e.g. double prime), roman numerals, script variants: long s, greek theta, space width variants (EM space,...), parenthesized, with full stop/comma, IDEOGRAPHIC TELEGRAPH; (CJK/KANGXI) radicals, HANGUL LETTER variants

Kompatibility Decomposition

Individual kompatibility decompositions proceed in a single step
(e.g. FDFA ≈ 0635 0644 0649 0020 0627 0644 0644 0647 0020 0639 0644 064A 0647 0020 0648 0633 0644 0645)
However, kompatibility and canonical decomposition need to be applied repeatedly:
- First compatibility decomposition, then canonical decomposition:
  U+01C4, LATIN CAPITAL LETTER DZ WITH CARON ≈<compat> 0044 017D ≡> 0044 005A 030C
- First canonical decomposition, then compatibility decomposition:U+0385, GREEK DIALYTIKA TONOS ≡> 00A8 0301 ≈<compat> 0020 0308 0301

Normalization Form KD (NFKD)

Definition:

Kompatibility Decomposition
Canonical Reordering

Normalization Form KC (NFKC)

Definition:

Kompatibility Decomposition
Canonical Reordering
Canonical Recomposition

Take Care with Normalization Forms

Normalization forms interact with case conversion
Normalization forms interact with string operations
e.g. concatenation
Normalization forms interact with markup
e.g. ≮≡<+ ̸
Normalization forms interact with escaping
Normalization interacts with Unicode Versions
(but greatest care has been taken to limit this)

Similar concerns may apply to other kinds of mappings

Because of the wide range of phenomena encoded by Unicode, there is always the chance that different phenomena interact in strange and unpredictable ways. Before assuming no interaction, carefully check. If you don't find any interactions, don't assume that will be necessarily so in all future versions.

Case Studies

Tutorial: Equivalence, Mapping, and Normalization

https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN

IUC44, Virtual Conference, 14 October 2020

Martin J. DÜRST

Case Study: IETF from IDNA2003 to Precis

IETF (Internet Engineering Task Force) need for 'normalized' identifiers
Internationalized Domain Names (IDNs, in IDNA)
User names, passwords, nicknames
(SASL, XMPP,...)
Internationalized Email Addresses

IETF Ideal

Not experts on characters
Avoid endless discussion on specific scripts,...
Assume problem is already solved elsewhere

⇒ Strong preference for one-stop shopping

IDNA Development

Design team for character issues
IDNA 2003
IDNA 2008
TR 46

Precis (framework):
- Allows characters based on character properties
- Width mapping (<wide>/<narrow>)
- Additional mappings
- Case mapping
- Normalization (NFC or NFKC)

IDNA 2003

(RFC 3490)

Case folding
NFKC
Nameprep (RFC 3491), based on Stringprep (RFC 3454)
Wide repertoire (what is not forbidden is allowed)
Based on Unicode 3.2 (fixed version, e.g. without Mongolian)

IDNA 2008

(RFC 5890-4)

Mappings (incl. case) outside spec
Result must be NFC, lower case
Contextual rules (e.g. for Malayalam)
Narrow repertoire (what's not allowed is forbidden)
Inclusion/exclusion based on character properties
Base version is Unicode 5.2.0
Semi-automatic version adjustments

IDNA 2008 Version Adjustments

Updated to Unicode 6.0.0 in RFC 6452
- Very short (4 pages)
- Core statement: No change to RFC 5892 is needed based on the changes made in Unicode 6.0.
Got stuck for a long time at Unicode 7.0.0
- 'Thinking' at that time documented in
  https://tools.ietf.org/html/draft-klensin-idna-5892upd-unicode70-05
- 7 different proposals for how to proceed
- Original stumbling block is U+08A1, ARABIC LETTER BEH WITH HAMZA ABOVE (new in Unicode 7.0.0)
- Problem: Both ARABIC LETTER BEH and HAMZA ABOVE already exist, but there is no equivalence
- Document kept growing (up to 35 pages)
- Not really a problem, because higher layer (registries) can take care
Moved on with RFC 8753 (IANA tables currently at Unicode version 11.0.0)

Unicode IDNA Compatibility Processing

(UTS #46)

Mixture between IDNA 2003 (for backwards compatibility)
and IDNA 2008
Updated for new Unicode versions, currently at Unicode version 13.0.0

Precis

Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols
Problem statement (RFC 6885, March 2013)
Updated specifications (Oct. 2017):
- Framework (RFC 8264)
  Building blocks and how to specify their combination
  (includes Identifiers and FreeForm)
- Usernames (CaseMapped and CasePreserved), OpaqueString, and Passwords (RFC 8265)
- Nicknames (RFC 8266)

Precis Components

Width mapping
Additional mappings (incl. character removals)
Case mappings
Normalization
Directionality (Bidi) restrictions
Character (type) restrictions
Script restrictions
Zero-length prohibition
Distinguishing between enforcement and comparison (e.g. for case distinctions)

Use fixpoint because idempotency unclear

Case Study 2: File Systems and IDNs/IRIs

Windows file system and IDNs are case-insensitive
Apple file system is case-sensitive
Windows file system keeps original casing for display
(needs data in two forms for efficiency)
IDNs don't keep original casing
(on the Web, there's no IBM, only ibm)
Windows file system does not normalize
IDNs and Apple file system normalize (but differently)

Strategy: When and How to Apply

To check equivalence, use mappings and compare for equality
Mapping can appear at system boundary or on actual equivalence check
System boundary may vary
- All of the Internet/WWW
- Specific servers, clients
- Some kinds/types of data
Check for consistency between system components
Check for consistency between software versions

Strategy: What to Apply

Case mapping (prefer lower case)
Pre-normalization tweaks
Normalization (Canonical or Kompatibility)
Post-normalization tweaks
Checks for disallowed characters
Checks for disallowed sequences (e.g. script mixing)

More Advice on Normalization

String Matching for the Web (W3C Working Group Note Feb. 2019)

Update (9th October 2020) under review, please send comments

Questions and Answers

This is your time for questions!

Questions may be unnormalized, but answers will be normalized!

You can also send questions by email to duerst@it.aoyama.ac.jp

Acknowledgments

Mark Davis for many years of collaboration and (and some disagreements) on normalization, and for proposing a wider approach to the topic of normalization.

Too many people from Unicode, W3C, and the IETF to list them all.

The IME Pad for facilitating character input.

Amaya and Opera 12.18 for slide production and display.

Tutorial Equivalence, Mapping, and Normalization

IUC44, Virtual Conference, 14 October 2020

Martin J. DÜRST

About the Slides

Abstract

Outline

Introduction

Audience

Main Points

Speaker Normalization

Tell me the Difference

A Big List

Why all these Similarities and Variants?

Why Does it Matter?

Applications

Equivalences/Mappings Defined by Unicode

Case

Something Familiar: Casing

Special Casings

Case Equivalence/Mapping

Data for Casing

A Bit of Theory

Equivalence vs. Mapping

Security Conditions for Mappings

Idempotence

Fixpoint

Various Equivalences

Numerical Equivalence

Security Confusables

Sorting

Canonical Equivalence

Canonical Equivalence (≡)

Why Canonical Equivalence

Combining Characters

Canonical Combining Class

Canonical Ordering

Canonical Ordering Example

Precomposed Characters

Hangul Syllables

Normalization Form D (NFD)

Evaluation of NFD

Usage Layers

Different Mindsets

Normalization Form C (NFC)

Evaluation of NFC

Limits of NFC and Canonical Equivalence

Canonical Recomposition

Composition Exclusions

Stability of NFD and NFC

Normalization Corrigenda

Kompatibility Equivalence

Kompatibility Equivalence (≈)

All Things Kompatibility

Kompatibility Decomposition

Normalization Form KD (NFKD)

Normalization Form KC (NFKC)

Take Care with Normalization Forms

Case Studies

Case Study: IETF from IDNA2003 to Precis

IETF Ideal

IDNA Development

IDNA 2003

IDNA 2008

IDNA 2008 Version Adjustments

Unicode IDNA Compatibility Processing

Precis

Precis Components

Case Study 2: File Systems and IDNs/IRIs

Strategy: When and How to Apply

Strategy: What to Apply

More Advice on Normalization

Questions and Answers

Acknowledgments

Tutorial
Equivalence, Mapping, and Normalization