Tutorial
Equivalence, Mapping, and Normalization

http://www.sw.it.aoyama.ac.jp/2013/pub/NormEquivTut

IUC 37, Santa Clara, CA, U.S.A., 20 October 2013

Martin J. DÜRST

duerst@it.aoyama.ac.jp

About the Slides

The most up-to-date version of the slides, as well as additional materials, is available at http://www.sw.it.aoyama.ac.jp/2013/pub/NormEquivTut.

These slides are created in HMTL, for projection with Opera (≤12.16 Windows/Mac/Linux, use F11). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some of the characters and character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.

Audience

Basic knowledge about Unicode (e.g. you attended some tutorials today)
Responsible for Unicode data or programs dealing with Unicode data

Main Points

Unicode Normalization in Context
Some technical details
Some history and politics
Some strategy

Speaker Normalization

Interested in Unicode since ca. 1992
Instigator of NFC (see draft-duerst-i18n-norm)
Co-creator of UAX #15, Unicode Normalization Forms
Advocate of normalization in the IETF and W3C
Three normalization implementations (charlint, checking for XML1.1, eprun)

Let's normalize the speaker, or in other words see how the speaker was involved in normalization!

Tell me the Difference

What's the difference between the following characters:

৪ and 8 and ∞
Bengali 4, 8, and infinity

க௧
Tamil ka and 1

骨 and 骨
meaning 'bone', same codepoint, different language preference/font

樂, 樂 and 樂
different Korean pronounciation, different codepoints

A Big List

Linguistic and Semantic Similarities	color, colour, Farbe, 色
Accidental Graphical Similarities	T, ⊤, ┳, ᅮ
Script Similarities	K, Κ, К; ヘ、へ
Cross-script 'Equivalences'	な, ナ
Numeric 'Equivalence'	7, ۷७৭௭౭๗໗༧፯៧᠗⁷₇⑦⑺⒎７⓻❼➆➐ (59 in total, not including a few Han ideographs)
Case 'Equivalence'	g, G; ǆ, ǅ, Ǆ; Σςσ
Compatibility Equivalence	⁴, ₄, ④, ⒋, ４
Canonical Equivalence	Å, Å; Ṝ, Ṛ + ̄, R + ̣ + ̄
Unicode Encoding Forms	UTF-8: E9 9D 92 E5 B1 B1, UTF-16: 9752 5C71
(Legacy) Encodings	ISO-8859-X, Shift_JIS,...

Similarities and equivalences range from binary (at the bottom) to semantic (at the top). In this tutorial, we assume an uniform Unicode Encoding Form (e.g. UTF-8) and don't deal with linguistic and semantic similarities, but discuss all the layers in-between them.

Why all these Similarities and Variants?

Historical cultural evolution
- Scripts and characters borrowed
- Changed by writing tools and customs
- From stylistic to orthographic distinctions
- Independent quest for simple graphics
Encoding realities
- Encoding structure choices
- Round-tripping requirements
- Encoding compromises
- Encoding accidents

Applications

Identification:
- Domain Names/Filenames/Usernames/Passwords
- Element names/attribute names/class names in XML/HTML/CSS
- Variable names in programming languages
Searching
Sorting

Equivalences/Mappings Defined by Unicode

Canonical Equivalence
- Normalization Form D (NFD)
- Normalization Form C (NFC)
Kompatibility Equivalence
- Normalization Form KD (NFKD)
- Normalization Form KC (NFKC)
Case equivalence: Case folding
Sorting: Sorting keys
Numeric Values
Accidental graphical similarities: confusability skeletons

It's Compatibility Equivalence officially, but we will use Kompatibility Equivalence (as in NFKC/NFKD) hereafter for better distinction with Canonical Equivalence.

Equivalence vs. Mapping

Equivalence is defined between pairs of strings
Two strings A and B are equivalent (or not)
Mapping is defined from one string to another
String C maps to string D
Equivalence is often defined using mapping
Example: For equivalence FOOeq and mapping FOOmap,
FOOmap(A) = FOOmap(B) ⇔ FOOeq(A, B)

Equivalence is often defined using a mapping. As an example, an equivalence FOOeq may be defined with a mapping FOOmap. Two strings A and B are FOOequivalent if (and only if) the FOOmapping of A and the FOOmapping of B are equal. Put the other way round, every mapping defines an equivalence.

Sorting

Sorting isn't just about equivalence (=), but about ordering (≤)
Mapping to sort keys serves the same purpose:
String A sorts before B if sort_key(A) ≤ sort_key(B)
Sorting is language/locale-dependent (viewer, not data)
Sorting uses levels, such as:
- Base character (A<B)
- Diacritics (u<ü)
- Case (g<G or G<g)
- Others (space, articles,...)

Sorting is related to equivalence and mappings, but would be a talk of its own.

Casing

Case is only relevant for Latin, Greek, Cyrillic, Armenian, and ancient Georgian
Unicode knows three cases: lowercase, titlecase, and uppercase
Case may be language/locale-dependent
(e.g. English: i↔I; Turkish: i↔İ, ı↔I)
Case may not be character-to-character
(e.g. ß↔SS)
Case may be context-dependent
e.g. σ↔Σ, but at the end of a word: ς↔Σ
Case may depend on typographic tradition
(accents on French uppercase letters,...)

Case Equivalence/Mapping

Case Folding:
- Aggressively maps case-related strings together
- To lower case
- Suited e.g. for search
Default Case Mapping:
- Not as aggressive as case folding
- Goes both ways
- Use if no context information available
- Use after context-specific mappings

Other Equivalences/Mappings

Numerical equivalence
Character similarities for spoofing detection

Canonical Equivalence

From Unicode Standard Annex (UAX) #15, Section 1.1:

Canonical equivalence is a fundamental equivalency between characters or sequences of characters that represent the same abstract character, and when correctly displayed should always have the same visual appearance and behavior.

(emphasis added by presenter)

Why Canonical Equivalence

Order of combining characters
R + ̣ + ̄ vs. R + ̄ + ̣
Precomposed vs. Decomposed
Ṝ vs. Ṛ + ̄ vs. R + ̣ + ̄
Singletons
Å vs. Å, 樂 vs. 樂 vs. 樂
Hangul syllables vs. Jamos
가 vs. ᄀ + ᅡ

Combining Characters

Many scripts use base characters and diacritics
Diacritics and similar characters are called Combining Characters
Often, all combinations may appear (e.g. Arabic, Hebrew)
Often, combinations limited per language,
but open-ended for the script (Latin)
Separate encoding seems advantageous
(was only one in Unicode 1.0)
Order meaningful if e.g. all diacritics above
Order arbitrary if e.g. above and below
⇒Canonical Ordering

Canonical Combining Class

A character property, integer between 0 and 254
All base characters have ccc=0
Only combining marks (but not all of them) have ccc≠0
Ccc is the same for combining characters that
- Go on the same side as the base character
- Therefore, need to maintain order
UnicodeData file (4th field, 0 when no entry)
(see also DerivedCombiningClass.txt)

Canonical Ordering

For any two consecutive characters ab
reorder as ba if ccc(a) > ccc(b) > 0
until you find no more such reorderings
Very similar to bubblesort (but not the same!)
Local operation: most characters have ccc=0, and don't move

Example (ş̤̂̏, ccc in (), characters being exchanged):

original	s	̂ (230)	̧ (202)	̏ (230)	̤ (220)
after first step	s	̧ (202)	̂ (230)	̏ (230)	̤ (220)
after second step	s	̧ (202)	̂ (230)	̤ (220)	̏ (230)
final	s	̧ (202)	̤ (220)	̂ (230)	̏ (230)

4 diacritics, 24 permutations, of which 12 are equivalent

Precomposed Characters

Legacy encodings had characters including diacritics
Limited transcoding technology
Limited display technology
European national pride and voting power
Unicode - ISO 10646 merger
Precomposed characters in Unicode 1.1
Equivalences between precomposed characters and combining sequences defined
⇒Normalization Form D

Hangul Syllables

Korean is written in square syllables (Hangul, 한글)
Syllables consist of
- Consonant(s) + vowel(s) (가)
- Consonant(s) + vowel(s) + consonant(s) (민)
Individual pieces (Conjoining Jamo) are also encoded:
- Leading consonant(s) (ᄍ)
- Vowel(s) (ᅱ)
- Trailing consonant(s) (ᆹ)
For all L, V, and T in modern use, every LV and LVT syllable is encoded (over 10'000)
Syllables with historical L (ᅑ), V (ᆑ), and T (ᇷ) must use Jamo

Normalization Form D (NFD)

(D stands for Decomposed)

Apply Canonical Decomposition, consisting of:

Applying Decomposition Mappings
- Code charts, using ≡, or UnicodeData file (6th field, when no <...>)
- Hangul syllable decomposition (algorithmic)
until you find no more decompositions
Canonical Reordering

Advantages of NFD:

Flat and straightforward
Fast operations

Disadvantages of NFD:

Most data is not in NFD
Difficult to force everybody to use it

Usage Layers

Electric current, light, electromagnetic waves
TCP/IP: Bytestrems
DNS/HTTP/SMTP/FTP: Bytes or characters
User application: Meaningful characters

Different Mindsets

Multilingual Editor:
- Character boundaries, complex display
- Preferred internal normalization
- Normalization not a major burden
Application Protocols (e.g. IETF):
- Textual content just transported as bytes+charset info
- Identifiers ASCII only, comparison bytewise or ASCII-case insensitive
Internationalized Identifiers:
- E.g. XML element names (W3C)
- Byte or codepoint comparison
- Normalization close to actual practice desired
⇒Normalization Form C

Normalization Form C (NFC)

(C stands for Composed)

After Canonical Decomposition (see NFD):

Canonical Recomposition

Advantages of NFC:

Very close to usage on Web
(design goal, in some ways actually too close)
More compact

Disadvantages of NFC:

More complex

The phrase "in some ways actually too close" refers to the fact that NFC is so close to actual usage on the Web that there isn't too much motivation to actively normalize content. This is not exactly the purpose of a normalization form, as it means that there is still some content out there that is not normalized.

Canonical Recomposition

Works repeatedly pairwise
Starts with base character and first combining character
Only look at first character of each combining class
(to maintain canonical equivalence)
When combination exists, combine and restart
Exclude Composition Exclusions
Data from code charts, using ≡, or UnicodeData file (6th field, when no <...>)
Hangul syllable decomposition (algorithmic)

Composition Exclusions

Singletons (nothing to recombine)
Script specific:
(precomposed rarely used)
- Indic (क़ख़ग़ज़ड़ढ़फ़य़ড়ঢ়য়ਖ਼ਗ਼ਜ਼ੜਫ਼ଡ଼ଢ଼)
- Tibetan
- Hebrew (שׁשׂשּׁשּׂאַאָאּבּגּדּהּוּ...)
Non-starter decompositions
(cases with oddball ccc)
Post Composition Exclusions
Data from CompositionExclusions.txt

Stability

Normalized text should stay so in future Unicode versions
Okay for NFD
Needs some work for NFC
Problem is new precomposed characters
old: q + ̂, new: q̂ (precomposed)
New (post composition) precomposed characters:
- Need to be excluded from composition (NFC)
  (unless both parts are new)
- Discourages encoding in the first place
Careful stability guarantees
Effective stop for encoding precomposed characters
- Strengthens compromise between Unicode and ISO
- Named character sequences

Normalization Corrigenda

(4 out of 9 for all of Unicode)

Yod with Hiriq Normalization
(forgotten post composition exclusion)
U+F951 Normalization
(wrong data entry: 陋 ≈電 rather than 陋≈陋)
Five CJK Canonical Mapping Errors
(some data entries simply wrong)
Normalization Idempotency
(difference between intent/sample implementation and description)

NFC/NFD Variants

MacIntosh file system NFD (roughly: Hangul NFC, rest NFD)
Traditional Hangul (improve display)
Compatibility Han Ideographs (use variant selectors)

Kompatibility Equivalence

From Unicode Standard Annex (UAX) #15, Section 1.1:

Compatibility equivalence is a weaker equivalence between characters or sequences of characters that represent the same abstract character, but may have a different visual appearance or behavior.

(emphasis added by presenter)

May not conserve semantics: 2³=8, but 2³ ≈ 23
Is not necessarily very consistent (e.g. ④≈4, but ➃≁4)
Use different characters when semantically different (e.g. Mathematical notation)
Use markup/styling for stylistic differences (e.g. emphasis)

All Things Kompatibility

<tag> indicates and classifies kompatibility in the Unicode Data file

<noBreak>: Difference in breaking properties only
<super>: Superscripts
<sub>: Subscripts
<fraction>: Fractions
<circle>: Circled letters
<square>: Square blocks
<font>: HEBREW LETTER WIDE; MATHEMATICAL/ARABIC MATHEMATICAL
<wide>: Fullwidth (double width) variants
<narrow>: Halfwidth variants
<small>: Small variants
<vertical>: Vertical variants
<isolated>, <final>, <initial>, <medial>: Arabic contextual variants and ligatures
<compat>: General compatibility (e.g. spacing/nonspacing), ligatures, double characters (e.g. double prime), roman numerals, script variants: long s, greek theta, space width variants (EM space,...), parenthesized, with full stop/COMMA, IDEOGRAPHIC TELEGRAPH; (CJK/KANGXI) radicals, HANGUL LETTER variants

Kompatibility Decomposition

Kompatibility decompositions as such proceed in one go (e.g. FDFA ≈ 0635 0644 0649 0020 0627 0644 0644 0647 0020 0639 0644 064A 0647 0020 0648 0633 0644 0645)
However, kompatibility and canonical decomposition need to be applied repeatedly:
- First compatibility decomposition, then canonical decomposition:
  U+01C4, LATIN CAPITAL LETTER DZ WITH CARON ≈<compat> 0044 017D ≡> 0044 005A 030C
- First canonical decomposition, then compatibility decomposition:U+0385, GREEK DIALYTIKA TONOS ≡> 00A8 0301 ≈<compat> 0020 0308 0301

Normalization Form KD (NFKD)

Kompatibility Decomposition
Canonical Reordering

Normalization Form KC (NFKC)

Kompatibility Decomposition
Canonical Reordering
Canonical Recomposition

Take Care with Normalization Forms

Normalization forms interact with case conversion
Normalization forms interact with string operations
e.g. concatenation
Normalization forms interact with markup
e.g. ≮≡<+ ̸
Normalization forms interact with escaping
Normalization interacts with Unicode Versions
(but greatest care has been taken to limit this)

Similar concerns may apply to other kinds of mappings

Because of the wide range of phenomena encoded by Unicode, there is always the chance that different phenomena interact in strange and unpredictable ways. Before assuming no interaction, carefully check. If you don't find any interactions, don't assume that will be necessarily so in all future versions.

Case Study: From IDNA2003 to Precis

IETF need for 'normalized' identifiers
IDNA 2003:
- Case folding
- NFKC
- Wide repertoire (what's not forbidden is allowed)
- Based on Unicode 3.2
IDNA 2008:
- Mappings (incl. case) outside spec
- NFC
- Contextual rules
- Narrow repertoire (what's not allowed is forbidden)
- Based on character properties
Precis (framework):
- Allowed characters based on character properties
- Width mapping (<wide>/<narrow>)
- Additional mappings
- Case mapping
- Normalization (NFC or NFKC)

Case Study: Windows File System vs. IDNs

Both are case-insensitive
Windows File System keeps original casing for display
(needs data in two forms for efficiency)
IDNs don't keep original casing
(on the Web, there's no IBM, only ibm)

Case Study: Normalization Breakin

(actual case!)

Username normalization different for
- Account creation
- Login
- Actual access
Create account with compatibility equivalent to the targeted one
Log in and take over

Strategy

To check equivalence, use (one of) the corresponding mappings and compare for equality
Mapping can appear at system boundary or on actual equivalence check
System boundary may vary
- All of the Internet/WWW
- Specific servers, clients
- Some kinds/types of data
Check for consistency between system components
Check for consistency between software versions

More about Normalization Implementation

Talk tomorrow (Tuesday, 22 October), Track 3, Session 4 (14:30 – 15:20):

Implementing Normalization in Pure Ruby - the Fast and Easy Way

Questions and Answers

Questions may be unnormalized, but answers will be normalized!

Acknowledgments

Mark Davis for many years of collaboration and (and some disagreements) on normalization, and for proposing a wider approach to the topic of normalization.

The IME Pad for facilitating character input.

Amaya and Web technology for slide editing and display.

Tutorial Equivalence, Mapping, and Normalization