Tutorial
Equivalence, Mapping, and Normalization
http://www.sw.it.aoyama.ac.jp/2013/pub/NormEquivTut
IUC 37, Santa Clara, CA, U.S.A.,
20 October 2013
Martin J. DÜRST
duerst@it.aoyama.ac.jp
Aoyama Gakuin
University
© 2013 Martin J.
Dürst, Aoyama Gakuin University
About the Slides
The most up-to-date version of the slides, as well as additional materials,
is available at http://www.sw.it.aoyama.ac.jp/2013/pub/NormEquivTut.
Audience
- Basic knowledge about Unicode (e.g. you attended some tutorials
today)
- Responsible for Unicode data or programs dealing with Unicode data
Main Points
- Unicode Normalization in Context
- Some technical details
- Some history and politics
- Some strategy
Speaker Normalization
Tell me the Difference
What's the difference between the following characters:
৪ and 8 and ∞
Bengali 4, 8, and infinity
க௧
Tamil ka and 1
骨 and
骨
meaning 'bone', same codepoint, different language
preference/font
樂, 樂 and 樂
different Korean pronounciation, different
codepoints
A Big List
Linguistic and Semantic Similarities |
color, colour, Farbe, 色 |
Accidental Graphical Similarities |
T, ⊤, ┳, ᅮ |
Script Similarities |
K, Κ, К; ヘ、へ |
Cross-script 'Equivalences' |
な, ナ |
Numeric 'Equivalence' |
7, ۷७৭௭౭๗໗༧፯៧᠗⁷₇⑦⑺⒎7⓻❼➆➐ (59
in total, not including a few Han ideographs) |
Case 'Equivalence' |
g, G; dž, Dž, DŽ; Σςσ |
Compatibility Equivalence |
⁴, ₄, ④, ⒋, 4 |
Canonical Equivalence |
Å, Å; Ṝ, Ṛ + ̄, R + ̣ + ̄ |
Unicode Encoding Forms |
UTF-8: E9 9D 92 E5 B1 B1, UTF-16: 9752 5C71 |
(Legacy) Encodings |
ISO-8859-X, Shift_JIS,... |
Why all these Similarities and Variants?
- Historical cultural evolution
- Scripts and characters borrowed
- Changed by writing tools and customs
- From stylistic to orthographic distinctions
- Independent quest for simple graphics
- Encoding realities
- Encoding structure choices
- Round-tripping requirements
- Encoding compromises
- Encoding accidents
Applications
- Identification:
- Domain Names/Filenames/Usernames/Passwords
- Element names/attribute names/class names in XML/HTML/CSS
- Variable names in programming languages
- Searching
- Sorting
Equivalences/Mappings Defined by Unicode
- Canonical Equivalence
- Normalization Form D (NFD)
- Normalization Form C (NFC)
- Kompatibility Equivalence
- Normalization Form KD (NFKD)
- Normalization Form KC (NFKC)
- Case equivalence: Case folding
- Sorting: Sorting keys
- Numeric Values
- Accidental graphical similarities: confusability skeletons
Equivalence vs. Mapping
- Equivalence is defined between pairs of strings
Two strings A and B are equivalent (or not)
- Mapping is defined from one string to another
String C maps to string D
- Equivalence is often defined using mapping
Example: For equivalence FOOeq and mapping FOOmap,
FOOmap(A) = FOOmap(B) ⇔ FOOeq(A,
B)
Sorting
- Sorting isn't just about equivalence (=), but about ordering (≤)
- Mapping to sort keys serves the same purpose:
String A sorts before B if sort_key(A) ≤
sort_key(B)
- Sorting is language/locale-dependent (viewer, not data)
- Sorting uses levels, such as:
- Base character (A<B)
- Diacritics (u<ü)
- Case (g<G or G<g)
- Others (space, articles,...)
Casing
- Case is only relevant for Latin, Greek, Cyrillic, Armenian, and ancient
Georgian
- Unicode knows three cases: lowercase, titlecase, and uppercase
- Case may be language/locale-dependent
(e.g. English: i↔I; Turkish: i↔İ, ı↔I)
- Case may not be character-to-character
(e.g. ß↔SS)
- Case may be context-dependent
e.g. σ↔Σ, but at the end of a word: ς↔Σ
- Case may depend on typographic tradition
(accents on French uppercase letters,...)
Case Equivalence/Mapping
- Case Folding:
- Aggressively maps case-related strings together
- To lower case
- Suited e.g. for search
- Default Case Mapping:
- Not as aggressive as case folding
- Goes both ways
- Use if no context information available
- Use after context-specific mappings
Other Equivalences/Mappings
- Numerical equivalence
- Character similarities for spoofing detection
Canonical Equivalence
From Unicode
Standard Annex (UAX) #15, Section 1.1:
Canonical equivalence is a fundamental equivalency between
characters or sequences of characters that represent the same abstract
character, and when correctly displayed should always have the same
visual appearance and behavior.
Why Canonical Equivalence
- Order of combining characters
R + ̣ + ̄ vs. R + ̄ + ̣
- Precomposed vs. Decomposed
Ṝ vs. Ṛ + ̄ vs. R + ̣ + ̄
- Singletons
Å vs. Å, 樂 vs. 樂 vs. 樂
- Hangul syllables vs. Jamos
가 vs. ᄀ + ᅡ
Combining Characters
- Many scripts use base characters and diacritics
- Diacritics and similar characters are called Combining Characters
- Often, all combinations may appear (e.g. Arabic, Hebrew)
- Often, combinations limited per language,
but open-ended for the script (Latin)
- Separate encoding seems advantageous
(was only one in Unicode 1.0)
- Order meaningful if e.g. all diacritics above
- Order arbitrary if e.g. above and below
- ⇒Canonical Ordering
Canonical Combining Class
- A character property, integer between 0 and 254
- All base characters have ccc=0
- Only combining marks (but not all of them) have ccc≠0
- Ccc is the same for combining characters that
- Go on the same side as the base character
- Therefore, need to maintain order
- UnicodeData
file (4th field, 0 when no entry)
(see also DerivedCombiningClass.txt)
Canonical Ordering
- For any two consecutive characters ab
reorder as ba if ccc(a) >
ccc(b) > 0
until you find no more such reorderings
- Very similar to bubblesort (but not the same!)
- Local operation: most characters have ccc=0, and don't move
- Example (ş̤̂̏, ccc in (), characters being exchanged):
original |
s |
̂ (230) |
̧ (202) |
̏ (230) |
̤ (220) |
after first step |
s |
̧ (202) |
̂ (230) |
̏ (230) |
̤ (220) |
after second step |
s |
̧ (202) |
̂ (230) |
̤ (220) |
̏ (230) |
final |
s |
̧ (202) |
̤ (220) |
̂ (230) |
̏ (230) |
4 diacritics, 24 permutations, of which 12 are equivalent
Precomposed Characters
- Legacy encodings had characters including diacritics
- Limited transcoding technology
- Limited display technology
- European national pride and voting power
- Unicode - ISO 10646 merger
- Precomposed characters in Unicode 1.1
- Equivalences between precomposed characters and combining sequences
defined
- ⇒Normalization Form D
Hangul Syllables
- Korean is written in square syllables (Hangul, 한글)
- Syllables consist of
- Consonant(s) + vowel(s) (가)
- Consonant(s) + vowel(s) + consonant(s) (민)
- Individual pieces (Conjoining Jamo) are also encoded:
- Leading consonant(s) (ᄍ)
- Vowel(s) (ᅱ)
- Trailing consonant(s) (ᆹ)
- For all L, V, and T in modern use, every LV and LVT syllable is encoded
(over 10'000)
- Syllables with historical L (ᅑ), V (ᆑ), and T (ᇷ) must use Jamo
Normalization Form D (NFD)
(D stands for Decomposed)
Apply Canonical Decomposition, consisting of:
Advantages of NFD:
- Flat and straightforward
- Fast operations
Disadvantages of NFD:
- Most data is not in NFD
- Difficult to force everybody to use it
Usage Layers
- Electric current, light, electromagnetic waves
- TCP/IP: Bytestrems
- DNS/HTTP/SMTP/FTP: Bytes or characters
- User application: Meaningful characters
Different Mindsets
- Multilingual Editor:
- Character boundaries, complex display
- Preferred internal normalization
- Normalization not a major burden
- Application Protocols (e.g. IETF):
- Textual content just transported as bytes+charset info
- Identifiers ASCII only, comparison bytewise or ASCII-case
insensitive
- Internationalized Identifiers:
- E.g. XML element names (W3C)
- Byte or codepoint comparison
- Normalization close to actual practice desired
- ⇒Normalization Form C
Normalization Form C (NFC)
(C stands for Composed)
After Canonical Decomposition (see NFD):
Advantages of NFC:
- Very close to usage on Web
(design goal, in some ways actually too close)
- More compact
Disadvantages of NFC:
Canonical Recomposition
- Works repeatedly pairwise
- Starts with base character and first combining character
- Only look at first character of each combining class
(to maintain canonical equivalence)
- When combination exists, combine and restart
- Exclude Composition Exclusions
- Data from code charts, using ≡, or UnicodeData
file (6th field, when no
<...>
)
- Hangul syllable decomposition (algorithmic)
Composition Exclusions
- Singletons (nothing to recombine)
- Script specific:
(precomposed rarely used)
- Indic (क़ख़ग़ज़ड़ढ़फ़य़ড়ঢ়য়ਖ਼ਗ਼ਜ਼ੜਫ਼ଡ଼ଢ଼)
- Tibetan
- Hebrew (שׁשׂשּׁשּׂאַאָאּבּגּדּהּוּ...)
- Non-starter decompositions
(cases with oddball ccc)
- Post Composition Exclusions
- Data from CompositionExclusions.txt
Stability
- Normalized text should stay so in future Unicode versions
- Okay for NFD
- Needs some work for NFC
- Problem is new precomposed characters
old: q + ̂, new: q̂ (precomposed)
- New (post composition) precomposed characters:
- Need to be excluded from composition (NFC)
(unless both parts are new)
- Discourages encoding in the first place
- Careful stability guarantees
- Effective stop for encoding precomposed characters
- Strengthens compromise between Unicode and ISO
- Named character sequences
Normalization Corrigenda
(4 out of 9 for all of Unicode)
NFC/NFD Variants
- MacIntosh file system NFD (roughly: Hangul NFC, rest NFD)
- Traditional Hangul (improve display)
- Compatibility Han Ideographs (use variant selectors)
Kompatibility Equivalence
From Unicode
Standard Annex (UAX) #15, Section 1.1:
Compatibility equivalence is a weaker equivalence between characters
or sequences of characters that represent the same abstract character, but
may have a different visual appearance or behavior.
- May not conserve semantics: 2³=8, but 2³ ≈ 23
- Is not necessarily very consistent (e.g. ④≈4, but ➃≁4)
- Use different characters when semantically different (e.g. Mathematical
notation)
- Use markup/styling for stylistic differences (e.g.
emphasis)
<tag>
indicates and classifies kompatibility
in the Unicode Data file
<noBreak>
: Difference in breaking properties only
<super>
: Superscripts
<sub>
: Subscripts
<fraction>
: Fractions
<circle>
: Circled letters
<square>
: Square blocks
<font>
: HEBREW LETTER WIDE
;
MATHEMATICAL
/ARABIC MATHEMATICAL
<wide>
: Fullwidth (double width) variants
<narrow>
: Halfwidth variants
<small>
: Small variants
<vertical>
: Vertical variants
<isolated>
, <final>
,
<initial>
, <medial>
: Arabic
contextual variants and ligatures
<compat>
: General compatibility (e.g.
spacing/nonspacing), ligatures, double characters (e.g. double prime),
roman numerals, script variants: long s, greek theta, space width variants
(EM space,...), parenthesized, with full stop/COMMA, IDEOGRAPHIC
TELEGRAPH
; (CJK/KANGXI) radicals, HANGUL LETTER
variants
Kompatibility Decomposition
- Kompatibility decompositions as such proceed in one go (e.g. FDFA ≈
0635 0644 0649 0020 0627 0644 0644 0647 0020 0639 0644 064A 0647 0020 0648
0633 0644 0645)
- However, kompatibility and canonical decomposition need to be applied
repeatedly:
- First compatibility decomposition, then canonical decomposition:
U+01C4, LATIN CAPITAL LETTER DZ WITH CARON
≈<compat> 0044 017D
≡> 0044 005A
030C
- First canonical decomposition, then compatibility
decomposition:
U+0385, GREEK DIALYTIKA TONOS
≡> 00A8 0301
≈<compat> 0020 0308 0301
Normalization Form KD (NFKD)
- Kompatibility Decomposition
- Canonical Reordering
Normalization Form KC (NFKC)
- Kompatibility Decomposition
- Canonical Reordering
- Canonical Recomposition
Take Care with Normalization Forms
- Normalization forms interact with case conversion
- Normalization forms interact with string operations
e.g. concatenation
- Normalization forms interact with markup
e.g. ≮≡<+ ̸
- Normalization forms interact with escaping
- Normalization interacts with Unicode Versions
(but greatest care has been taken to limit this)
Similar concerns may apply to other kinds of mappings
Case Study: From IDNA2003 to Precis
- IETF need for 'normalized' identifiers
- IDNA 2003:
- Case folding
- NFKC
- Wide repertoire (what's not forbidden is allowed)
- Based on Unicode 3.2
- IDNA 2008:
- Mappings (incl. case) outside spec
- NFC
- Contextual rules
- Narrow repertoire (what's not allowed is forbidden)
- Based on character properties
- Precis (framework):
- Allowed characters based on character properties
- Width mapping
(
<wide>
/<narrow>
)
- Additional mappings
- Case mapping
- Normalization (NFC or NFKC)
Case Study: Windows File System vs. IDNs
- Both are case-insensitive
- Windows File System keeps original casing for display
(needs data in two forms for efficiency)
- IDNs don't keep original casing
(on the Web, there's no IBM, only ibm)
Case Study: Normalization Breakin
(actual case!)
- Username normalization different for
- Account creation
- Login
- Actual access
- Create account with compatibility equivalent to the targeted one
- Log in and take over
Strategy
- To check equivalence, use (one of) the corresponding mappings and compare
for equality
- Mapping can appear at system boundary or on actual equivalence check
- System boundary may vary
- All of the Internet/WWW
- Specific servers, clients
- Some kinds/types of data
- Check for consistency between system components
- Check for consistency between software versions
More about Normalization Implementation
Talk tomorrow (Tuesday, 22 October), Track 3, Session 4 (14:30 –
15:20):
Implementing Normalization in Pure Ruby - the Fast and Easy Way
Questions and Answers
Questions may be unnormalized, but answers will be normalized!
Acknowledgments
Mark Davis for many years of collaboration and (and some disagreements) on
normalization, and for proposing a wider approach to the topic of
normalization.
The IME Pad for facilitating character input.
Amaya and Web technology for slide editing and display.