Tutorial
Equivalence, Mapping, and Normalization
https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN
IUC44, Virtual Conference, 14
October 2020
Martin J. DÜRST
duerst@it.aoyama.ac.jp
Aoyama Gakuin
University
© 2013-20 Martin
J. Dürst, Aoyama Gakuin
University
About the Slides
The most up-to-date version of the slides is available at https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN
These slides are created in HMTL, for projection with a very
old version (≤12.18) of Opera (please contact me if you cannot find a copy).
Texts in gray, like this one, are comments/notes which do not appear on the
slides. Please note that depending on the browser and OS you use, some of the
characters and character combinations may not display as intended, but e.g. as
empty boxes, question marks, or apart rather than composed. To check the
identity of a character, use e.g. Richard Ishida's Unicode Converter and the Unicode code charts.
Abstract
The multitude of characters available in Unicode means that
there are many ways in which characters or strings can be equivalent, similar,
or otherwise related. In this tutorial, you will learn about all these
relationships, in order to be able to better work with Unicode data and
programs handling Unicode data. The tutorial assumes that participants have a
basic understanding of the scope and breadth of Unicode, possibly from
attending tutorials earlier in the day.
Character relationships and similarities in Unicode range from
linguistic and semantic similarities at one end to the same character being
represented in different character encodings or Unicode encoding forms at the
other end. In the middle, numerical and case equivalences, compatibility and
canonical equivalences, graphic similarities, and many others can be found.
This sometimes bewildering wealth of characters, equivalences, and
relationships is due to the rich history of human writing as well as to the
realities of character encoding policies and decisions.
The tutorial will give some guidance to help users navigate
equivalences and differences for their use cases and applications. Each of
these many equivalences or relationships can or should be ignored in some
processing contexts, but may be crucial in others. Contexts may range from use
as identifiers (e.g. user ids and passwords, with security consequences) to
searching and sorting. For most of the equivalences, data is available in the
Unicode Standard and its associated data files, or is provided by other
standards such as IDNA and PRECIS. But the use of this data and the functions
provided by various libraries requires understanding of the background of the
equivalences.
When testing for equivalence of two strings, the general
strategy is to map or normalize both strings to a form that eliminates
accidental (in the given context) differences, and then compare the strings on
a binary level. The tutorial will not only look at officially defined
equivalences, but will also discuss variants that may be necessary in practice
to cover specialized needs. We will also discuss the relationships between
various classes of equivalences, necessary to avoid pitfalls when combining
them, and the stability of the equivalences over time and under various
operations such as string concatenation.
Outline
- Introduction
- Example: CasE
- A Bit of Theory
- Various Equivalences
- Canonical Equivalence (NFD and NFC)
- Kompatibility Equivalence (NFKD and NFKC)
- Case Studies
- Questions
Introduction
Tutorial: Equivalence, Mapping, and Normalization
https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN
IUC44,
Virtual Conference, 14 October 2020
Martin J. DÜRST
Audience
- Basic knowledge about Unicode
(e.g. you attended some tutorials today)
- Responsible for Unicode data
or programs dealing with Unicode data
Main Points
- Lots of characters: Lots of similarities
- Why should you care?
- Equivalences and mappings:
- Casing
- Numbers
- Security confusables
- Sorting
- Canonical/Kompatibility Equivalence
- Normal forms
- Technical details
- History and politics
- Strategic advice
Speaker Normalization
Let's normalize the speaker, or in other words see how the
speaker was involved in normalization!
Tell me the Difference
What's the difference between the following characters:
৪ and 8 and ∞
Bengali 4, 8, and infinity
க௧
Tamil ka and Tamil 1
骨 and 骨
same codepoint, different language preference/glyph
structure
樂, 樂 and 樂
different Korean pronounciation, different codepoints,
should look the same
A Big List
Linguistic and Semantic Similarities |
color, colour, Farbe, 色 |
not discussed |
Transliterations/Transcriptions |
Putin, Путин, 푸틴, プーチン, 普京, پوتین,
بوتين |
Accidental Graphical Similarities |
T, ⊤, ┳, ᅮ |
this talk's topic |
Script Similarities |
K, Κ, К; ヘ、へ |
Cross-script 'Equivalences' |
な, ナ |
Numeric 'Equivalence' |
7, ۷७৭௭౭๗໗༧፯៧᠗⁷₇⑦⑺⒎7⓻❼➆➐ (59
in total, not including Han ideographs) |
Case 'Equivalence' |
g, G; dž, Dž, DŽ; Σ, ς, σ |
Compatibility Equivalence |
⁴, ₄, ④, ⒋, 4 |
Canonical Equivalence |
Å, Å; Ṝ, Ṛ + ̄, R + ̣ + ̄ |
Font Differences |
A, A, A, A, A |
not discussed |
Unicode Encoding Forms |
UTF-8: E9 9D 92 E5 B1 B1, UTF-16: 9752 5C71 |
(Legacy) Encodings |
ISO-8859-X, Shift_JIS,... |
Similarities and equivalences range from binary (at the bottom)
to semantic (at the top). In this tutorial, we assume a uniform Unicode
Encoding Form (e.g. UTF-8) and do not deal with linguistic and semantic
similarities, but discuss all the in-between layers.
Why all these Similarities and Variants?
- Historical cultural evolution
- Scripts and characters borrowed
- Changed by writing tools and customs
- From stylistic to orthographic distinctions
- Independent quest for simple graphics
- Encoding realities
- Encoding structure choices
- Round-tripping requirements
- Encoding compromises
- Encoding accidents
Why Does it Matter?
(actual
case!)
- Possible to hijack user account
- How:
- Create account with compatibility equivalent to the targeted one
(e.g. ᴮᴵᴳᴮᴵᴿᴰ)
- Ask for a password reset
- Log in and take over
- Why:
- Username normalization incomplete
- ᴮᴵᴳᴮᴵᴿᴰ → BIGBIRD
- BIGBIRD → bigbird
- Correct: ᴮᴵᴳᴮᴵᴿᴰ → bigbird
See also: https://github.com/reinderien/mimic
Applications
- Identification:
- Domain Names/Filenames/Usernames/Passwords
- Identifiers in formats (JSON/XML/...) and programming languages
- Security implications!
- Searching
- Sorting
Equivalences/Mappings Defined by Unicode
- Canonical Equivalence (NFC/NFD)
- Kompatibility Equivalence (NFKC/NFKD)
- Case equivalence: Case folding/mapping
- Sorting: Sorting keys
- Numeric Values
- Accidental graphical similarities: confusability skeletons
- Unicode Identifier and Pattern Syntax (UTS #31)
It's Compatibility Equivalence officially, but
we will use Kompatibility Equivalence (as in NFKC/NFKD)
hereafter for better distinction with Canonical Equivalence.
Case
Tutorial: Equivalence, Mapping, and Normalization
https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN
IUC44,
Virtual Conference, 14 October 2020
Martin J. DÜRST
Something Familiar: Casing
A ↔ a
...
Z ↔ z
Is it actually that easy?
Special Casings
- Case is only relevant for Latin, Greek, Cyrillic, Armenian, Deseret, old
Hungarian, Cherokee (?), and ancient Georgian
- Unicode knows three cases: lowercase, Titlecase, and UPPERCASE
- Case may be language/locale-dependent
(e.g. English: i↔I; Turkish: i↔İ, ı↔I)
- Case may depend on typographic tradition
(accents on French uppercase letters,...)
- Case may not be character-to-character
(e.g. ß↔SS)
- Case may be context-dependent
e.g. σ↔Σ, but at the end of a word: ς↔Σ
- Very vaguely related: Japanese カタカナ↔ひらがな
Case Equivalence/Mapping
- Case Folding:
- Aggressively maps case-related strings together
- Target is lower case (T/Τ/Т but t/τ/т, see Wikipedia)
- Suited e.g. for search
- Default Case Mapping:
- Not as aggressive as case folding
- Goes both ways
- Use if no context information available
- Use after context-specific mappings
- Language-dependent Case Mappings
- Simple Case Mapping:
- Character-to-character only
- No change of string length
Data for Casing
A Bit of Theory
Tutorial: Equivalence, Mapping, and Normalization
https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN
IUC44,
Virtual Conference, 14 October 2020
Martin J. DÜRST
Equivalence vs. Mapping
- Equivalence is defined between pairs of strings
Two strings HELLO and hello are equivalent (or
not)
- Mapping is defined from one string to another
String Good Bye maps to string good bye
- Equivalence is often defined using mapping
Example: For the FOO equivalence, with equivalence FOOeq and mapping
FOOmap,
FOOmap(A) = FOOmap(B) ⇔ FOOeq(A,
B)
Equivalence is often defined using a mapping. As an example, an
equivalence FOOeq may be defined with a mapping FOOmap. Two strings
A and B are FOOequivalent if (and only if) the FOOmapping
of A and the FOOmapping of B are equal. Put the other way
round, every mapping defines an equivalence.
Security Conditions for Mappings
- Stable mapping
- Disallowed characters may become allowed
- Additional mappings may be added if they don't collapse existing
data
- Store mapped form (but may want to keep display form, too)
- Make sure mapping is idempotent, or check for a
fixpoint
Idempotence
A function f is idempotent if you get the same result
irrespective of applying the function once or more.
f(f(x)) = f(x) ⇔
f is idemponent
Caution: The combination of two independently idempotent functions may not
be an idempotent function
f(f(x)) = f(x),
g(g(x)) = g(x) ↛
f(g(f(g(x)) =
f(g(x)), i.e.
f(g(x)) may not be idempotent
Examples:
- Canonical decomposition and Kompatibility decomposition
- Normalization and case mapping
Fixpoint
Various Equivalences
Tutorial: Equivalence, Mapping, and Normalization
https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN
IUC44,
Virtual Conference, 14 October 2020
Martin J. DÜRST
Numerical Equivalence
- Columns 6 (decimal), 7 (digit), and 8 (fraction) of UnicodeData.txt
- (ASCII) hexadecimal digits in PropList.txt
- Other ways to represent numbers:
- Greek, Hebrew: Letters with specific values
- Numbers as spoken out
- Han ideographs
Security Confusables
Multilevel complexity because not only characters, but also
scripts are involved. Not necessarily using equivalence relations, because it
may be possible for a to be confusable to b, and b to c, but not a to c.
Sorting
- Sorting is not only about equivalence (=), but about ordering (≤)
- Mechanism: Mapping to sort keys
String A sorts before B if sort_key(A) ≤
sort_key(B)
- Sorting is language/locale-dependent (viewer, not data!)
- Sorting uses levels, such as:
- Base character (A<B)
- Diacritics (u<ü)
- Case (g<G or G<g)
- Others (space, articles,...)
Sorting is related to equivalence and mappings, but is a talk
of its own.
Canonical Equivalence
Tutorial: Equivalence, Mapping, and Normalization
https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN
IUC44,
Virtual Conference, 14 October 2020
Martin J. DÜRST
Canonical Equivalence (≡)
From Unicode
Standard Annex (UAX) #15, Section 1.1:
Canonical equivalence is a fundamental equivalency between
characters or sequences of characters that represent the same abstract
character, and when correctly displayed should always have the same
visual appearance and behavior.
(emphasis added by presenter)
Why Canonical Equivalence
- Order of combining characters
R + ̣ + ̄ vs. R + ̄ + ̣
- Precomposed vs. Decomposed
Ṝ vs. Ṛ + ̄ vs. R + ̣ + ̄
- Singletons
Å vs. Å, 樂 vs. 樂 vs. 樂
- Hangul syllables vs. Jamos
가 vs. ᄀ + ᅡ
Combining Characters
- Many scripts use base characters and diacritics
- Diacritics and similar characters are called Combining Characters
- Often, all combinations may appear (e.g. Arabic, Hebrew)
- Often, combinations are limited per language,
but open-ended for the script (Latin)
- Separate encoding seems advantageous
(nothing else available in Unicode 1.0)
- Order meaningful if e.g. all diacritics above
- Order arbitrary if e.g. above and below
⇒Canonical Ordering
Canonical Combining Class
- A character property, integer between 0 and 254
- All base characters have ccc=0
- Only combining marks (but not all of them) have ccc≠0
- Ccc is the same for combining characters that
- Go on the same side as the base character
(e.g. below: 220, above: 230)
- Therefore, need to maintain order
- UnicodeData
file (4th field, 0 when no entry)
(see also DerivedCombiningClass.txt)
Canonical Ordering
- For any two consecutive characters ab
reorder as ba if ccc(a) >
ccc(b) > 0
until you find no more such reorderings
- Very similar to bubblesort (but not the same!)
- Local operation: most characters have ccc=0, and don't move
Canonical Ordering Example
Example (ş̤̂̏, ccc in (), characters being exchanged)
original |
s |
̂ (230) |
̧ (202) |
̏ (230) |
̤ (220) |
after first step |
s |
̧ (202) |
̂ (230) |
̏ (230) |
̤ (220) |
after second step |
s |
̧ (202) |
̂ (230) |
̤ (220) |
̏ (230) |
final |
s |
̧ (202) |
̤ (220) |
̂ (230) |
̏ (230) |
4 diacritics, 24 permutations, of which 12 and 12 are equivalent
Precomposed Characters
- Legacy encodings have characters including diacritics
(Example: ISO-8859-1:
ÁÂÃÅÅÆÇÉÊÊËÍÎÎÏ...ñòóôõ÷øùûûýýÿ)
- Limited transcoding technology
- Limited display technology
- European national pride and voting power
- Unicode - ISO 10646 merger
- Precomposed characters in Unicode 1.1
- Equivalences between precomposed characters and combining sequences
defined
- ⇒Normalization Form D
Hangul Syllables
- Korean is written in square syllables (Hangul, 한글)
- Syllables consist of
- Consonant(s) + vowel(s) (가)
- Consonant(s) + vowel(s) + consonant(s) (민)
- Individual pieces (Conjoining Jamo) are also encoded:
- Leading consonant(s) (ᄍ)
- Vowel(s) (ᅱ)
- Trailing consonant(s) (ᆹ)
- For all L, V, and T in modern use, every LV and LVT syllable is encoded
(over 10'000)
- Syllables with historical L (ᅑ), V (ᆑ), and T (ᇷ) must use Jamo
Normalization Form D (NFD)
(D stands for Decomposed)
NFD is the result of Canonical Decomposition:
- Apply Decomposition Mappings
- Code charts, using ≡, or UnicodeData
file (6th field, when no
<...>
)
(apply repeatedly, because all decompositions are singletons or
binary)
- Hangul syllable decomposition (algorithmic)
until you find no more decompositions
- Apply Canonical Reordering
Evaluation of NFD
Advantages:
- Flat and straightforward
- Fast operations
Disadvantages:
- Most data is not in NFD
- Difficult to force everybody to use it
Usage Layers
- topmost: User application: Meaningful characters
- higher: DNS/HTTP/SMTP/FTP: Bytes or characters
- lower: TCP/IP: Byte streams
- bottom: Electric current, light, electromagnetic waves
Different Mindsets
- Multilingual Editor:
- Character boundaries, complex display
- Preferred internal normalization
- Normalization not a major burden
- Application Protocols (e.g. IETF):
- Textual content just transported as bytes+charset info
- Identifiers (mostly) ASCII, comparison bytewise or ASCII-case
insensitive
- Internationalized Identifiers:
- E.g. XML element names (W3C)
- Byte or codepoint comparison
- Normalization close to actual practice desired
⇒Normalization Form C
Normalization Form C (NFC)
(C stands for Composed)
Definition:
- Canonical Decomposition (see NFD,
includes decomposition mappings and canonical reordering)
- Canonical Recomposition
('reverse' of decomposition mappings;
see also below)
Evaluation of NFC
Advantages:
- Very close to usage on Web
(design goal; in some ways actually too close)
- More compact than NFD
Disadvantages:
The phrase "in some ways actually too close" refers to the fact
that NFC is close enough to actual usage on the Web that there isn't too much
motivation or pressure to actively normalize content. This is not exactly the
purpose of a normalization form, as it means that there is still some content
out there that is not normalized.
Limits of NFC and Canonical Equivalence
- Designed mainly for Latin
- Koreans don't like it for historical Korean
- MacIntosh file system NFD (roughly: Hangul NFC, rest NFD)
- Canonical equivalence of compatibility Han Ideographs
(use variant sequences)
- Reordering may be too strict or too loose
- Reordering may not match typing order or rendering order
- Stability guarantees may be interpreted purely formally
- Arabic, Thai,.....: not ideal
(new work for rendering Arabic: Unicode Arabic Mark Ordering
Algorithm)
Canonical Recomposition
- Works repeatedly pairwise
- Starts with base character, ends before next base character
- When combination exists, combine and continue with next combining
character
- Skip additional characters of same combining class after combining
failure
- Data from code charts, using ≡, or UnicodeData.txt
(field 6, when no
<...>
)
- Exclude Composition Exclusions
- Hangul syllable decomposition (algorithmic)
Composition Exclusions
- Singletons (nothing to recombine)
- Script specific:
(precomposed rarely used)
- Indic (क़ख़ग़ज़ड़ढ़फ़य़ড়ঢ়য়ਖ਼ਗ਼ਜ਼ੜਫ਼ଡ଼ଢ଼)
- Tibetan
- Hebrew (שׁשׂשּׁשּׂאַאָאּבּגּדּהּוּ...)
- Non-starter decompositions
(cases with oddball ccc)
- Post Composition Exclusions (⫝̸, some musical symbols)
- Data from CompositionExclusions.txt
Stability of NFD and NFC
- Normalized text should stay so in future Unicode versions
- Okay for NFD
- Needs some work for NFC
- Problem is new precomposed characters
old: q + ̂, new: q̂ (precomposed)
- New (post composition) precomposed characters:
- Need to be excluded from composition (NFC)
(unless both parts are new)
- Discourages encoding in the first place
- Careful
stability guarantees
- Effective stop for encoding precomposed characters
- Danger to interpret stability guarantee strictly formally
(anything can be added if it's not defined to be equivalent)
Normalization Corrigenda
(4 out of 9 corrigenda for all of Unicode)
Kompatibility Equivalence
Tutorial: Equivalence, Mapping, and Normalization
https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN
IUC44,
Virtual Conference, 14 October 2020
Martin J. DÜRST
Kompatibility Equivalence (≈)
From Unicode
Standard Annex (UAX) #15, Section 1.1:
Compatibility equivalence is a weaker equivalence between characters
or sequences of characters that represent the same abstract character, but
may have a different visual appearance or behavior.
(emphasis added by presenter)
- May not conserve semantics: 2³=8, but 2³ ≈ 23
- Is not necessarily very consistent (e.g. ④≈4, but ➃≁4)
- Use different characters when semantically different (e.g. Mathematical
notation)
- Use markup/styling for stylistic differences (e.g.
emphasis)
All Things Kompatibility
<tag>
indicates and classifies kompatibility
in UnicodeData.txt
(6th field)
<noBreak>
: Difference in breaking properties only
<super>
: Superscripts
<sub>
: Subscripts
<fraction>
: Fractions
<circle>
: Circled letters
<square>
: Square blocks
<font>
: HEBREW LETTER WIDE
;
MATHEMATICAL
/ARABIC MATHEMATICAL
<wide>
: Fullwidth (double width) variants
<narrow>
: Halfwidth variants
<small>
: Small variants
<vertical>
: Vertical variants
<isolated>
, <final>
,
<initial>
, <medial>
: Arabic
contextual variants and ligatures
<compat>
: General compatibility (e.g.
spacing/nonspacing), ligatures, double characters (e.g. double prime),
roman numerals, script variants: long s, greek theta, space width variants
(EM space,...), parenthesized, with full stop/comma, IDEOGRAPHIC
TELEGRAPH
; (CJK/KANGXI) radicals, HANGUL LETTER
variants
Kompatibility Decomposition
- Individual kompatibility decompositions proceed in a single step
(e.g. FDFA ≈ 0635 0644 0649 0020 0627 0644 0644 0647 0020 0639 0644 064A
0647 0020 0648 0633 0644 0645)
- However, kompatibility and canonical decomposition need to be applied
repeatedly:
- First compatibility decomposition, then canonical decomposition:
U+01C4, LATIN CAPITAL LETTER DZ WITH CARON
≈<compat> 0044 017D
≡> 0044 005A
030C
- First canonical decomposition, then compatibility
decomposition:
U+0385, GREEK DIALYTIKA TONOS
≡> 00A8 0301
≈<compat> 0020 0308 0301
Normalization Form KD (NFKD)
Definition:
- Kompatibility Decomposition
- Canonical Reordering
Normalization Form KC (NFKC)
Definition:
- Kompatibility Decomposition
- Canonical Reordering
- Canonical Recomposition
Take Care with Normalization Forms
- Normalization forms interact with case conversion
- Normalization forms interact with string operations
e.g. concatenation
- Normalization forms interact with markup
e.g. ≮≡<+ ̸
- Normalization forms interact with escaping
- Normalization interacts with Unicode Versions
(but greatest care has been taken to limit this)
Similar concerns may apply to other kinds of mappings
Because of the wide range of phenomena encoded by Unicode,
there is always the chance that different phenomena interact in strange and
unpredictable ways. Before assuming no interaction, carefully check. If you
don't find any interactions, don't assume that will be necessarily so in all
future versions.
Case Studies
Tutorial: Equivalence, Mapping, and Normalization
https://www.sw.it.aoyama.ac.jp/2020/pub/IUC44EMN
IUC44,
Virtual Conference, 14 October 2020
Martin J. DÜRST
Case Study: IETF from IDNA2003 to Precis
- IETF (Internet Engineering Task Force)
need for 'normalized' identifiers
- Internationalized Domain Names (IDNs, in IDNA)
- User names, passwords, nicknames
(SASL, XMPP,...)
- Internationalized Email Addresses
IETF Ideal
- Not experts on characters
- Avoid endless discussion on specific scripts,...
- Assume problem is already solved elsewhere
⇒ Strong preference for one-stop shopping
IDNA Development
- Design team for character issues
- IDNA 2003
- IDNA 2008
- TR 46
- Precis (framework):
- Allows characters based on character properties
- Width mapping
(
<wide>
/<narrow>
)
- Additional mappings
- Case mapping
- Normalization (NFC or NFKC)
IDNA 2003
(RFC 3490)
- Case folding
- NFKC
- Nameprep (RFC 3491),
based on Stringprep (RFC
3454)
- Wide repertoire (what is not forbidden is allowed)
- Based on Unicode 3.2 (fixed version, e.g. without Mongolian)
IDNA 2008
(RFC 5890-4)
- Mappings (incl. case) outside spec
- Result must be NFC, lower case
- Contextual rules (e.g. for Malayalam)
- Narrow repertoire (what's not allowed is forbidden)
- Inclusion/exclusion based on character properties
- Base version is Unicode 5.2.0
- Semi-automatic version adjustments
IDNA 2008 Version Adjustments
- Updated to Unicode 6.0.0 in RFC 6452
- Very short (4 pages)
- Core statement: No change to RFC 5892 is needed based on the
changes made in Unicode 6.0.
- Got stuck for a long time at Unicode 7.0.0
- 'Thinking' at that time documented in
https://tools.ietf.org/html/draft-klensin-idna-5892upd-unicode70-05
- 7 different proposals for how to proceed
- Original stumbling block is U+08A1, ARABIC LETTER BEH WITH HAMZA
ABOVE (new in Unicode 7.0.0)
- Problem: Both ARABIC LETTER BEH and HAMZA ABOVE already exist, but
there is no equivalence
- Document kept growing (up to 35 pages)
- Not really a problem, because higher layer (registries) can take
care
- Moved on with RFC
8753 (IANA
tables currently at Unicode version 11.0.0)
Unicode IDNA Compatibility Processing
(UTS #46)
- Mixture between IDNA 2003 (for backwards compatibility)
and IDNA 2008
- Updated for new Unicode versions, currently at Unicode version 13.0.0
Precis
- Preparation, Enforcement, and Comparison of Internationalized Strings in
Application Protocols
- Problem statement (RFC 6885, March 2013)
- Updated specifications (Oct. 2017):
- Framework (RFC
8264)
Building blocks and how to specify their combination
(includes Identifiers and FreeForm)
- Usernames (CaseMapped and CasePreserved), OpaqueString, and Passwords
(RFC 8265)
- Nicknames (RFC
8266)
Precis Components
- Width mapping
- Additional mappings (incl. character removals)
- Case mappings
- Normalization
- Directionality (Bidi) restrictions
- Character (type) restrictions
- Script restrictions
- Zero-length prohibition
- Distinguishing between enforcement and comparison (e.g. for case
distinctions)
Use fixpoint because idempotency unclear
Case Study 2: File Systems and IDNs/IRIs
- Windows file system and IDNs are case-insensitive
- Apple file system is case-sensitive
- Windows file system keeps original casing for display
(needs data in two forms for efficiency)
- IDNs don't keep original casing
(on the Web, there's no IBM, only ibm)
- Windows file system does not normalize
- IDNs and Apple file system normalize (but differently)
Strategy: When and How to Apply
- To check equivalence, use mappings and compare for equality
- Mapping can appear at system boundary or on actual equivalence check
- System boundary may vary
- All of the Internet/WWW
- Specific servers, clients
- Some kinds/types of data
- Check for consistency between system components
- Check for consistency between software versions
Strategy: What to Apply
- Case mapping (prefer lower case)
- Pre-normalization tweaks
- Normalization (Canonical or Kompatibility)
- Post-normalization tweaks
- Checks for disallowed characters
- Checks for disallowed sequences (e.g. script mixing)
More Advice on Normalization
String Matching
for the Web (W3C Working Group Note Feb. 2019)
Update (9th October 2020)
under review, please send comments
Questions and Answers
This is your time for questions!
Questions may be unnormalized, but answers will be normalized!
You can also send questions by email to duerst@it.aoyama.ac.jp
Acknowledgments
Mark Davis for many years of collaboration and (and some disagreements) on
normalization, and for proposing a wider approach to the topic of
normalization.
Too many people from Unicode, W3C, and the IETF to list them all.
The IME Pad for facilitating character input.
Amaya and Opera 12.18 for slide
production and display.