# Balanced Trees

(平衡木)

## Data Structures and Algorithms

### 9th lecture, November 28, 2017

http://www.sw.it.aoyama.ac.jp/2019/DA/lecture9.html

### Martin J. Dürst © 2009-19 Martin J. Dürst 青山学院大学

# Today's Schedule

• Leftovers and summary of last lecture
• Balanced trees for internal memory
• 2-3-4 trees
• Red-black trees
• AVL trees
• Balanced trees for secondary storage
• B trees
• B+ trees

# Summary of Last Lecture

• A dictionary is an ADT allowing the insertion, deletion, and search of data items using a key
• With a simplistic implementation, some operations take O(n) time
• With a binary search tree, all operations are
O(log n) on average, but O(n) in the worst case
• Different than for sorting, this cannot be improved using randomization:
• For quicksort, the algorithm can randomly select a pivot
• The order of insertions and deletions for a dictionary is externally determined

# Strengthening or Weakening Invariants of Binary Trees

• For the implementation of a priority queue, we
• Weakened the total order (of a list or array)
to a local order (between parent and child only)
• Strengthened the shape of a binary tree
to a complete binary tree
• We have to consider strengthening or weakening invariants to improve worst-case performance of a binary search tree

# Top-Down 2-3-4 Trees

• Each (internal) node has 2, 3, or 4 children
• A node with k children stores k-1 keys and data items
(if all nodes have 2 children, a 2-3-4 tree is equal to a full binary search tree)
• The keys in the internal nodes separate the key ranges in the subtrees
• The tree is of uniform height
• In the lowest layer of the tree, the nodes have no children
(implemented as a single unique empty node)
• Operations are generalizations of the same operation on a binary search tree

# Search in 2-3-4 Trees

• Start from the root node
• If the key being searched for is found in the current node, then return the corresponding data item
• Select the subtree based on this nodes' keys, and continue recursively

# Insertion into 2-3-4 Trees

• Basic operation: Search downwards, insert new data item into leaf node
• If a leaf node already has 4 children, it has to be split
• If a node has to be split, its middle key and data item have to be inserted into the parent node
• This may trigger further splits in parents, potentially up to the root
• To avoid splits after insertion (difficult to implement),
nodes with 4 children are split preemptively on the way from the root to the leaf
• This is the reason for the name top-down 2-3-4 tree
(there are other variants)

# Deletion from 2-3-4 Trees

• More complicated than insertion (same as binary search tree)
• Find data item to be deleted, using search
• If the item to be deleted is not in a leaf, exchange with an item in a leaf
• Remove the item in the leaf
• If this results in a leaf node without data items, move (borrow) items from neigboring leafs
• If the situation cannot be fixed using moving, merge some nodes
• If the situation cannot be fixed using merging, address the problem one layer higher
• If the problem cannot be solved in the top layer, reduce the number of the layers

# Efficiency of 2-3-4 Trees

• Maximum number of data items in a 2-3-4 tree of height h: n = 4h-1
• Minimum number of data items in a 2-3-4 tree of height h: n = 2h-1
• ⇒ The height of the tree is O(log n)
• The time needed for each operation is proportional to the height of the tree and therefore O(log n)
(even in the worst case)

# Implementation of 2-3-4 Tree

• Implementation in Ruby: 9234tree.rb
• Implementation of 2-3-4 trees is quite complicated
• Some memory (in nodes with 2 or 3 children) is unused
• Therefore, other balanced trees have been proposed

# Red-Black-Trees

• Implementation of a 2-3-4 tree with a binary tree
• The edges of the original tree are black
• Nodes with 3 or 4 children are split into multiple nodes, coloring the internal edges red
• Two consecutive red edges are forbidden
• If this invariant is violated, rotations are used for restoration
• If only black edges are counted, the tree is of uniform height
• When all edges are considered, the maximum depth of a leaf is at most twice the minimum depth (O(log n))

# AVL-Trees

• Proposed by Adelson-Velskii and Landis (Адельсон-Вельский and Ландис) in 1962
• Oldest (binary) balanced tree
• Invariant: At each internal node, the difference between the heights of the subtrees is 1 or less
• The difference between the heights of the left and the right subtrees (-1, 0, 1) is stored in each internal node and kept up to date
• The tree height is limited to 1.44 log2 n
• Searching is slightly faster than for a red-black-tree
• Insertion and deletion are slightly more complicated than for a red-black-tree

# Secondary Storage

Internal Memory External (Secondary) Storage
Access principle random random linear
Technology dynamic RAM SSD, HD magnetic tape
Unit of access word page/sector record
Example unit size 32/64 bits (4/8 bytes) 512/1024/2048/4096/... bytes varying
Access speed nanoseconds micro/milliseconds seconds or minutes

# B-Trees

• Variant of 2-3-4 trees
• Suited for external random access storage (SSD, HD)
• Each page is a node in the tree
• Maximise number of keys per page
• Minimum number of keys per page is about half of the maximum

# B+ Trees

Starting with a B-tree, all data (except keys) is moved to lowest layer of tree

⇒ The number of keys and child nodes per internal node increase
(for practical applications, the size of a key is much smaller than the size of the data)

⇒ The height of the tree shrinks

(the overall access time is dominated by the number of pages that have to be fetched from secondary memory)

# Definition of Variables for B+ Trees

• n: Overall number of data items (example: 50,000)
• Lp: Page size (example: 1024 bytes)
• Lk: Key size (example: 4 bytes)
• Ld: Data size (one item, except key) (example: 240 bytes)
• Lpp: Size of page number (page reference) (example: 4 bytes)
• αmin: minimum occupancy (usually 0.5)

# Items per Page for B+Trees

(⌊a⌋ is the floor function of a, the greatest integer smaller than or equal to a,
a⌋∈ℤ ∧ ⌊a⌋≦a ∧ ¬∃b: b∈ℤ ∧ ⌊a⌋≦b<a)

• dmax = ⌊Lp / (Lk + Ld)⌋ (example: 4)
(maximum number of data items per leaf page)
• dmin = ⌊dmax αmin⌋ (example: 2)
(minimum number of data items per leaf page)
• kmax = ⌊Lp / (Lk + Lpp)⌋ (example: 128)
(maximum number of children per internal node)
• kmin = ⌊kmax αmin⌋ (example: 64)
(minimum number of children per internal node)

# Number of Nodes for B+Trees

(⌈a⌉ is the ceiling function of a, the smallest integer greater than or equal to a,
a⌉∈ℤ ∧ a≦⌈a⌉ ∧ ¬∃b: b∈ℤ ∧ a<b≦⌈a⌉)

• Ndmax = ⌈n / dmin⌉ (example: 25,000)
(maximum number of leave pages)
• Ndmin = ⌈n / dmax⌉ (example: 12,500)
(minimum number of leave pages)
• Nkmax = ⌈Ndmax / kmin⌉ + ⌈Ndmax / kmin2⌉ ...
(maximum number of internal nodes)
(example: 391 + 7 + 1 = 399; height of B+tree: 4; total number of nodes: 25,399)
• Nkmin = ⌈Ndmin / kmax⌉ + ⌈Ndmin / kmax2⌉ + ...
(minimum number of internal nodes)
(example: 98 + 1 = 99; height of B+tree: 3; total number of nodes: 12,599)

# Summary

• Balanced search trees are important for efficient implementation of dictionary ADTs
• 2-3-4 trees and B(+)trees increase the degree of a binary tree, but keep the tree height constant
• Red-black-trees and AVL-trees impose limitations on the variation of the tree heigh
• Balanced trees allow to implement the basic operations on a dictionary ADT in O(log n) worst-case time
• B-trees and B+ trees are extremely important for the implementation of file systems and databases on secondary storage

# Glossary

red-black-tree

AVL-tree
AVL 木
secondary storage

B-tree
B 木
B+ tree
B+木
strengthen

weaken

uniform

lowest layer

occupancy

floor function

ceiling function

degree