Balanced Trees

(平衡木)

Data Structures and Algorithms

9th lecture, November 25, 2015

http://www.sw.it.aoyama.ac.jp/2017/DA/lecture9.html

Martin J. Dürst

© 2009-17 Martin J. Dürst 青山学院大学

Today's Schedule

• Summary of last lecture, leftovers
• Balanced trees for internal use
• 2-3-4 trees
• Red-black trees
• AVL trees
• Balanced trees for secondary storage
• B trees
• B+ trees

Summary of Last Lecture

• A dictionary is an ADT allowing the insertion, deletion, and search of data items using a key
• With a simplistic implementation, some operations take O(n) time
• With a binary search tree, all operations are O(log n) on average, but O(n) in the worst case
• Different than for sorting, this cannot be improved using randomization:
• For quicksort, the algorithm can randomly select a pivot
• The order of insertions and deletions for a dictionary is externally determined

Strengthening or Weakening the Invariants of Binary Trees

• For the implementation of a priority queue, we
• Weakened the total order (of a list or array) to a local order (between parent and child only)
• Strengthened the shape of a (general) binary tree to a complete binary tree
• We have to consider strengthening or weakening invariants to improve worst-case performance of a binary search tree

Top-Down 2-3-4 Trees

• Each (internal) node has 2, 3, or 4 children
• A node with k children stores k-1 keys and data items
(if all nodes have 2 children, a 2-3-4 tree is equal to a binary search tree)
• The keys in the internal nodes separate the key ranges in the subtrees
• The tree is of uniform height
• In the lowest layer of the tree, the nodes have no children
(implemented as a single unique empty node)
• Operations are generalizations of the same operation on a binary search tree

Search in 2-3-4 Trees

• Start from the root node
• If the key being searched for is found in the current node, then return the corresponding data item
• Select the subtree based on this nodes' keys, and continue recursively

Insertion into 2-3-4 Trees

• Basic operation: Search downwards, insert new data item into leaf node
• If a leaf node already has 4 children, it has to be split
• If a node has to be split, its middle key and data item have to be inserted into the parent node
• This may trigger further splits in parents, potentially up to the root
• To avoid splits after insertion (difficult to implement),
nodes with 4 children are split preemptively on the way from the root to the leaf
• This is the reason for the name top-down 2-3-4 tree
(there are other variants)

Deletion from 2-3-4 Trees

• More complicated than insertion (same as binary search tree)
• Find data item to be deleted, using search
• If the item to be deleted is not in a leaf, exchange with an item in a leaf
• Remove the item in the leaf
• If this results in a leaf node without data items, move (borrow) items from neigboring leafs
• If the situation cannot be fixed using moving, merge some nodes
• If the situation cannot be fixed using merging, address the problem one layer higher
• If the problem cannot be solved in the top layer, reduce the number of the layers

Efficiency of 2-3-4 Trees

• Maximum number of data items in a 2-3-4 tree of height h: n = 4h-1
• Minimum number of data items in a 2-3-4 tree of height h: n = 2h-1
• ⇒ The height of the tree is O(log n)
• The time needed for each operation is proportional to the height of the tree and therefore O(log n)

Implementation of 2-3-4 Tree

• Implementation in Ruby: 9234tree.rb
• Implementation of 2-3-4 trees is quite complicated
• Some memory (in nodes with 2 or 3 children) is unused
• Therefore, other balanced trees have been proposed

Red-Black-Trees

• Implementation of a 2-3-4 tree with a binary tree
• The edges of the original tree are black
• Nodes with 3 or 4 children are split into multiple nodes, coloring the internal edges red
• Two consecutive red edges are forbidden
• If this invariant is violated, rotations are used for restoration
• If only black edges are counted, the tree is of uniform height
• When all edges are considered, the maximum depth of a leaf is at most twice the minimum depth

AVL-Trees

• Proposed by Adelson-Velskii and Landis (Адельсон-Вельский and Ландис) in 1962
• Oldest (binary) balanced tree
• Invariant: At each internal node, the difference between the heights of the subtrees is 1 or less
• The difference between the heights of the left and the right subtrees (-1, 0, 1) is stored in each internal node and kept up to date
• The tree height is limited to 1.44 log2 n
• Searching is slightly faster than for a red-black-tree
• Insertion and deletion are slightly more complicated than for a red-black-tree

Secondary Storage

Internal Memory External (Secondary) Storage
Access principle random random linear
Technology dynamic RAM SSD, HD magnetic tape
Unit of access word page record
Example unit size 32/64 bits (4/8 bytes) 512/1024/2048/4096/... bytes varying
Access speed nanoseconds milliseconds seconds or minutes

B-Trees

• Variant of 2-3-4 trees
• Each page is a node in the tree
• Maximise the number of keys per page
• The minimum number of keys per page is about half of the maximum

Page of a B-Tree

 ref. to subtree key data ref. to subtree key data ref. to subtree ... ... ... key data ref. to subtree

B+ Trees

Starting with a B-tree, all data (except keys) is moved to lowest layer of tree

⇒ The number of keys and child nodes per internal node increase
(for practical applications, the size of a key is much smaller than the size of the data)

⇒ The height of the tree shrinks

(the overall access time is dominated by the number of pages that have to be fetched from secondary memory)

Internal Page of a B+ Tree

 ref. to subtree key ref. to subtree key ref. to subtree key ref. to subtree key ref. to subtree key ref. to subtree key ref. to subtree ... ... key ref. to subtree

Leaf Page of a B+ Tree

 key data key data key data ... ... key data

Definition of Variables for B+ Trees

• n: Overall number of data items (example: 50,000)
• Lp: Page size (example: 1024 bytes)
• Lk: Key size (example: 4 bytes)
• Ld: Data size (one item, except key) (example: 240 bytes)
• Lpp: Size of page number (page reference) (example: 4 bytes)
• αmin: minimum occupancy (usually 0.5)

Items per Page for B+Trees

(⌊a⌋ is the floor function of a, the greatest integer smaller than or equal to a,
a⌋∈ℤ ∧ ⌊a⌋≦a ∧ ¬∃b: b∈ℤ ∧ ⌊a⌋≦b<a)

• dmax = ⌊Lp / (Lk + Ld)⌋ (example: 4)
(maximum number of data items per leaf page)
• dmin = ⌊dmax αmin⌋ (example: 2)
(minimum number of data items per leaf page)
• kmax = ⌊Lp / (Lk + Lpp)⌋ (example: 128)
(maximum number of children per internal node)
• kmin = ⌊kmax αmin⌋ (example: 64)
(minimum number of children per internal node)

Number of Nodes for B+Trees

(⌈a⌉ is the ceiling function of a, the smallest integer greater than or equal to a,
a⌉∈ℤ ∧ a≦⌈a⌉ ∧ ¬∃b: b∈ℤ ∧ a<b≦⌈a⌉)

• Ndmax = ⌈n / dmin⌉ (example: 25,000)
(maximum number of leave pages)
• Ndmin = ⌈n / dmax⌉ (example: 12,500)
(minimum number of leave pages)
• Nkmax = ⌈Ndmax / kmin⌉ + ⌈Ndmax / kmin2⌉ ...
(maximum number of internal nodes)
(example: 391 + 7 + 1 = 399; height of B+tree: 4; total number of nodes: 25,399)
• Nkmin = ⌈Ndmin / kmax⌉ + ⌈Ndmin / kmax2⌉ + ...
(minimum number of internal nodes)
(example: 98 + 1 = 99; height of B+tree: 3; total number of nodes: 12,599)

Summary

• 2-3-4 trees and B(+)trees increase the degree of a binary tree, but keep the tree height constant
• Red-black-trees and AVL-trees impose limitations on the variation of the tree heigh
• Balanced trees allow to implement the basic operations on a dictionary ADT in O(log n) worst-case time
• B-trees and B+ trees are extremely important for the implementation of file systems and databases on secondary storage

Glossary

red-black-tree

AVL-tree
AVL 木
secondary storage

B-tree
B 木
B+ tree
B+木
strengthen

weaken

uniform

lowest layer

occupancy

floor function

ceiling function

degree