Balanced Trees

(平衡木)

Data Structures and Algorithms

9th lecture, November 12, 2015

http://www.sw.it.aoyama.ac.jp/2015/DA/lecture9.html

Martin J. Dürst

© 2009-15 Martin J. Dürst 青山学院大学

Today's Schedule

• Summary of last lecture
• Balanced trees for internal use
• 2-3-4 tree
• red-black-tree
• AVL-tree
• Balanced trees for secondary storage
• B-tree
• B+ tree

Summary of Last Lecture

• A dictionary is an ADT allowing the insertion, deletion, and search of data items using a key
• With a simplistic implementation, some operations take O(n) time
• With a binary search tree, all operations are O(log n) on average, but O(n) in the worst case
• Different than for sorting, this cannot be improved using randomization
(for quicksort, we can randomly select a pivot, but the order of insertions and deletions for a dictionary is determined externally)

Strengthening or Weakening the Invariants of Binary Trees

• For the implementation of a priority queue, we
• Weakened the total order (of a binary search tree) to a local order (between parent and child only)
• Strengthened the shape of a (general) binary tree to a complete binary tree
• We have to consider strengthening or weakening invariants to improve worst-case performance of a binary search tree

Top-Down 2-3-4 Trees

• Each (internal) node has 2, 3, or 4 children
• A node with k children stores k-1 keys and data items
(if all nodes have 2 children, a 2-3-4 tree is equal to a binary search tree)
• The keys in the internal nodes separate the key ranges in the subtrees
• The tree is of uniform height
• In the lowest layer of the tree, the nodes have no children
(implemented as a single unique empty node)

Search in 2-3-4 Trees

• Start from the root node
• If the key being searched for is found in the current node, then return the corresponding data item
• Select the subtree based on this nodes' keys, and continue recursively

(each operation on a 2-3-4 tree is a generalization of the same operation on a binary search tree)

Insertion into 2-3-4 Trees

• Basic operation: Search downwards, insert new data item into leaf node
• If there are already 3 data items in the leaf node, this node has to be split
• If a node has to be split, a key and data item have to be inserted into the parent node
• This may trigger further splits in parents, potentially up to the root
• To avoid splits after insertion (difficult to implement),
nodes with 4 children are split preemptively on the way from the root to the leaf
• This version of 2-3-4 trees is called top-down 2-3-4 tree

Deletion from 2-3-4 Trees

• More complicated than insertion (same as binary search tree)
• Find data item to be deleted, using search
• If the item to be deleted is not in a leaf, exchange with an item in a leaf
• Remove the item in the leaf
• If this results in a leaf node without data items, move (borrow) items from neigboring leafs
• If the situation cannot be fixed using moving, merge some nodes
• If the situation cannot be fixed using merging, address the problem one layer higher
• If the problem cannot be solved in the top layer, reduce the number of the layers

Efficiency of 2-3-4 Trees

• Maximum number of data items in a 2-3-4 tree of height h: n = 4h-1
• Minimum number of data items in a 2-3-4 tree of height h: n = 2h-1
• ⇒ The height of the tree is O(log n)
• The time needed for each operation is proportional to the height of the tree and therefore O(log n)

Implementation of 2-3-4 Tree

• Implementation in Ruby: 9234tree.rb9driver.rb
• Implementation of 2-3-4 trees is quite complicated
• Some memory (in nodes with 2 or 3 children) is unused
• Therefore, other balanced trees have been proposed

Red-Black-Trees

• Implementation of a 2-3-4 tree with a binary tree
• The edges of the original tree are black
• Nodes with 3 or 4 children are split into multiple nodes, coloring the internal edges red
• Two consecutive red edges are impossible/forbidden
• If this invariant is violated, rotations are used for restoration
• If only black edges are counted, the tree is of uniform height
• When all edges are considered, the maximum depth of a leaf is at most twice the minimum depth

AVL-Trees

• Proposed by Adelson-Velskii and Landis (Адельсон-Вельский and Ландис) in 1962
• Oldest (binary) balanced tree
• Invariant: At each internal node, the difference between the heights of the subtrees is 1 or less
• The difference between the heights of the left and the right subtrees (-1, 0, 1) is stored in each internal node and kept up to date
• The tree height is limited to 1.44 log2 n
• Searching is slightly faster than for a red-black-tree
• Insertion and deletion are slightly more complicated than for a red-black-tree

Secondary Storage

 Access principle Technology Internal Memory Secondary Storage random random linear dynamic RAM HD, SSD magnetic tape word page record 32/64 bits (4/8 bytes) 512/1024/2048/4096/... bytes varying nanoseconds milliseconds seconds or minutes

B-Trees

• Variant of 2-3-4 tree
• Each page is a node in the tree
• Maximise the number of keys per page
• The minimum number of keys per page is about half of the maximum

Page of a B-Tree

 ref. to subtree key data ref. to subtree key data ref. to subtree ... ... ... key data ref. to subtree

B+ Trees

Starting with a B-tree, all data (except keys) is moved to lowest layer of tree

⇒ The number of keys and child nodes per internal node increase
(for practical applications, the size of a key is much smaller than the size of the data)

⇒ The height of the tree shrinks

(the overall access time is dominated by the number of pages that have to be fetched from secondary memory)

Internal Page of a B+ Tree

 ref. to subtree key ref. to subtree key ref. to subtree key ref. to subtree key ref. to subtree key ref. to subtree key ref. to subtree ... ... key ref. to subtree

Leaf Page of a B+ Tree

 key data key data key data ... ... key data

Definition of Variables for B+ Trees

• n: Overall number of data items (example: 50,000)
• Lp: Page size (example: 1024 bytes)
• Lk: Key size (example: 4 bytes)
• Ld: Data size (one item, except key) (example: 240 bytes)
• Lpp: Size of page number (page reference) (example: 4 bytes)
• αmin: minimum occupancy (usually 0.5)

Items per Page for B+Trees

(⌊a⌋ is the floor function of a, the greatest integer smaller or equal to a)

• dmax = ⌊Lp / (Lk + Ld)⌋ (example: 4)
(maximum number of data items per leaf)
• dmin = ⌊dmax αmin⌋ (example: 2)
(minimum number of data items per leaf)
• kmax = ⌊Lp / (Lk + Lpp)⌋ (example: 128)
(maximum number of children per internal node)
• kmin = ⌊kmax αmin⌋ (example: 64)
(minimum number of children per internal node)

Number of Nodes for B+Trees

(⌈a⌉ is the ceiling function of a, the smallest integer greater or equal to a)

• Ndmax = ⌈n / dmin⌉ (example: 25,000)
(maximum number of leaves)
• Ndmin = ⌈n / dmax⌉ (example: 12,500)
(minimum number of leaves)
• Nkmax = ⌈Ndmax / kmin⌉ + ⌈Ndmax / kmin2⌉ ...
(maximum number of internal nodes)
(example: 391 + 7 + 1 = 399; height of B+tree: 4; total number of nodes: 25,399)
• Nkmin = ⌈Ndmin / kmax⌉ + ⌈Ndmin / kmax2⌉ + ...
(minimum number of internal nodes)
(example: 98 + 1 = 99; height of B+tree: 3; total number of nodes: 12,599)

Summary

• 2-3-4 trees and B(+)trees increase the degree of a binary tree, but keep the tree height constant
• Red-black-trees and AVL-trees impose limitations on the variation of the tree heigh
• Balanced trees allow to implement the basic operations on a dictionary ADT in O(log n) time
• B-trees and B+ trees are extremely important for the implementation of file systems and databases on secondary storage

Glossary

red-black-tree

AVL-tree
AVL 木
secondary storage

B-tree
B 木
B+ tree
B+木
strengthen

weaken

uniform

lowest layer

occupancy

floor function

ceiling function

degree