# Divide and Conquer, Mergesort

(分割統治法、マージソート)

## Data Structures and Algorithms

### 6th lecture, October 25, 2018

http://www.sw.it.aoyama.ac.jp/2018/DA/lecture6.html

### Martin J. Dürst

© 2009-18 Martin J. Dürst 青山学院大学

# Today's Schedule

• Leftovers, summary of last lecture, homework
• The importance of sorting
• Simple sorting algorithms: Bubble sort, selection sort, insertion sort
• Loops in Ruby
• Divide and conquer
• Merge sort
• Summary

# Summary of Last Lecture

• A priority queue is an important ADT
• Implementing a priority queue with an array or a linked list is not efficient
• In a heap, each parent has higher priority than its children
• In a heap, the highest priority item is at the root of a complete binary tree
• A heap is an efficient implementation of a priority queue
• Many data structures are defined using invariants
• The operations heapify_up and heapify_down are used to restore heap invariants
• A heap can be used for sorting, using heap sort

# Last Week's Homework

(no need to submit, but bring the sorting cards)

1. Complete the report (deadline October 24, 2018 (Wednesday), 19:00)
2. Cut the sorting cards, and bring them with you to the next lecture
3. Shuffle the sorting cards, and try to find a fast way to sort them. Play against others (who is fastest?).
4. Find five different applications of sorting (no need to submit)
5. Implement joining two (normal) heaps (no need to submit)
• Add the items of the smaller heap to the bigger heap (6heapmerge.rb)
• Use a binomial heap (binomial queue)
6. Think about the time complexity of creating a heap:
`heapify_down` will be called n/2 times and may take up to O(log n) each time.
Therefore, one guess for the overall time complexity is O(n log n).
However, this upper bound can be improved by careful analysis.
(no need to submit)

# Report: Manual Sorting: Problems Seen

• 218341.368 seconds (⇒about 61 hours)
• 61010·103·1010 (units? way too big)
• O(40000) (how many seconds could this be)
• Calulation of actual time backwards from big-O notation: 1second/operation, n=5000, O(n2) ⇒ 25'000'000 seconds?
• A O(n) algorithm (example: "5 seconds per page")
• For 12 people, having only one person work towards the end of the algorithm
• For humans, binary sorting is constraining (sorting into 3~10 parts is better)
• Using bubble sort (868 days without including breaks or sleep)
• Prepare 1010 boxes (problem: space, cost, distance for walking)
• Forgetting time for preparation, cleanup, breaks,...
• Submitting just a program
• Report too short

# Homework 3: Computational Complexity of `heapify_all`

• How `heapify_all` works: Apply `heapify_down` starting with lower layers
• The complexity of `heapify_down` is O(log n) or lower
• `heapify_all` may be Θ(n log n), but we should check
• Analysis for each layer:
Layer Size Operations of `heapify_down` Operations per layer
0 (bottom) n/2 0 (unnecessary) n/2
1 n/4 1 n/4
2 n/8 2 n/8
3 n/16 3 n/16
i n/2i+1 i i·n/2i+1

Total: ∑0≤i<log2 n i·n/2i+1 = n/2 ∑0≤i<log2 n i/2in/2·2 = nO(n) [6heapsum.rb]

• Conclusions:
• Time complexity can be lower than suggested by simple guessing
• It is possible to build a heap directly in O(n) time, but
adding items one-by-one will take O(n log n) in the worst case

# Derivation

0≤i≤∞ i/2i =

= 0/1 + 1/2 + 2/4 + 3/8 + 4/16 + 5/32 + 6/64 + 7/128 + 8/256 + 9/512 + ...
< 1/2 + 2/4 + 4/8 + 4/8   + 8/32 + 8/32 +   8/32 + 8/32 +   16/512 + ...
= 1/2 + 1·2/4 +2·4/8 +     4·8/32 +       8·16/512 + 16·32/131072 + ...
= 1/2 + 21/22 + 23/23 + 25/25 + 27/29 + 29/217 + 211/233 + 213/265 + ...
= 1/2 + ∑0≤k≤∞2(1+2k)-(2k+1)
= 1/2 + ∑0≤k≤∞22k-2k
< 3.254

# Importance of Sorting

• Make output easy to understand and check (search by humans)
• Group related items together
• Preparation for search (example: binary search, index in databases, ...)
• Use as component in more complicated algorithms

# Simple Sorting Algorithms

• Bubble sort
• Selection sort
• Insertion sort

# Bubble Sort

• Compare neigboring items,
exchange if not in order
• Pass through the data from start to end, repeatedly
• The number of passes needed to fully order the data is O(n)
• The number of comparisons (and potential exchanges) in each pass is O(n)
• Time complexity is O(n2)

Possible improvements:

• Alternatively pass back and forth
• Remember the place of the last exchange to limit the range of exchanges
• Work in parallel

Pseudocode/example implementation: 6sort.rb

# Various Ways to Loop in Ruby

• Looping a fixed number of times
• Looping with an index
• Many others, ...

# Looping a Fixed Number of Times

Syntax:

```number.times do
# some work
end```

Example:

```(length-1).times do
# bubble
end```

# Looping with an Index

Syntax:

```start.upto end do |index|
# some work using index
end```

Example:

```0.upto(length-2) do |i|
# select
end```

# Selection Sort

• Find the smallest element, and exchange it with the first element
• Continue finding the smallest and exchanging it with the first element of the rest of the array
• The area at the start of the array that is fully sorted will get larger and larger
• Number of exchanges: O(n)
• Work needed to find smallest element: O(n)
• Overall time complexity: O(n2)

# Details of Time Complexity for Selection Sort

• The number of comparisons to find the minimum of n elements is n-1
• The size of the unsorted area initially is n elements, at the end 2 elements
• i=2n n-i+1 = n-1 + n-2 + ... + 2 + 1 = n · (n-1) / 2 = O(n2)

# Insertion Sort

• View the first element of the array as sorted (sorted area of length 1)
• Take the second element of the array and insert it at the right place in to the sorted area
→sorted area of length 2
• Continue with the following elements, making the sorted area longer and longer
• To insert an element into the already sorted area,
move any elements greater than the new element to the right by one
• The (worst-case) time complexity is O(n2)
• Insertion sort is fast if the data is already (almost) sorted
• Insertion sort can be used if data items are added into an already sorted array

Improvement: Using a sentinel: Add a first data item that is guaranteed to be smaller than any real data items. This saves one index check.

# Details of Time Complexity for Insertion Sort

• The number of elements to be inserted is n
• The maximum number of comparisions/moves when inserting data item number i is i-1
• i=2n i-1 = 1 + 2 + ... + n-2 + n-1 = n · (n-1) / 2 = O(n2)

# Comparison: Selection Sort vs. Insertion Sort

Selection Sort Insertion Sort
handling first item O(n) O(1)
handling last item O(1) O(n)
initial area perfectly sorted sorted, but some items still missing
rest of data greater than any items in sorted area any size possible
advantage only O(n) exchanges fast if (almost) sorted
disadvantage always same speed may get slower if many moves needed

# Divide and Conquer

(Latin: divide et impera)

• Term of military strategy and tactics
• Problem solving method:
Solve a problem by dividing it into smaller problems
• Important principle for programming in general
(e.g. split a bigger program into various functions)
• Important design principle for algorithms and data structures

# Merge Sort (without recursion)

• Split the items to be sorted into two halves
• Separately sort each half
• Combine the two halfs by merging them

# Merge

• Two-way merge and multi-way merge
• Create one sorted sequence from two or more sorted sequences
• Repeatedly select the smaller/smallest item from the input sequences
• When only one sequence is left, copy the rest of the items

# Merge Sort

• Recursively split the items to be sorted into two halves
• Parts with only 1 item are sorted by definition
• Combine the parts (in the reverse order of splitting them) by merging

# Time Complexity of Merge Sort

• Split is possible in O(1) time (index calculation only)
• Merging n items takes O(n) time
• Recurrence:
M
(1) = 0
M(n) = 1 + 2 M(n/2)(*) + n (1) = 0
• Discovering a pattern by repeated substitution:
M(n) = 1 + 2 M(n/2) + n =
= 1 + 2 (1+ 2 M(n/2/2) + n/2) + n =
= 1 + 2 + 4 M(n/4) + n + n =
= 1 + 2 + 4 (1 + 2 M(n/4/2) + n/4) + n + n =
= 1 + 2 + 4 + 8 M(n/8) + n + n + n =
= 2k - 1 + 2k M(n/2k) + kn
• Using M(1) = 0: n/2k = 1 ⇒ k = log2 n
• M(n) = n - 1 + n log2 n
• Asymptotic time complexity: O(n log n)

(*) more exactly, M(⌈n/2⌉) + M(⌊n/2⌋) rather than 2 M(n/2)

# Properties of Merge Sort

• Merging means copying all elements
⇒ We need twice the memory of the original data
• Merge sort is better suited for external memory than for internal memory
• External memory:
• Punchcards
• Magnetic tapes
• Hard disks (HD)
• Solid state drives (SSD)

# Summary

• Simple sorting algorithms:
• Bubble sort (easiest to implement)
• Selection sort (only O(n) data exchanges)
• Insertion sort (fast when already (almost) sorted)
• Simple sorting algorithms are all O(n2)
• Merge sort is based on divide and conquer
• Merge sort is O(n log n) (same as heap sort)

# Homework for Next Time

• Using the sorting cards, play with your friends to see which algorithms may be faster.
(Example: Two players, one player uses selection sort, one player uses insertion sort, who wins?)

# Glossary

bubble sort
バブル整列法、バブルソート
selection sort

insertion sort

sentinel

index

divide and conquer

military strategy

tactics

design principle

merge sort
マージソート
merge

2-way merge
2 ウェイ併合
multiway merge
マルチウェイ併合
external memory

internal memory

punchcard
パンチカード
magnetic tape

hard disk
ハードディスク