堆(Heap)不是一个序列式容器,在C++中也不属于STL容器是组件
堆是幕后英雄,在python中有一个模块为heapq,意思是用heap实现的 priority queue,实际上是一个小顶堆。在排序算法中有堆排序,在Topk算法中,大顶堆\小顶堆更是功不可没
本文参考python源代码,了解堆的实现与使用
Usage:
heap = [] # creates an empty heap
heappush(heap, item) # pushes a new item on the heap
item = heappop(heap) # pops the smallest item from the heap
item = heap[0] # smallest item on the heap without popping it
heapify(x) # transforms list into a heap, in-place, in linear time
item = heapreplace(heap, item) # pops and returns smallest item, and adds
# new item; the heap size is unchanged
对于一个想变成Heap的数组heap=[]
,heapq 模块主要有5个函数:heappush()
,把一个元素放入堆中;heappop()
,从堆中取出一个元素;heapify()
,把一个列表变成一个堆;nlargest()
和 nsmallest()
分别提供列表中最大或最小的N个元素。
heappush()
和heappop()
,代码和注释:def cmp_lt(x, y):
# Use __lt__ if available; otherwise, try __le__.
# In Py3.x, only __lt__ will be called.
return (x < y) if hasattr(x, '__lt__') else (not y <= x)
def heappush(heap, item):
"""Push item onto heap, maintaining the heap invariant."""
heap.append(item)
_siftdown(heap, 0, len(heap)-1)
def heappop(heap):
"""Pop the smallest item off the heap, maintaining the heap invariant."""
lastelt = heap.pop() # raises appropriate IndexError if heap is empty
if heap:
returnitem = heap[0]
heap[0] = lastelt
_siftup(heap, 0)
else:
returnitem = lastelt
return returnitem
其中push的实现方法是,在数组尾部插入,然后将这个最尾部的叶子节点不断冒泡到合适位置_siftdown()
pop的实现方法是,将数组尾部的叶子节点取出,替换到顶部(这时明显跟节点已被取走,堆的性质已被打破),这时调整整个堆,把parent的节点中最小的子节点child上升到parent节点上来,直到叶子节点,然后再将刚才拿出来那个最后一个叶子节点放回他该在的位置_siftup()
def _siftdown(heap, startpos, pos):
newitem = heap[pos]
# Follow the path to the root, moving parents down until finding a place
# newitem fits.
while pos > startpos:
parentpos = (pos - 1) >> 1
parent = heap[parentpos]
if cmp_lt(newitem, parent):
heap[pos] = parent
pos = parentpos
continue
break
heap[pos] = newitem
def _siftup(heap, pos):
endpos = len(heap)
startpos = pos
newitem = heap[pos]
# Bubble up the smaller child until hitting a leaf.
childpos = 2*pos + 1 # leftmost child position
while childpos < endpos:
# Set childpos to index of smaller child.
rightpos = childpos + 1
if rightpos < endpos and not cmp_lt(heap[childpos], heap[rightpos]):
childpos = rightpos
# Move the smaller child up.
heap[pos] = heap[childpos]
pos = childpos
childpos = 2*pos + 1
# The leaf at pos is empty now. Put newitem there, and bubble it up
# to its final resting place (by sifting its parents down).
heap[pos] = newitem
_siftdown(heap, startpos, pos)
_siftdown
的实现很简单,新插入的元素是newitem,从叶子节点开始,不断向父节点挑战,只要父节点不小于新元素,就降级为自己的子节点,直到某一次挑战失败,将newitem赋值到当前节点上
_siftup
也挺暴力的,从起始节点开始(此时该起始节点处起始值是空的(已被pop出去),各个子节点群龙无首争此帝位),不断在自己的子节点中拿出最小的赋值给自己,然后向下寻找,一直到最后一个节点,将最后一个节点赋值成新元素,再进行一次冒泡
def heapify(x):
"""Transform list into a heap, in-place, in O(len(x)) time."""
n = len(x)
# Transform bottom-up. The largest index there's any point to looking at
# is the largest with a child index in-range, so must have 2*i + 1 < n,
# or i < (n-1)/2. If n is even = 2*j, this is (2*j-1)/2 = j-1/2 so
# j-1 is the largest, which is n//2 - 1. If n is odd = 2*j+1, this is
# (2*j+1-1)/2 = j so j-1 is the largest, and that's again n//2-1.
for i in reversed(xrange(n//2)):
_siftup(x, i)
因为一个生成好的heap显然先一半元素必然是后一半元素中某一个的父,那么只要遍历后一半数组,显然前一半也会被调整为一个heap了
heappushpop()
def heappushpop(heap, item):
"""Fast version of a heappush followed by a heappop."""
if heap and cmp_lt(heap[0], item):
item, heap[0] = heap[0], item
_siftup(heap, 0)
return item
放之前先比较了,新元素和堆顶元素的大小,如果比堆顶元素还小的话,那就不用放了
如果大于堆顶元素的话,该函数是heappop()
的一个变种,将堆顶pop出来,不用叶子元素替代堆顶了,而用新元素,然后照样进行堆的调整
nsmallest()
和nlargest()
def nlargest(n, iterable):
"""Find the n largest elements in a dataset.
Equivalent to: sorted(iterable, reverse=True)[:n]
"""
if n < 0:
return []
it = iter(iterable)
result = list(islice(it, n))
if not result:
return result
heapify(result)
_heappushpop = heappushpop
for elem in it:
_heappushpop(result, elem)
result.sort(reverse=True)
return result
def nsmallest(n, iterable):
"""Find the n smallest elements in a dataset.
Equivalent to: sorted(iterable)[:n]
"""
if n < 0:
return []
it = iter(iterable)
result = list(islice(it, n))
if not result:
return result
_heapify_max(result)
_heappushpop = _heappushpop_max
for elem in it:
_heappushpop(result, elem)
result.sort()
return result
TopK中,先从原list中取出k个元素生成一个大小为k的Heap,然后将原List在这个Heap上pushpop一遍,小的元素一定会被push进去又被pop出来的,剩下的一定是K个大的元素啦
BottomK中,做了一个反向,Heap模块中凡是形如_heapify_max
、_heappop_max
的,都不再使用小顶堆而换成了大顶堆
貌似原来不是这样的,参考Python heapq 模块的实现
[1]Python heapq 模块的实现
[2]Python heapq doc
[3]Python heapq source
[4]Python使用heapq实现小顶堆(TopK大)、大顶堆(BtmK小)