序列式适配器-Heap

堆(Heap)不是一个序列式容器,在C++中也不属于STL容器是组件

堆是幕后英雄,在python中有一个模块为heapq,意思是用heap实现的 priority queue,实际上是一个小顶堆。在排序算法中有堆排序,在Topk算法中,大顶堆\小顶堆更是功不可没

本文参考python源代码,了解堆的实现与使用

一个序列式适配器总要考虑用什么序列式容器来做基础,从源代码看出,python的堆也是用List来做基础的,参考下面这段注释

Usage:

heap = [] # creates an empty heap
heappush(heap, item) # pushes a new item on the heap
item = heappop(heap) # pops the smallest item from the heap
item = heap[0] # smallest item on the heap without popping it
heapify(x) # transforms list into a heap, in-place, in linear time
item = heapreplace(heap, item) # pops and returns smallest item, and adds
                               # new item; the heap size is unchanged

对于一个想变成Heap的数组heap=[],heapq 模块主要有5个函数:heappush(),把一个元素放入堆中;heappop(),从堆中取出一个元素;heapify(),把一个列表变成一个堆;nlargest()nsmallest() 分别提供列表中最大或最小的N个元素。

其中最重要的显然是heappush()heappop(),代码和注释:

def cmp_lt(x, y):
    # Use __lt__ if available; otherwise, try __le__.
    # In Py3.x, only __lt__ will be called.
    return (x < y) if hasattr(x, '__lt__') else (not y <= x)

def heappush(heap, item):
    """Push item onto heap, maintaining the heap invariant."""
    heap.append(item)
    _siftdown(heap, 0, len(heap)-1)

def heappop(heap):
    """Pop the smallest item off the heap, maintaining the heap invariant."""
    lastelt = heap.pop() # raises appropriate IndexError if heap is empty
    if heap:
        returnitem = heap[0]
        heap[0] = lastelt
        _siftup(heap, 0)
    else:
        returnitem = lastelt
    return returnitem

其中push的实现方法是,在数组尾部插入,然后将这个最尾部的叶子节点不断冒泡到合适位置_siftdown()

pop的实现方法是,将数组尾部的叶子节点取出,替换到顶部(这时明显跟节点已被取走,堆的性质已被打破),这时调整整个堆,把parent的节点中最小的子节点child上升到parent节点上来,直到叶子节点,然后再将刚才拿出来那个最后一个叶子节点放回他该在的位置_siftup()

_siftdown()和_siftup()的代码:

def _siftdown(heap, startpos, pos):
    newitem = heap[pos]
    # Follow the path to the root, moving parents down until finding a place
    # newitem fits.
    while pos > startpos:
        parentpos = (pos - 1) >> 1
        parent = heap[parentpos]
        if cmp_lt(newitem, parent):
            heap[pos] = parent
            pos = parentpos
            continue
        break
    heap[pos] = newitem

def _siftup(heap, pos):
    endpos = len(heap)
    startpos = pos
    newitem = heap[pos]
    # Bubble up the smaller child until hitting a leaf.
    childpos = 2*pos + 1 # leftmost child position
    while childpos < endpos:
        # Set childpos to index of smaller child.
        rightpos = childpos + 1
        if rightpos < endpos and not cmp_lt(heap[childpos], heap[rightpos]):
            childpos = rightpos
        # Move the smaller child up.
        heap[pos] = heap[childpos]
        pos = childpos
        childpos = 2*pos + 1
    # The leaf at pos is empty now. Put newitem there, and bubble it up
    # to its final resting place (by sifting its parents down).
    heap[pos] = newitem
    _siftdown(heap, startpos, pos)

_siftdown的实现很简单,新插入的元素是newitem,从叶子节点开始,不断向父节点挑战,只要父节点不小于新元素,就降级为自己的子节点,直到某一次挑战失败,将newitem赋值到当前节点上

_siftup也挺暴力的,从起始节点开始(此时该起始节点处起始值是空的(已被pop出去),各个子节点群龙无首争此帝位),不断在自己的子节点中拿出最小的赋值给自己,然后向下寻找,一直到最后一个节点,将最后一个节点赋值成新元素,再进行一次冒泡

有了前面的基础,heapify也就很简单了

def heapify(x):
    """Transform list into a heap, in-place, in O(len(x)) time."""
    n = len(x)
    # Transform bottom-up. The largest index there's any point to looking at
    # is the largest with a child index in-range, so must have 2*i + 1 < n,
    # or i < (n-1)/2. If n is even = 2*j, this is (2*j-1)/2 = j-1/2 so
    # j-1 is the largest, which is n//2 - 1. If n is odd = 2*j+1, this is
    # (2*j+1-1)/2 = j so j-1 is the largest, and that's again n//2-1.
    for i in reversed(xrange(n//2)):
        _siftup(x, i)

因为一个生成好的heap显然先一半元素必然是后一半元素中某一个的父,那么只要遍历后一半数组,显然前一半也会被调整为一个heap了

下面是一个技巧的放一个拿一个(不是拿一个放一个哦)函数:heappushpop()

def heappushpop(heap, item):
    """Fast version of a heappush followed by a heappop."""
    if heap and cmp_lt(heap[0], item):
        item, heap[0] = heap[0], item
        _siftup(heap, 0)
    return item

放之前先比较了,新元素和堆顶元素的大小,如果比堆顶元素还小的话,那就不用放了

如果大于堆顶元素的话,该函数是heappop()的一个变种,将堆顶pop出来,不用叶子元素替代堆顶了,而用新元素,然后照样进行堆的调整

然后,这个模块实现了topk算法,nsmallest()nlargest()

def nlargest(n, iterable):
    """Find the n largest elements in a dataset.

    Equivalent to: sorted(iterable, reverse=True)[:n]
    """
    if n < 0:
        return []
    it = iter(iterable)
    result = list(islice(it, n))
    if not result:
        return result
    heapify(result)
    _heappushpop = heappushpop
    for elem in it:
        _heappushpop(result, elem)
    result.sort(reverse=True)
    return result

def nsmallest(n, iterable):
    """Find the n smallest elements in a dataset.

    Equivalent to: sorted(iterable)[:n]
    """
    if n < 0:
        return []
    it = iter(iterable)
    result = list(islice(it, n))
    if not result:
        return result
    _heapify_max(result)
    _heappushpop = _heappushpop_max
    for elem in it:
        _heappushpop(result, elem)
    result.sort()
    return result

TopK中,先从原list中取出k个元素生成一个大小为k的Heap,然后将原List在这个Heap上pushpop一遍,小的元素一定会被push进去又被pop出来的,剩下的一定是K个大的元素啦

BottomK中,做了一个反向,Heap模块中凡是形如_heapify_max_heappop_max的,都不再使用小顶堆而换成了大顶堆

貌似原来不是这样的,参考Python heapq 模块的实现

参考

[1]Python heapq 模块的实现
[2]Python heapq doc
[3]Python heapq source
[4]Python使用heapq实现小顶堆(TopK大)、大顶堆(BtmK小)