弗洛伊德采样算法

弗洛伊德采样算法
Floyd's Sampling Algorithm

原始链接: https://buttondown.com/jaffray/archive/floyds-sampling-algorithm/

Floyd 的采样算法是一种令人惊讶的不直观方法，用于从 *n* 个元素集合中选择一个包含 *k* 个元素的随机子集。与水库采样等算法不同，它的逻辑并不立即显而易见。该算法遍历潜在的元素，并为每个元素随机选择一个从 1 到当前元素值的索引。如果该索引已在选定的集合中，则添加当前元素；否则，替换掉随机选择的元素。有两种方法可以理解为什么它有效。一种观点表明它通过展示在每次迭代时 k 子集和 k+1 子集之间的一一映射来维持均匀分布。另一种，来自 Paul，将其与 Fisher-Yates 洗牌联系起来。Floyd 算法本质上模拟了向上 Fisher-Yates 洗牌的最后 *k* 次交换，直接通过保留元素在“尾部”（选定集合）或将其替换为新元素来构建样本。这种联系揭示了该算法的优雅性——它是洗牌过程的直接实现，仅专注于最终样本。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Floyd 的采样算法 (buttondown.com/jaffray) 3 点赞 ibobev 1 小时前 | 隐藏 | 过去 | 收藏 | 讨论帮助指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

I love sampling algorithms. Here's the sampling algorithm that I find most magical. We want to generate a subset of {1, 2, ..., n} of size k.

def floyd(n, k):
    s = set()
    for i in range(n - k + 1, n + 1):
        t = random.randint(1, i)
        if t in s:
            s.add(i)
        else:
            s.add(t)
    return s

I learned about this algorithm the canonical way all good algorithm lore is imparted; one of my old coworkers wrote this comment in some code I was looking at once:

// Use Robert Floyd's algorithm to generate numSplits distinct integers
// between 0 and len(splitKeys), just because it's so cool!
chosen := make(map[int]struct{})
for j := len(splitKeys) - numSplits; j < len(splitKeys); j++ {
    t := fit.rng.Intn(j + 1)
    if _, alreadyChosen := chosen[t]; !alreadyChosen {
        // Insert T.
        chosen[t] = struct{}{}
    } else {
        // Insert J.
        chosen[j] = struct{}{}
    }
}

Unlike, say, Reservoir Sampling or something, this algorithm sort of resists an easy intuition to me. The branch makes it weird and makes a normal combinatorial, or visual argument tricky. In fact, I find even a casual intuition hard to grasp. Here are two possible ways to think about what makes it work.

First Perspective

Here's one way to think about it: each iteration of the algorithm goes from a uniformly distributed k-subset of [1..n] and an element of [1..n+1] to a uniformly distributed k+1-subset of [1..n+1]. If we can show that the map from those k-subsets and that sample is k+1-to-1, then we've shown that the resulting set is also uniformly distributed.

So, we need to show that each possible k+1-subset S of [1..n+1] has exactly k+1 inputs that map to it:

sets containing n+1 come from the k-subset without n+1, and any of the elements of S, and
sets not containing n+1 come from starting with one of its k-subsets and adding the missing element.

And that's it; the result is uniformly distributed.

This is not so bad, I think it basically makes sense to me. But I also learned of another way of thinking about this algorithm while trying to write this post.

Second Perspective

This one comes from Paul, and it notes a connection between this algorithm and the Fisher-Yates shuffle for permuting an array uniformly at random. Specifically, the "upward" variation that looks like this:

def shuffle(a):
    for i in range(0, len(a)):
        j = random.randint(0, i)
        a[i], a[j] = a[j], a[i]

This builds up a permutation incrementally. After each step, we have a permutation on [0..i], then we augment that and turn it into a permutation on [0..i+1].

Imagine plucking off the last k elements of an array we just shuffled, then that's a uniformly random sample of k elements. The observation here is that Floyd's algorithm is just doing the last k swaps here, directly.

If you call the last k elements of the array the "tail," then for each element i in the tail originally, either we:

swap i with j that's in the tail, in which case i will stay in the tail forever and be part of the sample, or
swap it with j not in the tail, then j will live in the tail forever and be part of the sample.

These two cases correspond directly to the two branches of Floyd's algorithm and we wind up with the same sample following them both (up to some reordering).