提出了一种用于对大型对象数组进行排序的算法; 任何人都可以告诉我这个算法叫什么？(在Google上找不到)

Question

提出了一种用于对大型对象数组进行排序的算法; 任何人都可以告诉我这个算法叫什么？(在Google上找不到)

Fra*_*ier 3 c++ arrays sorting algorithm

我需要对一大堆大型物体进行排序,这让我想到:有没有办法减少掉期数量？

所以我使用quicksort(但任何其他快速排序也应该在这里工作)来将索引排序到数组中的元素; 指数交易便宜.然后我使用这些索引将实际对象交换到它们的位置.不幸的是,这使用O(n)额外空间来存储索引.下面的代码说明了算法(我称之为IndexSort),在我的测试中,对于大型对象的数组,它似乎比plain quicksort快.

template <class Itr>
void IndexSort(Itr begin, Itr end)
{
    const size_t count = end - begin;

    // Create indices
    vector<size_t> ind(count);
    iota(ind.begin(), ind.end(), 0);

    // Sort indices
    sort(ind.begin(), ind.end(), [&begin] (const size_t i, const size_t j)
    {
        return begin[i] < begin[j];
    });

    // Create indices to indices. This provides
    // constant time search in the next step.
    vector<size_t> ind2(count);
    for(size_t i = 0; i < count; ++i)
        ind2[ind[i]] = i;

    // Swap the objects into their final places
    for(size_t i = 0; i < count; ++i)
    {
        if( ind[i] == i )
            continue;

        swap(begin[i], begin[ind[i]]);

        const size_t j = ind[i];

        swap(ind[i], ind[ind2[i]]);
        swap(ind2[i], ind2[j]);
    }
}

Run Code Online (Sandbox Code Playgroud)

现在我已经测量了两个,quicksort和IndexSort完成的交换(大型对象的交换),并发现quicksort进行了大量的交换.所以我知道为什么IndexSort 会更快.

但是,任何具有更多学术背景的人都可以解释为什么/这个算法实际上如何运作？(这对我来说并不直观,尽管我不知何故想到了它).

谢谢!

编辑:以下代码用于验证IndexSort的结果

// A class whose objects will be large
struct A
{
    int id;
    char data[1024];

    // Use the id to compare less than ordering (for simplicity)
    bool operator < (const A &other) const
    {
        return id < other.id;
    }

    // Copy assign all data from another object
    void operator = (const A &other)
    {
        memcpy(this, &other, sizeof(A));
    }
};

int main()
{
    const size_t arrSize = 1000000;

    // Create an array of objects to be sorted
    vector<A> randArray(arrSize);
    for( auto &item: randArray )
        item.id = rand();

    // arr1 will be sorted using quicksort
    vector<A> arr1(arrSize);
    copy(randArray.begin(), randArray.end(), arr1.begin());

    // arr2 will be sorted using IndexSort
    vector<A> arr2(arrSize);
    copy(randArray.begin(), randArray.end(), arr2.begin());

    {
        // Measure time for this
        sort(arr1.begin(), arr1.end());
    }

    {
        // Measure time for this
        IndexSort(arr2.begin(), arr2.end());
    }

    // Check if IndexSort yielded the same result as quicksort
    if( memcmp(arr1.data(), arr2.data(), sizeof(A) * arr1.size()) != 0 )
        cout << "sort failed" << endl;

    return 0;
}

Run Code Online (Sandbox Code Playgroud)

编辑:使测试不那么病态; 将大对象类的大小减小到只有1024个字节(加上一个int),并将要排序的对象数增加到一百万.这仍然导致IndexSort明显快于快速排序.

编辑:这需要更多测试.但它让我想一想,如果std :: sort可以在编译时检查对象大小,并且(取决于某个大小阈值)选择现有的快速排序实现或此IndexSort实现.

此外,IndexSort可以被描述为"就地标签排序"(请参阅下面的samgak的回答和我的评论).

Answer 1

sam*_*gak 5

它似乎是一个标签排序:

例如,流行的递归快速排序算法提供了相当合理的性能和足够的RAM,但是由于它复制了数组部分的递归方式,当数组不适合RAM时变得不太实用,因为它可能导致一些缓慢复制或移动操作进出磁盘.在那种情况下,即使需要更多的总比较,另一种算法也可能是优选的.

解决此问题的一种方法是,当复杂记录(例如在关系数据库中)由相对较小的关键字段进行排序时,该方法是在数组中创建索引,然后对索引进行排序,而不是整个阵列.(然后可以通过一次传递生成整个数组的排序版本,从索引读取,但通常甚至这是不必要的,因为排序索引是足够的.)因为索引比整个数组小得多,所以它可能很容易适应整个阵列所不具备的内存,有效地消除了磁盘交换问题.此过程有时称为"标记排序".

如上所述,标签排序可用于对不能适合存储器的大量数据进行排序.但是,即使它可以适合内存,它仍然需要较少的内存读写操作来处理大型对象的数组,如解决方案所示,因为每次都不会复制整个对象.

实现细节:虽然您的实现仅对索引进行排序,并在进行比较时通过索引返回原始对象数组,但实现它的另一种方法是使用排序键将索引/排序键对存储在排序缓冲区中比较.这意味着您可以在不同时在内存中拥有整个对象数组的情况下进行排序.

标记排序的一个示例是.NET中的LINQ to Objects排序算法:

排序有点灵活,因为它允许您提供比较委托.但是,它不允许您提供交换委托.在许多情况下这没关系.但是,如果要对大型结构(值类型)进行排序,或者如果要进行间接排序(通常称为标记排序),则交换委托是非常有用的.例如,LINQ to Objects排序算法在内部使用标记排序.您可以通过检查.NET Reference Source中提供的源来验证这一点.让你通过交换委托将使事情更灵活.

归档时间：	10 年，7 月前
查看次数：	1283 次
最近记录：	10 年，7 月前