C++ - 与 numpy 中的相比，argsort 低效率的矢量版本实现

Question

C++ - 与 numpy 中的相比，argsort 低效率的矢量版本实现

Pag*_*vid 6 c++ python performance numpy

这是我做的一个比较。np.argsort在由 1,000,000 个元素组成的 float32 ndarray 上计时。

In [1]: import numpy as np

In [2]: a = np.random.randn(1000000)

In [3]: a = a.astype(np.float32)

In [4]: %timeit np.argsort(a)
86.1 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Run Code Online (Sandbox Code Playgroud)

这是一个 C++ 程序执行相同的过程，但在引用此答案的向量上。

#include <iostream>
#include <vector>
#include <cstddef>
#include <algorithm>
#include <opencv2/opencv.hpp>
#include <numeric>
#include <utility>
int main()
{
  std::vector<float> numbers;
  for (int i = 0; i != 1000000; ++i) {
    numbers.push_back((float)rand() / (RAND_MAX));
  }

  double e1 = (double)cv::getTickCount();

  std::vector<size_t> idx(numbers.size());
  std::iota(idx.begin(), idx.end(), 0);

  std::sort(idx.begin(), idx.end(), [&numbers](const size_t &a, const size_t &b)
                                               { return numbers[a] < numbers[b];});

  double e2 = (double)cv::getTickCount();
  std::cout << "Finished in " << 1000 * (e2 - e1) / cv::getTickFrequency() << " milliseconds." << std::endl;
  return 0;
}

Run Code Online (Sandbox Code Playgroud)

它打印Finished in 525.908 milliseconds.并且比 numpy 版本慢得多。那么谁能解释一下是什么让np.argsort这么快？谢谢。

Edit1：np.__version__返回1.15.0运行Python 3.6.6 |Anaconda custom (64-bit)并g++ --version打印 8.2.0。操作系统是Manjaro Linux。

~~Edit2：我尝试编译-O2并-O3标记g++，我在 216.515 毫秒和 205.017 毫秒内得到了结果。这是一个改进，但仍然比 numpy 版本慢。（参考这个问题）~~这被删除了，因为我错误地在拔掉笔记本电脑的直流适配器的情况下运行测试，这会导致它变慢。在公平竞争中，C-array 和 vector 版本表现相当（大约需要 100ms）。

Edit3：另一种方法是用 C 替换 vector ，例如 array: float numbers[1000000];。之后的运行时间约为 100ms(+/-5ms)。完整代码在这里：

#include <iostream>
#include <vector>
#include <cstddef>
#include <algorithm>
#include <opencv2/opencv.hpp>
#include <numeric>
#include <utility>
int main()
{
  //std::vector<float> numbers;
  float numbers[1000000];
  for (int i = 0; i != 1000000; ++i) {
    numbers[i] = ((float)rand() / (RAND_MAX));
  }

  double e1 = (double)cv::getTickCount();

  std::vector<size_t> idx(1000000);
  std::iota(idx.begin(), idx.end(), 0);

  std::sort(idx.begin(), idx.end(), [&numbers](const size_t &a, const size_t &b)
                                               { return numbers[a] < numbers[b];});

  double e2 = (double)cv::getTickCount();
  std::cout << "Finished in " << 1000 * (e2 - e1) / cv::getTickFrequency() << " milliseconds." << std::endl;
  return 0;
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

sch*_*312 3

我采用了您的实施并用项目对其进行了测量10000000。大约花费了 1.7 秒。

现在我介绍一个类

class valuePair {
  public:
    valuePair(int idx, float value) : idx(idx), value(value){};
    int idx;
    float value;
};

Run Code Online (Sandbox Code Playgroud)

with 初始化为

std::vector<valuePair> pairs;
for (int i = 0; i != 10000000; ++i) {
    pairs.push_back(valuePair(i, (double)rand() / (RAND_MAX)));
}

Run Code Online (Sandbox Code Playgroud)

和排序比完成

std::sort(pairs.begin(), pairs.end(), [&](const valuePair &a, const valuePair &b) { return a.value < b.value; });

Run Code Online (Sandbox Code Playgroud)

此代码将运行时间缩短至 1.1 秒。我认为这是由于更好的缓存一致性，但与 python 结果仍然相距甚远。

归档时间：	7 年，6 月前
查看次数：	1307 次
最近记录：	7 年，6 月前