对于C++ Vector3实用程序类实现,是否比struct和class更快?

Wen*_*nyu 5 c++ optimization performance physics utility

出于好奇,我用3种方式实现了vector3实用程序:array(带有typedef),类和结构

这是数组实现:

typedef float newVector3[3];

namespace vec3{
    void add(const newVector3& first, const newVector3& second, newVector3& out_newVector3);
    void subtract(const newVector3& first, const newVector3& second, newVector3& out_newVector3);
    void dot(const newVector3& first, const newVector3& second, float& out_result);
    void cross(const newVector3& first, const newVector3& second, newVector3& out_newVector3);
    }

    // implementations, nothing fancy...really

     void add(const newVector3& first, const newVector3& second, newVector3& out_newVector3)

    {
        out_newVector3[0] = first[0] + second[0];
        out_newVector3[1] = first[1] + second[1];
        out_newVector3[2] = first[2] + second[2];
    }

    void subtract(const newVector3& first, const newVector3& second, newVector3& out_newVector3){
        out_newVector3[0] = first[0] - second[0];
        out_newVector3[1] = first[1] - second[1];
        out_newVector3[2] = first[2] - second[2];
    }

    void dot(const newVector3& first, const newVector3& second, float& out_result){
        out_result = first[0]*second[0] + first[1]*second[1] + first[2]*second[2];
    }

    void cross(const newVector3& first, const newVector3& second, newVector3& out_newVector3){
        out_newVector3[0] = first[0] * second[0];
        out_newVector3[1] = first[1] * second[1];
        out_newVector3[2] = first[2] * second[2];
    }
}
Run Code Online (Sandbox Code Playgroud)

一个类实现:

class Vector3{
private:
    float x;
    float y;
    float z;

public:
    // constructors
    Vector3(float new_x, float new_y, float new_z){
        x = new_x;
        y = new_y;
        z = new_z;
    }

    Vector3(const Vector3& other){
        if(&other != this){
            this->x = other.x;
            this->y = other.y;
            this->z = other.z;
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

当然,它包含通常出现在Vector3类中的其他功能.

最后,结构实现:

struct s_vector3{
    float x;
    float y;
    float z;

    // constructors
    s_vector3(float new_x, float new_y, float new_z){
        x = new_x;
        y = new_y;
        z = new_z;
    }

    s_vector3(const s_vector3& other){
        if(&other != this){
            this->x = other.x;
            this->y = other.y;
            this->z = other.z;
        }
    }
Run Code Online (Sandbox Code Playgroud)

同样,我省略了一些其他常见的Vector3功能.现在,我让他们三个创建9000000个新对象,并做9000000次交叉产品(我写了一大块数据数据,在其中一个完成后缓存,以避免缓存帮助它们).

这是测试代码:

const int K_OPERATION_TIME = 9000000;
const size_t bigger_than_cachesize = 20 * 1024 * 1024;

void cleanCache()
{
    // flush the cache
    long *p = new long[bigger_than_cachesize];// 20 MB
    for(int i = 0; i < bigger_than_cachesize; i++)
    {
       p[i] = rand();
    }
}

int main(){

    cleanCache();
    // first, the Vector3 struct
    std::clock_t start;
    double duration;

    start = std::clock();

    for(int i = 0; i < K_OPERATION_TIME; ++i){
        s_vector3 newVector3Struct = s_vector3(i,i,i);
        newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct);
    }

    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    printf("The struct implementation of Vector3 takes %f seconds.\n", duration);

    cleanCache();
    // second, the Vector3 array implementation
    start = std::clock();

    for(int i = 0; i < K_OPERATION_TIME; ++i){
        newVector3 newVector3Array = {i, i, i};
        newVector3 opResult;
        vec3::cross(newVector3Array, newVector3Array, opResult);
    }

    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    printf("The array implementation of Vector3 takes %f seconds.\n", duration);

    cleanCache();
    // Third, the Vector3 class implementation
    start = std::clock();

    for(int i = 0; i < K_OPERATION_TIME; ++i){
        Vector3 newVector3Class = Vector3(i,i,i);
        newVector3Class = Vector3::cross(newVector3Class, newVector3Class);
    }

    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    printf("The class implementation of Vector3 takes %f seconds.\n", duration);


    return 0;
}
Run Code Online (Sandbox Code Playgroud)

结果令人惊讶.

struct并且class实现完成任务大约0.23秒,而array实现只需要0.08秒!

如果数组确实具有这样的显着性能优势,虽然它的语法很难看,但它在很多情况下都值得使用.

所以我真的很想确定,这应该发生什么?谢谢!

ead*_*ead 7

简答:这取决于.如您所见,如果没有优化编译,则存在差异.

当我在(-O2-O3)上进行优化编译(内联所有函数)时没有区别(请继续阅读,看起来并不那么容易).

 Optimization    Times (struct vs. array)
    -O0              0.27 vs. 0.12
    -O1              0.14 vs. 0.04
    -O2              0.00 vs. 0.00
    -O3              0.00 vs. 0.00
Run Code Online (Sandbox Code Playgroud)

无法保证,您的编译器可以/将要做什么优化,因此完整的答案是"它取决于您的编译器".起初我会相信我的编译器会做正确的事情,否则我应该开始编程程序集.只有当代码的这一部分是真正的瓶颈时,才有必要考虑帮助编译器.

如果使用编译-O2,你的代码0.0对于两个版本都需要几秒钟,但这是因为优化器看到,这些值根本没有使用,所以它只是抛弃了整个代码!

让我们确保,这不会发生:

#include <ctime>
#include <cstdio>

const int K_OPERATION_TIME = 1000000000;

int main(){
    std::clock_t start;
    double duration;

    start = std::clock();

    double checksum=0.0;
    for(int i = 0; i < K_OPERATION_TIME; ++i){
        s_vector3 newVector3Struct = s_vector3(i,i,i);
        newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct);
        checksum+=newVector3Struct.x +newVector3Struct.y+newVector3Struct.z; // actually using the result of cross-product!
    }

    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    printf("The struct implementation of Vector3 takes %f seconds.\n", duration);

    // second, the Vector3 array implementation
    start = std::clock();

    for(int i = 0; i < K_OPERATION_TIME; ++i){
        newVector3 newVector3Array = {i, i, i};
        newVector3 opResult;
        vec3::cross(newVector3Array, newVector3Array, opResult);
        checksum+=opResult[0] +opResult[1]+opResult[2];  // actually using the result of cross-product!
    }

    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    printf("The array implementation of Vector3 takes %f seconds.\n", duration);

    printf("Checksum: %f\n", checksum);
}
Run Code Online (Sandbox Code Playgroud)

您将看到以下更改:

  1. 不涉及缓存(没有缓存未命中),所以我只删除了负责刷新它的代码.
  2. 类和结构之间在性能上没有区别(在编译之后实际上没有区别,整个公共 - 私有句法糖差异只是皮肤深层),所以我只看结构.
  3. 交叉产品的结果实际上是使用的,无法进行优化.
  4. 现在有1e9迭代,以获得有意义的时间.

通过此更改,我们可以看到以下时序(intel编译器):

 Optimization    Times (struct vs. array)
    -O0              33.2 vs. 17.1
    -O1              19.1 vs. 7.8
    -Os              19.2 vs. 7.9
    -O2              0.7 vs. 0.7
    -O3              0.7 vs. 0.7
Run Code Online (Sandbox Code Playgroud)

我有点失望,这-Os有一个糟糕的表现,但你可以看到,如果优化,结构和数组之间没有区别!


我个人非常喜欢-Os,因为它产生了我能够理解的装配,所以让我们来看看它为什么这么慢.

最明显的事情是,不查看生成的程序集:s_vector3::cross返回s_vector3-object但我们将结果分配给已存在的对象,因此如果优化器没有看到,旧的对象不再使用,他可能无法做RVO.所以让我们替换

newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct);
checksum+=newVector3Struct.x +newVector3Struct.y+newVector3Struct.z;
Run Code Online (Sandbox Code Playgroud)

有:

s_vector3 r = s_vector3::cross(newVector3Struct, newVector3Struct);
checksum+=r.x +r.y+r.z; 
Run Code Online (Sandbox Code Playgroud)

结果现在:2.14 (struct) vs. 7.9- 这是一个很大的改进!

我对它的看法:优化器做得很好,但如果需要的话,我们可以帮助它.