Wen*_*nyu 5 c++ optimization performance physics utility
出于好奇,我用3种方式实现了vector3实用程序:array(带有typedef),类和结构
这是数组实现:
typedef float newVector3[3];
namespace vec3{
void add(const newVector3& first, const newVector3& second, newVector3& out_newVector3);
void subtract(const newVector3& first, const newVector3& second, newVector3& out_newVector3);
void dot(const newVector3& first, const newVector3& second, float& out_result);
void cross(const newVector3& first, const newVector3& second, newVector3& out_newVector3);
}
// implementations, nothing fancy...really
void add(const newVector3& first, const newVector3& second, newVector3& out_newVector3)
{
out_newVector3[0] = first[0] + second[0];
out_newVector3[1] = first[1] + second[1];
out_newVector3[2] = first[2] + second[2];
}
void subtract(const newVector3& first, const newVector3& second, newVector3& out_newVector3){
out_newVector3[0] = first[0] - second[0];
out_newVector3[1] = first[1] - second[1];
out_newVector3[2] = first[2] - second[2];
}
void dot(const newVector3& first, const newVector3& second, float& out_result){
out_result = first[0]*second[0] + first[1]*second[1] + first[2]*second[2];
}
void cross(const newVector3& first, const newVector3& second, newVector3& out_newVector3){
out_newVector3[0] = first[0] * second[0];
out_newVector3[1] = first[1] * second[1];
out_newVector3[2] = first[2] * second[2];
}
}
Run Code Online (Sandbox Code Playgroud)
一个类实现:
class Vector3{
private:
float x;
float y;
float z;
public:
// constructors
Vector3(float new_x, float new_y, float new_z){
x = new_x;
y = new_y;
z = new_z;
}
Vector3(const Vector3& other){
if(&other != this){
this->x = other.x;
this->y = other.y;
this->z = other.z;
}
}
}
Run Code Online (Sandbox Code Playgroud)
当然,它包含通常出现在Vector3类中的其他功能.
最后,结构实现:
struct s_vector3{
float x;
float y;
float z;
// constructors
s_vector3(float new_x, float new_y, float new_z){
x = new_x;
y = new_y;
z = new_z;
}
s_vector3(const s_vector3& other){
if(&other != this){
this->x = other.x;
this->y = other.y;
this->z = other.z;
}
}
Run Code Online (Sandbox Code Playgroud)
同样,我省略了一些其他常见的Vector3功能.现在,我让他们三个创建9000000个新对象,并做9000000次交叉产品(我写了一大块数据数据,在其中一个完成后缓存,以避免缓存帮助它们).
这是测试代码:
const int K_OPERATION_TIME = 9000000;
const size_t bigger_than_cachesize = 20 * 1024 * 1024;
void cleanCache()
{
// flush the cache
long *p = new long[bigger_than_cachesize];// 20 MB
for(int i = 0; i < bigger_than_cachesize; i++)
{
p[i] = rand();
}
}
int main(){
cleanCache();
// first, the Vector3 struct
std::clock_t start;
double duration;
start = std::clock();
for(int i = 0; i < K_OPERATION_TIME; ++i){
s_vector3 newVector3Struct = s_vector3(i,i,i);
newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct);
}
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
printf("The struct implementation of Vector3 takes %f seconds.\n", duration);
cleanCache();
// second, the Vector3 array implementation
start = std::clock();
for(int i = 0; i < K_OPERATION_TIME; ++i){
newVector3 newVector3Array = {i, i, i};
newVector3 opResult;
vec3::cross(newVector3Array, newVector3Array, opResult);
}
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
printf("The array implementation of Vector3 takes %f seconds.\n", duration);
cleanCache();
// Third, the Vector3 class implementation
start = std::clock();
for(int i = 0; i < K_OPERATION_TIME; ++i){
Vector3 newVector3Class = Vector3(i,i,i);
newVector3Class = Vector3::cross(newVector3Class, newVector3Class);
}
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
printf("The class implementation of Vector3 takes %f seconds.\n", duration);
return 0;
}
Run Code Online (Sandbox Code Playgroud)
结果令人惊讶.
struct
并且class
实现完成任务大约0.23秒,而array
实现只需要0.08秒!
如果数组确实具有这样的显着性能优势,虽然它的语法很难看,但它在很多情况下都值得使用.
所以我真的很想确定,这应该发生什么?谢谢!
简答:这取决于.如您所见,如果没有优化编译,则存在差异.
当我在(-O2
或-O3
)上进行优化编译(内联所有函数)时没有区别(请继续阅读,看起来并不那么容易).
Optimization Times (struct vs. array)
-O0 0.27 vs. 0.12
-O1 0.14 vs. 0.04
-O2 0.00 vs. 0.00
-O3 0.00 vs. 0.00
Run Code Online (Sandbox Code Playgroud)
无法保证,您的编译器可以/将要做什么优化,因此完整的答案是"它取决于您的编译器".起初我会相信我的编译器会做正确的事情,否则我应该开始编程程序集.只有当代码的这一部分是真正的瓶颈时,才有必要考虑帮助编译器.
如果使用编译-O2
,你的代码0.0
对于两个版本都需要几秒钟,但这是因为优化器看到,这些值根本没有使用,所以它只是抛弃了整个代码!
让我们确保,这不会发生:
#include <ctime>
#include <cstdio>
const int K_OPERATION_TIME = 1000000000;
int main(){
std::clock_t start;
double duration;
start = std::clock();
double checksum=0.0;
for(int i = 0; i < K_OPERATION_TIME; ++i){
s_vector3 newVector3Struct = s_vector3(i,i,i);
newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct);
checksum+=newVector3Struct.x +newVector3Struct.y+newVector3Struct.z; // actually using the result of cross-product!
}
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
printf("The struct implementation of Vector3 takes %f seconds.\n", duration);
// second, the Vector3 array implementation
start = std::clock();
for(int i = 0; i < K_OPERATION_TIME; ++i){
newVector3 newVector3Array = {i, i, i};
newVector3 opResult;
vec3::cross(newVector3Array, newVector3Array, opResult);
checksum+=opResult[0] +opResult[1]+opResult[2]; // actually using the result of cross-product!
}
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
printf("The array implementation of Vector3 takes %f seconds.\n", duration);
printf("Checksum: %f\n", checksum);
}
Run Code Online (Sandbox Code Playgroud)
您将看到以下更改:
1e9
迭代,以获得有意义的时间.通过此更改,我们可以看到以下时序(intel编译器):
Optimization Times (struct vs. array)
-O0 33.2 vs. 17.1
-O1 19.1 vs. 7.8
-Os 19.2 vs. 7.9
-O2 0.7 vs. 0.7
-O3 0.7 vs. 0.7
Run Code Online (Sandbox Code Playgroud)
我有点失望,这-Os
有一个糟糕的表现,但你可以看到,如果优化,结构和数组之间没有区别!
我个人非常喜欢-Os
,因为它产生了我能够理解的装配,所以让我们来看看它为什么这么慢.
最明显的事情是,不查看生成的程序集:s_vector3::cross
返回s_vector3
-object但我们将结果分配给已存在的对象,因此如果优化器没有看到,旧的对象不再使用,他可能无法做RVO.所以让我们替换
newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct);
checksum+=newVector3Struct.x +newVector3Struct.y+newVector3Struct.z;
Run Code Online (Sandbox Code Playgroud)
有:
s_vector3 r = s_vector3::cross(newVector3Struct, newVector3Struct);
checksum+=r.x +r.y+r.z;
Run Code Online (Sandbox Code Playgroud)
结果现在:2.14 (struct) vs. 7.9
- 这是一个很大的改进!
我对它的看法:优化器做得很好,但如果需要的话,我们可以帮助它.