编辑:我编辑了问题及其标题更精确.
考虑以下源代码:
#include <vector>
struct xyz {
xyz() { } // empty constructor, but the compiler doesn't care
xyz(const xyz& o): v(o.v) { }
xyz& operator=(const xyz& o) { v=o.v; return *this; }
int v; // <will be initialized to int(), which means 0
};
std::vector<xyz> test() {
return std::vector<xyz>(1024); // will do a memset() :-(
}
Run Code Online (Sandbox Code Playgroud)
...我怎么能避免vector <>分配的内存用它的第一个元素的副本进行初始化,这是一个O(n)操作我宁愿为了速度而跳过,因为我的默认构造函数什么都不做?
如果不存在通用的解决方案,那么g ++特定的解决方案就可以做到(但我找不到任何属性来执行此操作).
编辑:生成的代码如下(命令行:arm-elf-g ++ - 4.5 -O3 -S -fno-verbose-asm -o-test.cpp | arm-elf-c ++ filt | grep -vE'^ [[: space:]] …
有人可以访问iPhone 3GS或Pandora,请测试我刚写的下面的汇编程序吗?
它应该在NEON矢量FPU上非常快速地计算正弦和余弦.我知道它编译得很好,但没有足够的硬件我就无法测试它.如果你可以计算一些正弦和余弦,并将结果与sinf()和cosf()的结果进行比较,那将会有所帮助.
谢谢!
#include <math.h>
/// Computes the sine and cosine of two angles
/// in: angles = Two angles, expressed in radians, in the [-PI,PI] range.
/// out: results = vector containing [sin(angles[0]),cos(angles[0]),sin(angles[1]),cos(angles[1])]
static inline void vsincos(const float angles[2], float results[4]) {
static const float constants[] = {
/* q1 */ 0, M_PI_2, 0, M_PI_2,
/* q2 */ M_PI, M_PI, M_PI, M_PI,
/* q3 */ 4.f/M_PI, 4.f/M_PI, 4.f/M_PI, 4.f/M_PI,
/* q4 */ -4.f/(M_PI*M_PI), -4.f/(M_PI*M_PI), -4.f/(M_PI*M_PI), -4.f/(M_PI*M_PI),
/* q5 …Run Code Online (Sandbox Code Playgroud) 使用arm-elf-gcc-4.5 -O3 -march = armv7-a -mthumb -mfpu = neon -mfloat-abi = softfp编译该程序时:
#include <arm_neon.h>
extern float32x4_t cross(const float32x4_t& v1, const float32x4_t& v2) {
float32x4x2_t
xxyyzz1(vzipq_f32(v1, v1)),
xxyyzz2(vzipq_f32(v2, v2));
float32x2_t
xx1(vget_low_f32(xxyyzz1.val[0])),
yy1(vget_high_f32(xxyyzz1.val[0])),
zz1(vget_low_f32(xxyyzz1.val[1])),
xx2(vget_low_f32(xxyyzz2.val[0])),
yy2(vget_high_f32(xxyyzz2.val[0])),
zz2(vget_low_f32(xxyyzz2.val[1]));
float32x2_t
x(vmls_f32(vmul_f32(yy1, zz2), zz1, yy2)),
y(vmls_f32(vmul_f32(zz1, xx2), xx1, zz2)),
z(vmls_f32(vmul_f32(xx1, yy2), yy1, xx2));
return vcombine_f32(vuzp_f32(x, y).val[0], z);
}
Run Code Online (Sandbox Code Playgroud)
......这就是我得到的.注意两个用@ <<<标记的无用指令
_Z5crossRK19__simd128_float32_tS1_:
vldmia r0, {d16-d17}
vldmia r1, {d22-d23}
vmov q10, q8 @ v4sf
vmov q9, q11 @ v4sf
vzip.32 q8, q10
vzip.32 q11, q9
vmov d24, …Run Code Online (Sandbox Code Playgroud) 我如何指示python在内部存储我的字符串的预先散列版本,以便在使用我的字符串作为键执行dict/set查找时它将使用该值?
我记得几周前读过它,但目前在python docs中找不到它: - /
考虑以下NEON优化功能:
void mat44_multiply_neon(float32x4x4_t& result, const float32x4x4_t& a, const float32x4x4_t& b) {
// Make sure "a" is mapped to registers in the d0-d15 range,
// as requested by NEON multiply operations below:
register float32x4_t a0 asm("q0") = a.val[0];
register float32x4_t a1 asm("q1") = a.val[1];
register float32x4_t a2 asm("q2") = a.val[2];
register float32x4_t a3 asm("q3") = a.val[3];
asm volatile (
"\n\t# multiply two matrices...\n\t"
"# result (%q0,%q1,%q2,%q3) = first column of B (%q4) * first row of A (q0-q3)\n\t"
"vmul.f32 %q0, %q4, …Run Code Online (Sandbox Code Playgroud) arm ×3
assembly ×3
gcc ×3
neon ×3
optimization ×2
c++ ×1
dictionary ×1
hash ×1
iphone-3gs ×1
performance ×1
python ×1
string ×1