我有C代码处理来自一个1800元素阵列的三个连续值.由ICC 14.0编译的代码比MSVC生成的代码慢约68%(1600对2700 cpu周期).我不明白为什么.有人可以帮忙吗?即使我设置英特尔编译器-O3开关,它也不会改变时序. cpu是Ivy Bridge.
#include <iostream> int main(){ int data[1200]; //Dummy-populate data for(int y=0; y<1200; y++){ data[y] = y/2 + 7; } int counter = 0; //Just to repeat the test while(counter < 10000){ int Accum = 0; long long start = 0; long long end = 0; int p = 0; start = __rdtsc(); while(p < 1200){ unsigned int level1 = data[p]; unsigned int factor = data[p + 1]; Accum += (level1 * factor); p = p + 2; } end = __rdtsc(); std::cout << (end - start) << " " << Accum << std::endl; counter++; } }
解决方法
ICC在这里很糟糕,因为它正在计算每个数据的地址[n]访问ala mov edi,dword ptr [rsp rax * 4 44h] …所有运行时乘法都很昂贵.您应该能够通过重新编码来避免它,因此索引是常量(也可以使用* p_data三次,但这会引入可能对性能产生负面影响的排序问题).
for (unsigned* p_data = &data[0],*p_end = data + 1800; p_data < p_end; p_data += 3) { unsigned level1 = p_data[0]; unsigned level2 = p_data[1]; unsigned factor = p_data[2]; Accum1 += level1 * factor; Accum2 += level2 * factor; }