通过SSE指令执行复杂的乘法和除法是否有益?
我知道使用SSE时加法和减法表现更好.有人能告诉我如何使用SSE执行复杂的乘法以获得更好的性能吗?
我知道使用SSE时加法和减法表现更好.有人能告诉我如何使用SSE执行复杂的乘法以获得更好的性能吗?
解决方法
复杂的乘法定义为:
((c1a * c2a) - (c1b * c2b)) + ((c1b * c2a) + (c1a * c2b))i
所以你的2个组件是一个复数
((c1a * c2a) - (c1b * c2b)) and ((c1b * c2a) + (c1a * c2b))i
因此,假设您使用8个浮点数来表示如下定义的4个复数:
c1a,c1b,c2a,c2b c3a,c3b,c4a,c4b
并且你想同时做(c1 * c3)和(c2 * c4)你的SSE代码看起来像“下面的东西”:
(注意我在windows下使用了MSVC,但原理是相同的).
__declspec( align( 16 ) ) float c1c2[] = { 1.0f,2.0f,3.0f,4.0f }; __declspec( align( 16 ) ) float c3c4[] = { 4.0f,1.0f }; __declspec( align( 16 ) ) float mulfactors[] = { -1.0f,1.0f,-1.0f,1.0f }; __declspec( align( 16 ) ) float res[] = { 0.0f,0.0f,0.0f }; __asm { movaps xmm0,xmmword ptr [c1c2] // Load c1 and c2 into xmm0. movaps xmm1,xmmword ptr [c3c4] // Load c3 and c4 into xmm1. movaps xmm4,xmmword ptr [mulfactors] // load multiplication factors into xmm4 movaps xmm2,xmm1 movaps xmm3,xmm0 shufps xmm2,xmm1,0xA0 // Change order to c3a c3a c4a c4a and store in xmm2 shufps xmm1,0xF5 // Change order to c3b c3b c4b c4b and store in xmm1 shufps xmm3,xmm0,0xB1 // change order to c1b c1a c2b c2a abd store in xmm3 mulps xmm0,xmm2 mulps xmm3,xmm1 mulps xmm3,xmm4 // Flip the signs of the 'a's so the add works correctly. addps xmm0,xmm3 // Add together movaps xmmword ptr [res],xmm0 // Store back out }; float res1a = (c1c2[0] * c3c4[0]) - (c1c2[1] * c3c4[1]); float res1b = (c1c2[1] * c3c4[0]) + (c1c2[0] * c3c4[1]); float res2a = (c1c2[2] * c3c4[2]) - (c1c2[3] * c3c4[3]); float res2b = (c1c2[3] * c3c4[2]) + (c1c2[2] * c3c4[3]); if ( res1a != res[0] || res1b != res[1] || res2a != res[2] || res2b != res[3] ) { _exit( 1 ); }
我上面所做的是我将数学简化了一下.假设如下:
c1a c1b c2a c2b c3a c3b c4a c4b
通过重新排列我最终得到以下向量
0 => c1a c1b c2a c2b 1 => c3b c3b c4b c4b 2 => c3a c3a c4a c4a 3 => c1b c1a c2b c2a
然后我将0和2相乘得到:
0 => c1a * c3a,c1b * c3a,c2a * c4a,c2b * c4a
接下来我将3和1相乘得到:
3 => c1b * c3b,c1a * c3b,c2b * c4b,c2a * c4b
最后,我在3中翻转了几个花车的标志
3 => -(c1b * c3b),-(c2b * c4b),c2a * c4b
所以我可以把它们加在一起然后得到
(c1a * c3a) - (c1b * c3b),(c1b * c3a ) + (c1a * c3b),(c2a * c4a) - (c2b * c4b),(c2b * c4a) + (c2a * c4b)
这是我们之后的:)