我创建了一个使用SIMD进行64位* 64位到128位的功能.目前我已经使用SSE2(acutally SSE4.1)实现了它.这意味着它可以同时运行两个64b * 64b到128b的产品.同样的想法可以扩展到AVX2或AVX512,同时提供四个或八个64b * 64到128b的产品.
我的算法基于 http://www.hackersdelight.org/hdcodetxt/muldws.c.txt
我的算法基于 http://www.hackersdelight.org/hdcodetxt/muldws.c.txt
该算法进行一次无符号乘法,一次有符号乘法和两次有符号*无符号乘法.使用_mm_mul_epi32和_mm_mul_epu32可以轻松完成签名的* signed和unsigned * unsigned操作.但混合签名和未签名的产品给我带来了麻烦.
例如,考虑一下.
int32_t x = 0x80000000; uint32_t y = 0x7fffffff; int64_t z = (int64_t)x*y;
双字产品应为0xc000000080000000.但是如果你假设编译器知道如何处理混合类型,你怎么能得到这个?这就是我提出的:
int64_t sign = x<0; sign*=-1; //get the sign and make it all ones uint32_t t = abs(x); //if x<0 take two's complement again uint64_t prod = (uint64_t)t*y; //unsigned product int64_t z = (prod ^ sign) - sign; //take two's complement based on the sign
使用SSE可以这样做
__m128i xh; //(xl2,xh2,xl1,xh1) high is signed,low unsigned __m128i yl; //(yh2,yl2,yh2,yl2) __m128i xs = _mm_cmpgt_epi32(_mm_setzero_si128(),xh); // get sign xs = _mm_shuffle_epi32(xs,0xA0); // extend sign __m128i t = _mm_sign_epi32(xh,xh); // abs(xh) __m128i prod = _mm_mul_epu32(t,yl); // unsigned (xh2*yl2,xh1*yl1) __m128i inv = _mm_xor_si128(prod,xs); // invert bits if negative __m128i z = _mm_sub_epi64(inv,xs); // add 1 if negative
这给出了正确的结果.但是我必须做两次(平时一次)它现在是我功能的重要部分.使用SSE4.2,AVX2(四个128位产品),甚至AVX512(八个128位产品)是否有更有效的方法?
编辑:基于@ElderBug的评论,看起来这样做的方法不是使用SIMD而是使用mul指令.对于它的价值,万一有人想要看到它有多复杂,这里是完整的工作功能(我只是让它工作,所以我没有优化它,但我不认为这是值得的).
void muldws1_sse(__m128i x,__m128i y,__m128i *lo,__m128i *hi) { __m128i lomask = _mm_set1_epi64x(0xffffffff); __m128i xh = _mm_shuffle_epi32(x,0xB1); // x0l,x0h,x1l,x1h __m128i yh = _mm_shuffle_epi32(y,0xB1); // y0l,y0h,y1l,y1h __m128i xs = _mm_cmpgt_epi32(_mm_setzero_si128(),xh); __m128i ys = _mm_cmpgt_epi32(_mm_setzero_si128(),yh); xs = _mm_shuffle_epi32(xs,0xA0); ys = _mm_shuffle_epi32(ys,0xA0); __m128i w0 = _mm_mul_epu32(x,y); // x0l*y0l,y0l*y0h __m128i w3 = _mm_mul_epi32(xh,yh); // x0h*y0h,x1h*y1h xh = _mm_sign_epi32(xh,xh); yh = _mm_sign_epi32(yh,yh); __m128i w1 = _mm_mul_epu32(x,yh); // x0l*y0h,x1l*y1h __m128i w2 = _mm_mul_epu32(xh,y); // x0h*y0l,x1h*y0l __m128i yinv = _mm_xor_si128(w1,ys); // invert bits if negative w1 = _mm_sub_epi64(yinv,ys); // add 1 __m128i xinv = _mm_xor_si128(w2,xs); // invert bits if negative w2 = _mm_sub_epi64(xinv,xs); // add 1 __m128i w0l = _mm_and_si128(w0,lomask); __m128i w0h = _mm_srli_epi64(w0,32); __m128i s1 = _mm_add_epi64(w1,w0h); // xl*yh + w0h; __m128i s1l = _mm_and_si128(s1,lomask); // lo(wl*yh + w0h); __m128i s1h = _mm_srai_epi64(s1,32); __m128i s2 = _mm_add_epi64(w2,s1l); //xh*yl + s1l __m128i s2l = _mm_slli_epi64(s2,32); __m128i s2h = _mm_srai_epi64(s2,32); //arithmetic shift right __m128i hi1 = _mm_add_epi64(w3,s1h); hi1 = _mm_add_epi64(hi1,s2h); __m128i lo1 = _mm_add_epi64(w0l,s2l); *hi = hi1; *lo = lo1; }
它变得更糟.在AVX512之前没有_mm_srai_epi64内在/指令所以我必须自己制作.
static inline __m128i _mm_srai_epi64(__m128i a,int b) { __m128i sra = _mm_srai_epi32(a,32); __m128i srl = _mm_srli_epi64(a,32); __m128i mask = _mm_set_epi32(-1,-1,0); __m128i out = _mm_blendv_epi8(srl,sra,mask); }
我上面的_mm_srai_epi64的实现是不完整的.我想我正在使用Agner Fog的Vector Class Library.如果你查看文件vectori128.h,你会发现
static inline Vec2q operator >> (Vec2q const & a,int32_t b) { // instruction does not exist. Split into 32-bit shifts if (b <= 32) { __m128i bb = _mm_cvtsi32_si128(b); // b __m128i sra = _mm_sra_epi32(a,bb); // a >> b signed dwords __m128i srl = _mm_srl_epi64(a,bb); // a >> b unsigned qwords __m128i mask = _mm_setr_epi32(0,-1); // mask for signed high part return selectb(mask,srl); } else { // b > 32 __m128i bm32 = _mm_cvtsi32_si128(b-32); // b - 32 __m128i sign = _mm_srai_epi32(a,31); // sign of a __m128i sra2 = _mm_sra_epi32(a,bm32); // a >> (b-32) signed dwords __m128i sra3 = _mm_srli_epi64(sra2,32); // a >> (b-32) >> 32 (second shift unsigned qword) __m128i mask = _mm_setr_epi32(0,-1); // mask for high part containing only sign return selectb(mask,sign,sra3); } }
解决方法
使用各种指令考虑整数乘法的吞吐量限制的正确方法是根据每个周期可以计算多少“产品位”.
mulx生产一个64×64 – >每个周期结果128;这是每周期64×64 = 4096“产品位”
如果你将SIMD上的乘数拼凑成32×32的指令 – > 64位乘法,你需要能够在每个周期得到4个结果来匹配mulx(4x32x32 = 4096).如果除了乘法之外没有算术,你只需要在AVX2上收支平衡.不幸的是,正如你已经注意到的那样,除了乘法之外还有很多算术,所以这是当前一代硬件上的非首发.