解决方法
您想要的是“非时间访问”,并告诉处理器期望您现在正在阅读的值暂时不再需要.然后处理器避免缓存该值.
请参阅上面链接的PDF的第49页.它使用intel内在函数在缓存周围进行流式处理.
On the read side,processors,until
recently,lacked support aside from
weak hints using non-temporal access
(NTA) prefetch instructions. There is
no equivalent to write-combining for
reads,which is especially bad for
uncacheable memory such as
memory-mapped I/O. Intel,with the
SSE4.1 extensions,introduced NTA
loads. They are implemented using a
small number of streaming load
buffers; each buffer contains a cache
line. The first movntdqa instruction
for a given cache line will load a
cache line into a buffer,possibly
replacing another cache line.
Subsequent 16-byte aligned accesses to
the same cache line will be serviced
from the load buffer at little cost.
Unless there are other reasons to do
so,the cache line will not be loaded
into a cache,thus enabling the
loading of large amounts of memory
without polluting the caches. The
compiler provides an intrinsic for
this instruction:
#include <smmintrin.h> __m128i _mm_stream_load_si128 (__m128i *p);
This intrinsic should be used multiple times,with addresses of
16-byte blocks passed as the
parameter,until each cache line is
read. Only then should the next cache
line be started. Since there are a few
streaming read buffers it might be
possible to read from two memory
locations at once
如果在读取时,缓冲区通过内存以线性顺序读取,那将是完美的.您使用流式读取来执行此操作.当您想要修改它们时,缓冲区将按线性顺序进行修改,如果您不希望在同一个线程中很快再次读取它们,则可以使用流式写入来执行此操作.