我们创建一个非常简单的C程序,首先访问一个大数组,然后调用CLFLUSH来刷新整个数组的虚拟地址空间.我们测量CLFLUSH刷新整个阵列所需的延迟.程序中数组的大小是一个输入,我们将输入从1MB更改为40MB,步长为2MB.
在我们的理解中,CLFLUSH应该刷新缓存中的内容.所以我们期望看到冲洗整个阵列的延迟首先在阵列的大小上线性增加,然后在数组大小大于20MB之后延迟会停止增加,这是我们程序的LLC的大小.
然而,实验结果非常令人惊讶,如图所示.阵列大小大于20MB后,延迟不会停止增加.
我们想知道如果地址不在缓存中,CLFLUSH是否可能在CLFLUSH将地址从缓存中刷新之前引入地址?
我们还试图在英特尔软件开发人员手册中搜索,如果地址不在缓存中,CLFLUSH将会做什么解释.
下面是我们用来绘制数据的数据.第一列是以KB为单位的数组大小,第二列是以秒为单位刷新整个数组的延迟.
任何建议/建议都不胜感激.
[改性]
以前的代码是不必要的.尽管CLFLUSH具有类似的性能,但可以在用户空间中轻松完成.所以我删除了凌乱的代码,以避免混乱.
SCENARIO=Read Only 1024,.00158601000000000000 3072,.00299244000000000000 5120,.00464945000000000000 7168,.00630479000000000000 9216,.00796194000000000000 11264,.00961576000000000000 13312,.01126760000000000000 15360,.01300500000000000000 17408,.01480760000000000000 19456,.01696180000000000000 21504,.01968410000000000000 23552,.02300760000000000000 25600,.02634970000000000000 27648,.02990350000000000000 29696,.03403090000000000000 31744,.03749210000000000000 33792,.04092470000000000000 35840,.04438390000000000000 37888,.04780050000000000000 39936,.05163220000000000000 SCENARIO=Read and Write 1024,.00200558000000000000 3072,.00488687000000000000 5120,.00775943000000000000 7168,.01064760000000000000 9216,.01352920000000000000 11264,.01641430000000000000 13312,.01929260000000000000 15360,.02217750000000000000 17408,.02516330000000000000 19456,.02837180000000000000 21504,.03183180000000000000 23552,.03509240000000000000 25600,.03845220000000000000 27648,.04178440000000000000 29696,.04519920000000000000 31744,.04858340000000000000 33792,.05197220000000000000 35840,.05526950000000000000 37888,.05865630000000000000 39936,.06202170000000000000
解决方法
参见这里的第7.5.7节 – http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
In general,CLFLUSHOPT throughput is higher than that of CLFLUSH,
because CLFLUSHOPT orders itself with respect to a smaller set of
memory traffic as described above and in Section 7.5.6. The
throughput of CLFLUSHOPT will also vary. When using CLFLUSHOPT,
flushing modified cache lines will experience a higher cost than
flushing cache lines in non-modi fied states. CLFLUSHOPT will provide
a performance benefit over CLFLUSH for cache lines in any coherenc e
states. CLFLUSHOPT is more suitable to flush large buffers (e.g.
greater than many KBytes),comp ared to CLFLUSH. In single-threaded
applications,flushing buffers using CLFLUSHOPT may be up to 9X
better than using CLFLUSH with Skylake microarchi- tecture.
该部分还解释了刷新修改的数据较慢,这显然来自回写罚款.
对于延长的延迟,您是否测量整个时间需要超过地址范围并清除每一行?在这种情况下,线性依赖于数组大小,即使它通过LLC大小.即使线路不在那里,clflush也不得不由执行引擎和内存单元进行处理,即使不存在也可以查找每一行的整个缓存层次结构.