VC：Perduance drop x20当线程多于cpus而不是g时

简单的多线程c 11程序,其中所有线程在紧循环中锁定相同的互斥体.

当它使用8个线程(作为逻辑cpus的数量),它可以达到500万锁/秒

但添加一个额外的线程 – 性能下降到20万/秒！

编辑：

根据g 4.8.2(ubuntu x64)：即使有100个线程,甚至没有性能下降！ (和两倍以上的表现,但这是另一个故事)
– 所以这似乎是VC互斥体实现的具体问题

我用以下代码(Windows 7 x64)转载它：

#include <chrono>
#include <thread>
#include <memory>
#include <mutex>
#include <atomic>
#include <sstream>
#include <iostream>

using namespace std::chrono;

void thread_loop(std::mutex* mutex,std::atomic_uint64_t* counter)
{
    while (true)
    {
        std::unique_lock<std::mutex> ul(*mutex);        
        counter->operator++();                    
    }        
}

int _tmain(int argc,_TCHAR* argv[])
{    

    int threads = 9;
    std::mutex mutex;
    std::atomic_uint64_t counter = 0;

    std::cout << "Starting " << threads << " threads.." << std::endl;
    for (int i = 0; i < threads; ++i)
        new std::thread(&thread_loop,&mutex,&counter);

    std::cout << "Started " << threads << " threads.." << std::endl;
    while (1)
    {   
        counter = 0;
        std::this_thread::sleep_for(seconds(1));        
        std::cout << "Counter = " << counter.load() << std::endl;                
    }    
}

VS 2013分析器告诉我,大部分时间(95.7％)被浪费在一个紧缩的循环中(rtlocks.cpp中的第697行)：

while (IsBlocked() & & spinWait._SpinOnce())
{
//_YieldProcessor is called inside _SpinOnce
}

可能是什么原因？怎么可以改善？

操作系统：windows 7 x64

cpu：i7 3770 4芯(x2超线程)

解决方法

使用8个线程您的代码正在旋转,但获取锁定,而不必在暂停线程之前暂停该线程,因此它将失去其时间片段.

当您添加越来越多的线程时,争用级别增加,因此线程将无法在其时间片内获取锁定的机会.当这种情况发生时,线程被暂停,并且上下文swith发生到另一个线程,cpu将检查该线程是否可以唤醒线程.

所有这些切换,暂停和唤醒都需要从用户模式转换到内核模式,这是一个昂贵的操作,因此性能受到很大的影响.

为了改善事情,可以减少争用锁定的线程数量或者增加可用的内核数量.在你的例子中,你使用的是std :: atomic number,所以你不需要锁定它就可以调用它,因为它已经是线程安全的.

VC：Perduance drop x20当线程多于cpus而不是g时

解决方法

猜你在找的C&C++相关文章