我有一个非常简单的并行for循环,它只是将零写入整数数组.但事实证明线程越多,循环越慢.我认为这是由于一些缓存抖动所以我玩了调度,块大小,__ restrict__,在并行块内嵌套并行,并刷新.然后我注意到读取数组进行减少也比较慢.
这应该显然非常简单,并且应该几乎线性加速.我在这里错过了什么?
完整代码:
#include <omp.h> #include <vector> #include <iostream> #include <ctime> void tic(),toc(); int main(int argc,const char *argv[]) { const int COUNT = 100; const size_t sz = 250000 * 200; std::vector<int> vec(sz,1); std::cout << "max threads: " << omp_get_max_threads()<< std::endl; std::cout << "serial reduction" << std::endl; tic(); for(int c = 0; c < COUNT; ++ c) { double sum = 0; for(size_t i = 0; i < sz; ++ i) sum += vec[i]; } toc(); int *const ptr = vec.data(); const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int std::cout << "parallel reduction" << std::endl; tic(); for(int c = 0; c < COUNT; ++ c) { double sum = 0; #pragma omp parallel for default(none) reduction(+:sum) for(int i = 0; i < sz_i; ++ i) sum += ptr[i]; } toc(); std::cout << "serial memset" << std::endl; tic(); for(int c = 0; c < COUNT; ++ c) { for(size_t i = 0; i < sz; ++ i) vec[i] = 0; } toc(); std::cout << "parallel memset" << std::endl; tic(); for(int c = 0; c < COUNT; ++ c) { #pragma omp parallel for default(none) for(int i = 0; i < sz_i; ++ i) ptr[i] = 0; } toc(); return 0; } static clock_t ClockCounter; void tic() { ClockCounter = std::clock(); } void toc() { ClockCounter = std::clock() - ClockCounter; std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl; }
运行此产生:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 ./omp_test max threads: 12 serial reduction elapsed clock ticks: 1790000 parallel reduction elapsed clock ticks: 19690000 serial memset elapsed clock ticks: 3860000 parallel memset elapsed clock ticks: 20800000
如果我使用-O2运行,g可以优化串行减少,我得到零时间,因此-O1.另外,放omp_set_num_threads(1);使时间更相似,尽管仍有一些差异:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 ./omp_test max threads: 1 serial reduction elapsed clock ticks: 1770000 parallel reduction elapsed clock ticks: 7370000 serial memset elapsed clock ticks: 2290000 parallel memset elapsed clock ticks: 3550000
这应该是相当明显的,我觉得我没有看到一些非常基本的东西.我的cpu是具有超线程的英特尔(R)Xeon(R)cpu E5-2640 0 @ 2.50GHz,但在具有4个内核且没有超线程的同事的i5中观察到相同的行为.我们都在运行Linux.
编辑
似乎一个错误是在时间方面,运行:
static double ClockCounter; void tic() { ClockCounter = omp_get_wtime();//std::clock(); } void toc() { ClockCounter = omp_get_wtime()/*std::clock()*/ - ClockCounter; std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl; }
产生更“合理”的时间:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 ./omp_test max threads: 12 serial reduction elapsed clock ticks: 1.80974 parallel reduction elapsed clock ticks: 2.07367 serial memset elapsed clock ticks: 2.37713 parallel memset elapsed clock ticks: 2.23609
但是,仍然没有加速,它只是不再慢.
EDIT2:
正如user8046所建议的那样,代码严重受内存限制.并且正如Z boson所建议的那样,串行代码很容易被优化掉,并且不确定这里测量的是什么.所以我做了一个小的改变,把总和放在循环之外,这样它在c的每次迭代都不会为零.我还用sum = F(vec [i])和memset操作替换了还原操作,其中vec [i] = F(i).运行方式:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 -D"F(x)=sqrt(double(x))" ./omp_test max threads: 12 serial reduction elapsed clock ticks: 23.9106 parallel reduction elapsed clock ticks: 3.35519 serial memset elapsed clock ticks: 43.7344 parallel memset elapsed clock ticks: 6.50351
计算平方根为线程增加了更多工作,最终有一些合理的加速(大约7倍,这是有意义的,因为超线程内核共享内存通道).