OpenMP使用教程(入门)
0x01 介绍
OpenMP是目前最常用的并行编程模型之一,它的出现使得程序员可以较为简单地编写并行程序(parallel software)。在使用OpenMP之前,我们首先要了解一下内容
了解如何编写c/c++程序。OpenMP支持c/c++以及Fortran,但我们一般都使用c/c++
如何将程序链接到某一个Library
OpenMP在计算机之中处于的层级如下图所示:
0x02 核心语法
大多数OpemMP的组成都是类似#pragma omp construct[clause[clause]...]的编译器指令(Compiler directives),例如
#pragma omp parallel num_threads(4)
函数原型(Prototypes)和使用的类型(type)定义在:
0x03 编译
> gcc -fopenmp test.c
> export OMP_NUM_THREADS=n #n is num of the professors in your computer
> ./a.out
0x04 示例程序
#include
#include
#include
int main(){
int nthreads,tid;
/* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(nthreads, tid)
{
/* Obtain thread number */
tid = omp_get_thread_num();
printf("Hello World from thread = %d\n", tid);
/* Only master thread does this */
if (tid == 0){
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
} /* All threads join master thread and disband */
return 0;
}
Output:
Hello World from thread = 0
Hello World from thread = 1
Hello World from thread = 13
Hello World from thread = 5
Hello World from thread = 14
Hello World from thread = 15
Hello World from thread = 3
Number of threads = 16
Hello World from thread = 8
Hello World from thread = 9
Hello World from thread = 11
Hello World from thread = 6
Hello World from thread = 2
Hello World from thread = 4
Hello World from thread = 10
Hello World from thread = 12
Hello World from thread = 7
0x05 共享内存程序
上图为共享内存空间的的多个线程示例,一个进程拥有多个线程,线程通过向共享内存进行读或者写操作来进行通讯。OS通过策略来协调何时来运行某一个程序,同时有同步(synchronization机制)来协调运行顺序,从而保证程序的正确运行。
0x06 线程创建
double A[1000];
omp_set_num_threads(4);//request 4 threads
#pragma omp parallel // each thread process the below function
{
int ID=omp_get_thread_num(4); //each thread needs a thread ID
pooh(ID,A)
}
上述程序的执行如下图所示

下面讲述一个具体的案例:
对于积分\(\int_0^1\frac{4}{1+x^2}dx\),我们可以由积分公式得出是\(\pi\)。但是对于一些复杂的积分,计算机无法根据积分公式求解,那么就只能进行近似求解:
\[\sum_{i=0}^NF(x_i)\Delta x\approx\pi
\]
如果\(\Delta x\)设置的越小,那么我们的结果就越接近精确解,不过程序所耗费的时间就越长,不过我们取的每一个\(\Delta x\)都是不相关的。因此可以进行并行化处理以缩短运行的时间。
首先我们看一下串行(serial)的程序
static long num_steps = 100000000;
double step;
int main ()
{
int i; double x, pi, sum = 0.0;
step = 1.0/(double) num steps;
for (i=0;i< num_steps; i++)
x=(i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum:
}
运行的结果为:
time=652.356ms
pi=3.141593
下面再看一下并行化的程序:
#include
#include
#include
static long num_steps=1000000000;
double step;
#define NUM_THREADS 16
void main(){
clock_t start,end;
start =clock();//or time(&start);
int i, nthreads;
double pi,sum[NUM_THREADS];
step=1.0/(double)num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i,id,nthrds;
double x;
id=omp_get_thread_num();
nthrds=omp_get_num_threads();
if(id==0) nthreads=nthrds;//主进程才可以修改全局变量
for(i=id,sum[id]=0.0;i x=(i+0.5)*step; sum[id]+=4.0/(1.0+x*x); } } for(i=0,pi=0.0;i end=clock(); printf("time=%f",(float)(start-end)/1000); printf("pi=%f",pi); } 输出的结果 NUM_THREADS=2 time=293.483000 pi=3.141593 NUM_THREADS=4 time=159.554000 pi=3.141593 注意:我们上述的程序存在着FalsePooling。因为我们可以看出随着NUM_THREADS的增加,time并非线性减少。有管FalsePooling详见https://zhuanlan.zhihu.com/p/55917869 0x07 同步 同步(Synchronization)是将一个或多个线程进行协调,最常见的有Barrier和Mutual exclusion两种 Barrier #pragma omp parallel { int id=omp_get_thread_num0; A[id] = big_calc1 (id); #pragma omp barrier //barrier B[id] = big_calc2(id, A); } 只有当所有线程都到达barrier的时候才会继续运行 Critical float res; #pragma omp parallel { float B; int i, id, nthrds; id = omp_get_thread_num0; nthrds=omp_get_num_threads0; for(i=id;i B= big _job(i); #pragma omp critical //Threads wait their turn. Only one at a time calls consume() res += consume (B); } } Atomic #pragma omp parallel { double tmp, B; B= DOITO; tmp = big ugly(B); #pragma omp atomic X+= tmp; } Loop #pragma omp parallel { #pragma omp for for(i=0;i do...; } } Reduction Reduction(op:list) Inside a parallel or a work-sharing construct: A local copy of each list variable is made and initialized depending on the "op" (e.g. 0 for " "+"). Updates occur on the local copy. Local copies are reduced into a single value and combined with the original global value.