OpenMP从入门到弃坑(1)

OpenMP从入门到弃坑(1)

OpenMP使用教程(入门)

0x01 介绍

OpenMP是目前最常用的并行编程模型之一,它的出现使得程序员可以较为简单地编写并行程序(parallel software)。在使用OpenMP之前,我们首先要了解一下内容

了解如何编写c/c++程序。OpenMP支持c/c++以及Fortran,但我们一般都使用c/c++

如何将程序链接到某一个Library

OpenMP在计算机之中处于的层级如下图所示:

0x02 核心语法

大多数OpemMP的组成都是类似#pragma omp construct[clause[clause]...]的编译器指令(Compiler directives),例如

#pragma omp parallel num_threads(4)

函数原型(Prototypes)和使用的类型(type)定义在:文件之中,需要下载以及安装

0x03 编译

> gcc -fopenmp test.c

> export OMP_NUM_THREADS=n #n is num of the professors in your computer

> ./a.out

0x04 示例程序

#include

#include

#include

int main(){

int nthreads,tid;

/* Fork a team of threads giving them their own copies of variables */

#pragma omp parallel private(nthreads, tid)

{

/* Obtain thread number */

tid = omp_get_thread_num();

printf("Hello World from thread = %d\n", tid);

/* Only master thread does this */

if (tid == 0){

nthreads = omp_get_num_threads();

printf("Number of threads = %d\n", nthreads);

}

} /* All threads join master thread and disband */

return 0;

}

Output:

Hello World from thread = 0

Hello World from thread = 1

Hello World from thread = 13

Hello World from thread = 5

Hello World from thread = 14

Hello World from thread = 15

Hello World from thread = 3

Number of threads = 16

Hello World from thread = 8

Hello World from thread = 9

Hello World from thread = 11

Hello World from thread = 6

Hello World from thread = 2

Hello World from thread = 4

Hello World from thread = 10

Hello World from thread = 12

Hello World from thread = 7

0x05 共享内存程序

上图为共享内存空间的的多个线程示例,一个进程拥有多个线程,线程通过向共享内存进行读或者写操作来进行通讯。OS通过策略来协调何时来运行某一个程序,同时有同步(synchronization机制)来协调运行顺序,从而保证程序的正确运行。

0x06 线程创建

double A[1000];

omp_set_num_threads(4);//request 4 threads

#pragma omp parallel // each thread process the below function

{

int ID=omp_get_thread_num(4); //each thread needs a thread ID

pooh(ID,A)

}

上述程序的执行如下图所示

![Screen Shot 2022-03-09 at 5.16.56 PM](/Users/leosher/Desktop/Screen Shot 2022-03-09 at 5.16.56 PM.png)

下面讲述一个具体的案例:

对于积分\(\int_0^1\frac{4}{1+x^2}dx\),我们可以由积分公式得出是\(\pi\)。但是对于一些复杂的积分,计算机无法根据积分公式求解,那么就只能进行近似求解:

\[\sum_{i=0}^NF(x_i)\Delta x\approx\pi

\]

如果\(\Delta x\)设置的越小,那么我们的结果就越接近精确解,不过程序所耗费的时间就越长,不过我们取的每一个\(\Delta x\)都是不相关的。因此可以进行并行化处理以缩短运行的时间。

首先我们看一下串行(serial)的程序

static long num_steps = 100000000;

double step;

int main ()

{

int i; double x, pi, sum = 0.0;

step = 1.0/(double) num steps;

for (i=0;i< num_steps; i++)

x=(i+0.5)*step;

sum = sum + 4.0/(1.0+x*x);

}

pi = step * sum:

}

运行的结果为:

time=652.356ms

pi=3.141593

下面再看一下并行化的程序:

#include

#include

#include

static long num_steps=1000000000;

double step;

#define NUM_THREADS 16

void main(){

clock_t start,end;

start =clock();//or time(&start);

int i, nthreads;

double pi,sum[NUM_THREADS];

step=1.0/(double)num_steps;

omp_set_num_threads(NUM_THREADS);

#pragma omp parallel

{

int i,id,nthrds;

double x;

id=omp_get_thread_num();

nthrds=omp_get_num_threads();

if(id==0) nthreads=nthrds;//主进程才可以修改全局变量

for(i=id,sum[id]=0.0;i

x=(i+0.5)*step;

sum[id]+=4.0/(1.0+x*x);

}

}

for(i=0,pi=0.0;i

end=clock();

printf("time=%f",(float)(start-end)/1000);

printf("pi=%f",pi);

}

输出的结果

NUM_THREADS=2

time=293.483000

pi=3.141593

NUM_THREADS=4

time=159.554000

pi=3.141593

注意:我们上述的程序存在着FalsePooling。因为我们可以看出随着NUM_THREADS的增加,time并非线性减少。有管FalsePooling详见https://zhuanlan.zhihu.com/p/55917869

0x07 同步

同步(Synchronization)是将一个或多个线程进行协调,最常见的有Barrier和Mutual exclusion两种

Barrier

#pragma omp parallel

{

int id=omp_get_thread_num0;

A[id] = big_calc1 (id);

#pragma omp barrier //barrier

B[id] = big_calc2(id, A);

}

只有当所有线程都到达barrier的时候才会继续运行

Critical

float res;

#pragma omp parallel

{

float B;

int i, id, nthrds;

id = omp_get_thread_num0;

nthrds=omp_get_num_threads0;

for(i=id;i

B= big _job(i);

#pragma omp critical //Threads wait their turn. Only one at a time calls consume()

res += consume (B);

}

}

Atomic

#pragma omp parallel

{

double tmp, B;

B= DOITO;

tmp = big ugly(B);

#pragma omp atomic

X+= tmp;

}

Loop

#pragma omp parallel

{

#pragma omp for

for(i=0;i

do...;

}

}

Reduction

Reduction(op:list)

Inside a parallel or a work-sharing construct:

A local copy of each list variable is made and initialized depending on the "op" (e.g. 0 for " "+").

Updates occur on the local copy.

Local copies are reduced into a single value and combined with the original global value.

相关典藏

你会玩“小弟弟”吗?几种手法让你玩的停不下来
仿bus365

你会玩“小弟弟”吗?几种手法让你玩的停不下来

📅 01-19 👁️‍🗨️ 2010
莪:莪字的意思/解释/读音/来源
仿bus365

莪:莪字的意思/解释/读音/来源

📅 07-08 👁️‍🗨️ 6310
使用了一年的蒲公英X1组网盒子
仿bus365

使用了一年的蒲公英X1组网盒子

📅 12-07 👁️‍🗨️ 5154