parallel computing
parallel computing : many threads solving a problem by working together.
The key to work together is to communicate.
00. MAP
Tasks read from and write to specific data elements.
01. GATHER
each calculation gathers input data elements together from different places to compute an output result.
02. Scatter
tasks compute where to write output.
03.Stencil
Tasks read input from a fixed neighborhood in an array.
Data Reused
04.Transpose
Tasks reorder data elements in memory.
转置并不仅仅是数学上的矩阵转置。
而是内存上的重排序。
可以说stencil是gather的一种,stencil要求对数组内每个元素进行更新。而(i%2)导致操作不会影响到所有元素。
05.parallel communicate pattern summary
06.GPU hardware
只有同一个block的threads可以协作通信。
由于程序员无法控制threadblock的运行,那么下面的一段程序,会输出16!个不同的结果。
#include <stdio.h>
#define NUM_BLOCK 16
#define BLOCK_WIDTH 1
__global__ void hello(){
printf("hello from a thread in block %d\n", blockIdx.x);
}
int main(){
// launch the kernel
hello<<<NUM_BLOCK, BLOCK_WIDTH>>>();
// force the printf()s to flush
cudaDeviceSynchronize();
printf("That's done!\n");
return 0;
}
07. what cuda guarantee
(1) all threads in a block run on the same SM at the same time.
(2) all blocks in a kernel finish before any blocks from the next kernel run.
08. GPU Memory Model
09. Synchronization
GPU-Barrier
__global__ void shift_1_index(){
// shift 1 index
int idx = threadIdx.x;
__shared__ int array[128]; // shared memory in thead block
array[idx] = threadIdx.x
__syncthreads();
if(idx < 127){
int temp = array[idx+1]; // local memeory in thread
__syncthreads();
array[idx] = temp;
__syncthreads();
}
}
由于CPU是顺序调用kernel函数的,那么在不同kernel函数之间有implicit synchronization。
10. write efficient programs
High level strategies. 3 trillion math operation per second
(1). maximize arithmetic intensity.
Math(more bigger more better) / (memory more smaller more better)
maximize work per thread == maximize useful compute operations per thread.
minimize memory per thread == minimize time spent on memory per second.
11. minimize time spent on memory.
Move frequently-accessed data into fast memory.
In GPU memory, there are 3 memory types.
local : owned by each thread. fastest. in register or l1 cache
shared: shared by block threads faster
global: shared by all threads slower
local > shared >> global >> CPU Memory(Host Memory)
// Using different memory spaces in CUDA
#include <stdio.h>
/**********************
* using local memory *
**********************/
// a __device__ or __global__ function runs on the GPU
__global__ void use_local_memory_GPU(float in)
{
float f; // variable "f" is in local memory and private to each thread
f = in; // parameter "in" is in local memory and private to each thread
// ... real code would presumably do other stuff here ...
}
/**********************
* using global memory *
**********************/
// a __global__ function runs on the GPU & can be called from host
__global__ void use_global_memory_GPU(float *array)
{
// "array" is a pointer into global memory on the device
array[threadIdx.x] = 2.0f * (float) threadIdx.x;
}
/**********************
* using shared memory *
**********************/
// (for clarity, hardcoding 128 threads/elements and omitting out-of-bounds checks)
__global__ void use_shared_memory_GPU(float *array)
{
// local variables, private to each thread
int i, index = threadIdx.x;
float average, sum = 0.0f;
// __shared__ variables are visible to all threads in the thread block
// and have the same lifetime as the thread block
__shared__ float sh_arr[128];
// copy data from "array" in global memory to sh_arr in shared memory.
// here, each thread is responsible for copying a single element.
sh_arr[index] = array[index]; // MAP operation
__syncthreads(); // ensure all the writes to shared memory have completed
// now, sh_arr is fully populated. Let's find the average of all previous elements
for (i=0; i<index; i++) { sum += sh_arr[i]; } // gather; use shared memory(sh_arr) is faster than global memory(array)
average = sum / (index + 1.0f);
// if array[index] is greater than the average of array[0..index-1], replace with average.
// since array[] is in global memory, this change will be seen by the host (and potentially
// other thread blocks, if any)
if (array[index] > average) { array[index] = average; } // array can be seen by threads in same blocks and other blocks
// the following code has NO EFFECT: it modifies shared memory, but
// the resulting modified data is never copied back to global memory
// and vanishes when the thread block completes
sh_arr[index] = 3.14; //vanishes when the thread block completes
}
int main(int argc, char **argv)
{
/*
* First, call a kernel that shows using local memory
*/
use_local_memory_GPU<<<1, 128>>>(2.0f);
/*
* Next, call a kernel that shows using global memory
*/
float h_arr[128]; // convention: h_ variables live on host
float *d_arr; // convention: d_ variables live on device (GPU global mem)
// allocate global memory on the device, place result in "d_arr"
cudaMalloc((void **) &d_arr, sizeof(float) * 128);
// now copy data from host memory "h_arr" to device memory "d_arr"
cudaMemcpy((void *)d_arr, (void *)h_arr, sizeof(float) * 128, cudaMemcpyHostToDevice);
// launch the kernel (1 block of 128 threads)
use_global_memory_GPU<<<1, 128>>>(d_arr); // modifies the contents of array at d_arr
// copy the modified array back to the host, overwriting contents of h_arr
cudaMemcpy((void *)h_arr, (void *)d_arr, sizeof(float) * 128, cudaMemcpyDeviceToHost);
// ... do other stuff ...
/*
* Next, call a kernel that shows using shared memory
*/
// as before, pass in a pointer to data in global memory
use_shared_memory_GPU<<<1, 128>>>(d_arr);
// copy the modified array back to the host
cudaMemcpy((void *)h_arr, (void *)d_arr, sizeof(float) * 128, cudaMemcpyHostToDevice);
// ... do other stuff ...
return 0;
}
Note that in this example we are shipping the data to the GPU, running only a single kernel, then copying it back. Often we will run several kernels on the GPU, one after another. When this happens there is no need to copy the intermediate results back to the host - you can run each kernel in sequence, leaving the intermediate result data on the GPU in global memory, and only copy the final result back to the host. 12. coalesce memory access
GPU is at its own most efficient when threads read or write contiguous global memory locations.
gputimer.h
#ifndef __GPU_TIMER_H__
#define __GPU_TIMER_H__
struct GpuTimer
{
cudaEvent_t start;
cudaEvent_t stop;
GpuTimer()
{
cudaEventCreate(&start);
cudaEventCreate(&stop);
}
~GpuTimer()
{
cudaEventDestroy(start);
cudaEventDestroy(stop);
}
void Start()
{
cudaEventRecord(start, 0);
}
void Stop()
{
cudaEventRecord(stop, 0);
}
float Elapsed()
{
float elapsed;
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed, start, stop);
return elapsed;
}
};
#endif /* __GPU_TIMER_H__ */
原子操作, 下面的atomic1.cu不是原子操作。 #include <stdio.h>
#include "gputimer.h"
#define NUM_THREADS 1000000
#define ARRAY_SIZE 10
#define BLOCK_WIDTH 1000
void print_array(int * arr, int num){
int i = 0;
for(;i<num;i++){
printf("%d \n", arr[i]);
}
printf("\n");
}
__global__ void increment_naive(int* g){
// find thrad index
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// each thread to increment consecutive elements, wrapping at ARRAY_SIZE
idx = idx % ARRAY_SIZE;
g[idx] += 1; // not atomic operation, when g[idx] == 100, and 50 threads read it in and output the same value == 101
}
int main(int argc, char const *argv[])
{
Gputimer timer;
printf("%d total threads in %d blocks writing into %d array elements\n",NUM_THREADS, BLOCK_WIDTH, ARRAY_SIZE);
// declare and allocate host memory
int h_arr[ARRAY_SIZE];
const int ARRAY_BYTES = ARRAY_BYTES * sizeof(int);
// declare, allocate and intialize value as zero in GPU global memory.
int* d_arr;
cudeMalloc((void**)&d_arr, ARRAY_BYTES);
cudaMemset((void*)d_arr, 0, ARRAY_BYTES);
// launch the kernel
timer.start();
increment_naive<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_arr);
timer.stop();
// copy back to host memory
cudeMemcpy(h_arr, d_arr, ARRAY_BYTES, cudeMemcpyDeviceToHost)
print_array(h_arr, ARRAY_SIZE);
printf("time elapsed = %g ms\n", timer.Elapsed());
// free GPU memory
cudaFree(d_arr);
return 0;
}
运行结果为:
so, we need atomic memory operations function atomicAdd()
atomicMin: minus
atomicXor: xor
atomicCAS: compare and swap
#include <stdio.h>
#include "gputimer.h"
#define NUM_THREADS 1000000
#define ARRAY_SIZE 10
#define BLOCK_WIDTH 1000
void print_array(int * arr, int num){
int i = 0;
for(;i<num;i++){
printf("%d \n", arr[i]);
}
printf("\n");
}
__global__ void increment_naive(int* g){
// find thrad index
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// each thread to increment consecutive elements, wrapping at ARRAY_SIZE
idx = idx % ARRAY_SIZE;
g[idx] += 1; // not atomic operation, when g[idx] == 100, and 50 threads read it in and output the same value == 101
}
__global__ void increment_atomic(int *g)
{
// which thread is this?
int i = blockIdx.x * blockDim.x + threadIdx.x;
// each thread to increment consecutive elements, wrapping at ARRAY_SIZE
i = i % ARRAY_SIZE;
atomicAdd(&g[i], 1);
}
int main(int argc, char const *argv[])
{
Gputimer timer;
printf("%d total threads in %d blocks writing into %d array elements\n",NUM_THREADS, BLOCK_WIDTH, ARRAY_SIZE);
// declare and allocate host memory
int h_arr[ARRAY_SIZE];
const int ARRAY_BYTES = ARRAY_BYTES * sizeof(int);
// declare, allocate and intialize value as zero in GPU global memory.
int* d_arr;
cudeMalloc((void**)&d_arr, ARRAY_BYTES);
cudaMemset((void*)d_arr, 0, ARRAY_BYTES);
// launch the kernel
timer.start();
// increment_naive<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_arr);
increment_atomic<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_arr);
timer.stop();
// copy back to host memory
cudeMemcpy(h_arr, d_arr, ARRAY_BYTES, cudeMemcpyDeviceToHost)
print_array(h_arr, ARRAY_SIZE);
printf("time elapsed = %g ms\n", timer.Elapsed());
// free GPU memory
cudaFree(d_arr);
return 0;
}
time became longer. It is the atomic operation cost.
13. limits of atomic memory operation
- only certain operation and data type(mostly integer type) are supported.
implement user define atomic function by using atomicCAS()
- still no ordering constrain
this leads to floating-point arithmetic non-associative(a+b) +c != a+(b+c) for float
- serializes access to memory
if you are not careful, the program will be very slow.