parallel computing

parallel computing : many threads solving a problem by working together.

The key to work together is to communicate.

00. MAP

Tasks read from and write to specific data elements.

01. GATHER

each calculation gathers input data elements together from different places to compute an output result.

02. Scatter

tasks compute where to write output.

03.Stencil

Tasks read input from a fixed neighborhood in an array.

Data Reused

04.Transpose

Tasks reorder data elements in memory.

转置并不仅仅是数学上的矩阵转置。

而是内存上的重排序。

可以说stencil是gather的一种，stencil要求对数组内每个元素进行更新。而(i%2)导致操作不会影响到所有元素。

05.parallel communicate pattern summary

06.GPU hardware

只有同一个block的threads可以协作通信。

由于程序员无法控制threadblock的运行，那么下面的一段程序，会输出16！个不同的结果。

#include <stdio.h>

#define NUM_BLOCK 16
#define BLOCK_WIDTH 1

__global__ void hello(){
    printf("hello from a thread in block %d\n", blockIdx.x);
}

int main(){
    // launch the kernel
    hello<<<NUM_BLOCK, BLOCK_WIDTH>>>();

    // force the printf()s to flush
    cudaDeviceSynchronize();
    printf("That's done!\n");
    return 0;
}

07. what cuda guarantee

(1) all threads in a block run on the same SM at the same time.

(2) all blocks in a kernel finish before any blocks from the next kernel run.

08. GPU Memory Model

09. Synchronization

GPU-Barrier

__global__ void shift_1_index(){
    // shift 1 index
    int idx = threadIdx.x;
    __shared__ int array[128]; // shared memory in thead block

    array[idx] = threadIdx.x
    __syncthreads();
    if(idx < 127){
        int temp = array[idx+1]; // local memeory in thread
        __syncthreads();
        array[idx] = temp;
        __syncthreads();
    }


}

由于CPU是顺序调用kernel函数的，那么在不同kernel函数之间有implicit synchronization。

10. write efficient programs

High level strategies. 3 trillion math operation per second

(1). maximize arithmetic intensity.

Math(more bigger more better) / (memory more smaller more better)

maximize work per thread == maximize useful compute operations per thread.

minimize memory per thread == minimize time spent on memory per second.

11. minimize time spent on memory.

Move frequently-accessed data into fast memory.

In GPU memory, there are 3 memory types.

local : owned by each thread. fastest. in register or l1 cache

shared: shared by block threads faster

global: shared by all threads slower

local > shared >> global >> CPU Memory(Host Memory)

// Using different memory spaces in CUDA
#include <stdio.h>

/**********************
 * using local memory *
 **********************/

// a __device__ or __global__ function runs on the GPU
__global__ void use_local_memory_GPU(float in)
{
    float f;    // variable "f" is in local memory and private to each thread
    f = in;     // parameter "in" is in local memory and private to each thread
    // ... real code would presumably do other stuff here ... 
}

/**********************
 * using global memory *
 **********************/

// a __global__ function runs on the GPU & can be called from host
__global__ void use_global_memory_GPU(float *array)
{
    // "array" is a pointer into global memory on the device
    array[threadIdx.x] = 2.0f * (float) threadIdx.x;
}

/**********************
 * using shared memory *
 **********************/

// (for clarity, hardcoding 128 threads/elements and omitting out-of-bounds checks)
__global__ void use_shared_memory_GPU(float *array)
{
    // local variables, private to each thread
    int i, index = threadIdx.x;
    float average, sum = 0.0f;

    // __shared__ variables are visible to all threads in the thread block
    // and have the same lifetime as the thread block
    __shared__ float sh_arr[128];

    // copy data from "array" in global memory to sh_arr in shared memory.
    // here, each thread is responsible for copying a single element.
    sh_arr[index] = array[index]; // MAP operation

    __syncthreads();    // ensure all the writes to shared memory have completed

    // now, sh_arr is fully populated. Let's find the average of all previous elements
    for (i=0; i<index; i++) { sum += sh_arr[i]; } // gather; use shared memory(sh_arr) is faster than global memory(array) 
    average = sum / (index + 1.0f);

    // if array[index] is greater than the average of array[0..index-1], replace with average.
    // since array[] is in global memory, this change will be seen by the host (and potentially 
    // other thread blocks, if any)
    if (array[index] > average) { array[index] = average; } // array can be seen by threads in same blocks and other blocks

    // the following code has NO EFFECT: it modifies shared memory, but 
    // the resulting modified data is never copied back to global memory
    // and vanishes when the thread block completes
    sh_arr[index] = 3.14; //vanishes when the thread block completes

}

int main(int argc, char **argv)
{
    /*
     * First, call a kernel that shows using local memory 
     */
    use_local_memory_GPU<<<1, 128>>>(2.0f);

    /*
     * Next, call a kernel that shows using global memory
     */
    float h_arr[128];   // convention: h_ variables live on host
    float *d_arr;       // convention: d_ variables live on device (GPU global mem)

    // allocate global memory on the device, place result in "d_arr"
    cudaMalloc((void **) &d_arr, sizeof(float) * 128);
    // now copy data from host memory "h_arr" to device memory "d_arr"
    cudaMemcpy((void *)d_arr, (void *)h_arr, sizeof(float) * 128, cudaMemcpyHostToDevice);
    // launch the kernel (1 block of 128 threads)
    use_global_memory_GPU<<<1, 128>>>(d_arr);  // modifies the contents of array at d_arr
    // copy the modified array back to the host, overwriting contents of h_arr
    cudaMemcpy((void *)h_arr, (void *)d_arr, sizeof(float) * 128, cudaMemcpyDeviceToHost);
    // ... do other stuff ...

    /*
     * Next, call a kernel that shows using shared memory
     */

    // as before, pass in a pointer to data in global memory
    use_shared_memory_GPU<<<1, 128>>>(d_arr); 
    // copy the modified array back to the host
    cudaMemcpy((void *)h_arr, (void *)d_arr, sizeof(float) * 128, cudaMemcpyHostToDevice);
    // ... do other stuff ...
    return 0;
}

Note that in this example we are shipping the data to the GPU, running only a single kernel, then copying it back. Often we will run several kernels on the GPU, one after another. When this happens there is no need to copy the intermediate results back to the host - you can run each kernel in sequence, leaving the intermediate result data on the GPU in global memory, and only copy the final result back to the host.

12. coalesce memory access

GPU is at its own most efficient when threads read or write contiguous global memory locations.

gputimer.h

#ifndef __GPU_TIMER_H__
#define __GPU_TIMER_H__

struct GpuTimer
{
      cudaEvent_t start;
      cudaEvent_t stop;
 
      GpuTimer()
      {
            cudaEventCreate(&start);
            cudaEventCreate(&stop);
      }
 
      ~GpuTimer()
      {
            cudaEventDestroy(start);
            cudaEventDestroy(stop);
      }
 
      void Start()
      {
            cudaEventRecord(start, 0);
      }
 
      void Stop()
      {
            cudaEventRecord(stop, 0);
      }
 
      float Elapsed()
      {
            float elapsed;
            cudaEventSynchronize(stop);
            cudaEventElapsedTime(&elapsed, start, stop);
            return elapsed;
      }
};

#endif  /* __GPU_TIMER_H__ */

原子操作，下面的atomic1.cu不是原子操作。

#include <stdio.h>
#include "gputimer.h"

#define NUM_THREADS 1000000
#define ARRAY_SIZE 10
#define BLOCK_WIDTH 1000

void print_array(int * arr, int num){
    int i = 0;
    for(;i<num;i++){
        printf("%d \n", arr[i]);
    }
    printf("\n");
}

__global__ void increment_naive(int* g){
    // find thrad index
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // each thread to increment consecutive elements, wrapping at ARRAY_SIZE
    idx = idx % ARRAY_SIZE;
    g[idx] += 1; // not atomic operation, when g[idx] == 100, and 50 threads read it in and output the same value == 101
}

int main(int argc, char const *argv[])
{
    Gputimer timer;
    printf("%d total threads in %d blocks writing into %d array elements\n",NUM_THREADS, BLOCK_WIDTH, ARRAY_SIZE);

    // declare and allocate host memory
    int h_arr[ARRAY_SIZE];
    const int ARRAY_BYTES = ARRAY_BYTES * sizeof(int);

    // declare, allocate and intialize value as zero in GPU global memory.
    int* d_arr;
    cudeMalloc((void**)&d_arr, ARRAY_BYTES);
    cudaMemset((void*)d_arr, 0, ARRAY_BYTES);

    // launch the kernel
    timer.start();
    increment_naive<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_arr);
    timer.stop();

    // copy back to host memory
    cudeMemcpy(h_arr, d_arr, ARRAY_BYTES, cudeMemcpyDeviceToHost)
    print_array(h_arr, ARRAY_SIZE);
    printf("time elapsed = %g ms\n", timer.Elapsed());

    // free GPU memory
    cudaFree(d_arr);

    return 0;
}

运行结果为：

so, we need atomic memory operations function atomicAdd()

atomicMin: minus

atomicXor: xor

atomicCAS: compare and swap

#include <stdio.h>
#include "gputimer.h"

#define NUM_THREADS 1000000
#define ARRAY_SIZE 10
#define BLOCK_WIDTH 1000

void print_array(int * arr, int num){
    int i = 0;
    for(;i<num;i++){
        printf("%d \n", arr[i]);
    }
    printf("\n");
}

__global__ void increment_naive(int* g){
    // find thrad index
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // each thread to increment consecutive elements, wrapping at ARRAY_SIZE
    idx = idx % ARRAY_SIZE;
    g[idx] += 1; // not atomic operation, when g[idx] == 100, and 50 threads read it in and output the same value == 101
}

__global__ void increment_atomic(int *g)
{
    // which thread is this?
    int i = blockIdx.x * blockDim.x + threadIdx.x; 

    // each thread to increment consecutive elements, wrapping at ARRAY_SIZE
    i = i % ARRAY_SIZE;  
    atomicAdd(&g[i], 1);
}

int main(int argc, char const *argv[])
{
    Gputimer timer;
    printf("%d total threads in %d blocks writing into %d array elements\n",NUM_THREADS, BLOCK_WIDTH, ARRAY_SIZE);

    // declare and allocate host memory
    int h_arr[ARRAY_SIZE];
    const int ARRAY_BYTES = ARRAY_BYTES * sizeof(int);

    // declare, allocate and intialize value as zero in GPU global memory.
    int* d_arr;
    cudeMalloc((void**)&d_arr, ARRAY_BYTES);
    cudaMemset((void*)d_arr, 0, ARRAY_BYTES);

    // launch the kernel
    timer.start();
    // increment_naive<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_arr);
    increment_atomic<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_arr);
    timer.stop();

    // copy back to host memory
    cudeMemcpy(h_arr, d_arr, ARRAY_BYTES, cudeMemcpyDeviceToHost)
    print_array(h_arr, ARRAY_SIZE);
    printf("time elapsed = %g ms\n", timer.Elapsed());

    // free GPU memory
    cudaFree(d_arr);

    return 0;
}

time became longer. It is the atomic operation cost.

13. limits of atomic memory operation

- only certain operation and data type(mostly integer type) are supported.

implement user define atomic function by using atomicCAS()

- still no ordering constrain

this leads to floating-point arithmetic non-associative(a+b) +c != a+(b+c) for float

- serializes access to memory

if you are not careful, the program will be very slow.

14. avoid thread divergence

[cuda] 02 Communication Pattern and GPU hardware