CUDA

CUDA#

ICHEC Training Material#

Staff course - ☠️ Stale

Unified Memory#

The CUDA Unified Memory (UM) offers a unified view of memory between CPUs and GPUs. The CPUs can access GPU memory directly, the GPUs can access CPU main memory directly, and memory pages migrate automatically between the two (post-Pascal). It can simplify the task of programming GPUs however it is not necessarily a technique to achieve higher performance within the code.

Introduction#

Unified Memory (UM) support was introduced to the CUDA programming model with CUDA version 6.0 in 2014. It simplifies memory management by providing access CPU and GPU memories, without the need to manually copy data from one to the other.

UM creates a single virtual memory space directly accessible by all GPUs and CPUs in the system, with automatic page migration for data locality. Data allocated in UM can be read or written through a single pointer by code running on either a CPU or GPU.

To allocate Unified Memory statically:

__device__ __managed__ float a[10];

To allocate Unified Memory dynamically:

cudaError_t cudaMallocManaged(void** ptr, size_t size);

Simple Example:

We run our codes on joules which has 2xQuadro RTX 5000 with 16GB of memory each.

Explicit Memory Management:

float *a_host, *a_dev;

a_host=(float *)malloc(asize);
cudaMalloc((void **)&a_dev, asize);

memset(a_host, 0, asize);

cudaMemcpy(a_dev, a_host, asize, cudaMemcpyHostToDevice);

kernel<<<...>>>(a_dev, asize);

cudaMemcpy(a_host, a_dev, asize, cudaMemcpyDeviceToHost);

free(a_host);
cudaFree(a_dev);

With Unified Memory:

float *a;

cudaMallocManaged(&a, asize);

memset(a, 0, asize);

kernel<<<...>>>(a, asize);
cudaDeviceSynchronize();

cudaFree(a);

Consider the second code running on joules. cudaMallocManaged() reserves the memory area. The page fault triggers the physical mapping. It can happen on the CPU or the GPU (by memset() in the above example). If the data is already mapped somewhere else, the page fault triggers the migration (with the kernel call in the above example). The key is that the system automatically migrates data allocated in the managed memory between host and device so that it looks like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU.

Note that it doesn’t resolve the technical limitations that require data transfer over PCI Express or NVLINK. Rather, it’s a change in who’s doing the memory management. So, data still needs to be copied to the GPU, but the programmer does not need to worry about manually copying data back and forth.

Timeline#

The initial target was the NVIDIA Kepler architecture however it had several limitations such as being limited to the GPU physical memory size. Pascal GPUs such as the NVIDIA Titan X and the NVIDIA Tesla P100 are the first architecture that has a full support to UM after CUDA 8.0 by adding 49-bit virtual addressing, on-demand page migration and so on. It has now improved support for Volta GPUs.

To query if UM is supported on a GPU:

int cudaDeviceProp::managedMemory

The features mentioned in the image above are:

On-demand page migration:

A touch of managed data triggers data migration if data is not available in the requested memory (page fault occurs). Page migration is all completely invisible to the developer. The Page Migration Engine automatically manages all data movement for us. Pages touched by the CPU (GPU) are moved back to the device (host) when needed.

Much like CPUs, GPUs have multiple levels of TLBs (Translation Lookaside Buffer) to perform address translations. When GPUs access a page that is not resident in the local GPU memory the translation for this page generates a fault message and locks the TLBs for the corresponding SM (Streaming Multiprocessor). This means any outstanding translations can proceed but any new translations will be stalled until all faults are resolved. This is necessary to make sure the SM’s view of memory is consistent since during page fault processing the driver may modify the page table and add or revoke access pages. The GPU can generate many faults concurrently and it’s possible to get multiple fault messages for the same page. The Unified Memory driver processes these faults, remove duplicates, updates mappings and transfers the data.

Note that on systems with pre-Pascal GPUs like the Tesla K80, calling cudaMallocManaged() allocates size bytes of managed memory on the GPU device that is active when the call is made. Internally, the driver also sets up page table entries for all pages covered by the allocation, so that the system knows that the pages are resident on that GPU. Upon launching a kernel, the CUDA runtime must migrate all pages previously migrated to host memory back to the device memory since these older GPUs can’t page fault.

Oversubscription:

In pre-Pascal architectures, any attempt to allocate data more than GPU memory gives “Out of Memory” error. After Pascal, the size of the GPU memory is no longer a limitation. Assume that GPU has 16 GB memory and we would like to allocate 64 GB of data with cudaMallocManaged(). Obviously, not all of this data can be physically resident on GPU memory at the same time. However, as kernel code touches this allocation, any pages that are not physically resident will be brought into residency on GPU memory by the demand paging system. When it runs out of GPU memory using this system, the pages that are not recently touched will be evicted. They will be no longer physically resident in the GPU memory map. If the GPU code touches an evicted page, then it causes migration of that page back to GPU again.

Concurrency:

Concurrent access to data from CPU and GPU causes a fatal segmentation fault in pre-Pascal architectures. The data in UM will be coherent as long as the CPU does not touch entries in managed data when the GPU executes a kernel. It can safely operate on it only after cudaDeviceSynchronize(). After Pascal, CPU and the GPU can simultaneously access managed memory, since they can both handle page faults. If GPU code touches the data, it will update that data and if the CPU code touches the same page, it will also update that page. Since kernel launches are asynchronous, the programmer doesn’t know when and how the GPU code is running relative to CPU code. They are running at the same time and there is no relative basis to judge which is going to happen first. In this case, it is up to the programmer to ensure that there are no race conditions caused by simultaneous accesses.

Atomic memory operations:

System-wide atomics allow us to access a managed allocation from all processors atomically. So, multiple processors will be able to do atomic operations on managed allocations even if those allocations are currently resident somewhere else.

Access counters:

Enabled on Volta Architectures with CUDA 9.2, access counters are for counting remote accesses going between GPU and remote processors. These counters can be used internally to notify the driver when a certain page is accessed too often remotely so the driver can decide to move it to local memory. With access counters only hot pages will be moved to the GPU. If there are some pages that are not accessed frequently, they are not migrated.

ATS (Address Translation Service) Support

CUDA 9.2 also introduces ATS support. It allows the GPU to directly access the CPU’s page tables. So, there is a single page table that is shared across CPU and GPU. And then the single page table can point to CPU or GPU memory. This provides the GPU full access CPU memory, for example to memory allocated directly with malloc().

UM Languages Support#

Taken from this GTC presentation:

Language	Code
C/C++	`cudaMallocManaged()`
Fortran	`managed` attribute (per allocation)
Pyhton	`pycuda.driver.managed_empty` (allocate `numpy.ndarray`)
OpenACC	`-ta=managed` compiler option (all dynamic allocations)

Background#

Traditional memory access model on GPUs relies on programmers allocating data in both CPU and GPU memories (with cudaMalloc()) and explicitly copying data before and after the kernel launch. Using cudaMemcpy(), we copy the input data to the device with the parameter cudaMemcpyHostToDevice and copy the result data back to the host with cudaMemcpyDeviceToHost:

cudaError_t cudaMemcpy(void * dst, const void * src, size_t count, enum cudaMemcpyKind kind);

Two important milestones in the process that has led to the release of Unified Memory are:

Zero-Copy Memory (Pinned Host Memory)#

Host data allocations are pageable by default. The GPU cannot access data directly from pageable host memory, so when a data transfer from pageable host memory to device memory is invoked, the CUDA driver must first allocate a temporary page-locked (pinned) host array, copy the host data to the pinned array, and then transfer the data from the pinned array to device memory.

After CUDA version 2.0, the transfer between pageable and pinned host arrays can be avoided by directly allocating host arrays in pinned memory (Zero-copy memory). Both the host and device can access the Zero-copy memory. Therefore, GPU threads can directly access and process data located on that memory. Host memory is allocated by using cudaMallocHost() or cudaHostAlloc() and pinned by cudaHostRegister() after allocation. Then, we will need to acquire a device pointer which would refer to the same memory with cudaHostGetDevicePointer() and no explicit copying will be needed.

Unified Virtual Addressing (UVA)#

From CUDA version 4.0, UVA has been an important feature available on Fermi and later architectures. It provides a single virtual memory address space for all memory in the system, and enables pointers to be accessed from GPU code no matter where in the system they reside. It is very similar to UM but they are not the same. Main difference between UVA and UM is that UVA does not automatically migrate data from one physical location to another, like UM does.

Allocations via cudaHostAlloc() are automatically portable across all the devices for which the UVA is enabled, and pointers returned by cudaHostAlloc() can be used directly from within kernels. The runtime can identify where the data was stored based on the value of the pointer. We do not need to call cudaHostGetDevicePointer() after cudaHostAlloc(), as the same address can be reused in the device memory space. An immediate consequence is that the cudaMemcpyKind flag in the cudaMemcpy() argument list becomes obsolete and is replaced by a generic cudaMemcpyDefault.

Performance Considerations#

Migrations are triggered by page faults. Page faulting may involve TLB invalidation, data migrations and page table updates. Each page fault causes some potential overhead impacting overall performance. Relying on page faults to move large amounts of data, page-by-page, with overhead on each page, is inefficient. Misuse of UM can slow a code down dramatically. CUDA 8.0 introduces useful APIs for providing the runtime explicit prefetching (cudaMemPrefetchAsync()) and memory usage hints (cudaMemAdvise()).

Prefetching#

Data prefetching means migrating data to a processor’s memory and mapping it in that processor’s page tables before the processor begins accessing that data. The intent of data prefetching is to avoid faults while also establishing data locality. The prefetching operation can be issued in a separate CUDA stream and overlapped with some compute work executing on the GPU.

cudaError_t cudaMemPrefetchAsync(const void *devPtr, size_t count, int dstDevice, cudaStream_t stream);

This call prefetches memory to the specified destination device. We can target any GPU or the CPU. It is analogous to cudaMemcpyAsync(). However, cudaMemPrefetchAsync() has more work to do than cudaMemcpyAsync() (updating of page tables in CPU and GPU). This means the call can take substantially more time to return than an ordinary async call.

Explicit Memory Hints#

It allows us to inform the CUDA runtime about how the data will actually be used by giving hints. Note that these are mostly suggestions to the runtime. There is no guarantee it will be applied by the UM system and they don’t trigger data movement by themselves.

cudaError_t cudaMemAdvise(const void *devPtr, size_t count, enum cudaMemoryAdvise advice, int device);

where advice can take the following values:

cudaMemAdviseSetReadMostly: This implies that the data is mostly going to be read by most processors and only occasionally written to. UM system will make a local copy of the data in a processor’s memory when that processor accesses it. Processors can read the data simultaneously without page faults. Writes are expensive. To maintain coherency, if a processor writes to it, this invalidates all copies except the one written and force migration.
cudaMemAdviseSetPreferredLocation: This suggests to the runtime that it would be best if the data is resident on a particular processor. Setting the preferred location does not cause data to migrate to that location immediately. If the data is already in its preferred location and another processor touches it, UM system can establish a mapping so read/write requests are implemented over the bus (PCIe or NVLINK) without requiring the data to be migrated. On the other hand, if the data is not in its preferred location or if a direct mapping cannot be established, then it will be migrated to the processor accessing it.
cudaMemAdviseSetAccessedBy: This says that a particular processor has access the data using a mapping instead of migration.

There are couple of ways to increase the performance for UM applications such as warp-per-page approach or overlapping kernels and prefetches using CUDA streams. See this page for details. There are also useful tips here.

Example#

One particular usecase of UM is to reduce the complexity of sharing dynamic data structures containing pointers such as linked lists between CPU and GPU. For C++ programmers, it eliminates the need for deep copies by using classes with copy constructors in a nice way. We will not go into details for this here. We will just illustrate a simple code that increments an array (of size 4GB) on the GPU taken from ORNL CUDA training page.

Kernel Code:

__global__ void inc(int *array, size_t n){
  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  while (idx < n){
    array[idx]++;
    idx += blockDim.x*gridDim.x;
  }
}

By using host pageable and device memory#

  int *h_array, *d_array;
  alloc_bytes(h_array, ds*sizeof(h_array[0]));  // malloc()
  cudaMalloc(&d_array, ds*sizeof(d_array[0]));
  memset(h_array, 0, ds*sizeof(h_array[0]));
  cudaMemcpy(d_array, h_array, ds*sizeof(h_array[0]),  cudaMemcpyHostToDevice);
  inc<<<256, 256>>>(d_array, ds);
  cudaMemcpy(h_array, d_array, ds*sizeof(h_array[0]), cudaMemcpyDeviceToHost);
  checkResult();

By using Unified Memory#

 int *h_array;
 alloc_bytes(h_array, ds*sizeof(h_array[0])); // cudaMallocManaged()
 memset(h_array, 0, ds*sizeof(h_array[0]));
 inc<<<256, 256>>>(h_array, ds);
 cudaDeviceSynchronize();
 checkResult();

The total migration size is 4GB, which matches the setup. There are also the minimum and the maximum migration sizes. The minimum size usually equals the OS page size which is 4KB on joules. In practice, the transfer size is not fixed to the OS page size and can vary significantly. Driver handles it depending on the access type to the memory. As we can see from the profiler output the driver has transferred chunks of up to 996 KB.

The number 15226 above on the second line is not the total number of faults, but rather the number of page fault groups. The faults are written to a special buffer in system memory and multiple faults forming a group are processed simultaneously by the Unified Memory driver. We can check the total number of faults for each group:

[bgursoy@joules hw6]$ nvprof --print-gpu-trace --log-file UM.txt ./a.out

Recall what happens in the GPU is that the kernel launches without any migration overhead, and when it accesses any absent pages, the GPU stalls execution of the accessing threads, and the Page Migration Engine migrates the pages to the device before resuming the threads. This means that the cost of the migrations is included in the kernel run time. In this kernel, every page in the arrays written by the CPU is accessed by the CUDA kernel on the GPU, causing the kernel to wait on a lot of page migrations.

Below is the output from NVIDIA visual profiler nvvp to show the timeline with page faults and migration events. We just call:

[bgursoy@joules hw6]$ nvvp ./a.out

Initialize the data in a kernel#

To eliminate the migration overhead during the kernel execution in the above example, we move the data initialisation to the GPU in another CUDA kernel by just replacing memset() on the host code with init kernel call.

__global__ void init(int *array, size_t n) {
  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  while (idx < n){
    array[idx]=0;
    idx += blockDim.x*gridDim.x;
  }
}

There are still device-to-host page faults, but this is due to the loop at the end of the program that checks the results on the CPU.

There is a speedup run up to 3x compared to the UM only code due to removing data migration to the GPU.

By using Unified Memory with prefetching#

  int *h_array;
  alloc_bytes(h_array, ds*sizeof(h_array[0])); // cudaMallocManaged()
  memset(h_array, 0, ds*sizeof(h_array[0]));
  cudaMemPrefetchAsync(h_array, ds*sizeof(h_array[0]), 0);
  inc<<<256, 256>>>(h_array, ds);
  cudaMemPrefetchAsync(h_array, ds*sizeof(h_array[0]), cudaCpuDeviceId);
  cudaDeviceSynchronize();
  checkResult();

We can see that there are no longer any GPU page faults reported, and the host to device and device to host transfers are shown as just 2MB transfers.

We can observe speedup up to 4x compared to the UM only code and it is almost the same as the original code with direct copies.

For large arrays#

We will quickly illustrate how UM handles very large data transfer between GPU and CPU for an array of size 20GB.

Here, we can observe UM oversubscription in the timeline.

By using CUPTI API#

We can also use CUPTI to collect UM counters about the page transfers during the application run with the CUPTI Activity API.

UM counters need to be configured before enabling the activity with CUPTI_ACTIVITY_KIND_UNIFIED_MEMORY_COUNTER. There is an example in the CUPTI samples provided by NVIDIA CUDA SDK. We apply the same approach for the example code above to increment the array elements. We modified it a bit to print more activity information as follows.

Following is some part of the output for the large array.

The values of the flags reported for the cause of the migration in this output:

2: The unified memory migrated to guarantee data coherence e.g. CPU/GPU faults on Pascal+ and kernel launch on pre-Pascal GPUs.
3: The unified memory was speculatively migrated by the UVM driver before being accessed by the destination processor to improve performance. (prefetching)
4: The unified memory migrated to the CPU because it was evicted to make room for another block of memory on the GPU.