Simplifying CUDA Memory Management

Taking Advantage of C++ Features

I occasionally like playing around with CUDA C++, mostly because I don’t really play computer games anymore and I really need to justify the amount of money I’ve spent on graphics cards over the years. Although regrettably, the amount of time taken on one such project (of which I will be writing later) likely took up more of my time than I would have otherwise spent one-shotting dragons with a bow and arrow in the most re-released game ever made.

Here I will attempt to explain how I used CUDA Unified Memory and a custom allocator to simplify memory management with respect to allocation and de-allocation, where performance is not the primary concern. I won’t go into any of the asynchronous CUDA techniques, for the benefit of keeping things simple.

I’ve tried to include as many references and notes to be as useful as possible, however if I’ve missed something, let me know.

See the gist on GitHub.

Memory Management In CUDA

Memory management in CUDA C++ can be arduous at best, and there are several options you have for managing memory transfers:

I believe that if performance is your primary concern, you should be using pinned memory1, otherwise unified memory simplifies things considerably. For brevity, I won’t include a paged transfer example on this post, but it will be in the GitHub gist.

Some notes about the code in this post:

  • All of the CUDA specific functions I’ve written will be in the c9::cuda namespace.
  • There is a check_error function that wraps CUDA calls; if the error code returned by the function is not success, then it will throw an exception. This will be in the gist.
  • Hopefully the CUDA kernel will be self-explanatory, but c9::cuda::elementwise_add takes two input arrays of the same size and writes the sum of each element at a given index into an output array. This will also be in the gist.

For more information on Unified Memory, see the Nvidia article Unified Memory for CUDA Beginners.

A Unified Memory Transfer

This is our motivating example. It creates and initialises three arrays backed by CUDA managed memory (Unified Memory), executes the kernel and prints the results.

/*  requires #include <array> */

constexpr auto array_size = 3;

/*  Create placeholder variables that will point to memory allocated
    through CUDA. */
int * memory_a = nullptr;
int * memory_b = nullptr;
int * memory_out = nullptr;

/*  Allocated using unified memory */
check_error(cudaMallocManaged((void**)&memory_a,
        array_size * sizeof(int)));
check_error(cudaMallocManaged((void**)&memory_b,
        array_size * sizeof(int)));
check_error(cudaMallocManaged((void**)&memory_out,
        array_size * sizeof(int)));

/*  This looks slightly awkward, but what we are doing here is using 
    "placement new" to create arrays within the memory regions we have
    allocated. It is a convenient way to initialise the memory with the
    values we want, as well as access the result data.
    https://en.cppreference.com/w/cpp/language/new#Placement_new */
auto a = new(memory_a) std::array<int, array_size>{1, 2, 3};
auto b = new(memory_b) std::array<int, array_size>{4, 5, 6};
auto out = new(memory_out) std::array<int, array_size>{0, 0, 0};

c9::cuda::elementwise_add<<<1, array_size>>>(
        a->data(), 
        b->data(), 
        out->data(), 
        a->size());

/* Allow the kernel to finish executing */
check_error(cudaDeviceSynchronize());

/*  Another slightly awkward bit of C++, it writes each element to
    standard out, separated by a colon, without having to write a for-loop.
    The irony of having to write an explanation in the same amount of space
    as a traditional loop is not lost on me.... */
std::copy(std::begin(*out), std::end(*out),
        std::ostream_iterator<int>(std::cout, ", "));
std::cout << "\n";   

/*  Arrays are trivially destructible and do not require a
    delete-expression, they can be removed by simply de-allocating their
    storage.
    See https://en.cppreference.com/w/cpp/language/destructor */
check_error(cudaFree(memory_a));
check_error(cudaFree(memory_b));
check_error(cudaFree(memory_out));

The example uses C-style casts to convert the int* pointers to void**, however this is not idiomatic C++. You would typically use static_cast for casting pointers, but this does not seem to work with void**. You would therefore have to use reinterpret_cast, which is generally frowned upon. I’m going to stick to C-style casts here!

When the kernel is launched, the device will attempt to recall the memory for each array from it’s own memory, however the device’s memory manager will indicate that it isn’t there and has to be retrieved from the host. This generates a “page fault” (because host and device memory is grouped in “pages” – this is not special to CUDA; it applies to other devices on the host such as hard drives) and informs the CUDA memory manager to transfer the memory to the device.

Running the program should produce the following:

5, 7, 9

In any case, there is a fair amount of code that can be abstracted here, notably:

This can all be achieved through a custom C++ allocator.

You can prefetch unified memory pages to the device, but this is out of scope of the article.

C++ Allocators

Allocators are to Standard Library containers as cake is to Civil Servants – without them they simply wouldn’t work. For example, whenever you use a std::vector<T>, there is an allocator behind the scenes controlling how the memory is managed. In fact, the actual type signature for a vector is std::vector<T, Allocator = std::allocator<T>>2, which is helpfully hidden from you. Put very basically, std::allocator<T> is responsible for allocating and de-allocating memory. Custom allocators can be used to allocate memory from a specific region of memory (e.g. stack memory, shared memory), or even memory pools (where it makes sense).

For information on the default allocator (std::allocator), see the C++ Reference Allocator

Our Custom Allocator

Since C++11, writing an allocator is much simpler than it used to be and we only have to write a handful of functions and definitions that the standard library expects. I would highly recommend reading Horward Hinnant’s allocator boilerplate for more detail.

While the boilerplate provided claims that you don’t have to define pointer and size_type, I couldn’t in practice (with nvcc) to get this to work. I have defined them here as simply as possible.

In any case, we need to define the following:

/* The allocator class */
template <typename T>
class unified_alloc
{
public:
    using value_type = T;
    using pointer = value_type*;
    using size_type = std::size_t;

    unified_alloc() noexcept = default;

    template <typename U>
    unified_alloc(unified_alloc<U> const&) noexcept {}

    auto allocate(size_type n, const void* = 0) -> value_type* {
        value_type * tmp;
        auto error = cudaMallocManaged((void**)&tmp, n * sizeof(T));
        if (error != cudaSuccess) {
            throw std::runtime_error { cudaGetErrorString(error) };
        }
        return tmp;
    }

    auto deallocate(pointer p, size_type n) -> void {
        if (p) {
            auto error = cudaFree(p);
            if (error != cudaSuccess) {
                throw std::runtime_error { cudaGetErrorString(error) };
            }
        }
    }
};

template <class T, class U>
auto operator==(unified_alloc<T> const &, unified_alloc<U> const &)
    -> bool
{
    return true;
}

template <class T, class U>
auto operator!=(unified_alloc<T> const &, unified_alloc<U> const &)
    -> bool
{
    return false;
}

While the C++ allocator requirements don’t specify the type of exception to be thrown, the standard library typically throws std::bad_alloc. The disadvantage of this is that it doesn’t contain a message, which means you can’t use it to propagate any error message from the CUDA runtime.

With respect to the equality operators, because this allocator is stateless, there is nothing that distinguishes one allocator of this type from another. Therefore they are always equal. In this case, the important thing is the test that one allocator is able to release memory allocated by another3 4.

Using our Allocator

Now that we have our allocator, we can now use it in most of the standard library containers. The most useful one in this case is std::vector.

#include  <vector>

auto data = std::vector<int, c9::cuda::unified_alloc<int>>{1, 2, 3};

This is somewhat inconvenient as we are repeating our type argument. However, we can create an alias template:

template <typename T>
using unified_vector = std::vector<T, unified_alloc<T>>;

And now, we can simply write:

auto data = c9::cuda::unified_vector<int>{};

Finally, our motivating example can be reduced to the following:

/* Create vectors as you would normally */
auto a = c9::cuda::unified_vector<int>{1, 2, 3};
auto b = c9::cuda::unified_vector<int>{4, 5, 6};
auto c = c9::cuda::unified_vector<int>(3); /* Sized initialisation */

/*  Launch kernel and synchronise device. */
c9::cuda::elementwise_add<<<1, a.size()>>>(
        a.data(), 
        b.data(), 
        c.data(),
        a.size());
check_error(cudaDeviceSynchronize());

/* Print contents of output vector */
std::copy(std::begin(c), std::end(c),
        std::ostream_iterator<int>(std::cout, ", "));
std::cout << "\n";

This vastly reduces the amount of code you have to write to handle Unified Memory. The benefits of this are:

Prefetching

As mentioned earlier, you are able to prefetch memory from the host to the device using cudaMemPrefetchAsync. You can create a wrapper function that only exists for containers backed by our unified allocator such that if used on an ordinary container, a compile time error will be produced. This can be done with some template trickery.

There are probably a million ways you could do this, but I think this is broken up enough to be understandable:

/*  Define a default type trait; any instantiation of this with a type will
    contain a value of false:

        is_unified<int>::value == false
*/
template<typename T>
struct is_unified : std::false_type{};

/*  A specialisation of the above type trait. If the passed in type is in
    itself a template, and the inner type is our unified allocator, then
    the trait type will contain a true value:

        is_unified<std::vector<int>>::value == false
        is_unified<c9::cuda::vector<int>>::value == true

    Remembering that the actual signature for both the stdlib and our CUDA 
    vector is something like:

        vector<int, allocator<int>>
*/
template<template<typename, typename> typename Outer, typename Inner>
struct is_unified<Outer<Inner, unified_alloc<Inner>>> : std::true_type{};


/*  A helper function that retrieves whether or not the passed in type is
    contains a unified allocator inner type, without using the type traits
    directly */
template<typename T>
constexpr static auto is_unified_v = is_unified<T>::value;


/*  This uses template substitution to generate a function that only exists
    for types that contain a unified allocator. If is_unified_v<T> is 
    false, std::enable_if_t does not exist, the substitution will fail, and 
    because it is not an error to have a failed substitution, the function
    will simply not exist.
    
    get_current_device is a utility function that uses the CUDA API to get
    the ID of the current device. It will be in the gist */
*/
template <typename T, typename = std::enable_if_t<is_unified_v<T>>>
auto prefetch(T const & container,  cudaStream_t stream = 0, 
        int device = get_current_device())
{
    using value_type = typename T::value_type;
    auto p = container.data();
    if (p) {
        check_error(cudaMemPrefetchAsync(p, container.size() *
            sizeof(value_type), device, stream));
    }
}

cudaMemPrefetchAsync has a default argument for the last parameter, the CUDA stream. Because it is an asynchronous function, you can use CUDA streams to group operations together, such that operations in different streams can be performed concurrently5. The default argument represents the default stream; this can be thought of as the stream that all of the synchronous operations use.

Now we should be able to do the following:

auto data = c9::cuda::unified_vector<int>{1, 2, 3};
c9::cuda::prefetch(data);

If you were to try it on an ordinary vector, you would get a compile time error:

auto data = std::vector<int>{1, 2, 3};
c9::cuda::prefetch(data);

Results in:

./cudavector.cu(244): error: no instance of function template "c9::cuda::prefetch" matches the argument list
            argument types are: (std::vector<int, std::allocator<int>>)

1 error detected in the compilation of "./cudavector.cu".

Conclusion

Hopefully you’ll be able to see that using these techniques for managing unified memory transfers can be made much easier.

Footnotes & References

  1. NVIDIA Developer: Maximizing Unified Memory Performance in CUDA 

  2. C++ Reference: std::vector 

  3. C++ Reference: Allocator Named Requirements: Relationships between instances 

  4. StackOverflow: Why C++ custom allocator needs comparison operators? 

  5. NVIDIA Developer: CUDA 7 Streams Simplify Concurrency