KernelAbstractions.jl is a package that allows you to write GPU-like kernels that target different execution backends. It is intended to be a minimal, and performant library that explores ways to best write heterogenous code.
KernelAbstraction.jl is focused on performance portability, it is GPU-biased and therefore the kernel language has several constructs that are necessary for good performance on the GPU, but may hurt performance on the CPU.
Kernel functions have to be marked with the
@kernel. Inside the
@kernel macro you can use the kernel language. As an example the
mul2 kernel below will multiply each element of the array
2. It uses the
@index macro to obtain the global linear index of the current workitem.
@kernel function mul2(A) I = @index(Global) A[I] = 2 * A[I] end
You can construct a kernel for a specific backend by calling the kernel function with the first argument being the device kind, the second argument being the size of the workgroup and the third argument being a static
ndrange. The second and third argument are optional. After instantiating the kernel you can launch it by calling the kernel object with the right arguments and some keyword arguments that configure the specific launch. The example below creates a kernel with a static workgroup size of
16 and a dynamic
ndrange. Since the
ndrange is dynamic it has to be provided for the launch as a keyword argument.
A = ones(1024, 1024) kernel = mul2(CPU(), 16) event = kernel(A, ndrange=size(A)) wait(event) all(A .== 2.0)
All kernel launches are asynchronous, each kernel produces an event token that has to be waited upon, before reading or writing memory that was passed as an argument to the kernel. See dependencies for a full explanation.
- Functions inside kernels are forcefully inlined, except when marked with
- Floating-point multiplication, addition, subtraction are marked contractable.
- The kernels are automatically bounds-checked against either the dynamic or statically provided
- Functions like
Base.sinare mapped to
@scratchhas been renamed to
@private, and the semantics have changed. Instead of denoting how many dimensions are implicit on the GPU, you only ever provide the explicit number of dimensions that you require. The implicit CPU dimensions are appended.