GPU Computation Technology Preview
One of the major research and development efforts at Fabric Engine is the implementation of generic GPU computation as a primary feature of the Fabric Engine Core. GPU computation in the Fabric Engine Core will allow Fabric Engine operators— the user-created functions that are responsible for high-performance, native, parallel computation—to run, unmodified, on GPU cores. While GPU cores are individually slower than CPU cores, a typical high-end graphics card will have hundreds if not thousands of such cores available, whereas a typical high-end CPU has at most eight cores. For highly parallel operations, correctly exploiting GPU cores results in enormous performance gains.
The purpose of this document is to provide a high-level outline of the development work that has been done to date to enable preliminary GPU computation within the Fabric Engine Core when running on AMD HSA hardware and to provide preliminary performance results obtained from running a GPU computation-enabled build of Fabric Engine.
Background
Fabric Engine is a software development platform that enables the execution of high- performance, native code within a context that automatically leverages parallel computation. The code that drives computation in Fabric Engine is written in a language called KL. KL is a high-level, procedural language with a familiar syntax that provides many language features well-suited to computational problems such as a rich type system with structures, fixed- and variable-length arrays, and dictionaries; operator overloading with a simple syntax; function polymorphism; and constructs for parallel computation. The KL compiler uses LLVM as its code generation layer. By using LLVM, the KL compiler is able to target a wide variety of hardware platforms with minimal change.
The AMD HSA technology platform has the goal of providing a heterogeneous computation platform in which both CPU and GPU cores access and manipulate memory identically. HSA will enable complex data structures with pointer indirection to be shared between the CPU and GPU. Not only will no copying of data between different memory spaces be necessary, but the pointers imbedded in a complex data structure will be usable without change on both CPU and GPU cores.
In collaboration with AMD, the Fabric Engine development team has extended the KL compiler and Fabric Engine Core execution environment to support GPU computation on high-end AMD GPUs. The primary means by which this preliminary work was possible was the availability of an LLVM back end for AMD GPU hardware.
Development Work
The largest change that was required in order to support GPU computation was support for multiple memory spaces within the KL compiler. Unlike a desktop PC CPU, which has only a single memory space, a typical GPU has several different address spaces; this is reflected in the __global, __local, __private and constant qualifiers in the OpenCL language. The KL compiler was originally developed under the assumption that all values reside in the same memory space, and this assumption had to be broken in order to support GPU computation.
The KL compiler was designed so that all internal functionality (for example, the implementation of an array indexing operator) is provided in platform-neutral way through the use of automatically-generated LLVM IR rather than dependence on external, pre-compiled libraries. This design decision was made in order to enable maximal use of the LLVM optimizer’s inlining ability. As it turned out, this decision was also critical for the implementation of GPU compute.
Rather than forcing programmers working with KL to decorate their source code with memory space specifiers as they have to do in OpenCL, the modified KL compiler internally tracks the memory space associated with all variables and function arguments and lazily generates LLVM IR with the appropriate memory spaces as needed. If, for instance, the user writes a function that modifies a value through a reference to a variable, and that function is called on two different values that reside in two difference memory spaces, a different version of the function is generated in LLVM IR for each memory space. This behaviour is completely transparent to the user that is writing the operator in KL; as far as the user is concerned, there is only one function. By automatically taking care of this critical detail, the KL language remains as user-friendly as possible while at the same time avoiding cumbersome workarounds such as hidden moves of data between different memory spaces.
In addition to the modifications to the KL compiler, a mechanism had to be added to the Fabric dependency graph to allow the data associated with dependency graph nodes to lie within GPU memory. This was necessary because the AMD toolchain we are working with does not yet have support for a unified memory space between the GPU and the CPU, and it is imperative for maximal performance that the results of GPU computations not be copied back to the CPU if they are not needed there. Since the Fabric Engine Creation Platform’s scene graph does its rendering using OpenGL VBOs, a feature was added that allowed the scene graph to directly access the VBO ID of a memory block containing the result of a GPU computation without copying that result back to the CPU.
Results
As a preliminary demo of this work, builds of the Fabric Engine Core and the Fabric Engine Creation Platform were created with support for GPU computation on the AMD HSA hardware platform. A sample scene that shows a model whose vertex positions and normals are modified by a Fabric operator written in KL was modified to allow the user to chose to compile the operator for either the CPU or GPU. The effect is modulated by an amount between 0 and 1, and the amount is animated to continually go from 0 to 1 and then back; this allows for simpler measurement of the average frames per second. An image of this application is given in the figure below.
The animated scene was run on a workstation with an AMD A10-5800K APU with both integrated graphics and a discrete Radeon HD 7800 card; however, only the discrete card was used for GPU computation and OpenGL rendering for these tests.
Two different deformations were run against four different models. The models increase in number of vertices by about one order of magnitude at each step. In the case of both deformations, the KL code that drives the deformation is identical in the CPU and GPU case; the only difference is that the Fabric Engine Core was told to run the deformation on the GPU rather than on the CPU.
The first deformation was a simple push of each vertex position in the direction of the normal, modulated by the animated amount parameter. This is a relatively simple computation; the KL source code for the operator is the following:
operator simplePushDeformOp(
Vec3 origPositions<>,
Vec3 origNormals<>,
Index index,
Scalar amount,
io Vec3 position,
io Vec3 normal
)
{
position = origPositions[index] + amount*origNormals[index];
normal = origNormals[index];
}
The performance results are as follows:
Cow model:
- Model size (vertices) : 3,599
- CPU FPS (average) : 500
- CPU Vertices Per Second: 1,799,500
- GPU FPS (average) : 300
- GPU Vertices Per Second : 1,079,700
bunny_ptx model:
- Model size (vertices) : 34,835
- CPU FPS (average) : 370
- CPU Vertices Per Second: 12,888,950
- GPU FPS (average) : 260
- GPU Vertices Per Second : 9,057,100
Hebemissin model:
- Model size (vertices) : 191,796
- CPU FPS (average) : 110
- CPU Vertices Per Second: 21,097,560
- GPU FPS (average) : 270
- GPU Vertices Per Second : 51,784,920
demon model:
- Model size (vertices) : 1,367,264
- CPU FPS (average) : 24
- CPU Vertices Per Second: 32,814,336
- GPU FPS (average) : 210
- GPU Vertices Per Second : 287,125,440
The second deformation was a push that is modulated by a standard Perlin noise function as well as the animated amount parameter. This is a more complex deformation to compute; for large models the performance is significantly less when performing this deformation. The source code for the deformation is:
function Scalar fade(Scalar t) {
return t * t * t * (t * (t * 6 - 15) + 10);
}
function Scalar lerp(Scalar t, Scalar a, Scalar b) {
return a + t * (b - a);
}
function Scalar grad(Integer hash, Scalar x, Scalar y, Scalar z) {
Integer h = hash & 15;
Scalar u = h < 8 ? x : y, v = h < 4 ? y : h==12||h==14 ? x : z;
return ((h&1) == 0 ? u : -u) + ((h&2) == 0 ? v : -v);
}
function Scalar pnoise (Scalar ix, Scalar iy, Scalar iz) {
Scalar x = ix, y = iy, z = iz;
const Scalar p[] = [
151, 160, 137, 91, 90, 15, 131, 13, 201, 95, 96, 53,
// ... remaining constants, total of 512
];
Integer X = Integer(floor(x)) & 255, Y = Integer(floor(y)) & 255, Z = Integer(floor(z)) & 255;
x -= floor(x); y -= floor(y); z -= floor(z); Scalar u = fade(x), v = fade(y), w = fade(z);
Integer A = p[X]+Y, AA = p[A]+Z, AB = p[A+1]+Z,
B = p[X+1]+Y, BA = p[B]+Z, BB = p[B+1]+Z;
return lerp(w,lerp(v,lerp(u, grad(p[AA ], x, y, z), grad(p[BA ], x-1, y, z)),
lerp(u, grad(p[AB ], x, y-1, z), grad(p[BB ], x-1, y-1, z))),
lerp(v, lerp(u, grad(p[AA+1], x, y, z-1 ), grad(p[BA+1], x-1, y, z-1)),
lerp(u, grad(p[AB+1], x, y-1, z-1), grad(p[BB+1], x-1, y-1, z-1))));
}
operator simplePushDeformOp(
Vec3 origPositions<>, Vec3 origNormals<>, Index index, Scalar amount,
io Vec3 position, io Vec3 normal
) {
Scalar n = 0.3*pnoise(0.3*oldPos.x, 0.3*oldPos.y, 0.3*oldPos.z);
n = 3.0*pnoise(n*oldPos.x, n*oldPos.y, n*oldPos.z);
position = origPositions[index] + amount*(n+1)/2*origNormals[index];
normal = origNormals[index];
}
The performance results are as follows.
Cow model:
- Model size (vertices) : 3,599
- CPU FPS (average) : 370
- CPU Vertices Per Second: 1,331,630
- GPU FPS (average) : 250
- GPU Vertices Per Second : 899,750
bunny_ptx model:
- Model size (vertices) : 34,835
- CPU FPS (average) : 130
- CPU Vertices Per Second: 4,528,550
- GPU FPS (average) : 180
- GPU Vertices Per Second : 6,270,300
Hebemissin model:
- Model size (vertices) : 191,796
- CPU FPS (average) : 35
- CPU Vertices Per Second: 6,712,860
- GPU FPS (average) : 170
- GPU Vertices Per Second : 32,605,320
demon model:
- Model size (vertices) : 1,367,264
- CPU FPS (average) : 5.9
- CPU Vertices Per Second: 8,066,857.6
- GPU FPS (average) : 44
- GPU Vertices Per Second : 60,159,616
In both cases, the CPU is faster for small models; this is due to the overhead of the OpenCL interface that is used to drive the GPU computation (which is not used for the CPU computation) that dominates the actual computation for small models. However, for medium to large models, the GPU computation is significantly faster for both deformations. For the largest model, this performance difference is almost an order of magnitude; and for the complex deformation on the largest model, it produces a jump from 5.9fps to 44fps which is the difference between non-real-time and real-time effects previews.
Here is a video illustrating our results. Note that our video capture software severely reduces GPU performance.
Conclusion
The technology preview of GPU computation for Fabric Engine demonstrates that the Fabric Engine platform is an environment in which users will be able to gain access to powerful GPU technology for computation without any changes to their application and still see up to an order of magnitude performance increase in their application.

Twitter
Facebook
LinkedIn
Vimeo
Google +