2D Graphics on Modern GPU

Is the traditional 2D imaging model nearing the end of its usefulness, or does it have a shiny future in the “modern graphics” world? I spent a week on a research retreat in a cottage in the woods to answer this question, as it shapes the future of UI toolkits. Performant UI must use GPU effectively, and it’s increasingly common to write UI directly in terms of GPU rendering, without a 2D graphics API as in the intermediate layer. Is that the future, or perhaps a mistake?

I have found that, if you can depend on modern compute capabilities, it seems quite viable to implement 2D rendering directly on GPU, with very promising quality and performance. The prototype I built strongly resembles a software renderer, just running on an outsized multicore GPU with wide SIMD vectors, much more so than rasterization-based pipelines.

Writing a 2D renderer is a fairly ambitious project, doubly so to make it run efficiently on GPU. I deliberately reduced the scope of the project in a number of ways to make it viable. Most importantly, I targeted only Metal 2.1, which is only a year old and only at 42.5% share among worldwide macOS users. Thus, I am targeting the near future of GPU and ignoring the past. My faithful readers will no doubt be curious how much of this work can be adapted to older GPUs, but I believe that’s a much more complex question, and one I deliberately did not address. (Other work like PathFinder is more appropriate.)

That said, I think the capabilities of Metal 2.1 are fairly mainstream and more so in coming years. From my perspective, it finally lets us program a GPU as if it were a big SIMD computer, which is basically what it’s been under the hood for a long time. Looking at newer features in, for example, CUDA 10, I don’t see anything that would profoundly change the way I would approach this problem. I believe this is a very attractive target for research and practice in efficient GPU implementation of classical algorithms.

It’s not surprising that 2D graphics can be efficiently implemented on GPU. The excellent 2014 Massively-Parallel Vector Graphics was a major inspiration to this work, and there have been follow-ups such as Li 2016 that promise even more performance. But both of these papers seemed very complex and focused narrowly on path rendering.

I had great fun implementing my prototype and learned a lot. Along the way I kept a notes document that recounts some of my struggles and touches on some deeper topics that I will only briefly touch on in this blog post. The code is available and could be fun to play with or look at.

I’ll stress again that this is a research prototype, not a finished product. The most promising use case for the work is likely CAD tools, where there might be quite complex scenes and it’s not necessarily practical to organize the UI drawing around GPU primitives (as opposed to 3D games, for example).

The architecture

Rendering starts with a “scene graph,” which is an on-GPU serialization of 2D drawing operations. It has a tree structure, in that operations like clipping are represented as a node with children; in the case of clipping, one child for the clip mask and another for the contents being clipped. It also has a graph structure, in that multiple instances can be shared by reference (with appropriate transform nodes to change their position). (Note: I didn’t get around to implementing much of the graph structure, but the prototype is designed to accommodate it without much trouble. I’m describing it anyway because it’s important to the motivation).

The imaging model allows per-pixel operations only; operations like blur are purposefully excluded. Thus, a simplistic approach to parallel rendering would be for each pixel in the target framebuffer to traverse the scene graph, applying the computation at each node (each of which can be seen as a small functional program), and finally writing the pixel color computed at the root node. That approach is of course quite inefficient, but forms the basis of what the code actually does.

To achieve performance, the code divides the target framebuffer into fixed size tiles, currently 16x16 pixels each. There are two passes, a tiling pass that creates a command list for each tile, and a rendering pass that consumes the command list, evaluating all 256 pixels of the tile in parallel, each pixel sequentially evaluating the commands for the tile.

Note that the approach to tiling is similar to PathFinder, but with important differences in the details. PathFinder renders intermediate alpha masks to a mask texture buffer, requiring a write and a read of global device memory, but I do all the blending in local memory in the rendering shader, as in MPVG. Minimizing global memory traffic is a major shared theme.

A 16x16 tile should be close to the sweet spot. It results in 128x96 (12k total) tiles for a typical 2048x1536 window. It’s not a huge amount of work to generate the tiles, but with fewer tiles it would be harder to exploit parallelism in the tiling phase. Similarly, if graphic elements are much smaller than the tile size, there would be wasted work during rendering, as (for the most part) the entire tile needs to be rendered for any element that touches the tile. But again, 16x16 is a good size for a threadgroup dispatch, to exploit parallelism within the tile, and the savings from the leftover parts of a tile would be offset by per-tile overhead as tiles get smaller. It’s always possible to tune such things, but it’s not reasonable to expect any big wins. (I will note, though, that MPVG uses a quadtree structure, which basically amounts to adapting tile size to the workload. There are potential savings, but I also think it adds a lot to their overall complexity.)

The rendering kernel (similar to a fragment shader) is fairly straightforward - it’s basically just computing signed area coverage for fills, distance fields for strokes, texture sampling for images and pre-rendered glyphs, and blends for clipping and compositing. The functional program represented by the scene graph is flattened into a linear sequence of operations, filtered of course to only those elements that touch the tile. For blend groups, nesting in the scene graph is represented by push/pop operations, with an explicit, local stack. (Again disclosure: I didn’t get too far into actually implementing blend groups, but it should be easy to see how they’d work).

Thus, most of the interesting parts are in tiling. That’s all about efficiently traversing the scene graph, quickly skipping over parts of the graph that don’t touch the tile being generated.

Similar to the simplistic rendering strategy above, a simple approach to tile generation would be to have a thread per tile (~12k threads), each of which sequentially traverses the scene graph. That’s a lot less work than doing a per-pixel traversal, but is still not great. As I’ll describe in the next section, the key to performance is a good serialization format for the scene graph, and SIMD techniques for extracting more parallelism from the traversal. The basic structure is there, though; the traversal of the scene graph and generation of tiles is at heart sequential, not relying on tricky GPU-compute techniques such as sorting.

Serialization

It’s often said that GPU is bad at data structures, but I’d turn that around. Most, but not all, data structures are bad at GPU. An extreme example is a linked list, which is still considered reasonable on CPU, and is the backbone of many popular data structures. Not only does it force sequential access, but it also doesn’t hide the latency of global memory access, which can be as high as 1029 cycles on a modern GPU such as Volta.

Well known to game developers, what is efficient on GPU is a structure-of-arrays approach. In particular, the tiling phase spends a lot of time looking at bounding boxes, to decide what belongs in each tile. Each group node in the graph has an array of bounding boxes of its children, 8 bytes each, and a separate array for the child contents. The core of the tiling pass is consuming those bounding box arrays, only dropping down to traverse the child when there’s an intersection.

Other than that, the serialization format is not that exotic, broadly similar to FlatBuffers or Cap’n Proto. As a digression, I find it amusing that the word for packing a data structure into a byte buffer is “serialization” even when it’s designed to be accessed in parallel. Maybe we should come up with a better term, as “parallel-friendly serialization” is an oxymoron.

While I mostly focused on parallel read access, I’m also intrigued by the possibility of generating the scene graph in parallel, which obviously means doing allocations in a multithread-friendly way. Nical has a good blog post on some of the issues.

Exploiting SIMD for bounding box culling

Now we get to the heart of the algorithm: going through an array of bounding boxes, looking for those that intersect a subset of tiles.

The core computational model provided by shader languages is an independent thread per tiny grain of work (vertex, fragment, etc.), and the compiler and hardware conspire mightily in support of that illusion. You’ll hear numbers like 2560 cores, and it’s very difficult to wrap one’s mind around that. For workloads typical of shadertoy, you don’t have to think too much about it, it magically gets through an impressive amount of computation per pixel.

The reality is very different. It’s also useful to think of a GPU as a SIMD computer with dozens of cores, each of which has a SIMD width of hundreds of bits. If you write code optimized for, say, a 24 core computer with 512 bit SIMD, or 12 cores x 1024 bits wide, that’s likely to run well on an Intel Iris 640. That’s not actually what it is, but the details are shrouded in mystery, so I tell myself these simplified stories to keep myself comfortable. Note that these numbers aren’t that different than a high end desktop or server chip. (Also see the notes doc for why I have two different numbers here, kind of a fun story that kept me up a bit one night)

In keeping with the 3D graphics tradition of clear and consistent naming, the SIMD concept is called SIMD groups on Metal, warps on Nvidia, wavefronts on AMD, and subgroups on Vulkan. (But note that there is an important distinction between pure SIMD and the “SIMT” concept in newer Nvidia models, see this presentation on cooperative groups for more detail.)

So for running the tiling kernel, instead of having a few hundred or a couple thousand independent threads traversing the bounding box array, there are actually a few dozen “SIMD groups”, each of which is, say, 16 wide. In the simple version of the code, all the ALU’s in a SIMD group load the same bounding box, test against it, and go to the next iteration of the loop. We want to do better.

The current code gives the 16-wide SIMD group responsibility for a block of tiles (16 wide, 1 tall, in a typical threadgroup geometry). On an iteration of the loop, each ALU loads a different bounding box, then tests for intersection against a 256x16 region of the target frame buffer. It then shares the result of that test with the other ALU’s in the SIMD group (using the simd_ballot intrinsic). There’s another pass with finer grained checking, but in the common case where no bounding boxes intersect the 256x16 region, it can immediately go to the next iteration. This is literally a 16x increase in theoretical bandwidth for consuming the bounding boxes, and I see that borne out by measurement.

In similar fashion, the SIMD approach crunches through the segments of filled and stroked paths, quickly sifting to assign them to the relevant tiles. Inside the tiler are a number of other optimizations; for example tiles in the interior of a filled path just get a constant color. (This logic is similar to PathFinder and was inspired by it).

The performance is impressive. I haven’t done careful benchmarking yet, but the Ghostscript tiger, the standard benchmark of 2D graphics renders in a 2048x1536 window in 2.8ms of GPU time on Intel Iris 640 integrated graphics. (A fun fact, this is a 500x speedup over results I got 20 years ago). More careful empirical evaluation is needed, especially as methodology of GPU performance can be quite tricky. Also, there are a bunch more things that can be done to improve performance further.

Basically, I have confidence that it will render any reasonable UI scene, up to a high level of complexity, smoothly at 60 frames per second. It should be especially nice for data visualization, CAD, and of course tools for graphic artists. An especially nice feature is that the GPU does basically all the heavy lifting, freeing up the CPU for application logic.

Imaging model

The prototype mostly does just does fills and strokes of vector paths, but the architecture of the renderer is designed to accommodate a full 2D graphics imaging model. Basically, it can handle any operation that works on a pixel at a time. Those include:

Soft masking and blending
The PhotoShop blend modes (also present in PDF and other imaging models)
Tone mapping
Color space conversions, including CMYK
Halftone effects
Gradients
Images
A wide variety of distance-field rendering techniques

Probably the most important effect that is not included in this set is image-based blurring. That said, it is possible to get analytic or approximate blurring of many shapes, for example this approximate blurred rounded rectangle, which can easily be adapted.

I’m particularly interested in the rendering quality. All antialiasing and blending in the prototype is done in a linear sRGB colorspace, which makes for especially clear vector shapes without rope-like visual artifacts. In the notes document are more ideas about improving distance field rendering (hint: never use smoothstep).

I’m mostly focused on making high resolution (4k and even higher) rendering fast, but an intriguing topic is to lavish compute power on making the finest possible images for lower resolution. One idea is to apply RGB subpixel rendering techniques (similar to ClearType), but for general vector graphics, not just fonts. There are more ideas (including a link to a code sketch) in the notes doc.

Implications

The 2D rendering engine is a fairly central component of any graphics-intensive application. Its performance and quality characteristics can have profound implication for the rest of the system. As one example, if rendering is very slow, the system around it develops workarounds like rendering layers to textures and compositing them, which generally solves smooth scrolling but creates other problems. This work reopens the question: what should a system look like when rendering is really fast?

One such related topic is immediate mode vs retained mode UI, a longstanding controversy, with passionate defenders on both sides. To be very clear, this renderer will work well with both. But I think there’s a special affinity for retained mode, as I hope to explain briefly.

Very often in UI, the biggest challenge in performance is traversing the entire UI state in order to determine the new appearance. Immediate mode GUI solves this by writing the UI logic in a fast language, so that it reliably comes in under the time budget. But another approach is to minimize the work by only touching the parts of UI state that actually changed. In classical 2D, that often manifests as “damage regions,” so that only a subregion of the screen is repainted. That’s not very effective for scrolling or things like animation of layer opacity, and many people believe that damage regions are obsolete (I disagree, mostly for reasons of power consumption, but that’s a story for another day).

A related approach is to retain parts of the scene graph (also commonly called “display list”), updating only those that have actually changed. Then the renderer redraws the screen based on the updated graph. Updated parameters can include translation (for scrolling) or alpha, so only a tiny amount of data need be uploaded to the GPU from frame to frame. Flutter is a good modern approach to this, and its “layers” are one of the keys to its performance.

The piet-metal approach is designed to support this approach, by hosting the scene graph on the GPU, so that the process of painting a frame does not rely on replaying the scene graph data structure resident on the CPU into GPU drawing commands. For simple scenes, this may not matter much, but for very complex visuals the difference might be significant.

Discussion

The week in the woods was extremely rewarding, and I recommend the format. Stones Throw Farm was a great setting for the research retreat.

To be clear, what I have now is a research prototype. It only implements a subset of the imaging model, and only works on relatively recent GPU hardware. But I believe it has some very appealing properties, making it especially useful as the groundwork for next-generation UI.

I believe the venerable 2D imaging model has lots of life left in it, as there is compelling evidence (not just my own work) that it can be implemented efficiently on GPU. I did the work largely to inform what to include and exclude in the piet API - anything that cannot efficiently be implemented on GPU is off the table. I plan to go forward on the existing piet/druid plans, confident that I can use existing platform-based drawing libraries like Direct2D for now, and that highly performant GPU-based implementations are at least possible.

This work has benefitted from discussions with many, though of course the mistakes I’ve made are my own. In particular, thanks to Allan MacKinnon and his Spinel work for inspiring me to consider compute for rendering, Patrick Walton for many stimulating discussions, and Brian Merchant (our Google Summer of Code student on this project) for asking provoking questions.