Raph Levien’s blog

A note on Metal shader converter

2023-06-12T18:05:42+00:00

At WWDC, Apple introduced Metal shader converter, a tool for converting shaders from DXIL (the main compilation target of HLSL in DirectX12) to Metal. While it is no doubt useful for reducing the cost of porting games from DirectX to Metal, I feel it does not move us any closer to a world of robust GPU infrastructure, and in many ways just adds more underspecified layers of complexity.

The specific feature I’m salty about is atomic barriers that allow for some sharing of work between threadgroups. These barriers are present in HLSL, and in fact have been since 2009, when Direct3D 11 and Shader Model 5 were first introduced. This barrier is not supported in Metal, and of the major GPU APIs, Metal is the only one that doesn’t support it. That holds back WebGPU’s performance (see gpuweb#3935 for discussion), as WebGPU must be portable across the major APIs.

I’ve discussed the value of this barrier in my blog post Prefix sum on portable compute shaders, but I’ll briefly recap. Among other things, it enables a single-pass implementation of prefix sum, using a technique such as decoupled look-back or the SAM prefix sum algorithm. A single-pass implementation can achieve the same throughput as memcpy, while a more traditional tree-reduction approach can at best achieve 2/3 that throughput, as it has to read the entire input in two separate dispatches. Further, tree reduction can actually be more complex to implement in practice, as the number of dispatches varies with the input size (it is typically 2 * ceil(log(n) / log(threadgroup size))). Prefix sum, in turn, is an important primitive for advanced compute workloads. There are a number of instances of it in the Vello pipeline, and it’s also commonly used in stream compaction, decoding of variable length data streams, and compression.

I believe there are other important techniques that are similarly unlocked by the availability of these primitives. For example, Nanite’s advanced compute pipelines schedule work through job queues, and in general it is not possible to reliably coordinate work between different threadgroups (even within the same dispatch) without such a barrier.

Complexity and reasoning

The GPU ecosystem exists at the knife edge of being strangled by complexity. A big part of the problem is that features tend to inhabit a quantum superposition of existing and not existing. Typically there is an anemic core, surrounded by a cloud of optional features. The Vulkan ecosystem is notorious for this: the extension list at vulkan.gpuinfo.org currently lists 146 extensions.

The widespread use of shader translation makes the situation even worse. When writing HLSL that will be translated into other shader languages, it’s no longer sufficient to consider Shader Model 5 to be a baseline, but rather the developer needs to keep in mind all the features that don’t translate to other languages. In some cases, the semantics change subtly (the rules for the various flavors “count leading zeros” when the input is 0 vary), and in other cases, like these device scoped barriers.

A separate category is things technically forbidden by the spec, but expected to work in practice. A good example here is the mixing of atomic and non-atomic memory operations (see gpuweb#2229). The spirv-cross shader translation tool casts non-atomic pointers to atomic pointers to support this common pattern, which is technically undefined behavior in C++, but in practice lots of people would be unhappy if the Metal shader compiler did anything other than the reasonable thing. Since Metal’s semantics are based on C++, I’d personally love to see this resolved by adopting std::atomic_ref from C++20 (Metal is still based on C++14). I’ll also note that the official Metal shader compiler tool generates reasonable IR for this pattern. It’s concerning that using open source tools such as spirv-cross triggers technical undefined behavior, but it’s probably not a big problem in practice.

I understand the incentives, but overall I find it disappointing that Metal chases shiny new features like ray-tracing, while failing to provide a solid, spec-compliant foundation for GPU compute.

Onward

The Metal announcements from WWDC move us no closer to a world of robust GPU infrastructure. But there is much we can still do.

For one, there is a GPU infrastructure stack that is based on careful specification and conformance testing, and has two high quality, open source implementations enabling deployment to almost all reasonably current GPU hardware. I speak of course of WebGPU. It’s lacking the shiny features – raytracing, bindless, and cooperative matrix operations (marketed as “tensor cores” and quite important for maximum performance in AI workloads) – but what is there should work.

For two, we can cheer on the work of Asahi Linux. They have recently announced OpenGL 3.1 support on Apple Silicon, and an intent to implement Vulkan. That work may be highly challenging, as obviously that implies implementing barriers which the Apple GPU engineers haven’t been able to manage. But they have done consistently impressive work so far, and I certainly hope they succeed. If nothing else, their work will result in much better public documentation of the hardware’s capabilities and limitations.

I have a recommendations for Apple as well. I hope that they document which HLSL features are expected to work and which are not. Currently in their documentation (which is admittedly beta), it just says “Some features not supported,” which I personally find not very useful. I would also like to give them credit for clarifying the Metal Shading Language Specification with respect to the scope of the mem_device flag to threadgroup_barrier. It now says, “The flag ensures the GPU correctly orders the memory operations to device memory for threads in the threadgroup,” which to a very careful reader does indicate threadgroup scope and no guarantee at device scope. Previously it said “Ensure correct ordering of memory operations to device memory,” which could easily be misinterpreted as providing a device scope guarantee.

I am optimistic in the long term about having really good, portable infrastructure for GPU compute, but it is clear that we have a long way to go.

Simplifying Bézier paths

2023-04-18T13:07:42+00:00

Finding the optimal Bézier path to fit some source curve is, surprisingly, not yet a completely solved problem. Previous posts have given good solutions for specific instances: Fitting cubic Bézier curves primarily addressed Euler spirals (which are very smooth), while Parallel curves of cubic Béziers rendered parallel curves. In this post, I describe refinements of the ideas to solve the much more general problem of simplifying arbitrary paths. Along with the theoretical ideas, a reasonably solid implementation is landing in kurbo, a Rust library for 2D shapes.

The techniques in this post, and code in kurbo, come close to producing the globally optimum Bézier path to approximate the source curve, with fairly decent performance (as well as a faster option that’s not always minimal in the number of segments), though as we’ll see, there is considerable subtlety to what “optimum” means. Potential applications are many:

Simplification of Bézier paths for more efficient publishing or editing
Conversion of fonts into cubic outlines
Tracing of bitmap images into vectors
Rendering of offset curves
Conversion from other curve types including NURBS, piecewise spiral
Distortions and transforms including perspective transform

While the primary motivation is simplification of an existing Bézier path into a new one with fewer segments, the techniques are quite general. One of the innovations is the ParamCurveFit trait, which designed for efficient evaluation of any smooth curve segment for curve fitting. Implementations of this trait are provided for parallel curves and path simplification, but it is intentionally open ended and can be implemented by user code.

This work is an opportunity to revisit some of the topics from my thesis, where I also explored systematic search for optimal Bézier fits. The new work is several orders of magnitude faster and more systematic about not getting stuck in local minima.

The ParamCurveFit trait

At its best, a trait in Rust is a way to model some fact about the problem, then serves as an interface or abstraction boundary between pieces of code. The ParamCurveFit trait models an arbitrary curve for the purpose of generating a Bézier path approximating that source curve. The key insight is to sample both the position and derivative of the source curve for given parameters, and there is also a mechanism for detecting cusps (particularly important for parallel curves).

The curve fitting process uses those position and derivative samples to compute candidate approximation curves, then measure the error between source curve and approximation.

Two implementations are provided: parallel curves of cubic Béziers, and arbitrary Bézier paths for simplification. Implementing the trait is not very difficult, so it should be possible to add more source curves and transformations, taking advantage of powerful, general mechanisms to compute an optimized Bézier path.

The core cubic Bézier fitting algorithm is based on measuring the area and moment of the source curve. A default implementation is provided by the ParamCurveFit trait implementing numerical integration based on derivatives (Green’s theorem), but if there’s a better way to compute area and moment, source curves can provide their own implementation.

For Bézier paths, an efficient analytic implementation is possible. No blog post of mine is complete without some reference to a monoid, and this one does not disappoint. It’s relatively straightforward to compute area and moments from a cubic Bézier using symbolic evaluation of Green’s theorem. The area and moments of a sequence of Bézier segments is the sum of the individual values for each segment. Thus, we precompute the prefix sum of the areas and moments for all segments in a path, then can query an arbitrary range in O(1) time by taking the difference between the end point and the start point.

Error metrics

The problem of finding an optimal Bézier path approximation is always relative to some error metric. In general, an optimal path is one with a minimal number of segments while still meeting the error bound, and a minimal error compared to other paths with the same number of segments. A curve fitting algorithm will evaluate many error metrics, at least one for each candidate approximation to determine whether it meets the error bound. The simplest technique subdivides in half when the bound is not met, but a more sophisticated approach attempts to optimize the positions of the subdivision points as well.

The main error metric used for curve fitting is Fréchet distance. Intuitively, it captures the idea of the maximum distance between two curves while also preserving orientation of direction along a path (the related Hausdorff metric does not preserve orientation and so a curve with a sharp zigzag may have a small Hausdorff metric measured against a straight line). The difference is shown in the following illustration:

Computing the exact Fréchet distance between two curves is not in general tractable, so we have to use approximations. It is important for the approximation to not underestimate, as this will yield a result that exceeds the error bound.

The classic Tiller and Hanson paper on parallel curves proposed a practical, reasonably accurate, and efficient error metric to approximate Fréchet distance. It samples the source curve at n points, and for each point casts a ray along the normal from the point, detecting intersections with the cubic approximation. The maxium distance from a source curve point to the corresponding intersection is the metric.

Unfortunately, there is a case Tiller-Hanson handles poorly, and equally unfortunately, it does come up when doing simplification of arbitrary paths. Consider a superellipse shape, approximated (badly) by a cubic Bézier with a loop. The rays strike only parts of the approximating curve and miss the loop entirely.

Increasing the sampling density helps a little but doesn’t guarantee that the approximation will be well covered. Indeed, it is the high curvature in the source curve that makes this coverage more uneven.

A more robust metric is to parametrize the curves by arc length, and measure the distance between samples that share a corresponding portion of the total arc length. This ensures that all parts of both curves are considered. When curves are close (which is a valid assumption for curve fitting), it closely approximates Fréchet distance, though potentially can overestimate (because nearest points don’t necessarily coincide with arc length parametrization) and underestimate due to sampling.

However, arc length parametrization is considerably slower (about a factor of 10) because it requires inverse arc length computations. The approach implemented in kurbo#269 is to classify whether the source curve is “spicy” (considering the deltas between successive normal angles) and use the more robust computation only in those cases.

Finding subdivision points

An extremely common approach is adaptive subdivision. Compute an approximation, evaluate the error metric, and subdivide in half when it is exceeded. This approach is simple, robust, and performant (the total number of evaluations is within a factor of 2 of the number of segments). However, it tends to produce results with more segments than absolutely needed. In the limit, it tends to be about 1.5 times the optimum.

Fancier techniques try to optimize the subdivision points to reduce the number of segments. That is equivalent to dividing the source curve into ranges such that each range is just barely below the threshold; ironically, it basically amounts to maximizing the error metric right up to the constraint.

One technique is given in section 9.6.4 of my thesis. Basically you start on one end and for each segment find a subdivision point from the last point to one that’s just barely under the error threshold. Under the assumption that errors are monotonic (which is not always going to be the case), this finds the global minimum number of segments needed. The last segment will have an error well below the threshold. Then, another search finds the minimum error for which this process yields the same number of segments. Again, if error is monotonic, the result is the Fréchet distance of all segments being equal, which is (at least roughly) equivalent to the overall Fréchet distance being minimized.

For smooth source curves, monotonic error is a reasonable assumption. Even so, the above technique seems to work fairly robustly, producing fewer segments than simple adaptive subdivision, though it is somewhere around 50x slower.

There may be scope to improve this optimization process further. Crates like argmin implement general purpose multidimensional optimization algorithms, and it’s worth exploring whether those could produce equally good results with fewer evaluations.

Bumps

Something unexpected arose during testing: the resulting “optimum” simplified paths had bumps. Obviously at first I thought this was some kind of failure of the algorithm, but now I think something more subtle is happening.

The solution below is with an error tolerance of 0.15, which produces a total of 43 segments:

Near the bottom is a bump. It almost looks like a discontinuity (and other similar examples even more so), but on closer examination it is simply one very long and one very short control arm (ie distance between control point and corresponding endpoint), which creates a high degree of curvature variation:

The underlying problem is that Fréchet distance optimizes only for a distance metric, and does not by itself guarantee a low angle (or curvature) error. In many cases, these objectives are not in tension – the curve that minimizes distance error also smoothly hugs the source curve. But there are cases where there is in fact a tradeoff, and when such tradeoffs exist, agressively optimizing for one causes the other to suffer.

For this particular range of the test data, I believe the Fréchet-minimizing cubic Bézier approximation does indeed exhibit the bump; tweaking the parameters will certainly improve smoothness, but also result in a greater distance from the source curve.

This state of affairs is not acceptable in most applications. It is evidence that Fréchet does not capture all aspects of the objective function. Section 9.2 of the thesis discusses this point.

Given that aggressively optimizing Fréchet may give undesirable results, what is to be done? The most systematic approach would be to design an error metric that takes both distance and angle into account, and calibrate it to correlate strongly with perception. That is perhaps a tall order, requiring research to properly establish tuning parameters, and likely with complexity and runtime performance implications to evaluate. Even so, I think it should be pursued.

A simpler approach, implemented in the current code, is to recognize that these bumpy cubic Béziers can be recognized by their parameters; when the distance from the endpoints to the control points roughly equal to the chord length, then curves can have cusps or nearly so. These are the δ values from the core quartic-based curve fitting solver; a simple approach would be to simply exclude curves with δ values greater than some threshold (0.85 works well), and this also improves performance by decreasing the number of error evaluations needed, but a more sophisticated approach is to multiply the error by a penalty factor for larger δ. One advantage of the latter approach is that it is still possible to fit every actual exact Bézier input.

Applying this tweak gives a much smoother result, though it does require one more segment (44 rather than 43 previously).

Another point to make at this point is that the number of path segments needed scales very gently as the error bound is tightened; changing the threshold from 0.15 to 0.05 increases the number of segments only from 44 to 60. In the limit, the error scales as O(n^6). This extremely fast convergence is one reason that cubic Béziers can be considered a universal representation of curved paths; while some other representation such as NURBS may be able to represent conic sections exactly, any such smooth curve can be approximated to an extremely tight tolerance using a modest number of cubic segments. The path simplification technique in this blog (and in kurbo) gives a practical way to actually attain this degree of accuracy.

And taking the error down to 0.01 requires only 90 segments. This creates an extremely close fit to the source curve, still without requiring an excessive number of segments.

These are vector images, so please feel free to open them in a new tab and zoom in to inspect them more carefully. You should see all solutions display an impressive amount of control over the curvature variation afforded by cubic Béziers. The code is available, so also feel free to experiment with it with your own test data and for your own applications. I’m especially interested in cases where the algorithm doesn’t perform well; it hasn’t been carefully validated yet.

Low pass filtering

The “best” curve depends on the use case. When the source curve is an authoritative source of truth, then making each segment G1 continuous with it at the endpoints is reasonable. However, when the source curve is noisy, perhaps because it’s derived from a scanned image or digitizer samples, then an optimum simplified path may have angle deviations relative to the source that make the overall curve smoother.

I haven’t been working from noisy data and haven’t done experiments, but I do suggest a possibility: use a global optimizing technique such as that provided by argmin, and jointly optimize both the location of the subdivision points and a delta to be applied to the angle (equally on both sides of a subdivision, so the resulting curve remains G1). Another possibility is to explicitly apply a low-pass filter, tuned so the amount of smoothing is consistent with the amount of simplification. In any case, using the existing code with no further tuning may yield less than optimum results.

Discussion

The current code in kurbo is likely considerably better than what’s in your current drawing tool, but curve fitting remains work in progress. The core primitive feels solid, but applying it might require different tuning depending on the specifics of the use case. I invite collaboration along these lines.

Many of the points in this blog were discussed in Zulip threads, particularly curve offsets and More thoughts on cubic fitting. The Zulip is a great place to discuss these ideas, and we welcome newcomers and people bringing questions.

Thanks to Siqi Wang for insightful questions and making test data available.

Moving from Rust to C++

2023-04-01T13:00:42+00:00

Note to readers: this is an April Fools post, intended to satirize bad arguments made against Rust. I have mixed feelings about it now, as I think people who understood the context appreciated the humor, but some people were confused. That’s partly because it’s written in a very persuasive style, and I mixed in some good points. For a good discussion of C++ safety in particular, see JF Bastien’s talk Safety and Security: The Future of C++.

I’ve been involved in Rust and the Rust community for many years now. Much of my work has been related to creating infrastructure for building GUI toolkits in Rust. However, I have found my frustrations with the language growing, and pine for the stable, mature foundation provided by C++.

Choice of build systems

One of the more limiting aspects of the Rust ecosystem is the near-monoculture of the Cargo build system and package manager. While other build systems are used to some extent (including Bazel when integrating into larger polyglot projects), they are not well supported by tools.

In C++, by contrast, there are many choices of build systems, allowing each developer to pick the one best suited for their needs. A very common choice is CMake, but there’s also Meson, Blaze and its variants as well, and of course it’s always possible to fall back on Autotools and make. The latter is especially useful if we want to compile Unix systems such as AIX and DEC OSF/1 AXP. If none of those suffice, there are plenty of other choices including SCons, and no doubt new ones will be created on a regular basis.

We haven’t firmly decided on a build system for the Linebender projects yet, but most likely it will be CMake with plans to evaluate other systems and migrate, perhaps Meson.

Safety

Probably the most controversial aspect of this change is giving up the safety guarantees of the Rust language. However, I do not think this will be a big problem in practice, for three reasons.

First, I consider myself a good enough programmer that I can avoid writing code with safety problems. Sure, I’ve been responsible for some CVEs (including font parsing code in Android), but I’ve learned from that experience, and am confident I can avoid such mistakes in the future.

Second, I think the dangers from memory safety problems are overblown. The Linebender projects are primarily focused on 2D graphics, partly games and partly components for creating GUI applications. If a GUI program crashes, it’s not that big a deal. In the case that the bug is due to a library we use as a dependency, our customers will understand that it’s not our fault. Memory safety bugs are not fundamentally different than logic errors and other bugs, and we’ll just fix those as they surface.

Third, the C++ language is evolving to become safer. We can use modern C++ techniques to avoid many of the dangers of raw pointers (though string views can outlive their backing strings, and closures as used in UI can have very interesting lifetime patterns). The C++ core guidelines are useful, even if they’re honored in the breach. And, as discussed in the next section, the language itself can be expected to improve.

Language evolution

One notable characteristic of C++ is the rapid adoption of new features, these days with a new version every 3 years. C++20 brings us modules, an innovative new feature, and one I’m looking forward to actually being implemented fairly soon. Looking forward, C++26 will likely have stackful coroutines, the ability to embed binary file contents to initialize arrays, a safe range-based for loop, and many other goodies.

By comparison, the pace of innovation in Rust has become more sedate. It used to be quite rapid, and async has been a major effort, but recently landed features such as generic associated types have more of the flavor of making combinations of existing features work as expected rather than bringing any fundamentally new capabilities; the forthcoming “type alias impl trait” is similar. Here, it is clear that Rust is held back by its commitment to not break existing code (enacted by the edition mechanism), while C++ is free to implement backwards compatibility breaking changes in each new version.

C++ has even more exciting changes to look forward to, including potentially a new syntax as proposed by Herb Sutter cppfront, and even the new C++ compatible Carbon language.

Fortunately, we have excellent leadership in the C++ community. Stroustrup’s paper on safety is a remarkably wise and perceptive document, showing a deep understanding of the problems C++ faces, and presenting a compelling roadmap into the future.

Community

I’ll close with some thoughts on community. I try not to spend a lot of time on social media, but I hear that the Rust community can be extremely imperious, constantly demanding that projects be rewritten in Rust, and denigrating all other programming languages. I will be glad to be rid of that, and am sure the C++ community will be much nicer and friendlier. In particular, the C++ subreddit is well known for their sense of humor.

The Rust community is also known for its code of conduct and other policies promoting diversity. I firmly believe that programming languages should be free of politics, focused strictly on the technology itself. In computer science, we have moved beyond prejudice and discrimination. I understand how some identity groups may desire more representation, but I feel it is not necessary.

Conclusion

Rust was a good experiment, and there are aspects I will look back on fondly, but it is time to take the Linebender projects to a mature, production-ready language. I look forward to productive collaboration with others in the C++ community.

We’re looking for people to help with the C++ rewrite. If you’re interested, please join our Zulip instance.

Discuss on Hacker News and /r/rust.

Requiem for piet-gpu-hal

2023-01-07T17:12:42+00:00

Vello is a new GPU accelerated renderer for 2D graphics that relies heavily on compute shaders for its operation. (It was formerly known as piet-gpu, but we renamed it recently because it is no longer based on the [Piet] render context abstraction, and has been substantially rewritten). As such, it depends heavily on having good infrastructure for writing and running compute shaders, consistent with its goals of running portably and reliably across a wide range of GPU hardware. Unfortunately, there hasn’t been a great off-the-shelf solution for that, so we’ve spent a good part of the last couple years building our own hand-rolled GPU abstraction layer, piet-gpu-hal.

During that time, the emerging WebGPU standard, along with the allied WGSL shader language, showed promise for providing such infrastructure, but it has seemed immature, and also has limitations that would have prevented us from doing experiments involving advanced GPU features. This motivated continued work on developing our own abstraction layer, so much so that the piet-gpu vision document has a section entitled “Why not wgpu?”.

Recently, however, we switched Vello to using wgpu for its GPU infrastructure and deleted piet-gpu-hal entirely. This was the right decision, but there was something special about piet-gpu-hal and I hope its vision is realized some day. This post talks about the reasons for the change and the tradeoffs involved.

In the process of doing piet-gpu-hal, we learned a lot about portable compute infrastructure. We will apply that knowledge to improving WebGPU implementations to better suit our needs.

Goals

The goals of piet-gpu-hal were admirable, and I believe there is still a niche for something that implements them. Essentially, it was to provide a lightweight yet powerful runtime for running compute shaders on GPU. Those goals are fulfilled to a large extent by WebGPU, it’s just that it’s not quite as lightweight and not quite as powerful, but on the other hand the developer experience is much better.

The most important goal, especially compared with WebGPU implementations, is to reduce the cost of shader compilation by precompiling them to the intermediate representation (IR) expected by the GPU, rather than compiling them at runtime. This avoids anywhere from about 1MB to about 20MB of binary size for the shader compilation infrastructure, and somewhere around 10ms to 100ms of startup time to compile the shaders.

An additional goal was to unlock the powerful capabilities present on many (but not all) GPUs by runtime query. For our needs, the most important ones are subgroups, descriptor indexing, and detection of unified memory (avoiding the need for separate staging buffers). However, for the coming months we are prioritizing getting things working well for the common denominator, which is well represented by WebGPU, as it’s generally possible to work around the lack of such features by doing a bit more shuffling around in memory.

Implementation choices

In this section, I’ll talk a bit about implementation choices made in piet-gpu-hal and their consequences.

Shader language

After considering the alternatives, we landed on GLSL as the authoritative source for writing shaders. HLSL was in the running, and it seems to be the most popular choice (primarily in the game world) for writing portable shaders, but ultimately GLSL won because it is capable of expressing all of what Vulkan can do. In particular, I wanted to experiment with the Vulkan memory model.

Another choice considered seriously was rust-gpu. That looks promising, and it has many desirable properties, not least being able to run the same code on CPU and GPU, but just isn’t mature enough. Hopefully that will change. I think porting Vello to it would be an interesting exercise, and would shake out many of the issues needing to be solved to use it in production.

Another appealing choice would be Circle, a C++ dialect that targets compute shaders among other things. It might have made it over the edge if it were released as open source; someone should buy out Sean’s excellent work and relicense it.

Ahead of time compilation

From the GLSL source, we also had a pipeline to compile this to shader IR. For all targets, the first step was glslangValidator to compile the GLSL to SPIR-V. And for Vulkan, that was sufficient.

For DX12, the next step was spirv-cross to convert the SPIR-V to HLSL, followed by DXC to convert this to DXIL. We really wanted to use DXC, especially over relying on the system provided shader compiler (some arbitrary version of FXC), both to access advanced features in Shader Model 6 such as wave operations, and also because FXC is somewhat buggy and we have evidence it will cause problems. In fact, I see the current reliance on FXC to be one of the biggest risks for WebGPU working well on Windows. (In the longer term, it is likely that both Tint and naga will compile WGSL directly to DXIL and perhaps DXBC, which will solve these problems, but will take a while)

For Metal, we used spirv-cross to convert SPIR-V to Metal Shading Language. We intended to go one step further, to AIR (also known as metallib), using the command-line tools, but we didn’t actually do that, as it would have made setting up the CI more difficult.

All this was controlled through a simple hand-rolled ninja file. At first, we ran this by hand and committed the results, but it was tricky to make sure all the tools were in place (which would have been considerably worse had we required both DXC with the signing DLL in place and the Metal tools), and it also resulted in messy commits, prone to merge conflicts. Our solution to this was to run shader compilation in CI, which was better but still added considerably to the friction, as it was easy to be in the wrong branch for committing PRs.

GPU abstraction layer

The core of piet-gpu-hal was an abstraction layer over GPU APIs. Following the tradition of gfx-hal, we called this a HAL (hardware abstraction layer), but that’s not quite accurate. Vulkan, Metal, and DX12 are all hardware abstraction layers, responsible among other things for compiling shader IR into the actual ISA run by the hardware, and dispatching compute and other work.

The design was based loosely on gfx-hal, but more tuned to our needs. To summarize, gfx-hal was a layer semantically very close to Vulkan, so that the Vulkan back-end was a very thin layer, and the Metal back-end resembled MoltenVK, with the DX12 back-end also simulating Vulkan semantics in terms of the underlying API. This ultimately wasn’t a very satisfying approach, and for wgpu was deprecated in favor of wgpu-hal.

We only implemented the subset needed for compute, which turned out to be a significant limitation because we found we also needed to run the rasterization pipeline for some tasks such as image scaling. The need to add new functionality to the HAL was also a major friction point. We were able to add features like runtime query for advanced capabilities such as subgroup size control, and a polished timer query implementation on all platforms (the latter is still not quite there for wgpu). Overall it worked pretty well.

I make one general observation: all these APIs and HALs are very object oriented. There are objects representing the adapter, device, command lists, and resources such as buffers and images. Managing these objects is nontrivial, especially because the lifetimes are complex due to the asynchronous nature of GPU work submission; you can’t destroy a resource until all command buffers referencing it have completed. These patterns are fairly cumbersome, and especially don’t translate easily to Rust.

For others seeking to build cross-platform GPU infrastructure, I suggest exploring a more declarative, data-oriented approach. In that approach, build a declarative render graph using simple value types as much as possible, then write specialized engines for each back-end API. This should lead to more straightforward code with less dynamic dispatching, and also resolve the need to find common-denominator abstractions. We are moving in this direction in Vello, and may explore it further as we encounter the need for native back-ends.

WebGPU pain points

Overall the task of porting piet-gpu to WGSL went well, but we ran into some issues. We expect these to be fixed and improved, but at the moment the experience is a bit rough. In particular, the demo is not running on DX12 at all, either in Chrome Canary for the WebGPU version or through wgpu. The main sticking point is uniformity analysis, which only very recently has a good solution in WGSL (workgroupUniformLoad) and the implementation hasn’t fully landed.

Collaboration with community

A major motivation for switching to WGSL and WebGPU is to interoperate better with other projects in that ecosystem. We already have a demo of Bevy interop, which is particularly exciting.

One such project is wgsl-analyzer, which gives us interactive language server features as we develop WGSL shader code: errors and warnings, inline type hints, go to definition, and more. Thanks especially to Daniel McNab, they’ve been super responsive to our needs and we maintain a working configuration. I strongly recommend using such tools; the old way sometimes felt like submitting shader code on a stack of punchcards to the shader compiler.

Obviously another major advantage is deploying to the web using WebGPU, which we have working in at least prototype form. With an intent to ship from Chrome and active involvement from Mozilla and Apple, the prospect of WebGPU shipping in usable state on real browsers seems close.

Precompiled shaders on roadmap?

The goals of a lightweight runtime for GPU compute remain compelling, and in our roadmap we plan to return to precompiled shaders so it is possible to use Vello without a need to carry along runtime shader compilation infrastructure, and also to more aggressively exploit advanced GPU features. There are two avenues we are exploring.

One is to add support for precompiled shaders to either wgpu or wgpu-hal. We have an issue with some thoughts. The advantage of this approach is that it potentially benefits many applications developed on WebGPU, for example native binary distributions of games. If it is possible to identify all permutations of shaders which will actually be used, and bundle the IR.

The other is to add additional native renderer back-ends. The new architecture is much more declarative and less object oriented, so the module that runs the render graph can be written directly in terms of the target API rather than going through an abstraction layer. If there is a desire to ship Vello for a targeted application (for example, animation playback on Android), this will be the best way to go.

Other GPU compute clients

One thing we were watching for was whether there was any interest in using piet-gpu-hal for other applications. Matt Keeter did some experimentation with it, but otherwise there was not much interest. Sometimes the “portable GPU compute” space feels a bit lonely.

An intriguing potential application space is machine learning. It would be an ambitious but doable project to get, say, Stable Diffusion running on portable compute using either piet-gpu-hal or something like it, so that very little runtime (probably less than a megabyte of code) would be required. Related projects include Kompute.cc, which runs machine learning workloads but is Vulkan only, and also MediaPipe.

One downside to trying to implement machine learning workloads in terms of portable compute shaders is that it doesn’t get access to neural accelerators such the Apple Neural Engine. When running in native Vulkan, you may get access to cooperative matrix features, which on Nvidia are branded “tensor cores,” but for the most part these are proprietary vendor extensions and it is not clear if and when they might be exposed through WebGPU. Even so, at least on Nvidia hardware it seems likely that using these features can unlock very high performance.

Going forward, one approach I find particularly promising for running machine learning is wonnx, which implements the ONNX spec on top of WebGPU. No doubt in the first release, performance will lag highly tuned native implementations considerably, but once such a thing exists as a viable open source project, I think it will be improved rapidly. And WebGPU is not standing still…

Beyond WebGPU 1.0

WebGPU 1.0 is basically a “least common denominator” of current GPUs. This has advantages and disadvantages. It means that there are fewer choices (and fewer permutations), and that code written in WGSL can run on all modern GPUs. The downside is that there are a number of features that most (but not all) current GPUs have that can speed things further, and those features are not available.

The likely path through this forest is to define extensions to WebGPU. These can start as privately implemented extensions, running in native, then, hopefully based on that experience, proposed for standardization on the web. They would be optional, meaning that more shader permutations will need to be written to make use of them when available, or fall back when not. One such extension, fp16 arithmetic, has already been standardized, though we don’t yet exploit it in Vello.

In Vello, we have identified three promising candidates for further extension.

First, descriptor indexing, which is the ability to create an array of textures, then have a shader dynamically access textures from that array. Without it, a shader can only have a fixed number of textures bound. To work around that limitation, we plan to have an image atlas, copy the source images into that atlas using rasterization stages, then access regions (defined by uv quads) from the single binding. We don’t expect performance to be that bad, as such copies are fairly cheap, but it does require extra memory and is not ideal. For fully native, the GPU world is moving to bindless, which is popular in DX12, and the recent descriptor buffer extension makes Vulkan work the same way. It is likely that WebGPU will standardize on something more like descriptor indexing than full bindless, because the latter is basically raw pointers and thus unsafe by design. In any case, see the image resources issue for more discussion.

Second, device-scoped barriers, which unlock single-pass (decoupled look-back) prefix sum techniques. They are not present in Metal, but otherwise should be straightforward to add. I wrote about this in my Prefix sum on portable compute shaders blog post. In the meantime, we are using multiple dispatches, which is much more portable, but not quite as performant.

Third, subgroups, also known as SIMD groups, wave operations, and warp operations. Within a workgroup, especially for stages resembling prefix sum, Vello uses a lot of workgroup shared memory and barriers. With subgroups, it’s possible to reduce that traffic, in some cases dramatically. That should make the biggest difference on Intel GPUs, which have relatively slow shared memory.

Unfortunately, there is a tricky aspect to subgroups, which is that in most cases there is no control in advance over the subgroup size, the shader compiler picks it on the basis of a heuristic. The subgroup size control extension, which is mandatory in Vulkan 1.3, fixes this, and allows writing code specialized to a particular subgroup size. Otherwise, it will be necessary to write a lot of conditional code and hope that the compiler constant-folds the control flow based on subgroup size. Another challenge is that it is more difficult to test code for a particular subgroup size, in contrast to workgroup shared memory which is quite portable. More experimentation is needed to determine whether subgroups without the size control extension work well across a wide range of hardware.

And on the topic of that experimentation, it’s difficult to do so without adequate GPU infrastructure. I may find myself reaching for the archived version of piet-gpu-hal.

Conclusion

Choosing the right GPU infrastructure depends on the goals, as sadly there is not yet a good consensus choice for a GPU runtime suitable for a wide variety of workloads including compute. For the goals of researching the cutting edge of performance, hand-rolled infrastructure was the right choice, and piet-gpu-hal served that well. For the goal of lowering the friction for developing our engine, and also interoperating with other projects, WebGPU and wgpu are a better choice. Our experience with the port suggests that the performance and features are good enough, and that it is a good experience all-around.

We hope to make Vello useful enough to use in production within the next few months. For many applications, WebGPU will be an appropriate infrastructure. For others, where the overhead of runtime shader compilation is not acceptable, we have a path forward but will need to consider alternatives. Either ahead-of-time shader compilation can be retrofitted to wpgu, or we will explore a more native approach.

In any case, we look forward to productive development and collaboration with the broader community.

Raph’s reflections and wishes for 2023

2022-12-31T14:44:42+00:00

This post reflects a bit on 2022 and contains wishes for 2023. It mixes project and technical stuff, which I write about fairly frequently, and more personal reflections, which I don’t.

Reflections on 2022

Overall 2022 was a good year for me, though it was stressful and challenging in some ways. A lot of the stuff I did I’ve talked about in this blog, but there are a few other things, and quite a few more things in the pipeline. It was not a year of shipping anything big, and that is one thing I’d like to see different in 2023.

What I do straddles a number of more traditional categories. I do research, but most of my output is blog posts and code, not academic papers. My area of focus is primarily 2D graphics and GUI infrastructure, and these are to a large extent neglected step-children of academic computer science - there aren’t conferences or journals that specialize in the topic, nor are there decent textbooks. As a result, knowledge tends to be somewhat arcane, and just as much “lore” shared among practitioners as a literature. Even so, I find the field intellectually very stimulating and consider this odd situation to be an opportunity.

My portfolio of projects is very ambitious, as it basically includes an entire Rust UI toolkit plus a lot of the supporting technologies for that. It’s arguably too much for one person to take on, but I’m trying my best to make it work, largely by fostering a community around the projects. In late 2022, a big step was setting up weekly office hours, one hour a week, where we check in and discuss the various projects. I think that’s working well and look forward to continuing it.

On happiness

I’m not one of those “quantified self” people, but I have noticed that my happiness tends to correlate pretty directly with how much code I’m writing. I’m sure some of that is simply because if there’s stressful stuff going on that gets in the way of coding, that makes me unhappy, but obviously I just really enjoy it.

In particular, I love solving deep puzzles, and I find plenty of opportunity for that. An especially enjoyable track is adapting algorithms to be massively parallel so they run efficiently on GPUs (especially compute). Sometimes that leads to friction; my projects often have a “rocket science” nature to them, making them hard to contribute to.

Aside from puzzle-solving, which is largely a solitary activity, I also like the aspects of teaching, getting people up to speed, especially in topics that are not easily accessible through a standard computer science education or textbooks. Writing, including this blog, is a big part of that.

To a large extent, I will try to make time in 2023 to focus on these things I really enjoy, that I find make me happy. I am extremely fortunate that my paying job, as a research software engineer on Google Fonts, lets me do that.

Vello

Of the various projects in flight (and there are many!), one is clearly rising to the top of the stack: Vello, the GPU-accelerated 2D renderer. We made a lot of progress last year, but it’s still not quite ready to ship. Part of it is that it’s trying to solve a very hard problem, and part of it is that some solid GPU infrastructure needs to be in place for it to fly. A lot of time and energy last year went into GPU infrastructure in various ways, both pushing piet-gpu-hal forward, and then doing a full rewrite into WGSL and WebGPU (writeup to come).

For those who might be confused by the name, Vello is a rename of piet-gpu but still fundamentally the same project. The old name didn’t really fit, as “Piet” refers to a trait/method abstraction for the traditional 2D graphics API, and we’re moving away from that. The new name is intended to evoke both vellum (parchment, as in books and manuscripts) and velocity.

It is clear to me that there is strong demand for a good, cross-platform 2D rendering engine. There are a few other interesting things going on the space, but I’ll be honest, what I see, I want to compete with. There are big missing pieces (the most important by far is the ability to import images), but I see a pretty clear path to getting all that done.

So this is by far the biggest goal for the year: get Vello to a usable state. That involves shipping a crate, doing a proper writeup, which will probably be a 20-30 page self-standing report, and doing quantitative performance evaluation against other renderers.

Xilem

Another major initiative is Xilem. This was originally just a reactive layer, intended to be generic over the underlying widget tree, but is emerging as an umbrella for the larger project.

My hypothesis is (still) that Xilem is the best known reactive architecture for Rust. If I am correct, that means it is more concise, more ergonomic, more efficient, and better integrated with async than comparable work on Dioxus, Sycamore, pax-lang, and the Elm variants. That would be an exciting result, in which case I hope and expect some interesting systems will be built on the architecture. If I am wrong (which is entirely possible), then those other reactive approaches are all likely good enough for practical use, and have more evidence behind them than Xilem currently does. I feel strongly that it is worthwhile to test the hypothesis, and the only way to do that is to try to build real systems on the architecture. That experience will also likely drive improvements to the Xilem architecture, and I expect we’ll learn something interesting in any case. [Note: this paragraph is a rewrite of the original; see raphlinus#88 for the diff]

Progress on Xilem will have a considerably different texture than Vello. I hope that a huge fraction of the actual implementation work will be done by the community. Most of that work is most decidedly not rocket science, as I expect lots of it to be straightforward adaptation of the existing Druid widget set into the new architecture (and I’m making certain decisions explicitly to make that easier). I’m trying hard not to lick the cookie too much, and encourage other people to take on subprojects. I’m also trying to foster a culture where everyone in the community feels empowered to review PR’s and keep things moving, as having that block on me has been difficult.

One thing I am looking forward to working on is immutable data structures for efficiently (sparsely) diffing collections, which will hopefully realize the promise of my RustLab 2020 talk. This is a problem I feel has never been fully solved in UI toolkits - either you use complex and fragile mechanisms to incrementally update the UI, or you end up diffing the whole collection every time - and I believe using solid computer science to solve it, plus good Rust API design, would be very satisfying.

I consider Xilem to be more speculative and riskier than Vello. The extent to which it succeeds is largely based on how well the community can organize around getting the work done; if it gated on me, it’s a good question when it would all get done, especially with other projects competing for my attention.

The best place to learn more about Xilem is my High Performance Rust UI talk. I go over the goals and motivations of the project, and there’s good Q&A at the end. There’s a bit more about the Rust language aspects, particuarly trying to do ergonomic API design, in my RustLab 2022 keynote. In any case, I will be writing more.

Obviously Xilem depends on having good 2D rendering, so clearly time spent getting Vello production-ready contributes toward the overall success of the project.

To the people who want a good GUI toolkit they can use right now: I apologize, and ask for patience. In the mean time, you might check out Iced, Egui, or Slint. Those are all pretty good right now, and continually improving. I personally find the end goal of Xilem to be extremely compelling, but even in the best case it will take a while to get there.

Curves and other research

Over the holiday break, I let my mind roam more freely than usual, and I found myself coming back to various problems in curves. Probably the most satisfying work was refining the Bézier curve fitting ideas (see kurbo#230 for the code, hopefully writeup coming before too long). I also have what I consider a very promising idea to improve hyperbeziers, which have been on the back burner for a couple years, and an idea for robust boolean path operations.

But probably the juiciest bit of work will be perfecting the path geometry parts in Vello. In particular, I have a fairly compelling prototype of a combined flatten + offset operation in terms of Euler spirals, which I feel can be implemented efficiently on GPU as a compute shader and integrated into the Vello pipeline. That will improve handling of strokes (to handle all the join/cap options), and also serve as the basis for stem darkening of font outlines. Fun fact: the parallel curve work I did a few months ago was motivated by trying to get a good GPU implementation, but ultimately I believe that particular work is best suited for creative vector graphics applications, while the Euler spiral approach is better suited for GPU.

Non-goals

A year ago, I said only half as a joke, that my main New Years resolution for 2022 would be to not write a shader language. I’m pleased to report that I have succeeded. I will renew that vow for next year as well.

In many ways, it would make sense to start designing a shader language. My process for writing compute shaders is basically to design them in a fictional high level language with operations like prefix sum, stream compaction, etc., then translate by hand into the much lower level WGSL (formerly GLSL). If a good high level language existed (Futhark is the closest of what I’ve seen so far, but rust-gpu is also a contender), it would in theory streamline that work and let me write more efficiently.

But I think the advice in Don’t write a programming language is pretty sound. A programming language is a multi-year endeavor, and with a poor chance of success even in the most favorable conditions. In addition, I’ve found that everything in GPU land is at least 5 times harder than it is on the CPU. Case in point, I had trouble even getting Futhark to run on the GPU hardware of my main development machines, as it doesn’t have any compute shader back-ends, only OpenCL (which is basically dead) and CUDA (not much help on AMD or Apple M1 hardware). And that’s a relatively successful effort with over 8 years of experience!

I am also unlikely to write papers for academic journals and conferences. Perhaps it’s sour grapes, but I haven’t had a good experience, and feel that the (nontrivial) time and effort doesn’t really pay off. I’ll continue to use this blog, and publication of open source software, as the primary way I communicate research results. That said, I had one good experience of being asked to collaborate on an ASPLOS paper, and I remain open to collaborations. For someone with an incentive to publish academically, I think there’s quite a bit of raw material to work with.

Conclusion

I really feel that a lot of the research effort of the past year and beyond is poised to pay off in the next. Most especially, I am excited about shipping Vello, as I think that will advance the state of 2D rendering and also points the way for doing other interesting things in GPU compute. I also really look forward to working deeply with the community forming around Xilem to push that forward and hopefully make it reality.

There’s lots more that could be said, especially about the many individual subprojects, but in this reflection I hoped to give a general overview.

Discussion on /r/rust.

In any case, I wish everybody a happy and healthy 2023.

Minikin retrospective

2022-11-08T18:40:42+00:00

There’s a lot of new interest in open source text layout projects, including cosmic-text, and the parley work being done by Chad Brokaw. I’m hopeful we’ll get a good Rust solution soon.

I encourage people working on text to study existing open source code bases, as much of the knowledge is arcane, and there is unfortunately no good textbook on the subject. Obviously browser engines facilitate extremely sophisticated layout, but because of their complexity and the constraints of HTML and CSS compatibility, they may be hard going. The APIs of DirectWrite and Core Text are also worthy of study, but unfortunately their implementations remain closed.

An interesting codebase is Minikin. It powers text layout in the Android framework. I wrote the original version starting back around 2013, but it is since maintained by the very capable Android text team.

A bit of additional context. The design of Minikin was constrained by compatibility with the existing Android text API (mostly exposed through Java, though these days there is a nontrivial NDK surface as well). There is a higher level layer, responsible for rich text representation, editing, and other things, while Minikin is the lower level that powers that and does shaping, itemization, hit testing, and so on. (See Text layout is a loose hierarchy of segmentation if these terms are unfamiliar). For the most part, the interface between the higher and lower levels is runs of text. Line breaking is an interesting case where the responsibility for crossing levels is shared across the layer boundary. For the most part, the higher level iterates its own rich text representation and hands runs to Minikin.

A special feature of Minikin is its optimized line breaking algorithm, strongly inspired by Knuth-Plass. This was motivated largely by the need to handle small screens better, but has some tweaks to make it even better for mobile. The heuristics try not to place a word by itself on the last line, and in general try to balance line lengths. Line breaking generally follows ICU rules, but does special case email addresses and URLs, as those are very common on mobile and the ICU rules work poorly.

There’s also fun stuff in there for grapheme boundaries, deciding between color and text emoji presentation forms, and calculating exactly what should be deleted on backspace (the logic for that is surprisingly complicated and as far as I know has not been written down anywhere).

Much of the internal structure of Minikin is dedicated to itemization, which is done largely on the basis of the cmap coverage of the fonts in the fallback chain. Doing that query font-by-font for every character would be expensive, so there are fancy bitmap acceleration structures. A good, general, cross-platform way to do itemization and fallback is a hard problem, but I think this solution works well for Android’s specific needs.

While there’s unfortunately no excellent documentation on Minkin’s internals, there are some resources, and part of the purpose of this blog is to point people to them. I’ve just gotten permission to publish the Project Minikin slide deck from 2013 (PDF), which explains the motivation and some of the early design ideas. There’s also an lwn article which goes into some detail, largely based on my 2015 ATypI talk (video, slides).

Flutter’s LibTxt was originally based on Minikin, but no doubt has diverged considerably since then, as they have fairly different requirements and of course are not bound by compatibility with Android.

But if you have deep interest in this, I recommend studying the code, as that’s the ultimate source of truth. It’s extremely fortunate that Android’s open source development model gives us access to it!

If you have questions, let me know (an issue on this repo is a good way, or you can ask on Mastodon), and I’ll do my best to answer.

Parallel curves of cubic Béziers

2022-09-09T17:45:42+00:00

Distance

Accuracy

Method

The problem of parallel or offset curves has remained challenging for a long time. Parallel curves have applications in 2D graphics (for drawing strokes and also adding weight to fonts), and also robotic path planning and manufacturing, among others. The exact offset curve of a cubic Bézier can be described (it is an analytic curve of degree 10) but it not tractable to work with. Thus, in practice the approach is almost always to compute an approximation to the true parallel curve. A single cubic Bézier might not be a good enough approximation to the parallel curve of the source cubic Bézier, so in those cases it is sudivided into multiple Bézier segments.

A number of algorithms have been published, of varying quality. Many popular algorithms aren’t very accurate, yielding either visually incorrect results or excessive subdivision, depending on how carefully the error metric has been implemented. This blogpost gives a practical implementation of a nearly optimal result. Essentially, it tries to find the cubic Bézier that’s closest to the desired curve. To this end, we take a curve-fitting approach and apply an array of numerical techniques to make it work. The result is a visibly more accurate curve even when only one Bézier is used, and a minimal number of subdivisions when a tighter tolerance is applied. In fact, we claim $O(n^6)$ scaling: if a curve is divided in half, the error of the approximation will decrease by a factor of 64. I suggested a previous approach, Cleaner parallel curves with Euler spirals, with $O(n^4)$ scaling, in other words only a 16-fold reduction of error.

Though there are quite a number of published algorithms, the need for a really good solution remains strong. Some really good reading is the Paper.js issue on adding an offset function. After much discussion and prototyping, there is still no consensus on the best approach, and the feature has not landed in Paper.js despite obvious demand. There’s also some interesting discussion of stroking in an issue in the COLRv1 spec repo.

Outline of approach

The fundamental concept is curve fitting, or finding the parameters for a cubic Bézier that most closely approximate the desired curve. We also employ a sequence of numerical techniques in support of that basic concept:

Finding the cusps and subdividing the curve at the cusp points.
Computing area and moment of the target curve
- Green’s theorem to convert double integral into a single integral
- Gauss-Legendre quadrature for efficient numerical integration
Quartic root finding to solve for cubic Béziers with desired area and moment
Measure error to choose best candidate and decide whether to subdivide
- Cubic Bézier/ray intersection

Each of these numeric techniques has its own subtleties.

Cusp finding

One of the challenges of parallel curves in general is cusps. These happen when the curvature of the source curve is equal to one over the offset distance. Cubic Béziers have fairly complex curvature profiles, so there can be a number of cusps - it’s easy to find examples with four, and it wouldn’t be surprising to me if there were more. By contrast, Euler spirals have simple curvature profiles, and the location of the cusp is extremely simple to determine.

The general equation for curvature of a parametric curve is as follows:

\[\kappa = \frac{\mathbf{x}''(t) \times \mathbf{x}'(t)}{|\mathbf{x}'(t)|^3}\]

The cusp happens when $\kappa d + 1 = 0$. With a bit of rewriting, we get

\[(\mathbf{x}''(t) \times \mathbf{x}'(t))d + |\mathbf{x}'(t)|^3 = 0\]

As with many such numerical root-finding approaches, missing a cusp is a risk. The approach currently used in the code in this blog post is a form of interval arithmetic: over the (t0..t1) interval, a minimum and maximum value of $|\mathbf{x}’|$ is computed, while the cross product is quadratic in t. Solving that partitions the interval into ranges where the curvature is definitely above or below the threshold for a cusp, and a (hopefully) smaller interval where it’s possible.

This algorithm is robust, but convergence is not super-fast - it often hits the case where it has to subdivide in half, so convergence is similar to a bisection approach for root-finding. I’m exploring another approach of computing bounding parabolas, and that seems to have cubic convergence, but is a bit more complicated and fiddly.

In cases where you know you have one simple cusp, a simple and generic root-finding method like ITP (about more which below) would be effective. But that leaves the problem of detecting when that’s the case. Robust detection of possible cusps generally also gives the locations of the cusps when iterated.

Computing area and moment of the target curve

The primary input to the curve fitting algorithm is a set of parameters for the curve. Not control points of a Bézier, but other measurements of the curve. The position of the endpoints and the tangents can be determined directly, which, just counting parameters, leaves two free. Those are the area and x-moment. These are generally described as integrals. For an arbitrary parametric curve (a family which easily includes offsets of Béziers), Green’s theorem is a powerful and efficient technique for approximating these integrals.

For area, the specific instance of Green’s theorem we’re looking for is this. Let the curve be defined as x(t) and y(t), where t goes from 0 to 1. Let D be the region enclosed by the curve. If the curve is closed, then we have this relation:

\[\iint_D dx \,dy = \int_0^1 y(t)\, x'(t)\, dt\]

I won’t go into the details here, but all this still works even when the curve is open (one way to square up the accounting is to add the return path of the chord, from the end point back to the start), and when the area contains regions of both positive and negative signs, which can be the case for S-shaped curves. The x moment is also very similar and just involves an additional $x$ term:

\[\iint_D x \, dx \,dy = \int_0^1 x(t)\, y(t)\, x'(t)\, dt\]

Especially given that the function being integrated is (mostly) smooth, the best way to compute the approximate integral is Gauss-Legendre quadrature, which has an extremely simple implementation: it’s just the dot product between a vector of weights and a vector of the function sampled at certain points, where the weights and points are carefully chosen to minimize error; in particular they result in zero error when the function being integrated is a polynomial of order up to that of the number of samples. The JavaScript code on this page just uses a single quadrature of order 32, but a more sophisticated approach (as is used for arc length computation) would be to first estimate the error and then choose a number of samples based on that.

Note that the area and moments of a cubic Bézier curve can be efficiently computed analytically and don’t need an approximate numerical technique. Adding in the offset term is numerically similar to an arc length computation, bringing it out of the range where analytical techniques are effective, but fortunately similar numerical techniques as for computing arc length are effective.

The basic approach to curve fitting was described in Fitting cubic Bézier curves. Those ideas are good, but there were some rough edges to be filled in and other refinements.

To recap, the goal is to find the closest Bézier, in the space of all cubic Béziers, to the desired curve (in this case the parallel curve of a source Bézier, but the curve fitting approach is general). That’s a large space to search, but we can immediately nail down some of the parameters. The endpoints should definitely be fixed, and we’ll also set the tangent angles at the endpoints to match the desired curve.

One loose end was the solving technique. My prototype code used numerical methods, but I’ve now settled on root finding of the quartic equation. A major reason for that is that I’ve found that the quartic solver in the Orellana and De Michele paper works well - it is fast, robust, and stable. The JavaScript code on this page uses a fairly direct implementation of that (which I expect may be useful for other applications - all the code on this blog is licensed under Apache 2, so feel free to adapt it within the terms of that license).

Another loose end was the treatment of “near misses.” Those happen when the function comes close to zero but doesn’t quite cross it. In terms of roots of a polynomial, those are a conjugate pair of complex roots, and I take the real part of that as a candidate. It would certainly be possible to express this logic by having the quartic solver output complex roots as well as real ones, but I found an effective shortcut: the algorithm actually factors the original quartic equation into two quadratics, one of which always has real roots and the other some of the time, and finding the real part of the roots of a quadratic is trivial (it’s just -b/2a).

Recently, Cem Yuksel has proposed a variation of Newton-style polynomial solving. It’s likely this could be used, but there were a few reasons I went with the more analytic approach. For one, I want multiple roots and this works best when only one is desired. Second, it’s hard to bound a priori the interval to search for roots. Third, it’s not easy to get the complex roots (if you did want to do this, the best route is probably deflation). Lastly, the accuracy numbers don’t seem as good (the Orellana and De Michele paper presents the results of very careful testing), and in empirical testing I have found that accuracy in root finding is a real problem that can affect the quality of the final results. A Rust implementation of the Orellana and De Michele technique clocks in at 390ns on an M1 Max, which certainly makes it competitive with the fastest techniques out there.

The last loose end was the treatment of near-zero slightly negative arm lengths. These are roots of the polynomial but are not acceptable candidate curves, as the tangent would end up pointing the wrong way. My original thought was to clamp the relevant length to zero (on the basis that it is an acceptable curve that is “nearby” the numerical solution), but that also doesn’t give ideal results. In particular, if you set one length to zero and set the other one based on exact signed area, the tangent at the zero-length side might be wrong (enough to be visually objectionable). After some experimentation, I’ve decided to set the other control point to be the intersection of the tangents, which gets tangents right but possibly results in an error in area, depending on the exact parameters. The general approach is to throw these as candidates into the mix, and let the error measurement sort it out.

Error measurement

A significant amount of total time spent in the algorithm is measuring the distance between the exact curve and the cubic approximation, both to decide when to subdivide and also to choose between multiple candidates from the Bézier fitting. I implemented the technique from Tiller and Hanson and found it to work well. They sample the exact curve at a sequence of points, then for each of those points project that point onto the approximation along the normal. That is equivalent to computing the intersection of a ray and a cubic Bézier. The maximum distance between the projected and true point is the error. This is a fairly good approximation to the Fréchet distance but significantly cheaper to compute.

Computing the intersection of a ray and a cubic Bézier is equivalent to finding the root of a cubic polynomial, a challenging numerical problem in its own right. In the course of working on this, I found that the cubic solver in kurbo would sometimes report inaccurate results (especially when the coefficient on the $x^3$ term was near-zero, which can easily happen when cubic Bézier segments are near raised quadratics), and so implemented a better cubic solver based on a blog post on cubics by momentsingraphics. That’s still not perfect, and there is more work to be done to arrive at a gold-plated cubic solver. The Yuksel polynomial solving approach might be a good fit for this, especially as you only care about results for t strictly within the (0..1) range. It might also be worth pointing out that the fma instruction used in the Rust implementation is not available in JavaScript, so the accuracy of the solver here won’t be quite as good.

The error metric is a critical component of a complete offset curve algorithm. It accounts for a good part of the total CPU time, and also must be accurate. If it underestimates true error, it risks letting inaccurate results slip through. If it overestimates error, it creates excessive subdivision. Incidentally, I suspect that the error measurement technique in the Elber, Lee and Kim paper (cited below) may be flawed; it seems like it may overestimate error in the case where the two curves being compared differ in parametrization, which will happen commonly with offset problems, particularly near cusps. The Tiller-Hanson technique is largely insensitive to parametrization (though perhaps more care should be taken to ensure that the sample points are actually evenly spaced).

Subdivision

Right now the subdivision approach is quite simple: if none of the candidate cubic Béziers meet the error bound, then the curve is subdivided at t = 0.5 and each half is fit. The scaling is n^6, so in general that reduces the error by a factor of 64.

If generation of an absolute minimum number of output segments is the goal, then a smarter approach to choosing subdivisions would be in order. For absolutely optimal results, in general what you want to do is figure out the minimum number of subdivisions, then adjust the subdivision points so the error of all segments are equal. This technique is described in section 9.6.3 of my thesis. In the limit, it can be expected to reduce the number of subdivisions by a factor of 1.5 compared with “subdivide in half,” but not a significant improvement when most curves can be rendered with one or two cubic segments.

Evaluation

Somebody evaluating this work for use in production would care about several factors: accuracy of result, robustness, and performance. The interactive demo on this page speaks for itself: the results are accurate, the performance is quite good for interactive use, and it is robust (though I make no claims it handles all adversarial inputs correctly; that always tends to require extra work).

In terms of accuracy of result, this work is a dramatic advance over anything in the literature. I’ve implemented and compared it against two other techniques that are widely cited as reasonable approaches to this problem: Tiller-Hanson and the “shape control” approach of Yang and Huang. For generating a single segment, it can be considerably more accurate than either.

In addition to the accuracy for generating a single line segment, it is interesting to compare the scaling as the number of subdivisions increases, or as the error tolerance decreases. These tend to follow a power law. For this technique, it is $O(n^6)$, meaning that subdividing a curve in half reduces the error by a factor of 64. For the shape control approach, it is $O(n^5)$, and for Tiller-Hanson is is $O(n^2)$. That last is a surprisingly poor result, suggesting that it is only a constant factor better than subdividing the curves into lines.

The shape control technique has good scaling, but stability issues when the tangents are nearly parallel. That can happen for an S-shaped curve, and also for a U with nearly 180 degrees of arc.

The Tiller-Hanson technique is geometrically intuitive; it offsets each edge of the control polygon by the offset amount, as illustrated in the diagram below. It doesn’t have the stability issues with nearly-parallel tangents and can produce better results for those “S” curves, but the scaling is much worse.

Regarding performance, I have preliminary numbers from the JavaScript implementation, about 12µs per curve segment generated on an M1 Max running Chrome. I am quite happy with this result, and of course expect the Rust implementation to be even faster when it’s done. There are also significant downstream performance improvements from generating highly accurate results; every cubic segment you generate has some cost to process and render, so the fewer of those, the better.

I haven’t implemented all the techniques in the Elber, Lee and Kim paper, but it is possible to draw some tentative conclusions from the literature. I expect the Klass technique (and its numerical refinement by Sakai and Suenaga) to have good scaling but relatively poor acccuracy for a single segment. The Klass technique is also documented to have poor numerical stability, thanks in part to its reliance on Newton solving techniques. The Hoschek and related (least-squares) approaches will likely produce good results but are quite slow (the Yang and Huang paper reports an eye-popping 49s for calculating a simple case with .001 tolerance, of course on older hardware).

The Euler spiral technique in my previous blog post will in general produce considerably more subdivision (with $O(n^4)$ scaling), but perhaps it would be premature to write it off completely. Once the curve is in piecewise Euler spiral form, a result within the given error bounds can be computed directly, with no need to explicitly evaluate an error metric. In addition, the cusps are located robustly with trivial calculation. That said, getting a curve into piecewise Euler spiral form is still challenging, and my prototype code uses a rather expensive error metric to achieve that.

Discussion

This post presents a significantly better solution to the parallel curve problem than the current state of the art. It is accurate, robust, and fast. It should be suitable to implement in interactive vector graphics applications, font compilation pipelines, and other contexts.

While parallel curve is an important application, the curve fitting technique is quite general. It can be adapted to generalized strokes, for example where the stroke width is variable, path simplification, distortions and other transforms, conversion from other curve representations, accurate plotting of functions, and I’m sure there are other applications. Basically the main thing that’s required is the ability to evaluate area and moment of the source curve, and ability to evaluate the distance to that curve (which can be done readily enough by sampling a series of points with their arc lengths).

This work also provides a bit of insight into the nature of cubic Bézier curves. The $O(n^6)$ scaling provides quantitative support to the idea that cubic Bézier curves are extremely expressive; with skillful placement of the control points, they can extremely accurately approximate a wide variety of curves. Parallel curves are challenging for a variety of reasons, including cusps and sudden curvature variations. That said, they do require skill, as geometrically intuitive but unoptimized approaches to setting control points (such as Tiller-Hanson) perform poorly.

There’s clearly more work that could be done to make the evalation more rigorous, including more optimization of the code. I believe this result would make a good paper, but my bandwidth for writing papers is limited right now. I would be more than open to collaboration, and invite interested people to get in touch.

Thanks to Linus Romer for helpful discussion and refinement of the polynomial equations regarding quartic solving of the core curve fitting algorithm.

Discuss on Hacker News.

References

Here is a bibliography of some relevant academic papers on the topic.

An offset spline approximation for plane cubic splines, Klass, 1983
Offsets of Two-Dimensional Profiles, Tiller and Hanson, 1984
High Accuracy Geometric Hermite Interpolation, de Boor, Höllig, Sabin, 1987 (PDF cache)
Optimal approximate conversion of spline curves and spline approximation of offset curves, Hoschek and Wissel, 1988 (PDF cache)
A New Shape Control and Classification for Cubic Bézier Curves, Yang and Huang, 1993 (PDF cache)
Comparing offset curve approximation methods, Elber, Lee, Kim, 1997 (PDF cache)
Cubic spline approximation of offset curves of planar cubic splines, Sakai and Suenaga, 2001 (PDF cache)
Boosting Efficiency in Solving Quartic Equations with No Compromise in Accuracy, Orellana and De Michele, 2020 (PDF cache)
High-Performance Polynomial Root Finding for Graphics, Yuksel, 2022 (PDF cache)

Advice for the next dozen Rust GUIs

2022-07-15T17:53:42+00:00

A few times a week, someone asks on the #gui-and-ui channel on the Rust Discord, “what is the best UI toolkit for my application?” Unfortunately there is still no clear answer to this question. Generally the top contenders are egui, Iced, and Druid, with Slint looking promising as well, but web-based approaches such as Tauri are also gaining some momentum, and of course there’s always the temptation to just build a new one. And every couple or months or so, a post appears with a new GUI toolkit.

This post is something of a sequel to Rust 2020: GUI and community. I hope to offer a clear-eyed survey of the current state of affairs, and suggestions for how to improve it. It also includes some lessons so far from Druid.

The motivations for building GUI in Rust remain strong. While Electron continues to gain momentum, especially for desktop use cases, there is considerable desire for a less resource-hungry alternative. That said, a consensus has not emerged what that alternative should look like. In addition, unfortunately there is fragmentation at the level of infrastructure as well.

Fragmentation is not entirely a bad thing. To some extent, specialization can be a good thing, resulting in solutions more adapted to the problem at hand, rather than a one-size-fits-all approach. More importantly, the diversity of UI approaches is a rich ground for experimentation and exploration, as there are many aspects to GUI in Rust where we still don’t know the best approach.

A large tradeoff space

One of the great truths to understand about GUI is that there are few obviously correct ways to do things, and many, many tradeoffs. At present, this tradeoff space is very sensitive, in that small differences in requirements and priorities may end up with considerably different implementation choices. I believe this is a main factor driving the fragmentation of Rust GUI. Many of the tradeoffs have to do with the extent to use platform capabilities for various things (of which text layout and compositing are perhaps the most intersting), as opposed to a cross-platform implementation of those capabilities, layered on top of platform abstractions.

A small rant about native

You’ll see the word “native” used quite a bit in discussions about UI, but I think that’s more confusing than illuminating, for a number of reasons.

In the days of Windows XP, “native” on Windows would have had quite a specific and precise meaning, it would be the use of user32 controls, which for the most part created a HWND per control and used GDI+ for drawing. Thanks to the strong compatibility guarantees of Windows, such code will still run, but it will look ancient and out of place compared to building on other technology stacks, also considered “native.” In the Windows 7 era, that would be windowless controls drawn with Direct2D (this was the technology used for Microsoft Office, using the internal DirectUI toolkit, which was not available to third party developers, though there were other windowless toolkits such as WPF). Starting with Windows 8 and the (relatively unsuccessful) Metro design language, many of the elements of the UI came together in the compositor rather than being directly drawn by the app. As of Windows 10, Microsoft started pushing Universal Windows Platform as the preferred “native” solution, but that also didn’t catch on. As of Windows 11, there is a new Fluent design language, natively supported only in WinUI 3 (which in turn is an evolution of UWP). “Native” on Windows can refer to all of these technology stacks.

On Windows in particular, there’s also quite a bit of confusion about UI functionality that’s directly provided by the platform (as is the case for user32 controls and much of UWP, including the Windows.UI namespace), vs lower level capabilities (Direct2D, DirectWrite, and DirectComposition, for example) provided to a library which mostly lives in userspace. The new WinUI (and Windows App SDK more generally) is something of a hybrid, with this distinction largely hidden from the developer. An advantage of this hybrid approach is that OS version is abstracted to a large extent; for example, even though UWP proper is Windows 10 and up only, it is possible to use the Uno platform to deploy WinUI apps on other systems, though it is a very good question to what extent that kind of deployment can be considered “native.”

On macOS the situation is not quite as chaotic and fragmented, but there is an analogous evolution from Cocoa (AppKit) to SwiftUI; going forward, it’s likely that new capabilities and evolutions of the design language will be provided for the latter but not the former. There was a similar deprecation of Carbon in favor of Cocoa, many years ago. (One of the main philosophical differences is that Windows maintains backwards compatibility over many generations, while Apple actually deprecates older technology)

The idea of abstracting and wrapping platform native GUI has been around a long time. Classic implementations include wxWidgets and Java AWT, while a more modern spin is React Native. It is very difficult to provide high quality experiences with this approach, largely because of the impedance mismatch between platforms, and few successful applications have been shipped this way.

The meaning of “native” varies from platform to platform. On macOS, it would be extremely jarring and strange for an app not to use the system menubar, for example (though it would be acceptable for a game). On Windows, there’s more diversity in the way apps draw menus, and of course Linux is fragmented by its nature.

Instead of trying to decide whether a GUI toolkit is native or not, I recommend asking a set of more precise questions:

What is the binary size for a simple application?
What is the startup time for a simple application?
Does text look like the platform native text?
To what extent does the app support preferences set at the system level?
What subset of expected text control functionality is provided?
- Complex text rendering including BiDi
- Support for keyboard layouts including “dead key” for alphabetic scripts
- Keyboard shortcuts according to platform human interface guidelines
- Input Method Editor
- Color emoji rendering and use of platform emoji picker
- Copy-paste (clipboard)
- Drag and drop
- Spelling and grammar correction
- Accessibility (including reading the text aloud)

If a toolkit does well on all these axes, then I don’t think it much matters it’s built with “native” technology; that’s basically internal details. That said, it’s also very hard to hit all these points if there is a huge stack of non-native abstractions in the way.

On winit

All GUIs (and all games) need a way to create windows, and wire up interactions with that window - primarily drawing pixels into the window and dealing with user inputs such as mouse and keyboard, but potentially a much larger range. The implementation is platform specific and involves many messy details. There is one very popular crate for this function – winit – but I don’t think a consensus, at least yet, so there are quite a few other alternatives, including the tao fork of winit used by Tauri, druid-shell, baseview (which is primarily used in audio applications because it supports the VST plug-in case), and handrolled approaches such as the one used by makepad.

I would describe the tension this way (perhaps not everyone will agree with me). The stated scope of winit is to create a window and leave the details of what happens inside that window to the application. For some interactions (especially GPU rendering, which is well developed in Rust space), that split works well, but for other interactions it is not nearly as clean. In practice, I think winit has evolved to become quite satisfactory for game use cases, but less so for GUI. Big chunks of functionality, such as access to native menus, are missing (the main motivation behind the tao fork, and the reason iced doesn’t support system menus), and keyboard support is persistently below what’s needed for high quality text input in a GUI.

I think resolving some of this fragmentation is possible and would help move the broader ecosystem forward. For the time being, the Druid family of projects will continue developing druid-shell, but is open to collaboration with winit. One way to frame this is that the extra capabilities of druid-shell serve as a set of requirements, as well as guidance and experience how to implement them well.

In the meantime, which windowing library to adopt is a tough choice, and I wouldn’t be surprised to see yet another one pop up if the application has specialized requirements. Consider the situation in C++ world: for games, both GLFW and SDL are good choices, but both primarily for games. Pretty much every serious UI toolkit has its own platform abstraction layer; while it would be possible to use something like SDL for more general purpose GUI, it wouldn’t be a great fit.

Advice: think through the alternatives when considering a windowing library. After adopting one, learn from the others to see how yours might be improved. Plan on dedicating some time and energy into improving this important bit of infrastructure.

Tradeoff: use of system compositor

One of the more difficult, and I think underappreciated tradeoffs is the extent to which the GUI toolkit relies on the system compositor. See The compositor is evil for more background on this, but here I’ll revisit the issue from the point of view of a GUI toolkit trying to navigate the existing space.

All modern GUI platforms have a compositor which composites the (possibly alpha-transparent) windows from the various applications running on the system. As of Windows 8, Wayland, and Mac OS 10.5, the platform exposes richer access to the compositor, so the application can provide a tree of composition surfaces, and update attributes such as position and opacity (generally providing an additional animation API so the compositor can animate transitions without the application having to provide new values every frame).

If the GUI can be decomposed into the schema supported by the compositor, there are significant advantages. For one, it is decidedly the most power-efficient method to accomplish effects such as scrolling. The system pays the cost of the compositor anyway (in most cases), so any effects that can be done by the compositor (including scrolling) are “for free.”

As an illustration of how a Rust UI app may depend on the compositor, see the Minesweeper using windows-rs demo. Essentially all presentation is done using the compositor rather than a drawing API (this is why the numbers are drawn using dots rather than fonts). This sample application depends on the Windows.UI namespace to be provided by the operating system, so will only run on Windows 10 (build 1803 or later).

All that said, there are significant disadvantages to the compositor as well. One is cross-platform support and compatibility. There is currently no good cross-platform abstraction for the compositor (planeshift was an attempt, but is abandoned). Further, older systems (Windows 7 and X11) cannot rely on the compositor, so there has to be a compatibility path, generally with degraded behavior.

There are other more subtle drawbacks. One is a lowest-common-denominator approach, emphasizing visual effects supported by the compositor, especially cross-platform. As just one example, translation and alpha fading is well-supported, but scaling of bitmap surfaces comes with visual degradation, compared with re-rendering the vector original. There’s also the issue of additional RAM usage for all the intermediate texture layers.

Perhaps the biggest motivation to use the compositor extensively is stitching together diverse visual sources, particularly video, 3D, and various UI embeddings including web and “native” controls. If you want a video playback window to scroll seamlessly and other UI elements to blend with it, there is essentially no other game in town. These embeddings were declared as out of scope for Druid, but people request them often.

Building a proper cross-platform infrastructure for the compositor is a huge and somewhat thankless task. The surface area of these interfaces is large, I’m sure there are lots of annoying differences between the major platforms, and no doubt there will need to be a significant amount of compatibility engineering to work well on older platforms. Browsers have invested in this work (in the case of Safari without the encumbrance of needing to be cross-platform), and this is actually one good reason to use Web technology stacks.

Advice: new UI toolkits should figure out their relationship to the system compositor. If the goal is to provide a truly smooth, native integration of content such as video playback, then they must invest in the underlying mechanisms, much as browsers have.

Tradeoff: platform text rendering

One of the most difficult aspects of building UI from scratch is getting text right. There are a lot of details to text rendering, but there’s also the question of matching the system appearance and being able to access the system font fallback chain. The latter is especially important for non-Latin scripts, but also emoji. Unfortunately, operating systems generally don’t have good mechanisms for enumerating or efficiently querying the system fonts. Either you use the built-in text layout capabilities (which means having to build a lowest common denominator abstraction on top of them), or you end up replicating all the work, and finding heuristics and hacks to access the system fonts without running into either correctness or efficiency problems.

There’s so much work involved in making a fully functional text input box that it is something of a litmus test for how far along a UI toolkit has gotten. Rendering is only part of it, but there’s also IME (including the emoji picker), copy-paste (potentially including rich text), access to system text services such as spelling correction, and one of the larger and richer subsurfaces for integrating accessibility.

Again, browsers have invested a massive amount of work into getting this right, and it’s no one simple trick. Druid, by comparison, does use the system text layout capabilities, but we’re seeing the drawbacks (it tends to be slow, and hammering out all the inconsistencies between platforms is annoying to say the least), so as we go forward we’ll probably do more of that ourselves.

Over the longer term, I’d love to have Rust ecosystem infrastructure crates for handling text well, but it’s an uphill battle. Just how to design the interface abstraction boundaries is a hard problem, and it’s likely that even if a good crate was published, there’d be resistance to adoption because it wouldn’t be trivial to integrate. There are thorny issues such as rich text representation, and how the text layout crate integrates with 2D drawing.

Advice: figure out a strategy to get text right. It’s not feasible to do that in the short term, but longer term it is a requirement for “real” UI. Potentially this is an area for the UI toolkits to join forces as well.

On architecture

One constant I’ve found is that the developer-facing architecture of a UI toolkit needs to evolve. We don’t have a One True architecture yet, and in particular designs made in other languages don’t adapt well to Rust.

Druid itself has had three major architectures: an intial attempt at applying ECS patterns to UI, the current architecture relying heavily on lenses, and the Xilem proposal for a future architecture. In between were two explorations that didn’t pan out. Crochet was an attempt to provide an immediate mode API to applications on top of a retained mode implementation, and lasagna was an attempt to decouple the reactive architecture from the underlying widget tree implementation.

There are a number of triggers that might motivate large scale architectural changes in GUI toolkits. Among them, support for multi-window, accessibility, virtualized scrolling (and efficient large containers in general), async.

The crochet experiment

Now is a good time to review an architectural experiement that ultimately we decided not to pursue. The crochet prototype was an attempt to emulate immediate mode GUI on top of a retained mode widget tree. The theory was that immediate mode is easier for programmers, while retained mode has implementation advantages including making it easier to do rich layouts. There were other goals, including facilitating language bindings (for langages such as Python) and also better async integration. Language bindings were a pain point in the existing Druid architecture.

Ultimately I think it would be possible to build UI with this architecture, but there were a number of pain points, so I don’t believe it would be a good experience overall. One of the inherent problems of an immediate mode API is what I call “state tearing.” Because updates to app state (from processing events returned by calls to functions representing widgets) are interleaved with rendering, the rendering of any given frame may contain a mix of old and new state. For some applications, when continuously updating at a high frame rate, an extra frame of latency may not be a serious issue, but I consider it a flaw. I had some ideas for how to address this, but it basically involves running the app logic twice.

There were other ergonomic paper cuts. Because Rust lacks named and optional parameters in function calls, it is painful to add optional modifiers to existing widgets. Architectures based on simple value types for views (as is the case in the in the greater React family, including Xilem) can just use variations on the fluent pattern, method calls on those views to either wrap them or set optional parameters.

Another annoyingly tricky problem was ensuring that begin-container and end-container calls were properly nested. We experimented with a bunch of different ways to do try to enforce this nesting at compile time, but none were fully satisfying.

A final problem with emulating immediate mode is that the architecture tends to thread a mutable context parameter through the application logic. This is not especially ergonomic (adding to “noise”) but perhaps more seriously effectively enforces the app logic running on a single thread.

Advice: of course try to figure out a good architecture, but also plan for it to evolve.

Accessibility

It’s fair to say that a UI toolkit cannot be considered production-ready unless it supports accessibility features such as support for screen readers for visually impaired users. Unfortunately, accessibility support has often taken the back seat, but fortunately the situation looks like it might improve. In particular, the AccessKit project hopes to provide common accessibility infrastructure to UI toolkits in the Rust ecosystem.

Doing accessibility well is of course tricky. It requires architectural support from the toolkit. Further, platform accessibility capabilities often make architectural assumptions about the way apps are built. In general, they expect a retained mode widget tree; this is a significant impedance mismatch with immediate mode GUI, and generally requires stable widget ID and a diffing approach to create the accessibility tree. For the accessibility part (as opposed to the GPU-drawn part) of the UI, it’s fair to say pure immediate mode cannot be used, only a hybrid approach which resembles emulation of an immediate mode API on top of a more traditional retained structure.

Also, to provide a high quality accessibility experience, the toolkit needs to export fine-grained control of accesibility features to the app developer. Hopefully, generic form-like UI can be handled automatically, but for things like custom widgets, the developer needs to build parts of the accessibility tree directly. There are also tricky interactions with features such as virtualized scrolling.

Accessibility is one of the great strengths of the Web technology stack. A lot of thought went into defining a cross-platform abstraction which could actually be implemented, and a lot of users depend on this every day. AccessKit borrows liberally from the Web approach, including the implementation in Chromium.

Advice: start thinking about accessibility early, and try to build prototypes to get a good understanding of what’s required for a high quality experience.

What of Druid?

I admit I had hopes that Druid would become the popular choice for Rust GUI, though I’ve never explicitly had that as a goal. In any case, that hasn’t happened, and now is a time for serious thought about the future of the project.

We now have clearer focus that Druid is primarily a research project, with a special focus on high performance, but also solving the problems of “real UI.” The research goals are longer term; it is not a project oriented to shipping a usable toolkit soon. Thus, we are making some changes along these lines. We hope to get piet-gpu to a “minimum viable” 0.1 release soon, at which point we will be switching drawing over to that, as opposed to the current strategy of wrapping platform drawing capabilities (which often means that drawing is on the CPU). We change the reactive architecture to Xilem.

Assuming we do a good job solving these problems, over time Druid might evolve into a toolkit usable for production applications. In the meantime, we don’t want to create unrealistic expectations. The primary audience for Druid is people learning how to build UI in Rust. This post isn’t the appropriate place for a full roadmap and vision document, but I expect to be writing more about that in time.

Conclusion

I don’t want to make too many predictions, but I am confident in asserting that there will be a dozen new UI projects in Rust in the next year or two. Most of them will be toys, though it is entirely possible that one or more of them will be in support of a product and will attract enough resources to build something approaching a “real” toolkit. I do expect fragmentation of infrastructure to continue, as there are legitimate reasons to choose different approaches, or emphasize different priorities. It’s possible we never get to a “one size fits all” solution for especially thorny problems such as window creation, input (including keyboard input and IME), text layout, and accessibility.

Meanwhile we will be pushing forward with Druid. It won’t be for everyone, but I am hopeful it will move the state of Rust UI forward. I’m also hopeful that the various projects will continue to learn from each other and build common ground on infrastructure where that makes sense.

And I remain very hopeful about the potential for GUI in Rust. It seems likely to me that it will be the language the next major GUI toolkit is written in, as no other language offers the combination of performance, safety, and high level expressiveness. All of the issues in this post are problems to be solved rather than obstacles why Rust isn’t a good choice for building UI.

Discuss on Hacker News and /r/rust.

Xilem: an architecture for UI in Rust

2022-05-07T15:17:42+00:00

Rust is an appealing language for building user interfaces for a variety of reasons, especially the promise of delivering both performance and safety. However, finding a good architecture is challenging. Architectures that work well in other languages generally don’t adapt well to Rust, mostly because they rely on shared mutable state and that is not idiomatic Rust, to put it mildly. It is sometimes asserted for this reason that Rust is a poor fit for UI. I have long believed that it is possible to find an architecture for UI well suited to implementation in Rust, but my previous attempts (including the current Druid architecture) have all been flawed. I have studied a range of other Rust UI projects and don’t feel that any of those have suitable architecture either.

This post presents a new architecture, which is a synthesis of existing work and a few new ideas. The goals include expression of modern reactive, declarative UI, in components which easily compose, and a high performance implementation. UI code written in this architecture will look very intuitive to those familiar with state of the art toolkits such as SwiftUI, Flutter, and React, while at the same time being idiomatic Rust.

The name “Xilem” is derived from xylem, a type of transport tissue in vascular plants, including trees. The word is spelled with an “i” in several languages including Romanian and Malay, and is a reference to xi-editor, a starting place for explorations into UI in Rust (now on hold).

Like most modern UI architectures, Xilem is based on a view tree which is a simple declarative description of the UI. For incremental update, successive versions of the view tree are diffed, and the results are applied to a widget tree which is more of a traditional retained-mode UI. Xilem also contains at heart an incremental computation engine with precise change propagation, specialized for UI use.

The most innovative aspect of Xilem is event dispatching based on an id path, at each stage providing mutable access to app state. A distinctive feature is Adapt nodes (an evolution of the lensing concept in Druid) which facilitate composition of components. By routing events through Adapt nodes, subcomponents have access to a different mutable state reference than the parent.

A quick tour of existing architectures

This architecture is designed to address limitations and problems with the existing state of the art, both the current Druid architecture and other attempts. It’s a little hard to understand some of the motivation without some knowledge of those architectures. That said, a full survey of reactive UI architectures would be quite a long work; this section can only touch the highlights.

The existing Druid architecture has some nice features, but we consistently see people struggle with common themes.

There is a big difference between creating static widget hierarchies and dynamically updating them.
The app data must have a Data bound, which implies cloning and equality testing. Interior mutability is effectively forbidden.
The “lens” mechanism is confusing and it is not easy to implement complex binding patterns.
We never figured out how to integrate async in a compelling way.
There is an environment mechanism but it is not efficient and doesn’t support fine-grained change propagation.

Another common architecture is immediate mode GUI, both in a relatively pure form and in a modified form. It is popular in Rust because it doesn’t require shared mutable state. It also benefits from overall system simplicity. However, the model is oversimplified in a number of ways, and it is difficult to do sophisticated layout and other patterns that are easier in retained widget systems. There are also numerous papercuts related to sometimes rendering stale state. (I experimented with a retained widget backend emulating an immediate mode API in the “crochet” architecture experiment and concluded that the result was not compelling). The popular egui crate is solidly an implementation of immediate mode, and makepad is also based on it, though it differs in some important ways.

A particularly common architecture for UI in Rust is The Elm Architecture, which also does not require shared mutable state. Rather, to support interactions from the UI, gestures and other related UI actions creates messages which are then sent to an update method which takes central app state. Iced, relm, and Vizia all use some form of this architecture. Generally it works well, but the need to create an explicit message type and dispatch on it is verbose, and the Elm architecture does not support cleanly factored components as well as some other architectures. The Elm documentation specifically warns against components, saying, “actively trying to make components is a recipe for disaster in Elm.”

Lastly, there are a number of serious attempts to port React patterns to Rust, of which I find Dioxus most promising. These rely on interior mutability and other patterns that I think adapt poorly to Rust, but definitely represent a credible alternative to the ideas presented here. I think we will have to build some things and see how well they work out.

Synchronized trees

The Xilem architecture is based around generating trees and keeping them synchronized. In that way it is a refinement of the ideas described in my previous blog post, Towards a unified theory of reactive UI.

In each “cycle,” the app produces a view tree, from which rendering is derived. This tree has fairly short lifetime; each time the UI is updated, a new tree is generated. From this, a widget tree is built (or rebuilt). The view tree is retained only long enough to assist in event dispatching and then be diffed against the next version, at which point it is dropped. The widget tree, by contrast, persists across cycles. In addition to these two trees, there is a third tree containing view state, which also persists across cycles. (The view state serves a very similar function as React hooks)

Of existing UI architectures, the view tree most strongly resembles that of SwiftUI - nodes in the view tree are plain value objects. They also contain callbacks, for example specifying the action to be taken on clicking a button. Like SwiftUI, but somewhat unusually for UI in more dynamic languages, the view tree is statically typed, but with a typed-erased escape hatch (Swift’s AnyView) for instances where strict static typing is too restrictive.

The Rust expression of these trees is instances of the View trait, which has two associated types, one for view state and one for the associated widget. The state and widgets are also statically typed. The design relies heavily on the type inference mechanisms of the Rust compiler. In addition to inferring the type of the view tree, it also uses associated types to deduce the type of the associated state tree and widget tree, which are known at compile time. In almost every other comparable system (SwiftUI being the notable exception), these are determined at runtime with a fair amount of allocation, downcasting, and dynamic dispatch.

A worked example

We’ll use the classic counter as a running example. It’s very simple but will give insight into how things work under the hood. For people who want to follow along with the code, check the idiopath directory of the idiopath branch (in the Druid repo); running cargo doc --open there will reveal a bunch of Rustdoc.

Here’s the application.

fn app(count: &mut u32) -> impl View<u32> {
    v_stack((
        format!("Count: {}", count),
        button("Increment", |count| *count += 1),
    ))
}

This was carefully designed to be clean and simple. A few notes about this code, then we’ll get in to what happens downstream to actually build and run the UI.

This function is run whenever there are significant changes (more on that later). It takes the current app state (in this case a single number, but in general app state can be anything), and returns a view tree. The exact type of the view tree is not specified, rather it uses the impl Trait feature to simply assert that it’s something that implements the View trait (parameterized on the type of the app state). The full type happens to be:

VStack<u32, (), (String, Button<u32, (), {anonymous function of type for <'a> FnMut(&'a mut u32) -> ()}>)>

For such a simple example, this is not too bad (other than the callback), but would get annoying quickly.

Another observation is that three nodes of this tree implement the View trait: VStack and its two children String and Button. Yes, that’s an ordinary Rust string, and it implements the View trait. So will colors and shapes (following SwiftUI; implementation is planned in the prototype but not complete).

The view tree returned by the app logic can be visualized as a block diagram:

The type of the associated widget is widget::VStack. Purely as a pragmatic implementation detail, we’ve chosen to type-erase the children of containers. Among other things, this avoids excessive monomorphization. If we wanted fully-typed widget trees, the architecture would support that, and indeed an earlier prototype chose that approach.

Identity and id paths

A specific detail when building the widget tree is assigning a stable identity to each view. These concepts are explained pretty well in the Demystify SwiftUI talk. As in SwiftUI, stable identity can be based on structure (views in the same place in the view tree get to keep their identity across runs), or an explicit key. To illustrate the latter, assume a list container, and that two of the elements in the list are swapped. That might play an animation of the visual representations of those two elements changing places.

The idea of assigning stable identities is quite standard in declarative UI (it’s also present in basically all non-toy immediate mode GUI implementations), but Xilem adds a distinctive twist, the use of id path rather than a single id. The id path of a widget is the sequence of all ids on the path from the root to that widget in the widget tree. Thus, the id path of the button in the above is [1, 3], while the label is [1, 2] and the stack is just [1].

To continue our running example, given the view tree, the build step produces an associated view state tree as well as a widget tree. Here, we’ve annotated the view tree with ids, and also shown how ids and id paths are stored in the newly built structures. Most importantly, the view state for the VStack contains the ids of the child nodes, and the Button widget contains its id path, which is [1, 3].

As it turns out, neither text labels nor buttons require any special view state other than what’s retained in the widget, so those view states are just (), the unit type. Of course, other view nodes may require more associated state, in which case the type would be something other than the unit type.

Advanced topic: does id identify view or widget?

In writing this explanation, I realized there is some ambiguity whether the id identifies a node in the view tree, or one in the widget tree. In many cases, there is a 1:1 correspondence between the two, so the distinction is not important. In addition, it's certainly possible to construct a widget tree so that the id of each node is provided during the method call that constructs that widget, and if so, it's more than reasonable to use the Xilem ids as generated during the reconciliation process. However, that's not required, and the identity of widgets could be from a disjoint space from Xilem ids assigned to views. Further, it's possible to have a view node that doesn't directly correspond to a widget. That happens with holders of async futures, environment getters, and potentially other nodes that have reason to receive events. In those cases, ids are associated with views, not widgets. It's important to be clear, though, that the *lifetime* of an id is persistent across multiple cycles, while the lifetime of nodes in the view tree is shorter. Thus, the most accurate description of Xilem ids is this: they are *persistent* identifiers that correspond to either structural or explicit identity of nodes in the view tree.

Event propagation

Now let’s click that button. Raw platform events such as mouse movement and keystrokes are handled by the UI infrastructure, and don’t necessarily result in events propagated to the app logic. For example, the mouse can hover over the button (changing the appearance to a hover state), or the window can be resized, and that will all be handled without involving the app. But clicking the button does generate an event to be dispatched to the app.

Obviously the goal will be to run that callback and increment the count, but the details of how that happens are subtly different than most declarative UI systems. Probably the “standard” way to do this would be to attach the callback to the button, and have it capture a reference to the chunk of state it mutates. Again, in most declarative systems but not Xilem, setting the new state would be done using some variant of the observer pattern, for example some kind of setState or other “setter” function to not only update the value but also notify downstream dependencies that it had changed, in this case re-rendering the label.

This standard approach works poorly in Rust, though it can be done (see in particular the Dioxus system for an example of one of the most literal transliterations of React patterns, including observer-based state updating, into Rust). The problem is that it requires shared mutable access to that state, which is clunky at best in Rust (it requires interior mutability). In addition, because Rust doesn’t have built-in syntax for getters and setters, invoking the notification mechanism also requires some kind of explicit call (though perhaps macros or other techniques can be used to hide it or make it less prominent).

Advanced topic: comparison with Elm

The observer pattern is not the *only* way event propagation works in declarative UI. Another very important and influential pattern is [The Elm Architecture], which, being based on a pure functional language, also does not require shared mutable state. Thus, it is also used successfully as the basis of several Rust UI toolkits, notably Iced. In Elm, app state is centralized (this is also a fairly popular pattern in React, using state management packages such as Redux), and events are given to the app through an `update` call. Dispatching is a three-stage process. First, the user defines a *message* type enumerating the various actions that are (globally) possible to trigger through the UI. Second, the UI element *maps* the event type into this user-defined type, identifying which action is desired. Third, the `update` method dispatches the event, delegating if needed to a child handler. Some people like the explicitness of this approach, but it is unquestionably more verbose than a single callback that manipulates state directly, as in React or SwiftUI.

So what does Xilem do instead? The view tree is also parameterized on the app state, which can be any type. This idea is an evolution of Druid’s existing architecture, which also offers mutable access to app state to UI callbacks, but removes some of the limitations. In particular, Druid requires app state to be clonable and diffable, a stumbling block for many new users.

When an event is generated, it is annotated with the path of the UI element that originated it. In the case of the button, [1, 3], and this is sent to the app logic for dispatching. In the case of a button click, there is no payload, but for a slider it would be the numeric slider value, or for a text input it would be the string.

Event dispatching starts at the root of the view tree, and calls to the event method on the View trait also contain a mutable borrow of the app state. In this example, the root view node is VStack. It examines the id path of the event, consults its associated view state, and decides that the event should be dispatched to the second child, as the id of that child (3) matches the corresponding id in the id path of the event. It recursively calls event on that child, which is the button. That is a leaf node, meaning there are no further ids in the id path, and the button handles that event by calling its callback, passing in the mutable borrow of app state that was propagated through the event traversal. That callback in turn just increments the count value.

Note that the UI framework retains the view tree a “fairly short” time - long enough to do any event dispatching that’s needed, and also for diffing (see below), but not longer than that.

Re-rendering

After clicking the button and running the callback, the app state consists of the number 1, formerly 0. The app logic function is run, producing a new view tree, and this time the string value is “Count: 1” rather than “Count: 0”. The challenge is then to update the widget tree with the new data.

As is completely standard in declarative UI, it is done by diffing the old view tree against the new one, in this case calling the rebuild method on the View trait. This method compares the data, updates the associated widget if there are any changes, and also traverses into children. The view tree is retained just long enough to do event propagation and to be diffed against the next iteration of the view tree, at which point it is dropped. At any point in time, there are at most two copies of the view tree in existence.

In the simplest case, the app builds the full view tree, and that is diffed in full against the previous version. However, as UI scales, this would be inefficient, so there are other mechanisms to do finer grained change propagation, as described below.

Components

The above is the basic architecture, enough to get started. Now we will go into some more advanced techniques.

It would be very limiting to have a single “app state” type throughout the application, and require all callbacks to express their state mutations in terms of that global type. So we won’t do that.

The main tool for stitching together components is the Adapt view node. This node is so named because it adapts between one app state type and another, using a closure that takes mutable access to the parent state, and calls into a child (through a “thunk”, which is just a callback for continuing the propagation to child nodes) with a mutable reference to the child state. We can then define a “component” in the Xilem world as a body of code that outputs a view tree with a different app state type than its parent component.

In the simple case where the child component operates independently of the parent, the adapt node is a couple lines of code. It is also an attachment point for richer interactions - the closure can manipulate the parent state in any way it likes. Event handling callbacks of the child component are also allowed to return an arbitrary type (unit by default), for propagation of data upward in the tree, including to parent components.

In Elm terminology, the Adapt node is similar to Html map, though it manipulates mutable references to state, as opposed to being a pure functional mapping between message types. It is also quite similar to the “lens” concept from the existing Druid architecture, and has some resemblance to Binding in SwiftUI as well.

Finer grained change propagation: memoizing

Going back to the counter, every time the app logic is called, it allocates a string for the label, even if it’s the same value as before. That’s not too bad if it’s the only thing going on, but as the UI scales it is potentially wasted work.

Ron Minsky has stated “hidden inside of every UI framework is some kind of incrementalization framework.” Xilem unapologetically contains at its core a lightweight change propagation engine, similar in scope to the attribute graph of SwiftUI, but highly specialized to the needs of UI, and in particular with a lightweight approach to downward propagation of dependencies, what in React would be stated as the flow of props into components.

In this particular case, that incremental change propagation is best represented as a memoization node, yet another implementation of the View trait. A memoization node takes a data value (which supports both Clone and equality testing) and a closure which accepts that same data type. On rebuild, it compares the data value with the previous version, and only runs the closure if it has changed. The signature of this node is very similar to Html.Lazy in Elm.

Comparing a number is extremely cheap (especially because all this happens with static typing, so no boxing or downcasting is needed), but the cost of equality comparison is a valid concern for larger, aggregate data structures. Here, immutable data structures (adapted from the existing Druid architecture) can work very well.

Let’s say there’s a parent object that contains all the app state, including a sizable child component. The type would look something like this:

#[derive(Clone)]
struct Parent {
    stuff: Stuff,
    child: Arc<Child>,
}

And at the top of the tree we can use a memoize node with type Arc<Parent>, and the equality comparison pointer equality on the Arc rather than a deep traversal into the structure (as might be the case with a derived PartialEq impl). The child component attaches with both a Memoize and an Adapt node.

The details of the Adapt node are interesting. Here’s a simple approach:

adapt(
    |data: &mut Arc<Parent>, thunk| thunk.call(&mut Arc::make_mut(data).child),
    child_view(...) // which has Arc<Child> as its app data type
)

Whenever events propagate into the child, make_mut creates a copy of the parent struct, which will then not be pointer-equal to the version stored in the memoize node. If such events are relatively rare, or if they nearly always end up mutating the child state, then this approach is reasonable. However, it is possible to be even finer grain:

adapt(
    |data: &mut Arc<Parent>, thunk| {
        let mut child = data.child.clone();
        thunk.call(&mut child),
        if !Arc::ptr_eq(&data.child, &child) {
            Arc::make_mut(data).child = child;
        }
    },
    child_view(...) // which has Arc<Child> as its app data type
)

This logic propagates the change up from the child object to the parent object in the app state only if the child state has actually changed.

The above illustrates how to make the pattern work for structure fields (and is very similar to the “lensing” technique in the existing Druid architecture), but similar ideas will work for collections. Basically you need immutable data structures that support pointer equality and cheap, sparse diffing. I talk about that in some detail in my talk A Journey Through Incremental Computation (slides), with a focus on text layout and a list component. Also note that the druid_derive crate automates generation of these lenses for Druid, and no doubt a similar approach would work for adapt/memoize in Xilem. For now, though, I’m seeing how far we can get just using vanilla Rust and not relying on macros. I think all this is a fruitful direction for future work.

Also to note: while immutable data structures work well in the Xilem architecture, they are not absolutely required. The View trait itself can be implemented by anyone, as long as the build, rebuild, and event methods have correct implementation; change propagation is especially the domain of the rebuild method.

Type erasure

This section is optional but contains some interesting bits on advanced use of the Rust type system.

Having the view tree (and associated view state and widget tree) be fully statically typed has some advantages, but the types can become quite large, and there are cases where it is important to erase the type, providing functionality very similar to AnyView in SwiftUI.

For a simple trait, the standard approach would be Box<dyn Trait>, which boxes up the trait implementation and uses dynamic dispatch. However, this approach will not work with Xilem’s View trait, because that trait is not object-safe. There are actually two separate problems - first, the trait has associated types, and second, the rebuild method takes the previous view tree for diffing purposes as a Self parameter; though the contents of the view trees might differ, the type remains constant.

Fortunately, though simply Box<dyn View> is not possible due to the View trait not being object-safe, there is a pattern (due to David Tolnay) for type erasure. You can look at the code for details, but the gist of it is a separate trait (AnyView) with Any in place of the associated type, and a blanket implementation (impl<T> AnyView for T where T: View) that does the needed downcasting.

The details of type erasure in Xilem took a fair amount of iteration to get right (special thanks to Olivier Faure and Manmeet Singh for earlier prototypes). An earlier iteration of this architecture used the Any/downcast pattern everywhere, and I also see that pattern for associated state in rui and iced_pure, even though the main view object is statically typed.

In the SwiftUI community, AnyView is frowned on, but it is still useful to have it. While impl Trait is a powerful tool to avoid having to write out explicit types, it doesn’t work in all cases (specifically as the return type for trait methods), though there is work to fix that. There is an additional motivation for type erasure, namely language bindings.

Language bindings for dynamic languages

As part of this exploration, I wanted to see if Python bindings were viable. This goal is potentially quite challenging, as Xilem is fundamentally a (very) strongly typed architecture, and Python is archetypally a loosely typed language. One of the limitations of the existing Druid architecture I wanted to overcome is that there was no satisfying way to create Python bindings. As another negative data point, no dynamic language bindings for SwiftUI have emerged in the approximately 3 years since its introduction.

Yet I was able to create a fairly nice looking proof of concept for Xilem Python bindings. Obviously these bindings rely heavily on type erasure. The essence of the integration is impl View for PyObject, where the main instance is a Python wrapper around Box<dyn AnyView> as stated above. In addition, PyObject serves as the type for both app state and messages in the Python world; an Adapt node serves to interface between these and more native Rust types. Lastly, to make it all work we need impl ViewSequence for PyTuple so that Python tuples can serve as the children of containers like VStack, for building view hierarchies.

I should emphasize, this is a proof of concept. To do a polished set of language bindings is a fairly major undertaking, with care needed to bridge the impedance mismatch, and especially to provide useful error messages when things go wrong. Even so, it seems promising, and, if nothing else, serves to demonstrate the flexibility of the architecture.

Async

The interaction between async and UI is an extremely deep topic and likely warrants a blog post of its own. Even so, I wanted to explore it in the Xilem prototype. Initial prototyping indicates that it can work, and that the integration can be fine grained.

Async and change propagation for UI have some common features, and the Xilem approach has parallels to Rust’s async ecosystem. In particular, the id path in Xilem is roughly analogous to the “waker” abstraction in Rust async - they both identify the “target” of the notification change.

In fact, in the prototype integration, the waker provided to the Future trait is a thin wrapper around an id path, as well as a callback to notify the platform that it should wake the UI thread if it is sleeping. Somewhat unusually for Rust async, each View node holding a future calls poll on it itself; in some respects, a future-holding view is like a tiny executor of its own. A UI built with Xilem does not provide its own reactor, but rather relies on existing work such as tokio (which was used for the prototype).

We refer the interested reader to the prototype code for more details. Clearly this is an area that deserves to be explored much more deeply.

Environment

A standard feature of declarative UI is a dynamically scoped environment or context in which child nodes can retrieve information, usually in some form of key/value format, from ancestors somewhere higher up in the tree. Most Rust UI architectures have this (including existing Druid), but you should ask: is there fine-grained change propagation, or does it have to rebuild all children when any environment key changes?

We have a design for this, not fully implemented yet. The Cx structure threaded through the reconciliation process gains an environment, which is a map from key to (subscribers, value) tuples. The subscribers field is a set of id paths (of child nodes). There are then two View nodes, one for setting an environment value, which is then available to descendants, and one for retrieving that value. Let’s look at the build and rebuild methods for both of those nodes, the setter first.

The associated state of the setter is the value and the subscriber set. On build it creates that state, with the subscriber set empty. On rebuild it compares the new value with that stored in the state (these values must be equality-comparable), and, if they differ, sends a notification to each of the subscribers in the subscriber set, using the id path to dispatch those notifications. In both cases, it recurses to the child node, then pops the key from the environment stack in Cx before it returns (here, “pop” means either removing the key from the environment map, or setting the value to what it was before traversing into the setter node, depending on that previous value).

In the build case, it fetches the value and subscriber set from the environment map, and adds itself to that subscriber set. In either the build or rebuild case, it retrieves the value and calls the closure for its child view tree, passing in the value retrieved from the environment. Note that the “subscriber set” concept is an adaptation of the classic observer pattern to the Xilem architecture.

One reason this hasn’t been prototyped is that the implementation details are fairly different depending on whether reconciliation can be multithreaded (a note on that below). It’s possible to do but requires an immutable map data structure to avoid high cloning cost, as well as interior mutability for the subscription set.

We have a similar prototype of a useState mechanism, but not enough data on its usefulness to say for sure whether it carries its weight. Such a mechanism does not seem to be needed in the Elm architecture.

Prospects

The work presented in this blog post is conceptual, almost academic, though it is forged from attempts to build real-world UI in Rust. It comes to you at an early stage; we haven’t yet built up real UI around the new architecture. Part of the motivation for doing this writeup is so we can gather feedback on whether it will actually deliver on its promise.

One way to test that would be to try it in other domains. There are quite a few projects that implement reactive UI ideas over a TUI, and it would also be interesting to try the Xilem architecture on top of Web infrastructure, generating DOM nodes in place of the associated widget tree.

I’d like to thank a large number of people, though of course the mistakes in this post are my own. The Xilem architecture takes a lot of inspiration from Olivier’s Panoramix and Manmeet’s olma explorations, as well as Taylor Holliday’s rui. Jan Pochyla provided useful feedback on early versions, and conversations with the entire Druid crew on xi.zulipchat.com were also informative. Ben Saunders provided valuable insight regarding Rust’s async ecosystem.

Discuss on Hacker News and /r/rust.

piet-gpu progress: clipping

2022-02-24T19:33:42+00:00

Recently piet-gpu has taken a big step towards realizing its vision, moving the computation of path clipping from partially being done on the CPU to entirely being done on the GPU.

This post explains how that works. It’s been quite a journey - I actually started a draft of this more than a year ago. Since then, I’ve had to come up with a fundamental new parallel algorithm. It’s finally done. I think clipping is very much at the heart of what makes piet-gpu different than other 2D rendering engines.

What is clipping?

The basic idea of clipping is simple, but it has a profound impact on the overall imaging model. For one, it induces a tree structure, while drawing without clipping can be seen as a linear sequence of layers. There are different ways of implementing pure vector clipping, but the one I’ve gone with generalizes to blending, which is next.

Clipping with a vector path affects the drawing of all child objects of the clip node. Points inside the path are drawn, and points outside the path are not.

For pure vector path clipping, the effective clip applied to each (leaf) drawing node is the intersection of all clip paths up the tree. Thus, one viable implementation is to do this path intersection in vector space. However, boolean path operations are tricky to implement, and only CPU implementations are known; moving them to the GPU would be difficult, to say the least.

Rather, we treat (antialiased) clipping as a blend operation – in fact, it is an instance of the the Porter-Duff “source in” composition operator. Conceptually, the children of a clip are rendered into a temporary, intially clear buffer, the clip mask is rendered into an alpha channel, then the temporary buffer is composited on the background, the alpha channel multiplied by the mask. This approach can be done fully on the GPU, and the blending operations generalize.

Clips can nest arbitrarily; a clip node can be the child of another clip node. Thus, a full 2D scene is effectively a tree, which profoundly affects the way drawing is done.

Bounding boxes

A common technique in 2D rendering is to assign a bounding box to each drawing operation. The bounding box is an enclosing rectangle (hopefully tight), so that drawing operations affect only pixels inside the rectangle. For rendering any subregion of the target surface, any objects whose bounding box does not intersect that bounding box may be ignored. Piet-gpu uses bounding boxes extensively to organize parallel drawing, first by 256x256 pixel bins, then 16x16 pixel tiles. Bounding boxes exploit the fact that most 2D drawing is sparse, in that most draw objects don’t touch most tiles.

Bounding boxes of course interact with clipping. Each clip node in the tree has a bounding box, computed from the corresponding clip path. Then, the bounding boxes for all descendants in the tree is intersected with that bounding box. Another way of phrasing this is that the clipped bounding box for a draw object is the object’s own bounding box intersected with the bounding boxes of all clip paths in the path from that node to the root of the tree.

Computing these bounding boxes is less work than rendering the objects, but is not free. In particular, we’d like to do that work on the GPU rather than the CPU.

Per-tile optimization

Things get even more interesting with per-tile optimization. The piet-gpu rendering pipeline is organized as a series of stages, finally writing a per-tile command list (sometimes called “tape”) for each 16x16 tile in the target. The last stage is “fine rasterization,” which plays these commands for all pixels in the tile. Within a tile there is no control flow; all commands are evaluated for all pixels.

In the general case, clip is rendered as follows. The fine rasterizer maintains a bunch of per-pixel state, notably a current pixel and a blend stack. The current pixel is initially clear (alpha = 0) or a background color, and ordinary draw objects are composited onto it (generally using Porter-Duff “over”). A BeginClip operation pushes the current pixel on a blend stack. The children of the clip are rendered, compositing into the current pixel. Then, at a matching EndClip, the clip mask is rendered into an alpha mask, that’s composited with the current pixel using “source in” (basically multiplying the alpha with the mask), and that result is composited (Porter-Duff “over”) with the top of the blend stack (which is popped), becoming the new current pixel.

Much of the time, we don’t need the general case, however. Piet-gpu rendering works by 16x16 pixel tiles. Earlier stages in the pipeline computes what should happen within a tile, and the final stage (fine rasterization) performs a sequence of drawing operations for all pixels in the tile. This gives us the opportunity to do some optimization.

In particular, zooming into a single tile, a clip path may be in one of 3 states: zero coverage, partial coverage, or full coverage. Partial coverage only happens with the clip path intersects the tile. Full coverage is for tiles entirely within the clip path, and zero coverage is for tiles outside the clip path.

The mask computation and compositing only needs to happen for the partial coverage tiles (shown as gray in the above figure). The others can be rendered much more efficiently. Zero coverage tiles suppress the rendering of child nodes, basically from the BeginClip to the corresponding EndClip. And full coverage tiles are basically a no-op; the mask need not be rendered, and child nodes are rendered as if there were no clip in effect.

GPU computation of clip bounding boxes

A previous iteration had accounting of bounding boxes done by CPU. A major goal of piet-gpu is to move as much computation as possible to the GPU.

The fundamental task is assigning to each draw object a bounding box rectangle that incorporates the intersection of all enclosing clips. A linearized representation of the scene will have a BeginClip element and an EndClip element. The BeginClip will also reference a path, and that path has an associated bounding box.

Here’s what that calculation looks like, as a sequential algorithm:

stack = [viewport_bbox]
for element in scene:
    if element is BeginClip(path):
        stack.push(intersect(stack.last(), path.bbox()))
        element.effective_bbox = stack.last()
    elif element is EndClip:
        element.effective_bbox = stack.last()
        stack.pop()
    else:
        element.effective_bbox = intersect(stack.last(), element.bbox())

As a sequential algorithm, this is very straightforward, almost trivial. There’s a stack of bounding boxes, and the size of that stack is bounded by the maximum nesting depth. The cost to processing each element is O(1) as well. The only problem is, you really, really don’t want to run a sequential algorithm on a GPU.

That raises a question: is there a way to make a parallel version of this algorithm, one that runs efficiently on actual GPU hardware? That question has been a major obsession of mine for at least a year. And I am pleased to say, the answer is yes. I’ve done it.

The core of my solution is what I call the stack monoid, which is a variant on the well-studied parentheses matching problem. I blogged about an earlier version in stack monoid revisited. Since then, I’ve made an improved version, with almost 5x peak performance and better portability as well. I’m not going to go into the details in this blog post, rather I will just say the solution is available by magic, and focus on the application to the 2D rendering problem.

Basically, we use the result of parentheses matching for two things. First, each EndClip is able to access the same path and bounding box data as the corresponding BeginClip. In particular, that lets us do the per-tile optimization in coarse rasterization efficiently, as that shader doesn’t need to maintain significant state. Second, it computes the intersection of all clip bounding boxes on the path to the root. Rectangle intersection is, thankfully, a monoid, so it is possible

Stream compaction

This section is a detail that can be skipped, but may be of interest to people writing fancier GPU algorithms.

The original piet-gpu design used an “array of structures” approach to scene description, in particular a single array with fixed size elements, each of which was a tagged union of various drawing element types, including path segments. Processing this array basically requires a large switch statement to deal with the variants in the union. I had contemplated doing a stack monoid over this array, but was very worried about the performance cost of computing the stack monoid for every element in this array. I now have a very fast stack monoid implementation, but even so have reworked the architecture so this cannot be a problem.

The new architecture (described in some detail in the new element processing pipeline issue) is more of a “structure of arrays” approach, which is extremely popular in the graphics and game world due to its performance advantage. Every major datatype gets its own stream. Further, as much of the logic for that type gets moved into its own shader dispatch, which works in bulk on only that type of object, with no big switch statement. To stitch these together, we use a bunch of indices into these streams, which are computed using prefix sum of the counts.

Specifically, the draw object stage does a stream compaction and writes a clip stream, which is an array of just the BeginClip and EndClip objects. A clip index is an index into this stream. At the same time, it assigns a clip index to each draw object. A sequence of draw objects enclosed by the same clip all have the same clip index. In the above diagram, the “clips” array represents the clip stream written by the draw object stage. The arrows associating the different parts of the scene together are also computed in the draw object stage using prefix sum.

The clip stage then does parenthesis matching and bbox intersection of the clips in the clip stream. When it’s done, it assigns a bounding box to each object in the clip stream, intersecting the bounding boxes of the clip paths that have already been computed by the path processing stage, to produce clip bounding boxes. It also sets the path in EndClip to refer to the same path as the corresponding BeginClip.

Thus, the work of the clip stage is proportional to the number of clips in the scene, not to the total number of objects. It would take an enormous number of clips for this work to show up to any significant amount in profiles. We used similar stream compaction techniques to move to a more compact path encoding, and I plan to apply it to other parts of the pipeline as well.

Clipping and scrolling

One application of clipping is to define the viewport of a scrolled view in a UI. This can be represented in piet-gpu as a clip node with a transform node as a direct child, then the scrolled contents as children of the transform node. The translation associated with the transform node controls the scroll position (this architecture could do scaling as well).

A design goal of piet-gpu is the most of these contents can be encoded once and retained, so a new scene with a different scroll position could reassembled with very little work CPU-side. On the GPU side, there would be a fairly small amount of work to compute clip bounding boxes, which would be able to cull objects quite early in the pipeline,

Obviously this approach will work for moderate scrolling, where it is practical to have all the resources resident on the GPU. For huge scrolled windows, some virtualization is needed, with resources swapped in and out as they scroll into and out of view. Even so, this is an appealing direction to explore, as smooth scrolling is still a challenge for UI toolkits.

This work is perhaps most similar to Massively Parallel Vector Graphics. We both represent the scene as a flattened tree, and allow arbitrary nesting depth. However, their tree algorithm is much more simplistic: for a nesting depth of n, they do n scans, each addressing one level of nesting. This work uses a new algorithm that allows arbitrary nesting depth with no slowdown. (GPU tree algorithms with a work factor proportional to the depth of the tree are not unusual; for example)

In a more traditional GPU renderer, the general way to do blends is to allocate a temporary texture, render into that, and then composite by drawing a quad into the render target, sampling from the intermediate texture. GPUs are very highly optimized for this sort of work, with hardware support for texture sampling and “raster ops” for compositing, but even so it requires traffic to main memory. I believe it’s faster not to have to do the work at all.

The techniques here are similar to those in Matt Keeter’s Massively Parallel Rendering of Complex Closed-Form Implicit Surfaces. Clipping is basically the same as intersection in constructive geometry (whether 2D or 3D), and that paper uses similar techniques to optimize a tape, taking advantage of algebraic simplifications on a per-region basis. Those techniques are more general, while this work is more specialized to 2D rendering tasks.

It’s fairly dated by now, but Adam Langley’s blog post on clipping in Chromium makes for interesting reading. The main problem being discussed is “conflation artifacts,” which are not fully addressed by the alpha-channel compositing approach to clipping, but even so it remains the standard technique, largely because it’s more or less mandated by the W3C compositing and blending spec and the HTML canvas drawing model.

Do you maintain a 2D renderer with interesting clip support? Are there good writeups somewhere I’ve missed? Let me know, and I’ll be happy to add links.

Next steps

There are some things missing from this blog post, notably performance numbers. Right now, my focus in piet-gpu is to get the architecture right. It feels like that’s converging, many of the hard problems are being solved.

Now that the infrastructure for clipping is in place, blending should be relatively straightforward. Most of what needs to happen is additional blending logic in the fine rasterization, plus of course plumbing the relevant metadata through the pipeline. That, plus radial and sweep gradients, are the two major pieces need to support COLRv1 emoji, the next big milestone.

Raph Levien’s blog

A note on Metal shader converter

Complexity and reasoning

Onward

Simplifying Bézier paths

The ParamCurveFit trait

Error metrics

Finding subdivision points

Bumps

Low pass filtering

Discussion

Moving from Rust to C++

Choice of build systems

Safety

Language evolution

Community

Conclusion

Requiem for piet-gpu-hal

Goals

Implementation choices

Shader language

Ahead of time compilation

GPU abstraction layer

WebGPU pain points

Collaboration with community

Precompiled shaders on roadmap?

Other GPU compute clients

Beyond WebGPU 1.0

Conclusion

Raph’s reflections and wishes for 2023

Reflections on 2022

On happiness

Vello

Xilem

Curves and other research

Non-goals

Conclusion

Minikin retrospective

Parallel curves of cubic Béziers

Outline of approach

Cusp finding

Computing area and moment of the target curve

Refinement of curve fitting approach

Error measurement

Subdivision

Evaluation

Discussion

References

Advice for the next dozen Rust GUIs

A large tradeoff space

A small rant about native

On winit

Tradeoff: use of system compositor

Tradeoff: platform text rendering

On architecture

The crochet experiment

Accessibility

What of Druid?

Conclusion

Xilem: an architecture for UI in Rust

A quick tour of existing architectures

Synchronized trees

A worked example

Identity and id paths

Event propagation

Re-rendering

Components

Finer grained change propagation: memoizing

Type erasure

Language bindings for dynamic languages

Async

Environment

Other topics

Prospects

piet-gpu progress: clipping

What is clipping?

Bounding boxes

Per-tile optimization

GPU computation of clip bounding boxes

Stream compaction