Fast Rounded Rectangle Shadows

This demonstrates a shader I came up with for constant-time single-step rounded rectangle drop shadows on the GPU. I became interested in this problem after reading that browsers still render box shadows on the CPU and upload them to the GPU. Rectangular drop shadows are used all over the place in UI design and are an important feature to accelerate, especially given the rise of animation in interface design.

You can pause the animation and toggle between rounded and square corners for comparison:

Fast Rectangle Blur

There's a well-known shortcut for rendering drop shadows of axis-aligned rectangles. It turns out that a 2D drop shadow can be defined as the multiplication of two perpendicular 1D blurred boxes. Since there's a closed-form solution to the convolution of a 1D gaussian with a 1D box (it's just the piecewise integral of a gaussian), this becomes a constant-time single-step rendering algorithm. It looks like this:

// License: CC0 (

// This approximates the error function, needed for the gaussian integral
vec4 erf(vec4 x) {
  vec4 s = sign(x), a = abs(x);
  x = 1.0 + (0.278393 + (0.230389 + 0.078108 * (a * a)) * a) * a;
  x *= x;
  return s - s / (x * x);

// Return the mask for the shadow of a box from lower to upper
float boxShadow(vec2 lower, vec2 upper, vec2 point, float sigma) {
  vec4 query = vec4(point - lower, upper - point);
  vec4 integral = 0.5 + 0.5 * erf(query * (sqrt(0.5) / sigma));
  return (integral.z - integral.x) * (integral.w - integral.y);

The integral of the gaussian f(x) = exp(-x^2 / (2 sigma^2)) / (sigma sqrt(2 pi)) is exactly F(x) = (1 + erf(x / (sigma sqrt(2)))) / 2, where erf() is the error function.

Fast Rounded Rectangle Blur

Unfortunately there's no closed-form solution for a rounded rectangle drop shadow. After a lot of experimentation, I've found the best approach is to use the closed-form solution along the first dimension and sampling along the second dimension. If the sample locations are chosen intelligently, only a small fixed number of samples are needed for a good approximation. Here's a straightforward, unoptimized implementation:

// License: CC0 (

// A standard gaussian function, used for weighting samples
float gaussian(float x, float sigma) {
  const float pi = 3.141592653589793;
  return exp(-(x * x) / (2.0 * sigma * sigma)) / (sqrt(2.0 * pi) * sigma);

// This approximates the error function, needed for the gaussian integral
vec2 erf(vec2 x) {
  vec2 s = sign(x), a = abs(x);
  x = 1.0 + (0.278393 + (0.230389 + 0.078108 * (a * a)) * a) * a;
  x *= x;
  return s - s / (x * x);

// Return the blurred mask along the x dimension
float roundedBoxShadowX(float x, float y, float sigma, float corner, vec2 halfSize) {
  float delta = min(halfSize.y - corner - abs(y), 0.0);
  float curved = halfSize.x - corner + sqrt(max(0.0, corner * corner - delta * delta));
  vec2 integral = 0.5 + 0.5 * erf((x + vec2(-curved, curved)) * (sqrt(0.5) / sigma));
  return integral.y - integral.x;

// Return the mask for the shadow of a box from lower to upper
float roundedBoxShadow(vec2 lower, vec2 upper, vec2 point, float sigma, float corner) {
  // Center everything to make the math easier
  vec2 center = (lower + upper) * 0.5;
  vec2 halfSize = (upper - lower) * 0.5;
  point -= center;

  // The signal is only non-zero in a limited range, so don't waste samples
  float low = point.y - halfSize.y;
  float high = point.y + halfSize.y;
  float start = clamp(-3.0 * sigma, low, high);
  float end = clamp(3.0 * sigma, low, high);

  // Accumulate samples (we can get away with surprisingly few samples)
  float step = (end - start) / 4.0;
  float y = start + step * 0.5;
  float value = 0.0;
  for (int i = 0; i < 4; i++) {
    value += roundedBoxShadowX(point.x, point.y - y, sigma, corner, halfSize) * gaussian(y, sigma) * step;
    y += step;

  return value;

Instruction count can be reduced further using forward differencing and other tricks but I've left that stuff out since it obfuscates the algorithm.

Other Approaches

Drop shadows are defined as the convolution of a circular gaussian blob with the mask of the object receiving the shadow. 2D convolution is prohibitively expensive because each pixel requires O(r2) samples, where r is the radius of the gaussian. Since convolution by a gaussian blob is separable, these blurs are usually done using two successive 1D convolutions, one along x and one along y, which brings the complexity down to O(r). This is much better than naive convolution but is still pretty expensive. GPU blurs can use tricks such as double-sampling using linear interpolation and downsampling before blurring but any texture sampling technique will suffer from the overhead of extra memory, extra fill-rate, and extra render passes.

Drop shadows can also be rendered on the CPU using repeated application of a moving average box-blur to approximate a gaussian blur. This works out to O(1) cost per pixel assuming the blur radius is much less than the size of the image. Every additional box blur pass improves the approximation and in practice stopping after three passes is fine. This is a legitimate approach but the data upload to the GPU ends up becoming a significant bottleneck.