Radiance: A Raymarching Benchmark

01 — Origin

From Weekend Project to GPU Benchmark

This benchmark started as a weekend project to recreate Breakout — the 1976 Atari game — with modern graphics. The twist: instead of traditional rendering, every pixel is calculated through pure mathematics. No shortcuts.

Something unexpected happened: the "simple" game brought an RTX 5090 to its knees. A geometrically trivial game from 1976, rendered with modern techniques, genuinely stresses cutting-edge hardware.

Bricks

Ball

Paddle

~32KB

Working Set

The entire scene fits in your GPU's L1 cache. The challenge isn't moving data — it's the sheer amount of math required when you stop making approximations.

This benchmark evolved from a 32KB WebGPU implementation into a DirectX 12 compute shader benchmark. The original plan was straightforward: take 80 bricks, a paddle, and a ball, then render them with physically accurate lighting, real-time soft shadows, and full raymarched geometry. No texture maps, no pre-baked illumination, no shortcuts — just pure mathematics calculating every pixel from first principles.

The computational payload is remarkably small:

1,280

Bytes (Brick Data)

30,720

Bytes (Debris Data)

512

Bytes (Constants)

~32KB

Total Working Set

This fits comfortably in L1 cache on modern GPUs. This benchmark tests computational throughput and execution efficiency, not VRAM memory bandwidth. The GPU spends time computing distance fields rather than waiting for memory fetches.

Note: For applications that stream gigabytes of textures or process massive datasets, memory bandwidth becomes critical. AI inference benchmarks or professional rendering applications with large texture sets are better suited for testing memory subsystem performance.

02 — Controls

Mission Control Panel

All benchmark settings are accessible through the Mission Control panel window. You can also use keyboard shortcuts, but the panel provides the easiest way to configure and run the benchmark.

Quick Start

Click "4-Run Benchmark" to execute the standard automated test. The benchmark handles everything — warmup, data collection, and report generation.

Resolution Presets

Native, 4K, 2.5K, 1080p, 720p, or 480p. Higher resolutions stress the GPU more; 480p shows what future hardware could achieve.

Debris Presets

MAX (8 x total bricks in level), 1/2, 1/4, or 1/8. More debris = more computation per pixel. Disable for playable framerates on current hardware.

Autoplay

Enable AI paddle control for hands-free benchmarking. The AI intercepts most balls automatically.

Visual Adjustments

Sliders for brightness, ball glow, and ray steps (1–1024). Level selection buttons let you test all 6 included levels. These settings affect visual quality and computational load.

The Mission Control panel provides GUI access to all benchmark parameters. Every keyboard shortcut has an equivalent control here.

Benchmark Execution

"4-Run Benchmark" executes the full protocol: warmup run (discarded), then 4 data collection runs at 60 seconds each with 10Hz telemetry. Generates HTML report on completion.

Resolution Presets

Native, 4K (3840×2160), 2.5K (2560×1440), 1080p, 720p, 480p. Thread dispatch scales quadratically: 4K dispatches 8.3M threads vs 480p's 0.3M threads.

Debris Presets

MAX (8 x total bricks in level), 1/2, 1/4, 1/8. Each particle adds distance calculations to every raymarching step. The debris loop is the primary performance scaling factor.

Autoplay Toggle

Enables deterministic AI paddle control. Algorithm: find lowest descending ball, move paddle toward ball.x at 20% of position difference per frame. CPU overhead <0.1%.

Autoplay Algorithm

The autoplay AI is intentionally simplistic but effective for benchmark reproducibility:

Every frame:
1. Find the lowest active ball with negative y-velocity (descending)
2. Calculate that ball's horizontal position (ball.x)
3. Move paddle toward that position at 20% of the difference

This creates smooth, natural-looking movement that intercepts most balls. The AI doesn't look ahead, so fast-moving balls at steep angles can sometimes escape. The algorithm runs on the CPU with negligible overhead.

Visual Parameter Sliders

Brightness: Global light intensity multiplier (affects shadow visibility)
Ball Glow: Emissive intensity of ball light sources (affects colored lighting on surfaces)
Ray Steps: Maximum raymarching iterations per primary ray (1–1024). Default 128 balances quality/performance; 1024 for maximum precision
Level Selection: 6 levels with different brick arrangements, each providing varied complexity curves during destruction

03 — Architecture

How Raymarching Works

Traditional games draw triangles. This benchmark doesn't use triangles at all — it calculates everything mathematically.

For each pixel on your screen, the GPU shoots a virtual "ray" into the scene. Instead of checking if that ray hits a triangle, the benchmark uses mathematical equations to determine the distance to every object. It "marches" the ray forward step by step until it finds a surface.

Sphere Tracing (Raymarching) Visualization

This technique is called sphere tracing. At each step, the math tells us "the nearest surface is at least X units away," so we can safely jump forward by that distance. Repeat until we hit something or escape into empty space.

Signed Distance Fields Explained

A signed distance field (SDF) is a mathematical function that takes a point in 3D space and returns the distance to the nearest surface. For a sphere centered at position C with radius R:

distance = length(point - center) - radius

If the result is negative, you're inside the sphere. If positive, you're outside, and the magnitude tells you exactly how far away the surface is. The power of SDFs comes from combining them — the scene SDF evaluates every object and returns the minimum distance.

Sphere Tracing Algorithm

Traditional vs. Raymarching Pipeline

Traditional Rasterization

1 Vertex Shader transforms coordinates

2 Rasterizer determines pixel coverage

3 Pixel Shader applies textures/lighting

4 Output Merger composites to framebuffer

Raymarching (This Benchmark)

1 Compute Shader dispatches 1 thread/pixel

2 Each thread marches ray through scene SDF

3 On hit: trace shadow rays, calculate lighting

4 Write directly to UAV texture

At 4K resolution (3840×2160), this dispatches 8,294,400 threads organized in 8×8 tiles. Each thread calculates one pixel by marching rays up to 128 steps (default) or 1,024 at maximum quality.

04 — Stress Points

What This Benchmark Tests

This benchmark pushes your GPU's math processing power to the limit. Unlike most games that spend time moving textures around, Radiance keeps your GPU busy with pure calculations.

Compute Power

Sustained floating-point math: square roots, dot products, and distance calculations — thousands per pixel.

Decision Making

Adjacent pixels can take completely different paths through the code, testing your GPU's ability to handle varied workloads.

Important: This benchmark generates significant heat. Ensure your cooling is adequate and that all power cables (especially 12VHPWR) are properly seated.

Primary Stress Points

FP32 Compute Units

Sustained floating-point arithmetic: square root operations (vector length), dot products (lighting), absolute values and component-wise operations (box SDF), min/max operations (combining SDFs). Compute units stay saturated.

Branch Prediction / Divergence

Adjacent pixels can take wildly different execution paths. Pixel (0,0) might hit a brick after 5 march steps. Pixel (0,1) might march 150 steps finding nothing. Yet these threads execute in the same SIMD group (warp/wavefront). Architectures with better divergence handling show measurable advantages.

Instruction Cache

Procedural shader code with multiple material types and lighting paths creates a larger instruction footprint than simple benchmarks with tiny loops. The I-cache must hold the entire shader and prefetch efficiently as execution branches.

Register File

Each thread maintains ray position (3 floats), direction (3 floats), accumulated color (3 floats), scene evaluation results (4 floats), and temporaries. This creates register pressure limiting simultaneous in-flight threads. GPUs with larger register files maintain higher occupancy.

Per-Pixel Computational Cost

A single pixel with full complexity might trace:

100

Primary Ray Steps

Global Shadow Steps

4×60

Ball Shadow Steps

420

Total Scene Evaluations

Each evaluation loops through 80 bricks and up to 640 particles. Multiply by 8.3 million pixels at 4K.

What This Benchmark Does NOT Test

Idle

RT Cores / Ray Accelerators

Fixed-function triangle-ray intersection units. SDFs are mathematical functions, not triangles — RT Cores cannot evaluate arbitrary equations.

Idle

Tensor Cores / AI Accelerators

Matrix multiplication accelerators for FP16/INT8. This benchmark performs scalar floating-point arithmetic on individual values and small vectors. No matrix operations.

Idle

Traditional Graphics Pipeline

Vertex shaders, hull/domain/geometry shaders, rasterizer, and output merger all sit idle. Provides zero information about their performance.

Historical Parallel: RT Cores mirror the early 2000s transition from fixed-function T&L to programmable shaders. The GeForce 256's hardware T&L was faster for specific operations, but programmable shaders enabled techniques fixed-function hardware couldn't support. RT Cores are specialized hardware for common operations that cannot handle workloads outside their designed scope.

05 — The Code

Shader Architecture

The entire rendering engine fits in approximately 400 lines of HLSL across two compute kernels. Complete source is provided with the benchmark.

Signed Distance Functions

The benchmark defines surfaces using mathematical equations. For a sphere, it's the distance from any point to the center, minus the radius. For a box, it uses the maximum of each axis component.

sdBox() — Calculates distance to a rectangular solid (bricks, paddle, walls)

length(p - center) - radius — Distance to a sphere (balls)

The scene SDF evaluates all objects and returns the minimum — "the nearest surface is this far away."

HLSL Source

float sdBox(float3 p, float3 b) {
    float3 q = abs(p) - b;
    return length(max(q, 0.0)) 
         + min(max(q.x, max(q.y, q.z)), 0.0);
}

// Ball distance (inline)
float dBall = length(p - ballPos.xyz) - 0.7;

// Scene combines all objects
float4 mapScene(float3 p) {
    float res = mapOpaque(p, true, false, ...);
    // Check each ball
    [unroll]
    for (uint i = 0; i < MAX_BALLS; ++i) {
        if (state.ballPos[i].z > 0.0) {
            float dBall = length(p - state.ballPos[i].xyz) - 0.7;
            if (dBall < res) {
                res = dBall;
                closestMat = 2.0; // Ball material
            }
        }
    }
    return float4(res, closestMat, auxY, auxZ);
}

Line-by-Line Explanation

// Box SDF using standard formula
// q = signed distance on each axis
// For points outside: Euclidean distance to corner
// For points inside: negative of penetration depth


// Sphere is simple: distance to center minus radius
// Negative = inside, Positive = outside

// Scene SDF returns minimum distance across all objects
// This is the fundamental SDF combination operator

// [unroll] hint: compiler unrolls loop for performance
// MAX_BALLS = 12 (compile-time constant)

// ballPos[i].z > 0 means ball is active
// (w component often used for active flag)

// Ball radius is 0.7 units

// Track which material is closest for shading
// Material 2 = emissive ball surface


// Return: (distance, material, aux1, aux2)
// aux values carry material-specific data

Raymarching Loop

Each pixel fires a ray from the camera. The ray steps forward repeatedly, asking "how far to the nearest surface?" each time. When that distance drops below 0.001 units, we've hit something.

128 steps — Default maximum iterations per ray

1,024 steps — Maximum quality setting

More steps = more accurate surfaces, but exponentially more computation.

HLSL Source

[numthreads(8, 8, 1)]
void renderCS(uint3 id : SV_DispatchThreadID) {
    uint2 dims;
    outputTex.GetDimensions(dims.x, dims.y);
    if (id.x >= dims.x || id.y >= dims.y) return;

    float2 uv = (float2(id.xy) - float2(dims) * 0.5) 
                / (float)dims.y;
    
    float3 ro = float3(0.0, 0.0, 48.0);  // Camera
    float3 rd = normalize(float3(uv.x, -uv.y, -1.2));

    float t = 0.0;
    bool hit = false;
    int limit = (int)state.maxRaySteps;
    
    [loop]
    for (int i = 0; i < limit; ++i) {
        float4 res = mapScene(ro + rd * t);
        if (res.x < 0.001) { 
            hit = true; 
            mData = res; 
            break; 
        }
        t += res.x;  // Step by safe distance
        if (t > 100.0) break;  // Escaped
    }
    // ... lighting calculations follow
}

Line-by-Line Explanation

// 8×8 = 64 threads per group (vendor-neutral)
// Divides evenly by 16, 32, 64 for all architectures

// Get output texture dimensions dynamically
// Bounds check: threads outside texture do nothing


// UV coordinates normalized to [-0.5, 0.5]
// Aspect ratio preserved (divide by height only)

// Camera positioned at z=48, looking toward origin
// -1.2 z-component sets ~80° field of view

// t = total distance traveled along ray
// limit = 128 default, 1024 max quality

// [loop] hint: don't unroll (variable iteration count)

// mapScene returns (distance, material, aux, aux)
// 0.001 = hit threshold (surface found)
// Store material data for shading


// Sphere tracing: step by SDF distance (safe)
// 100.0 = max ray distance (escaped scene)

// If hit: calculate normals, shadows, lighting

Soft Shadow Calculation

When a ray hits a surface, the shader traces additional rays toward each light source. These "shadow rays" determine how much light reaches that point.

1 global light ray — Main scene illumination

Up to 4 ball light rays — Each glowing ball casts colored light

Objects that almost block the light create partial shadows, giving the soft, realistic penumbras you see in the benchmark.

HLSL Source

float2 softShadowSteps(
    float3 ro, float3 rd, 
    float k, float maxDist, 
    float noise, 
    bool includePaddle, 
    bool isBallShadow
) {
    float res = 1.0;
    float t = 0.05 + (noise * 0.02);
    float steps = 0.0;
    
    [loop]
    for (uint i = 0; i < 160; ++i) {
        steps += 1.0;
        float h = mapOpaque(ro + rd * t, ...);
        res = min(res, k * h / t);
        t += clamp(h, 0.02, 0.5);
        if (res < 0.001 || t > maxDist) 
            break;
    }
    return float2(clamp(res, 0.0, 1.0), steps);
}

Line-by-Line Explanation

// Returns (shadow factor, step count)
// ro = shadow ray origin (surface + normal offset)
// rd = direction toward light source
// k = penumbra softness (8.0 = soft shadows)
// maxDist = distance to light
// noise = per-pixel jitter to reduce banding
// includePaddle = whether paddle casts shadows
// isBallShadow = special handling near balls

// res = 1.0 means fully lit
// Start slightly offset to avoid self-intersection
// Track steps for performance analysis

// Up to 160 steps per shadow ray

// Evaluate scene SDF at current position
// Soft shadow formula: smaller h/t = darker shadow
// Near-misses create penumbra effect
// Clamp step size for stability
// res < 0.001 = fully shadowed, early exit
// t > maxDist = reached light, done

// Return shadow factor [0,1] and diagnostic steps

Physics Kernel (Debris Simulation)

The physics kernel simulates debris particles using explicit Euler integration — the same method used in professional finite element analysis codes like LS-DYNA.

640 particles max — 80 bricks × 8 debris per brick

Each particle has position, velocity, rotation, and color. Collisions with walls, floor, and paddle are handled with restitution coefficients.

HLSL Source

[numthreads(64, 1, 1)]
void updatePhysics(uint3 id : SV_DispatchThreadID) {
    uint idx = id.x;
    if (idx >= (uint)state.debrisCount) return;

    Particle p = debris[idx];
    if (p.pos.w < 0.5) return;  // Inactive
    
    // Explicit Euler integration
    p.vel.y -= 0.015;  // Gravity
    p.pos.x += p.vel.x;
    p.pos.y += p.vel.y;
    p.pos.z += p.vel.z;
    p.color.a += p.vel.w;  // Rotation
    p.vel *= 0.98;  // Drag
    
    // Wall collisions (restitution 0.6)
    if (p.pos.x > 19.5) { 
        p.pos.x = 19.5; 
        p.vel.x *= -0.6; 
    }
    // ... similar for other walls
    
    // Floor collision (restitution 0.4)
    if (p.pos.y < -14.5) {
        p.pos.y = -14.5;
        p.vel.y = -p.vel.y * 0.4;
        p.vel.x *= 0.8;  // Friction
    }
    
    debris[idx] = p;
}

Line-by-Line Explanation

// 64 threads: matches all vendor SIMD widths
// One thread per particle

// Early exit if beyond active particle count

// Read particle from structured buffer
// pos.w = active flag (1.0 = active)

// Standard explicit Euler integration
// Same method as LS-DYNA, ABAQUS Explicit
// new_pos = old_pos + velocity * dt
// (dt = 1.0 implicit in coefficients)

// vel.w = angular velocity for visual spin
// Linear drag coefficient

// Walls at x = ±19.5
// Coefficient of restitution: 0.6
// Particle loses 40% energy on bounce


// Floor at y = -14.5
// More inelastic than walls (0.4)
// Additional friction on horizontal velocity
// Simulates rough surface contact

// Write back to UAV buffer

06 — Industry Relevance

Correlation With Unreal Engine 5 Lumen

Lumen's software ray tracing mode uses signed distance fields generated from triangle meshes. The global illumination system shoots rays through these distance fields using compute shaders — the same computational pattern this benchmark uses.

Ray Budget Comparison

Parameter	UE5 Lumen (60 FPS target)	Radiance (Full Quality)
Primary rays per pixel	1–2	1
Steps per ray	20–40 max	128–1,024
Shadow rays per pixel	0–1	5 (1 global + 4 balls)
Steps per shadow ray	~20	160
Temporal reprojection	8–16 frames	None
Effective ray budget multiplier	1× (baseline)	~50–100×

Lumen's Two Rendering Paths

Software Lumen (Compute Shader)

Pure compute shader raymarching against SDFs — exactly what this benchmark does. Runs on any GPU without RT Cores.

This benchmark directly tests this code path.

Hardware Lumen (RT Core Accelerated)

Converts SDFs into simplified proxy geometry, builds BVH structures, uses RT Cores for intersection tests, validates with original SDF data.

Faster when available, but tests a different computational pathway.

06 — Performance

07 — Performance

Performance Expectations

Here's the reality: even an RTX 5090 cannot run this benchmark at high resolutions with all features enabled. This isn't a bug — it's the computational load that this benchmark presents.

Playable

Debris OFF, 1080P to 2.5K

With physics debris disabled, modern flagship GPUs achieve playable framerates at high resolutions. The image quality is striking — zero aliasing, soft shadows, pure mathematical precision. Frame rates actually improve as you destroy bricks (fewer objects to calculate).

Playable

Debris ON, 480p

If you want to see debris, you'll need to run at 480p. With 640 debris particles, current hardware is still pushed to its limits. This configuration shows what future GPUs will render in real-time at higher resolutions — a glimpse of graphics quality when hardware catches up.

Demanding

Debris ON, 1080p

The standard benchmark configuration. Performance starts strong but degrades as debris accumulates. Expect 30 FPS initially, dropping to 3 FPS at peak complexity. This is the intended stress test.

Why does debris matter so much? Each of the 640 particles must be checked during every raymarching step, for every pixel, for every ray. A single pixel might evaluate 420 rays × 640 particles = 268,800 distance calculations. Multiply by 8 million pixels at 4K.

The benchmark's computational cost scales dramatically with configuration. Understanding these scaling factors helps interpret results correctly.

Ray Budget Analysis

Configuration	Pixels	Scene Evals/Pixel	Total Evals/Frame
480p, Debris OFF	0.3M	~180 (primary + shadows)	~54M
480p, Debris ON (640)	0.3M	~420 × 640 loop iterations	~80B
4K, Debris OFF	8.3M	~180	~1.5B
4K, Debris ON (640)	8.3M	~420 × 640 loop iterations	~2.2T

The debris loop is the dominant factor. Each additional particle adds distance calculations to every raymarching step. This creates a multiplicative relationship: doubling debris count roughly doubles per-pixel cost, regardless of resolution.

Performance Envelope

Playable

Debris Disabled, 1080P 2.5K

RTX 5090 achieves 30–60+ FPS depending on scene complexity and resolution. Frame rate increases as bricks are destroyed (fewer SDF evaluations). This demonstrates the "future vision" — what games could look like when rendered with mathematical precision instead of triangle approximations. You can use multiple lighting sources.

Playable

Debris Enabled, 480p

Even at minimal resolution, the debris evaluation loop dominates. This configuration exists to stress-test compute throughput independent of resolution scaling. Results here correlate with scientific computing workloads that are similarly compute-bound.

Demanding

Debris Enabled, 1080p

The default benchmark configuration. Initial performance is acceptable but degrades as debris accumulates. The performance curve itself is diagnostic — thermal throttling shows as progressive degradation across runs; healthy systems maintain consistent per-complexity performance.

DLSS / FSR / Frame Generation

Note on AI upscaling: Tensor Cores remain available during benchmark execution — we don't use them for rendering. This means DLSS, FSR, or frame generation could theoretically be applied. We have no way to detect frame generation or assess whether the generated frames are "good enough" for benchmarking purposes, although bright lights on a dark background should be challenging.

What This Demonstrates

The performance gap between "debris off" and "debris on" reveals something important about the future of graphics. With debris disabled, the benchmark shows what pixel-perfect rendering looks like — zero aliasing, physically accurate shadows, mathematical precision that triangle-based rendering cannot match. The visual quality is immediately distinguishable from traditional rendering.

With debris enabled, the benchmark shows what that quality costs. Current hardware cannot sustain it at high resolutions. But hardware improves. What requires 480p today will run at 4K eventually. The benchmark captures this trajectory.

07 — Running

08 — Running

Running the Benchmark

Quick Start

Download and extract Radiance
Run Radiance.exe
Click "4-Run Benchmark" in the Mission Control panel (Simple or Full)
Wait ~5 minutes for results

Default Configuration

Resolution: 1920×1080
Ray Steps: 128
Debris Particles: 640
Benchmark Duration: 4 runs × 60 seconds each

Benchmark Protocol

The four-run protocol executes:

Run 0

Warmup (Not Recorded)

Runs 1–4

60s Each @ 10Hz Telemetry

Each run uses different random seeds, creating variation in ball trajectory and brick destruction order. This serves multiple purposes:

Thermal Throttling Detection: If performance degrades progressively across runs, thermal issues are likely.
Pattern Breaking: Additional balls and fixed runtime avoid loops.
Non-Determinism: Each run produces slightly different complexity curves as bricks are destroyed in different orders.

Preventing Driver Optimization

The benchmark employs several defenses against driver pattern recognition:

Non-Deterministic Physics: Random number generator ensures different initial conditions each execution.
Variable Complexity: Brick destruction order depends on ball trajectory, creating unique scene progression each time.
Multiple Runs: Three independent runs with different random seeds create completely different computational sequences.
Transparent Source: Any optimization benefiting this benchmark would also benefit real raymarching applications.

Architecture Fairness

The 8×8 thread groups (64 threads) divide evenly by 16, 32, and 64 — ensuring full occupancy across:

NVIDIA: 32-wide warps (64 ÷ 32 = 2 warps)
AMD: 64-wide wavefronts (64 ÷ 64 = 1 wavefront)
Intel Arc: Variable widths (64 ÷ 16 = 4 groups, 64 ÷ 32 = 2 groups)

08 — FAQ

09 — FAQ

Frequently Asked Questions

Why is my expensive GPU struggling with a simple game? +

Because we removed all the shortcuts. Normal games use pre-calculated lighting, simplified physics, and texture tricks. This benchmark calculates everything mathematically from scratch — every pixel, every frame. It's like the difference between driving to work and calculating your route using orbital mechanics.

Should I enable debris particles? +

For standard benchmarking, yes — debris is enabled by default (80 bricks x 8 particles per brick = 640 particles). Disabling debris makes the benchmark much easier on your GPU but removes a significant portion of the computational load.

My GPU is running hot. Is this normal? +

Yes. This benchmark pushes sustained computational load. If your GPU is thermal throttling, the benchmark will show it — frame rates will drop during later runs. Ensure your cooling is adequate and your 12VHPWR cables are properly seated.

What's the difference between this and 3DMark? +

3DMark tests the traditional graphics pipeline — triangles, textures, rasterization. This benchmark tests pure compute shader performance through raymarching. They measure different capabilities. A GPU could score well on 3DMark but struggle here, or vice versa.

Will my RT Cores / ray tracing hardware help? +

No. RT Cores accelerate triangle-ray intersection using specialized hardware. This benchmark uses mathematical equations (signed distance fields), not triangles. RT Cores sit completely idle during execution — they simply can't process this type of workload.

Why does this matter for real games? +

Unreal Engine 5's Lumen global illumination uses the same raymarching technique when RT Cores aren't available. Good performance here suggests good performance in UE5 games with software-based lighting. It's testing techniques that shipping games already use.

Why can't this benchmark use RT Cores? +

RT Cores implement the Möller–Trumbore algorithm for triangle-ray intersection. They're optimized for querying BVH acceleration structures with triangle meshes. Signed distance fields are mathematical functions — you cannot load length(p - center) - radius into an RT Core. The hardware has no mechanism to evaluate arbitrary functions; it only knows triangles.

Could you convert SDFs to proxy geometry for RT Core acceleration? +

Yes, and that's exactly what UE5 Lumen's hardware path does. But it would fundamentally change what the benchmark measures. You'd need to convert SDFs to meshes (preprocessing), build BVH structures (per-frame for moving debris), then validate hits against original SDFs (hybrid approach). For 640 moving particles requiring constant BVH rebuilds, the overhead may negate RT Core benefits. More importantly, we'd be testing triangle-ray intersection performance rather than compute shader efficiency.

Why 8×8 thread groups instead of 16×16 or 32×32? +

64 threads divides evenly by 16, 32, and 64 — ensuring full SIMD utilization across NVIDIA (32-wide warps), AMD (64-wide wavefronts), and Intel Arc (variable widths). Larger groups could improve occupancy on some architectures but create partial wavefront utilization on others, artificially skewing results.

How does this correlate with FEA / simulation workloads? +

The physics kernel implements explicit Euler integration — the same fundamental algorithm in LS-DYNA, ABAQUS Explicit, and other finite element analysis codes used for automotive crash simulation and structural dynamics. The computational pattern (read state, apply forces, integrate, check contacts, write state) is structurally identical. Performance here provides a rough proxy for explicit dynamics simulations, though professional codes typically require FP64 double precision rather than FP32.

What about molecular dynamics simulations? +

Computational biology tools that simulate protein folding use explicit integration with force field evaluation — similar to this benchmark's physics kernel. Each molecule's position updates based on electrostatic and van der Waals forces from neighbors. The memory access pattern (reading neighbor data, calculating forces, updating state) creates similar cache demands. Performance here provides a rough proxy for molecular dynamics without requiring domain expertise to configure a biological simulation.

Does Intel Arc's dual-issue capability help here? +

Potentially. The physics kernel mixes integer operations (array indexing: debris[idx]) with floating-point (physics calculations: p.vel.y -= 0.015). Intel Arc's architecture can execute integer and float operations from the same thread in a single cycle. This specific workload might let that feature shine, though we haven't verified it empirically.

How does divergence affect different architectures? +

Adjacent pixels can take wildly different execution paths — one hitting after 5 steps, another marching 150 steps. These threads share a SIMD group. NVIDIA handles this through warp-level predication; AMD's wider wavefronts may see more divergence impact. Intel's variable SIMD width provides flexibility. Architectures with better divergence handling show measurable advantages.

What are the benchmark's known limitations? +

Workload characteristics: FP32 compute-focused, not representative of current game workloads. Specific to raymarching techniques, not path tracing or photon mapping. Physics simulation is simplified compared to professional FEA codes.

Platform coverage: Currently Windows DirectX 12 only. WebGPU version exists but lacks automated benchmark features.

Does not test: Memory bandwidth to VRAM (small working set stays cache-resident), RT Core performance, Tensor Core performance, traditional rasterization pipeline.

09 — The Future

10 — The Future

The Next 25 Years

In 1981, 640K of system memory was thought to be plenty. More than forty years later, that number is a punchline. Today's smartphones carry more memory than entire data centers of that era.

Thinking that 2025 GPU hardware is "good enough" repeats the same mistake.

This benchmark exists because a simple paddle-and-ball game — 80 bricks, rendered with mathematical precision — can bring a flagship GPU to its knees. Not because the hardware is weak. The RTX 5090 is a marvel of engineering, containing billions of transistors executing trillions of operations per second. It's extraordinarily powerful by any historical measure.

It's just not powerful enough.

The visual quality you see in Radiance at 480p — the perfect anti-aliasing, the soft penumbral shadows, the mathematically precise surfaces — is what all games could look like. Not as a stylized choice, but as a technical baseline. Zero shimmering. Zero jagged edges. Zero texture filtering artifacts. Every pixel calculated from first principles.

We're not there yet. But we will be.

Developers today make thousands of compromises. Baked lighting instead of real-time global illumination. Triangle meshes instead of mathematical surfaces. Temporal reprojection to amortize costs across frames. Screen-space approximations. Level-of-detail hacks. These aren't artistic choices — they're engineering necessities, workarounds for hardware that can't yet deliver the uncompromised vision.

Every generation, some of those compromises become unnecessary. What required tricks in 2015 runs natively in 2025. The techniques that seem impossibly expensive today will be baseline expectations in 2035.

The Sky Is the Limit

We call them "GPUs" by legacy — Graphics Processing Units — but that name is historical artifact. These are massively parallel math processors. They train neural networks, simulate protein folding, mine cryptocurrencies, render films, and occasionally play games. The "graphics" in GPU describes their origin, not their destiny.

The same silicon that struggles with this benchmark today will run real-time path tracing tomorrow. Then real-time fluid dynamics. Then simulations we haven't imagined yet. The trajectory is clear; the timeline is not.

This benchmark captures a moment in that trajectory. It shows what we're reaching toward — and measures how far we have to go. The gap between "playable with compromises" and "playable without compromises" is the opportunity space for the next generation of hardware.

Tim Sweeney saw this future in 1999. The industry has spent twenty-five years building toward it. We're not at the finish line — we're at the starting blocks of something much bigger. The next twenty-five years will be extraordinary.

Get Radiance

From Weekend Project to GPU Benchmark

Mission Control Panel

Quick Start

Resolution Presets

Debris Presets

Autoplay

Visual Adjustments

Benchmark Execution

Resolution Presets

Debris Presets

Autoplay Toggle

Autoplay Algorithm

Visual Parameter Sliders

How Raymarching Works

Signed Distance Fields Explained

Traditional vs. Raymarching Pipeline

Traditional Rasterization

Raymarching (This Benchmark)

What This Benchmark Tests

Compute Power

Decision Making

Primary Stress Points

FP32 Compute Units

Branch Prediction / Divergence

Instruction Cache

Register File

Per-Pixel Computational Cost

What This Benchmark Does NOT Test

RT Cores / Ray Accelerators

Tensor Cores / AI Accelerators

Traditional Graphics Pipeline

Shader Architecture

Correlation With Unreal Engine 5 Lumen

Ray Budget Comparison

Lumen's Two Rendering Paths

Software Lumen (Compute Shader)

Hardware Lumen (RT Core Accelerated)

Performance Expectations

Debris OFF, 1080P to 2.5K

Debris ON, 480p

Debris ON, 1080p

Ray Budget Analysis

Performance Envelope

Debris Disabled, 1080P 2.5K

Debris Enabled, 480p

Debris Enabled, 1080p

DLSS / FSR / Frame Generation

What This Demonstrates

Running the Benchmark

Quick Start

Default Configuration

Benchmark Protocol

Preventing Driver Optimization

Architecture Fairness

Frequently Asked Questions

The Next 25 Years

The Sky Is the Limit

Keyboard Controls

Movement & Launch

Ball Spawning

Game Control

Visual Settings

Physics Debris

Resolution