> DXR allows each material to register a unique hit shader, so Modern GPUs need to dynamic dispatch based on the result of ray hits.
That's not how it works in practice. Even with hardware accelerated raytracers (like Intel Arc).
AMD systems push the hit/miss onto various buffers and pass them around.
Intel systems push the entire call-stack and shuffles them around.
Lets say your 256 thread-group chunk has 30% "metalic hits", 15% "diffuse hits", and the remaining 55% are misses. You cannot "build up" a new group with just one thread-group (!!!!).
To efficiently run things, you'll need ~4 thread groups (aka: 1024 rays) before you can run a full 256-thread group again for hits, and you'll need ~2 thread-groups (aka: 512 rays) before you get a full 256-thread group again for misses. And finally you'll need ~7 thread-groups (aka: 1792 rays to pass through) before you have the 256-diffuse hits needed to fill up a SIMD Unit.
In all cases, you need to dynamically grow a buffer and "build up" enough parallelism before running the recursive hits (or miss) handlers. The devil is in the details.
Intel has very dedicated and specialized accelerators that moves the stacks around (!!!!) so that all groups remain fully utilized. I believe AMD's implementation is "just" an append buffer followed by a consume buffer, simple enough really. Probably inside of shared memory but who knows what the full implementation details are. (The AMD systems have documented ISAs so we know what instructions are available. AMD's "raytracers" are BVH tree traversal accelerators but don't seem to have stack-manipulation or stack-movements like Intel's raytracing implementation)
That's not how it works in practice. Even with hardware accelerated raytracers (like Intel Arc).
AMD systems push the hit/miss onto various buffers and pass them around.
Intel systems push the entire call-stack and shuffles them around.
Lets say your 256 thread-group chunk has 30% "metalic hits", 15% "diffuse hits", and the remaining 55% are misses. You cannot "build up" a new group with just one thread-group (!!!!).
To efficiently run things, you'll need ~4 thread groups (aka: 1024 rays) before you can run a full 256-thread group again for hits, and you'll need ~2 thread-groups (aka: 512 rays) before you get a full 256-thread group again for misses. And finally you'll need ~7 thread-groups (aka: 1792 rays to pass through) before you have the 256-diffuse hits needed to fill up a SIMD Unit.
In all cases, you need to dynamically grow a buffer and "build up" enough parallelism before running the recursive hits (or miss) handlers. The devil is in the details.
Intel has very dedicated and specialized accelerators that moves the stacks around (!!!!) so that all groups remain fully utilized. I believe AMD's implementation is "just" an append buffer followed by a consume buffer, simple enough really. Probably inside of shared memory but who knows what the full implementation details are. (The AMD systems have documented ISAs so we know what instructions are available. AMD's "raytracers" are BVH tree traversal accelerators but don't seem to have stack-manipulation or stack-movements like Intel's raytracing implementation)