Do command lists reduce overhead of context switches due to using DirectCompute?

Apr 8, 2013 at 7:02 AM
Edited Apr 8, 2013 at 7:05 AM
I like the RenderEffect class, it's really handy in its usage, however I'm worried that it puts compute shaders in the same place as the other shader types and when you configure a pipeline with the render effect, and it has a compute shader in it, so for instance in the particle simulation you would bind a compute shader, then go back to rendering, then use a compute shader and so on and the overhead only multiplies when you use multiple particle simulations at the same time.So the question is - what is your experience with this issue?With profiling I couldn't find much performance difference between switching the GPU "mode" back and forth, and by ordering them to have a minimum amount of switches.Could this be because of the usage of Command Lists?Sorry for asking such questions, but I can't find detailed explanations on this subject anywhere.Is there any source that would explain what happens behind the scenes of DirectX, something that specifically explains the overhead of using DirectCompute?I'm trying to integrate compute shaders into my own framework as well and I'm scared that if I don't order the calls to reduce the switches, later on I would suffer huge performance losses.I'm not even sure - does it actually affect the GPU itself or are they referring to driver and/or API overhead?
Apr 8, 2013 at 11:07 AM
That's a great question. In general, the GPU is essentially doing both setups in a very similar way - the Compute Shader and the rendering shaders are all using the same processing resources. So overall, there is really no difference there. However, both AMD and nVidia had given the advice when D3D11 had just started out that we should avoid switching from one style of usage to the other, because there is some switchover time involved. My guess is that this has something to do with stalling the pipeline, and changing the number of threads in a given execution even, since the pixel shader and the compute shader would naturally run at different group sizes.

Command lists are simply a way of packaging up a group of commands in an attempt to minimize the number of calls into the immediate context. Some people say not to use them, some people say they effectively work. Ultimately I find that they provide some performance increase in certain situations, but I guess it depends on lots of variables.

So to answer your question, I'm not sure :) I haven't seen the latest info from the IHVs to indicate that the latest generation of GPUs even still requires the extra time to switch over from compute to rendering, so it may not even be an issue. Its also possible that the effect is only visible under very heavy switching, like going back and forth 5-6 times in a single frame. Unfortunately there is no hard advice about this available at the moment.

My advice would be to design your framework such that you can move things around easily and adjust for these types of issues. If you are properly prepared for such an issue, then you can rearrange the execution when it needs to be - but this depends on drivers, GPUs, CPUs, memory, etc... There is really no way of knowing how things will work until you try them out. However, if your architecture lets you move operations around, then you could even dynamically adjust the order at runtime based on the FPS results you are seeing...

Sorry I can't answer your question directly, but I hope that helps.
Apr 8, 2013 at 9:12 PM
Thanks, I just have to ask one more thing - when an entity is rendered, the effecet in the material sets its data into the pipeline, however in the pre-render before that, the material queues a task(or many), so if 5 entities have the same material it would queue 5 render tasks, uh I got a little lost here, not sure how to formulate the question exactly - I'm just a little lost on how the in-between stuff works with the material, the entity and the view/task.
Apr 9, 2013 at 2:23 AM
Normally if a material has a Task included in it (they used to be called render views), then it won't be shared among multiple entities. You can think of these tasks as simply producing the resources that are needed by that object in order to render itself. In the case of a reflective object, that task is generating an environment map. In the case of the GPU based particle system, it is doing the simulation step.

The system seems complicated, but most uses of it are actually pretty minimalist. The MirrorMirror sample is probably the most complex since there are multiple reflective objects near each other, but other than these, you can think of the tasks as just encapsulating one rendering pass.

You might also find this discussion thread helpful. Please don't be afraid to ask more detailed questions though - if it isn't clear to you, then it is likely not clear to others either, and you are helping to improve the content out there!
Apr 9, 2013 at 2:27 AM
Oh I see, I thought entities share materials and that the material is set before rendering each batch of entities, now it's clearer.