Question on compute shaders

Jan 10, 2014 at 3:14 PM
How do you determine what arguments you pass yo Dispatch?I'm looking at the Particle Storm sample and you pass "(m_iParticleCount/512, 1, 1)".Does that mean only m_iParticleCount/512 GPU cores are active while processing the particles?So for 1024 particles only 2 threads?Isn't it better if you have a thread for each particle for maximum possible speed.Or do I have it completely wrong xD.I see that you have a book - does it cover this stuff(specifically how to determine thread groups)?It's basically the part that's most alien to me - compute shaders and indirect dispatching.

It gets even more confusing when I look it up in the net.This for instance:

http://recreationstudios.blogspot.com/2010/04/simple-compute-shader-example.html

He uses: [numthreads(32, 32, 1)] so that's 1024 shader cores being utilized?But then he says "then there are 1024x1024x1 separate threads being run".I don't get it, Z is 1 right?So it's just one 2D threadgroup of 32x32 threads.How did he figure its 1024x1024?
Coordinator
Jan 10, 2014 at 3:34 PM
I think you are confusing the thread group dimensions and the dispatch dimensions. In your compute shader, you define your thread group size -> (512, 1, 1) in the particle storm demo. This is the number of threads that will be created for every item in the dispatch call. So for particle storm, if there is 1024 particles, then the dispatch call would be like you showed -> Dispatch(1024/512, 1, 1) or Dispatch(2, 1, 1).

This means that there will be 211 thread groups spawned and fed into the GPU. With 2 thread groups * 51211 thread per group, we end up with 1024 threads executing on the GPU, which is what you were originally thinking should be the case :) In your other link, he is using thread group size of (32, 32, 1) and dispatch size of (32, 32, 1), which is where he is coming up with 102410241 threads -> (3232, 3232, 1*1).

Regarding the book, there is a whole chapter dedicated to compute shaders, and of course some of the samples utilize them as well so you can see example usage. There is a light coverage of IndirectDispatch (as opposed to regular Dispatch), but they are essentially the same thing with the exception that the arguments to the call are coming from a buffer resource instead of directly supplied in the function call.

You can always ask questions here as well, and the source code is what is used for all of the samples in the book.
Jan 10, 2014 at 3:45 PM
Edited Jan 10, 2014 at 3:46 PM
ok just a quick question - if you spawn less threads than particle count, you just would run out of thread IDs to index the buffer elements, right?But what if you spawn too much and, using thread IDs to access buffer elements, you access a value past the buffer's size.You get some GPU exception?Or just display driver crash?

edit: oh you're using append/consume.I guess these methods are thread safe?So calling consume from one thread makes sure this thread owns the consumed element, right?
Coordinator
Jan 10, 2014 at 3:53 PM
The append/consume methods are guaranteed to not supply the same element to more than one thread if you consume it, and vice versa for appending. In general, I have seen that you can cause some strange behavior if you exceed the consuming too many elements, or try to append too many, but that was a while ago when the drivers for DX11 were still not very good.

I believe the expected behavior for consuming an element that is out of range is to return all zeros, but that is just from memory. You should be able to try it out and see what it does with a straight forward test case. It can often be helpful to use the 'CopyStructureCount' method (which is shown in the ParticleStorm demo) to get access to the current values within the buffer.

Sorry I don't have a more concrete answer for you though...
Jan 10, 2014 at 8:02 PM
Edited Jan 10, 2014 at 8:02 PM
ok just one more question - the UAV which is basically like a SRV but can be both input and output(right?) - after the GPU writes to it, does it just get copied back to system memory?Like if you pass from one shader to another and so on(chaining them) does it go back to the RAM every time a shader is done?Can you tell it not to come back to the RAM and stay in the GPU, since the particles are completely controlled by the shader and don't need to go back to system memory.
Coordinator
Jan 13, 2014 at 3:42 PM
Edited Jan 13, 2014 at 3:43 PM
blackenedarmor wrote:
ok just one more question - the UAV which is basically like a SRV but can be both input and output(right?)
That is correct - the exact same concept as the shader resource view, but that it is read/write instead of read only. This has implications for where the resource can be bound to the pipeline.

blackenedarmor wrote:
... after the GPU writes to it, does it just get copied back to system memory? Like if you pass from one shader to another and so on(chaining them) does it go back to the RAM every time a shader is done? Can you tell it not to come back to the RAM and stay in the GPU, since the particles are completely controlled by the shader and don't need to go back to system memory.
Actually all of the D3D11 resources reside in GPU memory (this discounts the fact that GPU memory is able to be virtualized in modern Windows versions...). So any operations that modify a resource are already trying to keep the memory as close to the GPU as possible. The only time the resource is copied to system memory is when you map it, and even that can only be done on certain resources created with special usage flags.
Jan 14, 2014 at 12:49 AM
Does that mean all created shaders are always in GPU memory?I happen to create a lot of permutations and I guess this is bad for using a lot of memory.I should try the shader linkage API, but I'm not sure if it has any performance penalties of its own.The documentation simply says what it does, but doesn't say if it incurs a large overhead or not.I mean the way they explain it it's like virtual functions and we know they're less efficient than normal ones.
Coordinator
Jan 15, 2014 at 1:48 AM
It is a GPU resource, so it should reside in GPU memory. The object that you work with in your C++ code is actually just a COM pointer to the shader object, and the driver is the one that actually manages the communication from your COM pointer to the actual resource on the GPU.

The shader linkage API isn't designed to make the code faster or anything like that - instead it is designed to reduce the number of permutations that are needed in an application/game. I honestly haven't used it, and I also don't know of anyone else that has either...

But like everything else in graphics programming, it would be interesting for you to try it out and profile to see if it has a big influence in one way or the other!