How to minimize the draw calls per frame while rendering multiple meshes?

May 12, 2014 at 4:17 AM
Edited May 12, 2014 at 10:18 AM
Hi jzink,

I have a large B3D scene with 120 meshes, each of which includes a vertex buffer and material information including diffuse, specular, ambient factors. To render this model using Glyph3, I have created about 120 entities. This way takes me about 120 IndexedDraw calls per frame, and the fps is down to 30fps. Unfortunately, I can't apply physics to the application at this speed. I want to ask you two questions.
  1. Is this speed normal in your engine since when I post my problem on the gamedev, someone said that 120 patches per frame is not the reason for the 30fps speed. I'm using a 1.6hz core i5 cpu, and GPU Geforce 740M.
  2. If this speed is normal in Glyph3, is there any way to work around this problem? I am thinking about minimizing the number of IndexedDraw calls to only one per frame by grouping all vertices from meshes to a single vertex buffer. And each vertex will contains a materialID and a transformID. Such these IDS will serve as entries to structured buffer storing material and transformation information.
I hope to see your opinion about it. Thanks in advance!

Update: At first, I though most of the running time is spent in IndexedDraw, but when using the profiler Sleepy (in the inclusive statistic mode), it turns out that the call: PipelineManagerDX11::ClearPipelineResources takes the most, as it is shown in the below picture.

Image

After digging deep into the function ClearPipelineResources, I found out that this function then call TStateArrayMonitor::SetState(), which take the highest time duration (in the execlusive mode of Sleep profiler).

Image1

This seems to contrast to your article where you discussed that using pipeline state monitoring could reduce time of calling DirectX API.
May 16, 2014 at 3:15 PM
Edited May 16, 2014 at 3:27 PM
Where are you, jzink?
I tried at my best for several days and now, the frame rate is still about 115 fpt. You can see more about my question on gamedev.stack
http://gamedev.stackexchange.com/questions/74886/draw-call-optimization-for-multiple-meshes-in-directx11
Coordinator
May 17, 2014 at 1:15 PM
Sorry I have been offline forthe past week... I will take a closer look at this tonight and give you feedback about what is happening.
May 17, 2014 at 3:11 PM
Edited May 17, 2014 at 3:14 PM
It's so great to see you come back.

I put here some more explanation of the source code.

The only change between DrawIndexed calls is the material parameters, which is shown in the following shader code:
cbuffer Material
{
    float4 gMatDiffuse;
    float4 gMatSpecular;
    float4 gMatAmbient;
    float4 gMatEmission;
    float  gMatShininess;
    Texture2D gTexDiffuse;
};
All of other pipeline states is still the same after each call.

At first I designed the program as the normal way. I create about 120 entities, each of which has its own material, and the FPS is only about 30fps.

I debuged and found out most of the CPU time is spent in the function
//--------------------------------------------------------------------------------
void Entity3D::Render( PipelineManagerDX11* pPipelineManager, IParameterManager* pParamManager, VIEWTYPE view)
{ 
      pPipelineManager->ClearPipelineResources();
}
So instead of 120 entities, I created only one, which, in turn, owns 120 pipeline executors. This entity divides the process of configuring the pipeline into two different stages as the following code:
void MultiMeshEntity3D::Render(PipelineManagerDX11* pPipelineManager, IParameterManager* pParamManager, VIEWTYPE view)
{
    // Test if the entity contains any geometry, and it has a material
    if (!m_bHidden && (m_pMeshes.size() > 0) && (m_sParams.Material != NULL))
    {
        // Only render if the material indicates that you should
        if (m_sParams.Material->Params[view].bRender)
        {
            //m_sParams.Material->SetRenderParams(pParamManager, view);

            // Set the entity render parameters
            this->SetRenderParams(pParamManager);

            // Configure the pipeline with the render effect supplied by the material.
            pPipelineManager->ClearPipelineResources();
                        
                        //I added this function to RenderEffect11 class to configure only OutputMergerStage and RasterizerStage because these
                        //two stages do not change after each IndexedDraw calls.
            m_sParams.Material->Params[view].pEffect->ConfigurePipelieStandardStates(pPipelineManager, pParamManager);

            // Let the geometry execute its drawing operation.  This includes 
            // configuring the input to the pipeline, plus calling an appropriate
            // draw call.
            for (int i = 0; i < m_pMeshes.size(); i++)
            {
                                //MeshInfo material information such as diffuse, specular, and texture resource.
                MeshInfo* pMesh = m_pMeshes[i];

                                //Update the mesh material information
                m_pMeshParamsWriter->UpdateMeshParameters(pMesh, pParamManager);
                m_sParams.Material->Params[view].pEffect->BindShaderParameters(pPipelineManager, pParamManager);
                pPipelineManager->ApplyPipelineResources();
                                
                                // pExecutor is DrawIndexedExecutor<Vertex>
                pMesh->pExecutor->Execute(pPipelineManager, pParamManager);
            }
        }
    }
}
Updating material information is done as following:
//-------------------------------------------------------------------------------------------------------
void MeshParamsWriter::UpdateMeshParameters(MeshInfo* pMesh, IParameterManager* pParamaterManager)
{
    pMatDiffuseWriter->SetValue(pMesh->diffuse);
    pMatSpecularWriter->SetValue(pMesh->specular);
    pMatAmbientWriter->SetValue(pMesh->ambient);
    pMatEmissionWriter->SetValue(pMesh->emission);
    pMatShininessWriter->SetValue(pMesh->shininess);
    pTexDiffuseWriter->SetValue(pMesh->pTextureRes);

    pParameters->SetRenderParams(pParamaterManager);
}
Unfortunately, the FPS is only about 110 FPS, it's still so slow for me.

Could you show me what is the problem here.

Please tell me if you need any other information.
Coordinator
May 17, 2014 at 9:04 PM
Are you certain that the actual GPU work is not sufficient to make the system run at 110 FPS? Can you take a frame grab with either PIX or the Graphics Debugger and post it to see how the timing is working?

This would show which states are being set, and perhaps give some insight into what is happening...
May 17, 2014 at 11:55 PM
Edited May 18, 2014 at 1:06 AM
It seems that your Pipeline Monitoring works ok since when I debug, it only sets the pipeline states for the first time, and in the next time, it only set changed states such as shader resource views. I think the bottleneck lies in the code section
            // Configure the pipeline with the render effect supplied by the material.
            pPipelineManager->ClearPipelineResources();
            m_sParams.Material->Params[view].pEffect->ConfigurePipeline( pPipelineManager, pParamManager );
            pPipelineManager->ApplyPipelineResources();
It's the Graphics Debugger Log from my program.
1 View Draw: ViewPerspective
3 ID3D11DeviceContext::ClearRenderTargetView(obj:10,{0,0.5,0.7,0})
4 ID3D11DeviceContext::ClearDepthStencilView(obj:13,D3D11_CLEAR_DEPTH,1,0)
22 ID3D11DeviceContext::DrawIndexed(5898,0,0)
29 ID3D11DeviceContext::DrawIndexed(20664,0,0)
36 ID3D11DeviceContext::DrawIndexed(3000,0,0)
43 ID3D11DeviceContext::DrawIndexed(5340,0,0)
50 ID3D11DeviceContext::DrawIndexed(3540,0,0)
57 ID3D11DeviceContext::DrawIndexed(2760,0,0)
64 ID3D11DeviceContext::DrawIndexed(3444,0,0)
71 ID3D11DeviceContext::DrawIndexed(1680,0,0)
78 ID3D11DeviceContext::DrawIndexed(1680,0,0)
85 ID3D11DeviceContext::DrawIndexed(72,0,0)
92 ID3D11DeviceContext::DrawIndexed(1920,0,0)
99 ID3D11DeviceContext::DrawIndexed(1332,0,0)
106 ID3D11DeviceContext::DrawIndexed(618,0,0)
113 ID3D11DeviceContext::DrawIndexed(720,0,0)
120 ID3D11DeviceContext::DrawIndexed(6,0,0)
127 ID3D11DeviceContext::DrawIndexed(174,0,0)
134 ID3D11DeviceContext::DrawIndexed(360,0,0)
141 ID3D11DeviceContext::DrawIndexed(78,0,0)
148 ID3D11DeviceContext::DrawIndexed(1368,0,0)
155 ID3D11DeviceContext::DrawIndexed(324,0,0)
162 ID3D11DeviceContext::DrawIndexed(900,0,0)
169 ID3D11DeviceContext::DrawIndexed(216,0,0)
176 ID3D11DeviceContext::DrawIndexed(900,0,0)
183 ID3D11DeviceContext::DrawIndexed(216,0,0)
190 ID3D11DeviceContext::DrawIndexed(900,0,0)
197 ID3D11DeviceContext::DrawIndexed(216,0,0)
204 ID3D11DeviceContext::DrawIndexed(504,0,0)
211 ID3D11DeviceContext::DrawIndexed(6,0,0)
218 ID3D11DeviceContext::DrawIndexed(174,0,0)
225 ID3D11DeviceContext::DrawIndexed(360,0,0)
232 ID3D11DeviceContext::DrawIndexed(78,0,0)
239 ID3D11DeviceContext::DrawIndexed(108,0,0)
246 ID3D11DeviceContext::DrawIndexed(24,0,0)
253 ID3D11DeviceContext::DrawIndexed(900,0,0)
260 ID3D11DeviceContext::DrawIndexed(216,0,0)
267 ID3D11DeviceContext::DrawIndexed(108,0,0)
274 ID3D11DeviceContext::DrawIndexed(24,0,0)
281 ID3D11DeviceContext::DrawIndexed(144,0,0)
288 ID3D11DeviceContext::DrawIndexed(504,0,0)
295 ID3D11DeviceContext::DrawIndexed(6,0,0)
302 ID3D11DeviceContext::DrawIndexed(174,0,0)
309 ID3D11DeviceContext::DrawIndexed(360,0,0)
316 ID3D11DeviceContext::DrawIndexed(78,0,0)
323 ID3D11DeviceContext::DrawIndexed(372,0,0)
330 ID3D11DeviceContext::DrawIndexed(30,0,0)
337 ID3D11DeviceContext::DrawIndexed(360,0,0)
344 ID3D11DeviceContext::DrawIndexed(78,0,0)
351 ID3D11DeviceContext::DrawIndexed(144,0,0)
358 ID3D11DeviceContext::DrawIndexed(108,0,0)
365 ID3D11DeviceContext::DrawIndexed(24,0,0)
372 ID3D11DeviceContext::DrawIndexed(540,0,0)
379 ID3D11DeviceContext::DrawIndexed(540,0,0)
386 ID3D11DeviceContext::DrawIndexed(5322,0,0)
393 ID3D11DeviceContext::DrawIndexed(540,0,0)
400 ID3D11DeviceContext::DrawIndexed(792,0,0)
407 ID3D11DeviceContext::DrawIndexed(1800,0,0)
414 ID3D11DeviceContext::DrawIndexed(540,0,0)
421 ID3D11DeviceContext::DrawIndexed(540,0,0)
428 ID3D11DeviceContext::DrawIndexed(36,0,0)
434 ID3D11DeviceContext::DrawIndexed(36,0,0)
441 ID3D11DeviceContext::DrawIndexed(36,0,0)
448 ID3D11DeviceContext::DrawIndexed(450,0,0)
455 ID3D11DeviceContext::DrawIndexed(36,0,0)
461 ID3D11DeviceContext::DrawIndexed(36,0,0)
468 ID3D11DeviceContext::DrawIndexed(36,0,0)
475 ID3D11DeviceContext::DrawIndexed(36,0,0)
482 ID3D11DeviceContext::DrawIndexed(396,0,0)
488 ID3D11DeviceContext::DrawIndexed(396,0,0)
494 ID3D11DeviceContext::DrawIndexed(396,0,0)
500 ID3D11DeviceContext::DrawIndexed(396,0,0)
506 ID3D11DeviceContext::DrawIndexed(396,0,0)
512 ID3D11DeviceContext::DrawIndexed(396,0,0)
518 ID3D11DeviceContext::DrawIndexed(396,0,0)
524 ID3D11DeviceContext::DrawIndexed(396,0,0)
530 ID3D11DeviceContext::DrawIndexed(396,0,0)
536 ID3D11DeviceContext::DrawIndexed(396,0,0)
542 ID3D11DeviceContext::DrawIndexed(396,0,0)
548 ID3D11DeviceContext::DrawIndexed(396,0,0)
554 ID3D11DeviceContext::DrawIndexed(396,0,0)
561 ID3D11DeviceContext::DrawIndexed(1512,0,0)
567 ID3D11DeviceContext::DrawIndexed(1512,0,0)
574 ID3D11DeviceContext::DrawIndexed(78,0,0)
581 ID3D11DeviceContext::DrawIndexed(540,0,0)
588 ID3D11DeviceContext::DrawIndexed(36,0,0)
595 ID3D11DeviceContext::DrawIndexed(78,0,0)
602 ID3D11DeviceContext::DrawIndexed(360,0,0)
609 ID3D11DeviceContext::DrawIndexed(24,0,0)
616 ID3D11DeviceContext::DrawIndexed(180,0,0)
623 ID3D11DeviceContext::DrawIndexed(12,0,0)
630 ID3D11DeviceContext::DrawIndexed(624,0,0)
637 ID3D11DeviceContext::DrawIndexed(180,0,0)
644 ID3D11DeviceContext::DrawIndexed(12,0,0)
651 ID3D11DeviceContext::DrawIndexed(624,0,0)
658 ID3D11DeviceContext::DrawIndexed(180,0,0)
665 ID3D11DeviceContext::DrawIndexed(12,0,0)
672 ID3D11DeviceContext::DrawIndexed(624,0,0)
679 ID3D11DeviceContext::DrawIndexed(180,0,0)
686 ID3D11DeviceContext::DrawIndexed(12,0,0)
693 ID3D11DeviceContext::DrawIndexed(624,0,0)
700 ID3D11DeviceContext::DrawIndexed(180,0,0)
707 ID3D11DeviceContext::DrawIndexed(12,0,0)
714 ID3D11DeviceContext::DrawIndexed(864,0,0)
721 ID3D11DeviceContext::DrawIndexed(648,0,0)
727 ID3D11DeviceContext::DrawIndexed(648,0,0)
733 ID3D11DeviceContext::DrawIndexed(648,0,0)
740 ID3D11DeviceContext::DrawIndexed(216,0,0)
747 ID3D11DeviceContext::DrawIndexed(180,0,0)
754 ID3D11DeviceContext::DrawIndexed(12,0,0)
761 ID3D11DeviceContext::DrawIndexed(180,0,0)
768 ID3D11DeviceContext::DrawIndexed(12,0,0)
775 ID3D11DeviceContext::DrawIndexed(180,0,0)
782 ID3D11DeviceContext::DrawIndexed(12,0,0)
789 ID3D11DeviceContext::DrawIndexed(624,0,0)
796 ID3D11DeviceContext::DrawIndexed(180,0,0)
803 ID3D11DeviceContext::DrawIndexed(12,0,0)
810 ID3D11DeviceContext::DrawIndexed(624,0,0)
811 View Draw: ViewTextOverlay
832 IDXGISwapChain::Present(0,0)
And this is the detail information of the final DrawIndexed. All other calls is the same.
__810 ID3D11DeviceContext::DrawIndexed(624,0,0)__
Input Layout : obj:426
    807 ID3D11DeviceContext::IASetInputLayout(obj:426)
Index Buffer : obj:310
    809 ID3D11DeviceContext::IASetIndexBuffer(obj:310,DXGI_FORMAT_R32_UINT,0)
Vertex Buffer(s) : obj:309
    808 ID3D11DeviceContext::IASetVertexBuffers(0,1,{obj:309},{36},{0})
Primitive Topology : D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST
    Set by previous frame : D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST 810 DrawIndexed was not the first event to use this state from the previous frame. This state was already used by 22 DrawIndexed.
VS Constant Buffer(s) : obj:68
    10 ID3D11DeviceContext::VSSetConstantBuffers(0,1,{obj:68}) 810 DrawIndexed was not the first event to use 10 VSSetConstantBuffers. This state was already used by 22 DrawIndexed.
VS Shader : obj:67
    9 ID3D11DeviceContext::VSSetShader(obj:67,nullptr,0) 810 DrawIndexed was not the first event to use 9 VSSetShader. This state was already used by 22 DrawIndexed.
Rasterizer State : obj:5
    15 ID3D11DeviceContext::RSSetState(obj:5) 810 DrawIndexed was not the first event to use 15 RSSetState. This state was already used by 22 DrawIndexed.
Viewport(s) : {{D3D11_VIEWPORT: TopLeftX=0, TopLeftY=0, Width=800, Height=600, MinDepth=0, MaxDepth=1}}
    16 ID3D11DeviceContext::RSSetViewports(1,{{D3D11_VIEWPORT: TopLeftX=0, TopLeftY=0, Width=800, Height=600, MinDepth=0, MaxDepth=1}}) 810 DrawIndexed was not the first event to use 16 RSSetViewports. This state was already used by 22 DrawIndexed.
PS Constant Buffer(s) : obj:70
    12 ID3D11DeviceContext::PSSetConstantBuffers(0,1,{obj:70}) 810 DrawIndexed was not the first event to use 12 PSSetConstantBuffers. This state was already used by 22 DrawIndexed.
PS Shader Resource(s) : obj:40
    806 ID3D11DeviceContext::PSSetShaderResources(0,1,{obj:40})
PS Sampler(s) : obj:21
    13 ID3D11DeviceContext::PSSetSamplers(0,1,{obj:21}) 810 DrawIndexed was not the first event to use 13 PSSetSamplers. This state was already used by 22 DrawIndexed.
PS Shader : obj:69
    11 ID3D11DeviceContext::PSSetShader(obj:69,nullptr,0) 810 DrawIndexed was not the first event to use 11 PSSetShader. This state was already used by 22 DrawIndexed.
Render Target View(s) : obj:10
    2 ID3D11DeviceContext::OMSetRenderTargets(8,{obj:10,nullptr,nullptr,nullptr,nullptr,nullptr,nullptr,nullptr},obj:13) 810 DrawIndexed was not the first event to use 2 OMSetRenderTargets. This state was already used by 22 DrawIndexed.
Depth-Stencil View : obj:13
    2 ID3D11DeviceContext::OMSetRenderTargets(8,{obj:10,nullptr,nullptr,nullptr,nullptr,nullptr,nullptr,nullptr},obj:13) 810 DrawIndexed was not the first event to use 2 OMSetRenderTargets. This state was already used by 22 DrawIndexed.
Blend State : obj:7
    17 ID3D11DeviceContext::OMSetBlendState(obj:7,{1,1,1,1},0xffffffff) 810 DrawIndexed was not the first event to use 17 OMSetBlendState. This state was already used by 22 DrawIndexed.
Depth-Stencil State : obj:6
    18 ID3D11DeviceContext::OMSetDepthStencilState(obj:6,0) 810 DrawIndexed was not the first event to use 18 OMSetDepthStencilState. This state was already used by 22 DrawIndexed.
Coordinator
May 18, 2014 at 9:09 AM
Interesting... Ok, so that means that your bottleneck is most likely what you originally reported, and that we are clearing the CPU states too frequently. I'll try this with the mirror mirror sample (which has similar usage pattern) and see if I can find a reason that there are more ClearPipelineResources calls than needed. Also, I'll check if there is a way to optimize the number of resources that get cleared in this case or if there is a way to simplify the operation somehow. I'll report back later :)
May 18, 2014 at 12:38 PM
I read the experiment result of your pipeline monitoring in gamedev, and I saw that the FPS of your mirror mirror sample is 160. This seems to contrast to the speed of 15 FPT when mirror mirror sample run on my computer (I changed the configure to multi-thread). My profile is Windows 8, Core I5 1.6, and Geforce 740M. Is there any difference here?
Coordinator
May 18, 2014 at 5:28 PM
Are you running in a release build, and running outside of the development environment? My development machine that I took those numbers on was with an AMD quad core machine, with an AMD 5700 level graphics card. I think you should get better results than I do...

Also the usual questions apply - do you have the latest drivers for your graphics card installed?
Coordinator
May 18, 2014 at 5:39 PM
I was thinking about this more, and if you only have a single entity, which is issuing all of the draw calls, then you should only be calling ClearPipelineResources once, right? If that is your biggest CPU usage, and it is only called once per frame, then you should be achieving very high framerates.

I just tried out the MirrorMirror sample in release mode on Win32 build config, and it ran at ~220 fps. This sample uses many entities, meaning that the same ClearPipelineResources method should be called many more times than in your case....

There must be something else going on that is driving the performance issue. Is there anything modified in the engine that you are using, or is it a fresh download of the current commit?
Coordinator
May 18, 2014 at 6:24 PM
Actually it should also be possible to move the ClearPipelineResources out of the Entity3D::Render method, and instead perform that call at the higher level ViewPerspective::ExecuteTask method. This should minimize the number of times it is called, and will actually not hurt the performance of the rest of the state setting. Since each entity will set its needed parameters, then it doesn't matter which other states are still set... so we can safely skip the clear call without any problems.

Can you double check your profiling results in this way? And also ensure that you are profiling a release build as well!
May 19, 2014 at 12:29 AM
Edited May 19, 2014 at 12:30 AM
I changed to the release mode and with the fresh version of Hieroglyph, the mirror mirror sample run at ~130 fps. It's still lower than yours. Do you think this is due to my CPU Core I5 1.6?

When I move the function pipeline->ClearPipelineResources to ViewPerspective::executure, the speed is increased by 20 fps.

Finally, the most greatest thing is when I build my application in the release mode, it run at 430 fps. This problem obsessed me for about a week, and now, it seems to take a heavy weight off my mind.

I just want to ask you a final question. Why is there a big difference in speed between release and debug mode in your engine?

P/S: Thanks for spending time in helping me out of it. Sorry for my bad english.
May 19, 2014 at 12:36 AM
Edited May 19, 2014 at 12:37 AM
jzink wrote:
I was thinking about this more, and if you only have a single entity, which is issuing all of the draw calls, then you should only be calling ClearPipelineResources once, right? If that is your biggest CPU usage, and it is only called once per frame, then you should be achieving very high framerates.
I have two version. One with multiples entities which run at 50 fps, and other with only one entity which run at 70 fps.
Coordinator
May 19, 2014 at 7:11 AM
The main reason that there is a big difference is that when you have very simple rendering techniques, the GPU can finish much earlier than the CPU for a given frame rendering. That means that the CPU is the primary bottleneck, so anything that you can do to speed it up will make it much faster. In release mode, the compiler and linker optimize for speed and remove all of the debugging information so that it runs much faster.

I wouldn't expect such a big difference (i.e. from ~70 in debug up to ~400 in release) but that is an indication that your CPU is the bottleneck in debug mode. Is your processor a quad core, or dual core? If it is only dual core, you may want to run in singlethreaded mode, or possibly modify the NUM_THREADS constant to 2 which may help!