Performance regression

Dec 17, 2011 at 10:16 PM

After I had downloaded the newest revision 72383, I decided to carry some test and see whether there are any performance gains coming from the included set of changes that reduced number of API calls even further. To my surprise, this revision actually causes a huge performance regression compared to the earlier revision 72208. I tested 2 samples: DeferredRendering and LightPrepass. Here are my results (I know they aren't very precise, but the regression is so big I don't think more accurate numbers are needed here):

Revision 72208:

DeferredRendering: FPS fluctuates between 2000 and 2600, most of the time is around 2200.

LightPrepass: 1100-1500, avg. 1200.

 

Revision 72383:

DeferredRendering: 1400-1800, avg. 1600.

LightPrepass: 400-450, avg. 430.

 

As you can see there is a huge difference in performance, especially the case of LightPrepass looks strange. All samples were built in release mode and executed in the same environment (Win7 x64 SP1, GeForce GTX460 with 290.36 drivers). They were of course compiled with the same software stack. Also, I haven't changed any of the samples code nor runtime settings (options like e.g. number of lights). It seems that after the OutputMergerStageDX11 modernisation some performance regression was created.

Greetings.

 

Dec 18, 2011 at 12:20 AM

That's for sure a performance regression but it's not that huge.

Considering the DeferredRendering demo the lastest version only takes an additional ~0.17 milliseconds.

Coordinator
Dec 18, 2011 at 1:00 PM

Thanks for the heads-up.  I will take a look at this later tonight when I get home, but my first guess is that the additional CPU time spent comparing the states is causing more overhead than it is preventing on CPU bound machines (IIRC, Sephirothusi has a kickin' GPU...).  I'll report what I can figure out later today!

- Jason

Coordinator
Dec 18, 2011 at 8:58 PM

Very interesting...  I have reproduced the results that Sephirothusi had (although proportionally lower numbers due to my GPU).  I also checked the number of API calls in each version of the Light Pre-pass and the Deferred Rendering samples, and in the latest version they all have a relatively large decrease in API calls.

After checking the differences between the two commits that are mentioned above, I managed to track the performance difference down to the OutputMergerStageStateDX11::CompareRenderTargets(...) method. If I fake out this method to always return something greater than 1, then the performance returns to the higher values.  Here is the contents of that function:

int OutputMergerStageStateDX11::CompareRenderTargets( OutputMergerStageStateDX11& desired )
{
	int count = 0;
	for ( int i = D3D11_SIMULTANEOUS_RENDER_TARGET_COUNT-1; i >= 0; i-- )
	{
		if ( RenderTargetViews[i] != desired.RenderTargetViews[i] )
		{
			count = i+1;
			break;
		}
	}

	return( count );
}
//--------------------------------------------------------------------------------

This method is simply comparing an array of integers for the current state against the desired state - I have no idea why this is cutting down performance so much...  Even if I hard code the for-loop to only check the first array entry, I still get the performance hit.  My assumption is that the checking of these arrays is causing some memory coherency change, but another method that compares the unordered access view states is also called - but doesn't affect performance!  I really don't understand why there is such a big difference there - if anyone sees anything suspicious, please chime in and let me know what you think it could be!

On a separate note, the checking of these states is actually not all that necessary, since the only time the ApplyRenderTargets() method is called is right after someone sets a new render target...  In this case, I could just skip the state checking and push the API call anyways - but I still want to know why this is making such a big performance difference.

Any help or ideas out there?

Dec 18, 2011 at 10:10 PM
Edited Dec 18, 2011 at 10:39 PM

I have changed the code to this:

 

int OutputMergerStageStateDX11::CompareRenderTargets( OutputMergerStageStateDX11& desired )
{
	int i = 0;
	for ( i = 0; i <= D3D11_SIMULTANEOUS_RENDER_TARGET_COUNT-1; i++ )
	{
		if ( RenderTargetViews[i] != desired.RenderTargetViews[i] )
		{
			break;
		}
	}

	return( i+1 );
}

 

And now the performance regression is gone (or so it seems)!

I don't think I really know how to explain this though... I'm afraid my change alters the behaviour of the code in an unwanted way that breaks something...

Greetings.

 

EDIT:

I am pretty sure that my change is doing something wrong, because whenever I am trying to make a version of this function with loop iterating not from 0 to 

D3D11_SIMULTANEOUS_RENDER_TARGET_COUNT-1

but from

D3D11_SIMULTANEOUS_RENDER_TARGET_COUNT-1

to 0, like in the original function, the regression appears again. Sorry for the trouble.

Greetings.

Coordinator
Dec 19, 2011 at 9:05 PM

I committed your change as you listed it above.  Even if it searches in the opposite order, it still detects changes in the buffers and triggers the update - so it is still functional as it is.  It is still a mystery as to why it is happening, but at least it is not regressing.  I'm still looking for a tip about what might cause such a performance it...

Thanks for the code Sephirothusi!

- Jason

Dec 19, 2011 at 10:04 PM

Maybe some kind of CPU cache trashing ?