The short version of this part is "Everything I know is wrong." Here's the long version...
I was concerned about performance and was getting my usual weird behavior whenever I time anything in OpenGL. In particular, dividing a large buffer of 75*75*75 cubes into 75 separate buffers took a lot longer to draw. I was worried that there was some overhead to the Draw call that I would not be able to reduce. Still, it looked promising. The hacked version of 2.1 instancing seemed to work, and the times weren't completely ridiculous.
Cubes were still a problem though, since they are not all identical. I want to drop the hidden faces of cubes (I need to when they are transparent), and each face has a different texture pattern. Last week, I had decided the only way to do this was to instance cube faces, not whole cubes. But I continued on this, trying to build a shader that would draw whole cubes.
I finally got something to work which wasn't too slow, so I tried it out on my other machines. Frustratingly, the ratio of instanced time to plain buffers varied a bit on each platform. Overall, it looked like if I used instancing to cut the storage requirement, I was going to lose 25% to 50% of my performance. So I stopped working on it.
Then Florian Bösch left a comment that explained my performance issue with multiple buffers. He said that the GPU can do the z-buffer test before calling the fragment shader, provided the shader never sets the fragment depth. This means that when the pixel is buried beneath existing graphics, the fragment shader is never called.
So even though I was drawing the same collection of cubes in both cases, the drawing order matters. I was skeptical, since I wasn't sorting the cubes -- just generating them with nested loops, in x, y, then z. But when I looked at the code more carefully, I realized he was right. My 75 separate buffers case was actually drawing them from the bottom up (increasing y), where the single-buffer case was drawing them from left to right (increasing x). This does result in a different number of obscured pixels, and so a different draw time.
I went ahead and fixed this, and then got the same time for the single buffer and multiple buffer cases. After making sure the other instancing versions drew cubes in the same order, I figured I was done. It didn't change the results. A couple of days later, the truth hit me...
All along, I've been assuming that fragment shaders were the important things. After all, a 10 by 10 textured rectangle has only 4 vertexes, but 100 pixels. So I thought "who cares about vertexes? All the time is in the fragments." But if the GPU is never calling the fragment shaders for buried pixels, that's not true.
At worst case, if I drew my graphics from front to back, only one fragment shader per screen pixel would ever be called. All the rest of the pixels for the back cubes would be buried and their shaders skipped. So the time to render a frame would be dominated by the vertex shaders, not the fragment shaders.
To test this, I turned the eye away from the graphics on my little instancing test. Since it doesn't do anything smart like check the view frustum, it was drawing the same 75*75*75 cubes, even though all of them were clipped. And sure enough, the frame time only drops by a couple of milliseconds. Most of the time is in the vertexes.
I then took McrView and added a "discard" to the start of all the fragment shaders, to cut that time out. And again, the total frame time only drops by 2-3 milliseconds out of 25ms on my most complex scenes.
This changes a lot of my assumptions. On the instancing issue, I had hoped that even though the vertex shaders were slower with my instancing hack, it wouldn't matter. That hope is gone... It really is going to cost me a lot of time to save that display memory.
On the other hand, I had resisted doing anything fancy to combine adjacent faces in the output. I had thought that since one big rectangle textures as many pixels as two smaller ones, it wouldn't make much difference. Now I'm thinking that it might make a big difference. So I need to investigate that.
I also spent last night reading another blog called 0 FPS. I had started reading this in January 2012, when a post of his referred to this blog and I saw the reference on my log. Then his blog went dead, with one post the middle of April and nothing more until the end of June. By then I had dropped it off my Google home page.
Recently a reader mentioned it, and I went back to check it again. There are several newer posts and they are mostly relevant to things I've been struggling with this year. So I have a bit of reading and thinking to do. In particular, he talks about generating meshes from Minecraft-like voxels. This is exactly what I need if I'm going to reduce the number of vertexes I send to the display.
He also talks about marching cubes, dual contouring (which I want to implement for SeaOfMemes), and simplifying isosurfaces. That gave me an idea.
The other thing I did this week was finally plug my ObjectStore class from Part 53 into Crafty. To get some sample data, I grabbed pieces of McrView and made an exporter to convert my copy of the TwentyMine server into my format.
The copy of the TwentyMine database I have covers an area about 4000 by 6000 meters, so I'm also interested in seeing what distant scenery I can generate from it. To that end, I had the converter save the heights and top block for all positions on the map. It saves that at multiple resolutions. I want to see how this looks as a heightmap in the style of the block world I showed you last week. I don't have that coded yet though.
More next week.
blog comments powered by Disqus