|Home | Downloads | Screenshots | Forums | Source code | RSS|
May 24th 2017, by StapleButter
In the previous post, I said I wanted to run the 3D renderer on a separate thread. Well, we're going to see how all that works in detail.
Ever since 3D rendering was added into melonDS, it's been one of the bottlenecks whenever games use it. On the other hand, 2D rendering, while not being very well optimized, doesn't make for a big performance hit.
2D rendering isn't very expensive or difficult though -- the 2D renderers are oldschool tile engines, essentially drawing raster graphics onscreen at the specified coordinates, optionally with a bunch of fancy effects, but nothing too complex. In comparison, the 3D renderer is a full-fledged 3D GPU. It basically turns a bunch of polygons defined in 3D space, into a 2D representation that is then passed to the main 2D renderer and composited mostly like a regular 2D layer.
Transformations by various matrices, culling, clipping, viewport transform, rasterization with perspective-correct interpolation... it's a bunch of work.
The approach originally taken was to render a whole 3D frame upon scanline 215. I'm not sure whether rendering should start upon scanline 214 or 215 (GBAtek says it starts "48 scanlines in advance"), but melonDS starts at 215.
Which basically meant that the emulator had to wait until the whole 3D frame was rendered before doing anything else.
However, in emulation, you can't just throw everything on separate threads. Considering a component, whether you can put it on a separate thread depends on how tightly it is synchronized to other components.
This excellent article from byuu explains well how all this works.
In the case of our DS 3D renderer, threading is feasible because tight synchronization isn't required. Once the renderer starts rendering, you can't alter its state, the polygon and vertex lists are double-buffered and all the other important registers are latched. The only thing you can do is change VRAM mapping, but I have yet to come across a game that pulls stunts like swapping texture banks mid-frame.
So you can put the rendering process on a separate thread, and just tell it when to start rendering, and wait until it's done before telling it to start a new frame.
That's not the approach I tried, though. Rendering starts upon scanline 215, but the 2D renderer needs 3D graphics as soon as scanline 0, which only leaves 48 scanlines worth of time (out of 263) for the 3D thread to do its job. We can do better.
The 2D renderer is scanline-based. So, upon scanline 0, it doesn't need a whole 3D frame, but only the first scanline of it. And similarly for further scanlines. This gives the 3D thread 192 scanlines worth of time, which is a lot better.
It required more work though, as the 3D renderer had to be modified to work per-scanline. But nothing insurmontable. After a while tracking threading bugs, this finally worked, and gave a nice speed boost. Except it caused glitches in Worms 2.
The glitches were because of the "wait for the frame to be finished before starting a new frame" bit. I first did it upon scanline 215, right before signalling the 3D thread to start rendering. Well, normally the renderer would be done by scanline 192 anyway.
But Worms 2 takes advantage of the fact that 3D rendering starts 48 scanlines in advance, which also means it finishes 48 scanlines in advance. So the game unmaps texture banks and starts loading new textures as soon as scanline 146.
Which would cause problems if the 3D thread was still running at that time. When I forced the main thread to wait for the 3D thread to finish upon scanline 144, the glitches went away.
In the end, the threaded 3D renderer gives a nice speed boost. On my laptop, Super Mario 64 DS runs fullspeed now, and Mario Kart DS is close to fullspeed too. And I haven't seen more glitches, so all should be fine.
It still needs a bunch of polish. I want to make it optional, and the 3D thread needs proper start/stop control.
Also noting that there will be no performance gain if you have a single-core CPU (the whole threading apparatus may even reduce performance).
And, of course, there are other targets to optimize too.
Love and waffles~
|4 comments have been posted.|