Threads!
In the previous post, I said I wanted to run the 3D renderer on a separate thread. Well, we're going to see how all that works in detail.


Ever since 3D rendering was added into melonDS, it's been one of the bottlenecks whenever games use it. On the other hand, 2D rendering, while not being very well optimized, doesn't make for a big performance hit.

2D rendering isn't very expensive or difficult though -- the 2D renderers are oldschool tile engines, essentially drawing raster graphics onscreen at the specified coordinates, optionally with a bunch of fancy effects, but nothing too complex. In comparison, the 3D renderer is a full-fledged 3D GPU. It basically turns a bunch of polygons defined in 3D space, into a 2D representation that is then passed to the main 2D renderer and composited mostly like a regular 2D layer.

Transformations by various matrices, culling, clipping, viewport transform, rasterization with perspective-correct interpolation... it's a bunch of work.


The approach originally taken was to render a whole 3D frame upon scanline 215. I'm not sure whether rendering should start upon scanline 214 or 215 (GBAtek says it starts "48 scanlines in advance"), but melonDS starts at 215.

Which basically meant that the emulator had to wait until the whole 3D frame was rendered before doing anything else.


However, in emulation, you can't just throw everything on separate threads. Considering a component, whether you can put it on a separate thread depends on how tightly it is synchronized to other components.

This excellent article from byuu explains well how all this works.

In the case of our DS 3D renderer, threading is feasible because tight synchronization isn't required. Once the renderer starts rendering, you can't alter its state, the polygon and vertex lists are double-buffered and all the other important registers are latched. The only thing you can do is change VRAM mapping, but I have yet to come across a game that pulls stunts like swapping texture banks mid-frame.

So you can put the rendering process on a separate thread, and just tell it when to start rendering, and wait until it's done before telling it to start a new frame.

That's not the approach I tried, though. Rendering starts upon scanline 215, but the 2D renderer needs 3D graphics as soon as scanline 0, which only leaves 48 scanlines worth of time (out of 263) for the 3D thread to do its job. We can do better.

The 2D renderer is scanline-based. So, upon scanline 0, it doesn't need a whole 3D frame, but only the first scanline of it. And similarly for further scanlines. This gives the 3D thread 192 scanlines worth of time, which is a lot better.

It required more work though, as the 3D renderer had to be modified to work per-scanline. But nothing insurmontable. After a while tracking threading bugs, this finally worked, and gave a nice speed boost. Except it caused glitches in Worms 2.

The glitches were because of the "wait for the frame to be finished before starting a new frame" bit. I first did it upon scanline 215, right before signalling the 3D thread to start rendering. Well, normally the renderer would be done by scanline 192 anyway.

But Worms 2 takes advantage of the fact that 3D rendering starts 48 scanlines in advance, which also means it finishes 48 scanlines in advance. So the game unmaps texture banks and starts loading new textures as soon as scanline 146.

Which would cause problems if the 3D thread was still running at that time. When I forced the main thread to wait for the 3D thread to finish upon scanline 144, the glitches went away.


In the end, the threaded 3D renderer gives a nice speed boost. On my laptop, Super Mario 64 DS runs fullspeed now, and Mario Kart DS is close to fullspeed too. And I haven't seen more glitches, so all should be fine.

It still needs a bunch of polish. I want to make it optional, and the 3D thread needs proper start/stop control.

Also noting that there will be no performance gain if you have a single-core CPU (the whole threading apparatus may even reduce performance).

And, of course, there are other targets to optimize too.



Love and waffles~
Slicing the melons!
So here are the two main goals for melonDS 0.3: threading the 3D renderer, and starting work on wifi connectivity.


The first goal basically aims at running the 3D renderer in parallel with the rest of the emulator. On the hardware, the renderer's state can't be altered while it is rendering, so the timing doesn't have to be precise, and we can use it to our advantage. As the current 3D renderer is a bottleneck, threading it should give a nice speed boost for multi-core CPUs (which are quite the norm nowadays).


The second goal doesn't mean wifi will work, but hey, we need to start somewhere.

Well actually wifi emulation has already been upped a notch compared to 0.2. The wifi RAM and associated registers are functional, as well as most of the timers. What remains to be done is functionality for sending/receiving packets. Power management and specific multiplayer features also need proper investigation. Then we have things like the RF and BB chips, which will likely never be fully emulated since they control very low-level aspects of wifi, like "how much energy should it take to consider we're receiving data".

So where does this get us? I have tested Pictochat and NSMB multiplayer, and in both cases, the host sets up a beacon and attempts to send it regularly, which is a good sign.

(The beacon is a packet regularly sent by wifi access points to advertise their presence. Since the DS doesn't support ad-hoc communication, multiplayer games use a similar scheme to communicate, typically with the first player acting as a host and other players being clients.)

Anyway, don't get too hyped over this, there's nothing too new here. DeSmuME and NO$GBA both get atleast this far if not further.


In the meantime, I've been implementing another obscure feature of the DS: writable VCount.

Old consoles typically have a register that reflects which scanline is being drawn onscreen. There are various names it can be called (LY on the GameBoy, VCOUNT on the GBA/DS), but it's essentially the same thing.

The DS has the particularity that the VCount register can be written to. This feature is typically used to synchronize consoles during multiplay, but there are other possible uses -- ZXDS uses it to bring the refresh rate to 50Hz.

GBAtek doesn't provide a whole lot of details about this and how it works, besides a recommended VCount value range of 202-212. The I/O map listing also hints that VCount is only writable from the ARM7, but that seems to be a copy-paste error, I observed that VCount is writable from both CPUs.

So I set out and investigated it, which gave fun results, as always with the DS.

I immediately tried writing to VCount during active display (scanlines 0-191). And, surprise, it works. The GPU then continues drawing from the updated VCount, but the screens keep going from their old position as the LCD controllers can't skip forward or backward.

It's also worth noting that VCount writes are delayed until the next scanline, which has the advantage that it makes timings reliable.

The effects of VCount writes only last for one frame, things go back to normal after the end of the frame (unless you write to VCount every frame).

Implementing this feature in melonDS was also the perfect occasion to make the framerate limiter suck less. It had to be upgraded anyway in order to support adjusting the framerate cap based on how long the frame lasted. So for example, if you ran ZXDS, melonDS would run at 50 FPS instead of 60. The new framerate limiter is also more accurate, which makes for nearly perfect audio output.

(ZXDS wouldn't run currently due to the lack of DLDI support, but you get the idea)
More fun fixes
So as I said in the last post, I've been fixing the firmware boot issues. Basically, when booting from the firmware, there was a high chance that the game would crash upon boot. But the bug and its effects were random. Sometimes it would simply hang on a white screen, sometimes it would jump to an invalid address... and sometimes it would work fine.


So we're going to see how we try to fix a random bug.


First thing to do is finding a way to reproduce the bug consistenly. So, hoping my bug would be affected by the time taken before clicking the health/safety warning screen, I set up a quick hack to automatically click the screen after a fixed number of frames. That didn't cut it, so I made the RTC return a fixed time. Which finally allowed me to reproduce the bug reliably.

I landed on a variation where the ARM9 jumped to an invalid address, which made it easy to find where it was doing that. I then started backtracking, which eventually led me to find out that some data were being copied from the wrong place, but it appeared the bug had more consequences than anticipated, making the backtracking long and tedious.

So instead, I dumped the RAM at the time the game booted, and compared it with a RAM dump from before a direct boot. It appeared that a chunk of code got accidentally erased. The range that got erased was within the cart's secure area, so I checked the code that handled secure area reads, but the bug wasn't there.

So I tracked where that code region was getting erased. The bug appeared, and it was another stupid bug. It turned out to be a DMA from the ARM7 accidentally doing that.

The bug was that during cart transfers, melonDS tried to start DMA for both CPUs. The cart interface can only be enabled for one CPU at a time, and thus DMA should only be checked for that CPU.

The bug resulted from a combination of factors:

1. The ARM7 BIOS loads the cart secure area, using DMA. When it's done, it leaves the DMA enabled(!). So when the ARM9 went to load something else from the cart, it accidentally triggered the ARM7-side DMA.

2. The secure area is made of 4 blocks, which the BIOS loads in a random order. So, if the last secure area block loaded happened to be the last one, the bug wouldn't overwrite the secure area and would only write after it -- which isn't a big deal because the rest of the game binary is loaded later. So in that rare case, the game would boot fine.


Was a fun ride, though.
Fixing Pokémon White
This post will get into the bugfixing process for a particular bug. Bugfixing in emulation may look like black magic, but it's not so different from general bugfixing-- it boils down to understanding the bug and why it occurs. Of course, emulation makes it harder when it involves blobs of machine code for which you have no source code, but nothing insurmontable.


Anyway, the issue with Pokémon White (and probably others) was that the game wouldn't boot unless it was launched from the firmware. Not really convenient, especially as at the time of writing this, firmware boot in melonDS is unstable.

I first suspected the RAM setup done prior to a direct boot (NDS::SetupDirectBoot). There are several variables stored in memory by the firmware, which games can use for various purposes. For example, the firmware stores the cartridge chip ID there, games then check the cartridge chip ID against the stored value, and throw an exception if it has changed (which typically means that the cart was ejected).

However, some testing revealed that there was nothing the direct boot setup was missing that could have broken Pokémon, atleast regarding the RAM.


So I had to dig deeper into the issue. It turned out that during initialization, the ARM9 got interrupted by an IRQ, but for some reason, it never returned to its work after processing the IRQ.

DS games often use multiple threads, so it isn't uncommon to switch to a different task after an IRQ. But that wasn't the case here, it wasn't even picking a thread to return to. It got stuck inside the IRQ handler.

The IRQ occured upon receiving data from the IPC FIFO. In particular, the ARM7 sent the word 0x0008000C. The ARM9-side handler was coded to panic and enter an infinite loop upon receiving this word.

More investigation of the FIFO traffic showed that the ARM9 first sent the word 0x0000004D, which is part of the initialization procedure. To which the ARM7 replied with 0x0008000C. But it appeared that the ARM7-side FIFO handler was coded to do that. For a while, that stumped me. I couldn't understand how it was supposed to work.


I then logged FIFO traffic when booting the game from the firmware, whenever it successfully booted, to see where the exchange differed. The ARM9 sent 0x0000004D, to which the ARM7 replied 0x0000004D. But it appeared that in that case, the ARM7 was using a different handler.

I checked the ARM7 code responsible for setting up the FIFO handler. It first set up the 'good' handler, then called a function, then set up the 'bad' handler. Looking at the function inbetween, I noticed that it checked register 0x04000300, which is POSTFLG, and is set to 1 by the firmware.

Hey, what if...

So I quickly checked the direct boot setup, and it wasn't initializing the POSTFLG registers. And, surprise, addressing that fixed the issue, letting Pokémon White boot directly.


Well, not bad. Next up is fixing the issues that plague firmware boot, I guess.
melonDS 0.2, there it is
You can check it out on the downloads page.





I'll let you find out what the novelties are ;)
Breaking the silence
You may have noticed that there's one thing melonDS 0.1 lacks sorely: sound. It's an integral part of the gaming experience.

It's also one of the big items in the TODO list, so let's do this.


Getting basic sound playback going wasn't too difficult. The DS sound hardware is pretty simple: 16 channels that can play sound data encoded in PCM8, PCM16 or IMA-ADPCM. Channels 8 to 13 also support rectangular waves (PSG), and channels 14 and 15 can produce white noise.

I first worked on PSG as it's pretty simple, there would be less things that could go wrong. And indeed, the issues I got were mostly dumb little things like forgetting to clear intermediate buffers. Once that was working, I completed it with support for the remaining sound formats.


And there it was, sound in melonDS. That thing finally begins resembling a DS emulator :)

The sound core in melonDS is synchronous. The major downside is that it sounds like shit if emulation isn't running fullspeed. But it's been stated that the goal of melonDS is to do things right. There are other reasons why it has to be synchronous, too; the main one is sound capture.


The DS has two sound capture units, which can record audio output and write it to memory. Those are used in some games to apply a surround effect to the audio output, or to do reverb. The idea is to send the mixer output to the capture units instead of outputting directly, then use channels 1 and 3 to output the captured audio data after it's been altered by software.

Setups using sound capture expect that capture buffers will be filled at a fixed interval. This breaks apart if your sound core is asynchronous, because there is no guarantee that it will produce sound at a regular rate. What if you end up producing more sound than the game's buffer can hold? In these situations, the best way is to ignore those effects entirely. Same reason why the old HLE audio implementation of Dolphin didn't support reverb effects.


Anyway, sound output is still in the works, but it's fairly promising so far.


Oh by the way, have I mentioned that there are other DS emulators being worked on? Check out medusa!
melonDS 0.1 is out!


It's here, finally! Without further ado, check it out on the downloads page!

You can also read the release notes on the board for more information.
The aging cart
So I got the UI far enough to be able to run things again. Of course, I tried the aging cart.

It's a nice help when it comes to making emulators accurate. It has a nice big set of tests. The code is also well-structured and not too hard to understand. There is a table at 0x021F2FD0 that lists all the tests, pointing to each test's name and the corresponding function.

melonDS failed the DMA priority test, like DeSmuME.

The failure was actually not related to DMA priority, which is handled correctly in melonDS. The DMA priority test runs two DMA transfers, a long one that starts immediately and a short one that starts upon HBlank but has higher priority. Each DMA fills a buffer with values pulled from a shared timer. It then checks the continuity of the stored values to find out whether and when the long DMA was interrupted.

This pointed to something I wanted to do since a while: rework timer emulation in melonDS as it was a bit complex and grossly inaccurate. After doing so, melonDS passed the DMA priority test, but it also fixed a bunch of issues, like FMVs playing at shitty speeds.

After implementing some more obscure DMA types, I was able to pass more tests. For example, DMA type 3 is triggered at the start of each scanline, but it runs on scanlines 2 to 193 included, and, unlike HBlank DMA, it always stops on scanline 194. It's not clear what purpose this DMA type would serve -- maybe it was intended for some external device acquiring video from the screens.

Then came DMA type 4, which is used for feeding the display FIFO. The current implementation in melonDS is a gross hack, but it is pretty much impossible to use the display FIFO without DMA due to its tight timing, and emulating it properly would be resource-intensive. The display FIFO is another obscure feature -- I don't know of any retail game that uses it, but I have yet to be surprised.

At this point, melonDS gets to the capture control test, but fails it. Again, the issue is unrelated to the screen capture logic. The test renders 3D graphics, and checks correctness by checksumming the captured image. Which basically requires pixel-perfect 3D graphics.

The aging cart has been a fun ride so far, and it's still far from being finished :)
melonDS 0.1: soon a thing!
And yet, it may very well take a while.

As stated in the TODO list, the main remaining thing to do, besides fixes to timers, is building an actual UI for melonDS. The current one was something I had thrown together quickly early in development so I could see graphics. melonDS was barely beginning to run things, it supported VRAM display so it could run ARMWrestler, then got some extra graphics support, enough to render the DS firmware interface.

Things have changed a lot since then. melonDS now has an almost-complete 2D renderer with more accurate colors than the other emulators out there, and even a 3D renderer that is more than satisfying for a first release. It got support for a bunch of things that don't sound very exciting but are required to get games working. All in all, I believe we have a fairly solid emulator base.

But the interface is still lame. It still loads a ROM from a hardcoded filename. It's still a barebones video output and a console spewing nonsense as the game runs. It still has hardcoded (and retarded) key mappings.

The current interface is made of Windows-specific code, which is why I didn't build upon it; I want something cross-platform.

I'm still having trouble deciding what I will work with for this UI. I'm considering libui and SDL, if I can get them to cooperate. I want something lightweight, melonDS is going to stay pretty simple. As for SDL, I'm going to need it (or an equivalent library) for things like joystick input or audio output.

I initially hoped to be able to stay away from Visual Studio as far as Windows is concerned, but to my regret, alternate solutions are more or less of a headache to get working. So I'm likely going to ditch the CodeBlocks project and use CMake.

Which reminds me why I have been postponing the UI stuff: I would rather work on interesting emulation stuff than go through this trouble for a cross-platform UI. I recently discovered that there is an aging cart (Nintendo-internal test ROM) for the DS, and of course, I want to run it in melonDS to see how good it does. None of the emulators I know of pass the whole test, which means that either the test ROM is tricky or the emulators aren't as good as we think. Well, in its current state, melonDS wouldn't pass it either, but there's room for improvement.

No screenshots for this post, I'm keeping some surprise for the release ;)
VRAM? Lots of fun!
First of all, a little status update. From the previous list, DMA timings are covered, they're properly emulated except for cart DMA. I have also been fixing several GXFIFO related bugs, which lets most games run fine. Some badly programmed games (SM64DS or Rayman RR2, using immediate DMA instead of proper GXFIFO DMA) still overflow the FIFO a little, and since I haven't yet figured out why, I added a little hack to cover that for now.

Improving the 3D renderer, the other big TODO item, mostly means implementing textures. And also that weird focus I developed on making the renderer pixel-perfect.

Pixel perfection can be postponed. Most games would still be playable fine without it.

Textures mean supporting texture memory, which is a special case of VRAM mapping. I first want to revamp VRAM emulation to better match the hardware. There are notes about VRAM mapping in the melonDS code, and they have been there since a while now, waiting for a proper implementation.

Thing is, VRAM mapping on the DS isn't linear like with most old consoles. You get 9 VRAM banks of different sizes (from 16K to 128K), each can be mapped to different addresses for different purposes. GBAtek documents the mapping modes for each, but it's missing details like how the banks are mirrored and what happens if two or more VRAM banks overlap (which does happen in some games). So, I wanted to test that.

I first assumed the mapping was 1:1, as most DS emulators out there handle it. Which would mean that if two banks overlapped, it would either map the last one, or apply a priority order. So the first test was to map banks A and B to different addresses, write 0x1111 into A and 0x2222 into B, then map them at the same address and read, then do it again but map in a different order. The results would tell whether A has priority over B, or B over A, or whether it just maps the last bank.

The result wasn't quite what I expected. The tests didn't read 0x1111 or 0x2222, they both read 0x3333.

VRAM mapping isn't a 1:1 table. It basically just tells each bank which address range it should respond to. When two banks share an address range, they respond at the same time: writes go to both banks, and reads read from both banks too, ORing the values together, as shown above.

VRAM mirroring is also funky, each bank is mirrored in its own way, and the mirroring scheme doesn't depend on the bank's size. But for special cases like extended palettes or texture memory, mirroring doesn't matter.

I have yet to finish covering the VRAM mapping oddities without completely killing performance, and make the 2D GPU code suck less. Once that's done, we can get to the fun part of implementing textures in the 3D renderer.