melonDS 0.7.1 -- here it is!
As title says.

We're not showing screenshots because they wouldn't be a good medium for conveying the number of changes in this release.


The biggest change here is that the core timings were entirely renovated to try being closer to the hardware.

First of all, after several days of gruelling testing and guesswork, we were finally able to understand the GX timings, and emulate them properly for the first time. But, as we weren't gonna stop there, we also renovated the timings for DMA and memory accesses, so that both of them are closer to their hardware counterparts. DMA and ARM7 should be pretty close to perfection now, ARM9 less so but it's still more realistic.

We have also been fixing the emulator's main loop, so that the ARM9, ARM7 and system clocks shouldn't desync anymore.

All of this, with a few added optimizations, fixes a whole bunch of issues, from things flickering to audio crackling to games outright going haywire (hi RaymanDS). Your games are now running better than ever!

... or not. That's also the point of the 0.7.1 release, I want to hear about any issues caused by the timing renovation, so we can get them fixed for the epic 0.8.

We already have one such issue, all of this is causing sprite flickering in Pokémon Platinum. Quick attempts at fixing this went nowhere, so we will have to investigate this proper.


There's also a number of misc fixes. For example, the 3D glitches that showed up in Mario Kart DS were fixed, but there's already a lengthy post about this.

There's a small fix to 2D windows, nothing really noteworthy, just fixes a game that was setting up backwards windows.

The input system no longer requires a dpad to be mapped to directional keys for joystick axes to work.

The code that looks for local files (melonDS.ini, BIOS, firmware...) was modified to explicitly check the directory melonDS is in, if that is not the same as the working directory. So in these cases it will no longer fail to find its files. Also, if melonDS.ini is absent, the preferred directory for creating it is the directory melonDS is in.

And, finally, we finally got rid of the old savemem type autodetect code. Considering its complexity and the amount of failures, it was a trainwreck. So instead, now, melonDS will pull that information from the provided ROM database (romlist.bin).

Edit- If your game is still failing to save: you might have an old failed save file lying around, try deleting it. When a save file is present, melonDS determines the savemem type from the file size, so if a file is present with a wrong size, you will have to delete it.


Enjoy!


Windows 64-bit
Linux 64-bit

melonDS Patreon
Update
I had a few bugs in my implementation. After fixing those, the timings I got were closer to what it should have been. Too fast, but way too fast.

Anyway, I had issues with my test app too. One of the memory buffers it accesses was in main RAM instead of DTCM; fixing that oversight gave me more realistic timings. In reality, the Millionaire issue was what I first thought-- on hardware, the FMV decoder completes in 146 scanlines, and I was being too slow.

So, obviously, fixing the timings made it fast enough that the glitch was entirely gone. But, aside from that, we were back to square one: running too fast and the issues it causes everywhere. Rayman DS is a good test for that, and it was shitting itself big time, so... yeah.

If we raise the cached memory timings to 2 cycles, as some sort of average between cache hits and cache misses, we get closer to the real thing. Still a bit too fast, but this time Rayman DS is running normally.


However...


The new timing logic, coupled with PU emulation, is a pretty noticeable speed hit.

We can keep the PU logic for a potential future 'homebrew developer' build, that would also have several features to warn against typical DS pitfalls. For example, some form of cache emulation and/or warning about things like DMA without flushing/invalidating caches, 8-bit accesses to VRAM, etc...

If there is demand for such a build, that is.

As far as regular gamers are concerned, we can go for a faster compromise that would run most of the shit (really most of it) well. It wouldn't run things like GBA emulators that make extensive use of the PU, but... eh.

All the commercial games around use the same PU settings. We can say as much about libnds homebrew, its settings are a little different (for example, putting the DTCM at 0x0B000000 instead of the typical 0x027E0000), but nothing terrible there. All the differences we can observe are essentially in where the DTCM goes.

Since the DTCM is already handled separately (in CP15.cpp), we can handle its timings separately, and use a coarse table for the rest of the memory regions, like we did before. All while retaining enough of the new timing model that melonDS's timing characteristics would be much closer to the real thing.

As explained before, it's near impossible to perfectly emulate ARM9 timings due to the sheer complexity of the architecture and the way it's implemented in the DS, so if we can get 'close enough', might as well do it without wasting too much performance over it.
Timing renovation, ep 4: world of shit
Took a while, but I have finally implemented the new memory timings, atleast CPU-side. DMA is not done yet, but will be quick to do.

Noting that these timings, atleast ARM9-side, would still be faster than the real thing, mostly due to not emulating cache misses and a lot of other little things. But it should be a lot closer.


And, how does that fare?

So much for that.

For now it's a total fucking trainwreck. Everything is running too slow and there are quite a few issues with it. For example, Rayman DS is slow as shit.

And even worse, the Millionaire FMV bug.

As it still happened, I extracted the FMV decoder code to run it within a test app. So it runs within the same conditions: code in ITCM, data it accesses in mainRAM, cache enabled over similar regions, etc.

What's the result? On hardware, the decoder is taking ~724 scanlines to run. Which means it actually spans over several frames, and, since it's set to start upon VCount=20, by the time it would be done, VCount would be 218. Which is past scanline 214, so under these conditions, we would be getting glitches on hardware too. And yet, the thing runs fine on hardware.


I don't fucking understand it.


I fail to see how all that is possible.


Unless libnds is changing some setting that radically affects timings.
Timing renovation, ep 3: the new GX timings
So, this is it. GX timings are covered.

After days of feeding specific command lists to the geometry engine, measuring how long they take to run, and trying to figure out the logic behind it, we finally did it. And implemented it in melonDS.

Have yet to write a post to get into that in detail, but that will go on the board.

I believe it was definitely worth it.


Looking at history, it is apparent that first-gen DS emulators have run into issues caused by display lists taking too long to execute, despite following the timings given by GBAtek. When measuring their timings, we can guess that they went for the easier solution and lowered their per-command timings. Nothing bad with that, if it gets games running, and gets them running at better speeds on lower-end platforms, but that ain't accurate.

In a less "accuracy horseman" note, we'll quote byuu's article on accuracy, again:

If you do not get the timing perfect, you will end up playing a perpetual game of whack-a-mole. Fixing one game to break two others, fixing them to break yet two more. Fixing them to break the initial game once again. You need only look at the changelogs over the past fifteen years to verify this.


You might not end up needing absolute perfection there, but history has shown that, if you don't have the basic logic down, hacking around timing issues can only get you so far.

A prime example may well be Burnout Legends, which JMC47 mentioned in his blog post The next generation of DS emulators. The game seems to have built-in frameskipping or slowdown compensation, but it's not working correctly on emulators, resulting in random slowdowns or speedups.

I haven't looked into this game in detail, though, so I don't know for sure what it's doing.

But a possibility would be that the game knows ahead of time how long its 3D scene will take to render. This is possible if you have measured your individual display lists by running them on hardware, and gave them metadata indicating their time cost. From there, if you know which objects are going to be shown, you know how long the scene will take to render. You can then use that knowledge to keep your game logic in sync even if you're dropping frames, so the game doesn't appear to run slower.

But of course, if your emulator's GX timings don't match the hardware's, this falls apart. And if a game is doing that kind of thing, no amount of hodgepodging GX timings will fix it if you haven't gotten the logic right.

Anyway, I tested Burnout Legends with the new GX timings, but I could hardly judge whether it was correct as the framerate dips below 60FPS on my PC. So I asked JMC47, who gave it a try and said that the game is now running as it should.


Coming back to our problem games, those that overflow the GXFIFO. Super Mario 64 DS, and also Rayman Raving Rabbids 2. SM64DS was the worst offender, with more than 10000 stall cycles per frame, against a ~1800 cycle average for Rayman RR2.

None of the other games I tested overflowed the FIFO. They're generally somewhat well-programmed, and use the adequate DMA mode which avoids overflows.

After the timing renovation, these two games still overflow the FIFO, but much less. The stall cycles were nearly halved in SM64DS and the reduction in Rayman RR2 is tenfolds.

I guess this is how they work. What was less nice was that Rayman RR2 got regular music streaming issues from the stalls. So, quickly, we fire up a new hardware test, and find out that GXFIFO stalls don't halt the ARM7. Once this is addressed, the audio issues are gone.


So, in the end, this first part of the timing renovation turns out rather well.

Now it's time to get to the more daunting part: the general timings. Mostly memory access timings. I have these down already, so it's mostly a matter of implementing them in a way that doesn't slow things down.

And, regarding what I said above, I hope that I don't run into that kind of problems. Implementing the ARM9 caches is possible, but has a performance cost. So, if I hack around this, the game code would likely just end up running too fast. If this causes problems, we will have to hack around them, or propose ARM9 cache emulation as an option.

We'll see what we can do there. Apparently, the current timings are not that bad, but some things are running too slow, as Who Wants To Be A Millionaire has shown. The other issue is that, well, the current code for timings is a gross hack.


Love and waffles!
Happy birthday, albeit late
It turns out that melonDS is two years and 20 days old, counting from the first serious commit. So:



Too many candles there? Nah! That's just the melonDS company being generous.


So, the usual bit of retrospective. We're not going to repeat the whole history, you can read that on the one-year post.

Anyway, reading the DeSmuME commit log shows that they're trying to play catch-up.

Ironically enough, they finally took their wifi system out from behind its EXPERIMENTAL_WIFI wall, and exposed the settings to their users. They even went as far as fusioning the old 'adhoc'/'infrastructure' modes (basically selecting between nifi/local and internet/WFC), akin to melonDS where the setting just doesn't exist-- it's always running both modes at the same time.

A few notes on that, for those who are interested into DeSmuME altWFC:

* You need a recent DeSmuME build, such as those produced by their buildbot. Not the 0.9.11 release from 2015.
* It functions the same way as melonDS, you need libpcap and an Ethernet connection.
* They haven't yet been fixing the underlying issues in their wifi stack or maybe even emulator core. For example, Mario Kart DS races don't work, while they do on melonDS. (I do not know what causes the issue in DeSmuME though, haven't faced it at all in melonDS, so I can't help there)

However it is nice that they're finally reconsidering their stance on what was largely my work, and giving it some spotlight.

It's also nice in that melonDS is achieving one of its goals there, not regarding wifi but in a more general sense. One of the reasons behind this project was the state of the DS emulation scene at the time. melonDS showed a clear intention of disrupting what was an increasingly stale scene, injecting some fresh blood into it and hopefully encouraging further development by an effect of emulation (heh).

Well, seems we're getting there!


So, I want to take a while to thank all of you comrades who have helped make this possible, be it by providing amazing documentation, testing games, reporting bugs, suggesting fancy things, making videos, donating on Patreon, or whatever else that has ever helped me. melonDS is not a one-man show, we are in this together. Thank you for your support! And the whole DS scene can thank us all too as everybody benefits from this.


As for year 2018, it wasn't full of a lot of awesome melonDS development, mostly because it's largely been a shitshow. But the pace is picking up again, so, may 2019 be a better year.


Here is a quick roadmap for further melonDS releases.

0.8 will be when we have a decent OpenGL renderer. So we can pull awesome shit like upscaling or Switch/Android/whatever ports that don't run at crapoed-snail speeds. Might also give a speed boost on desktop. The current renderer is a bottleneck when it's used for serious 3D scenes. But hey, it's nearly pixel-perfect and makes for authentic DS graphics experience! :P



If you enjoy licking the bars in that level, you get an accurate render of that quirky texturing. Who would want upscaling or other graphics enhancements when you have THIS?

:P

In a more serious note, that also happens to be GPU-related, the more immediate roadmap, for what would be version 0.71 or so.

0.71? 0.7.1? 0.7b? 0.7.zog? What would be the best there? Can't immediately go to 0.8, I promised it'd be the OpenGL renderer release.

Anyway.

This intermediate release will be that of timing renovation and primped up wifi connectivity. Which will, you guess, need a bunch of testing to make sure it's going okay and we don't end up shipping a 0.8 whose awesomeness is completely undermined by timing bugs.

The timing renovation includes GPU timings too. For now, these are only the geometry engine timings, which dictate how long display lists take to run. On the other side, rendering engine timings would basically be the time taken to draw polygons, and therefore emulating glitches arising from having too much shit onscreen. This is a whole other can of worms we aren't opening for now.

I talked a bit about the geometry engine timings in the last post, but actually, it's a bit more complicated than what I stated.

For example, vertex timings. Each time I came up with a model for those and how they interacted with other commands, I always found test cases that contradicted this model. The answer is that vertex timings aren't uniform. Each vertex which completes a polygon will take longer to execute and impose rules on when the next vertex commands can run. Vertex commands themselves can take 4 or 6 cycles (add one cycle for a full 16-bit vertex) depending on which command follows, there is parallel execution going on. However, the time required for building a polygon is 28 cycles for a triangle and 37 cycles for a quad. And, for example, after completing a triangle, the next vertex can only start 10 cycles later, the one after that needs to be atleast 9 cycles later, and so on.

So, yeah. I have yet to test all the commands I can test to determine how they behave relative to vertex commands. For example, for some weird reason, command LIGHT_VECTOR needs to wait until the polygon-building process has completed, which can slow things down a lot if you're changing light directions per polygon. LIGHT_VECTOR is a bit of a snowflake though, most commands seem to parallelize fine.

Also, as a phun side note, the lag spikes I was getting in SM64DS were caused by a bug in how I counted cycles during GXFIFO stalls. Now it seems to be happy with the GXFIFO stalls even without touching the GPU timings. Maybe it's meant to work this way after all? That might be why they made it run at 30 FPS.

Oh well.

The other bit of this release would be, basically, making the whole online wifi shit more user-friendly. Like, actually finishing that feature. Making it work under Linux and with npcap and all that. Making it work with a wifi connection, or even Bluetooth, IR, radio, smoke signals, or whatever you like, which will require adding in a small DHCP and NAT.


The melonDS adventure isn't over! May it continue forever. Thank you all!
Fun with custom WFC servers
PeeJay Bonobo and his friends have been having some fun with melonDS and custom WFC servers. For example:



So, what do we learn from this?

• This is some pretty cool shit!
• Graphically, melonDS is hardly distinguishable from the real thing
• The wifi stack is also fairly robust! There are no stability issues. Although this is less demanding than local multiplayer (nifi).
• They managed to get this experimental, undocumented feature working.

So yeah, that last point.

Local multiplayer was celebrated and widely advertised, despite suffering from data loss every now and then. But internet connectivity was later implemented but never mentioned in any changelogs. There was only this thread mentioning it.

Reason is that this feature is unfinished. Most of it is actually coded, and it works, but it's nowhere near where it ought to be in terms of user-friendliness. It was more or less a quick hack, and was hardcoded to use the second network adapter (whichever that is), because that was what worked for me at the time. It also only works if your computer is connected via ethernet.

The first part just requires building up the UI for selecting a network adapter. As well as some extra code for naming them, atleast under Windows: winpcap provides a 'name' and a 'description' for each adapter, the former is some GUID-like identifier string and the latter seems to always be 'Microsoft', so, not too user-friendly. I will have to dig into the Windows API to look for a better method. Haven't checked under Linux but you probably get names like your typical /dev/eth0, which would be good enough.

Second issue is due to the way this works.

Packets sent by the emulated DS will be forwarded as-is over the network, bearing the DS's MAC address. In return, packets sent to the DS will be readable from the host using promiscuous mode, which is widely supported. Basically, on the network, the emulated DS is seen as an entirely separate device from its host. It will typically contact the DHCP and get its own IP address and all.

This is the simplest way to make it work. The main downside is that, well, it doesn't work over wifi. Reason is simple enough: over wifi, traffic goes through an access point, and all devices talking to that access point have to be authenticated and associated with it. This will be the case for the host, but the guest (emulated DS) is seen as an entirely separate device, and has never associated to the AP. Thus, any guest traffic will be rejected.

Two solutions for that:

1. Associating melonDS to the AP.

Amusingly, this would basically be an emulated DS talking to an emulated AP (melonAP) talking to an actual AP.

But there is a giant pile of complication arising from this.

First, we have to somehow figure out that the connection is a wifi connection. Then, figure out the AP's MAC address. Associate melonDS to it, either somehow grabbing the password from the system or asking the user for it. How fun.

Then there is the whole issue that this requires a wifi adapter that supports monitor mode and injection, so we are able to send proper auth/assoc frames. If we have that much control over it, might as well just drop melonAP and directly forward melonDS's wifi traffic to the host wifi adapter, that would be a whole lot simpler. (although we might want to come up with something to take care of WPA2/etc transparently, so you don't need an insecure network)

In the end, this is so unwieldy that requiring an ethernet connection is a better alternative.

2. Doing our own DHCP and NAT

The idea is to create a small subnet between the host and the guest. A small DHCP will hand out a fixed IP address for each. The host will basically act as a bridge between the actual network and the guest.

The advantage would be that this method would be agnostic to the kind of connection used. All the traffic would be explicitly going through the host, instead of pretending to be a separate device.

Of course, we still need to figure out the host's network adapter details, such as its MAC address and its current IP address. libpcap doesn't have all the functionality needed there so we'll likely need some platform-specific code (which we'll need anyway to get proper interface names).

Then it'll need to do some analysis on packets going through it, to change their MAC/IP addresses as needed.

Nothing insurmontable.

Of course, we could keep the old direct mode as an option, more straightforward and less likely to break but requiring an ethernet connection.



That is, when we're done with the timing renovation.

How's that going btw? Still testing and measuring the DS.

I have general timings more or less covered up now. But, while I was at it, I had to get rid of this old hack:

void CmdFIFOWrite(CmdFIFOEntry& entry)
{
   if (CmdFIFO->IsEmpty() && !CmdPIPE->IsFull())
   {
       CmdPIPE->Write(entry);
   }
   else
   {
       if (CmdFIFO->IsFull())
       {
           //printf("!!! GX FIFO FULL\n");
           //return;

           // temp. hack
           // SM64DS seems to overflow the FIFO occasionally
           // either leftover bugs in our implementation, or the game accidentally doing that
           // TODO: investigate.
           // TODO: implement this behavior properly (freezes the bus until the FIFO isn't full anymore)

           while (CmdFIFO->IsFull())
               ExecuteCommand();
       }

       CmdFIFO->Write(entry);
   }
   [...]
}


This is when we have a complete GX command entry that we can store in the FIFO. Note the case where the FIFO is full.

What happens then? It stalls the bus until there is room in the FIFO. According to GBAtek, this goes as far as stalling the ARM7.

What does the code above do? Run GX commands until the FIFO isn't full anymore, but this doesn't stall the bus, it basically acts like the FIFO is infinite. Probably has weird implications on timings, given ExecuteCommand() will still count cycles.

This dates back from the original GXFIFO implementation (it's nearly two years old, wow). After some fixing, I eventually had it working well enough that all the games I tested worked without shitting themselves. Except, of course, Super Mario 64 DS, which still kept overflowing the FIFO. I just figured that the game was, well, not very well programmed (it uses immediate-mode DMA instead of the adequate GXFIFO mode, for instance). So I hacked it and postponed the whole thing, instead focusing on, you know, building the emulator core and making it be more than a curiosity.

But now we're at the stage where our core is mostly good and we have to make the emulator be awesome in all other ways.

And, especially, the timing renovation. Attempting to fix a whole bunch of bugs that are timing issues.

So, of course, when I went there, I ran into SM64DS again. Such is destiny. But I went on anyway, and implemented GXFIFO bus stalls.

And of course, SM64DS wasn't too happy with it. The stalls caused visible lag spikes. Those don't occur on hardware, so clearly we're doing something wrong there.

Logging shows that the game is constantly blasting the GXFIFO with 118-word bursts, regardless of FIFO levels. It never checks GXSTAT. It just keeps transferring chunks over and over again until all is transferred. It doesn't care what the GPU has to say about this.

So I inspect the code that is doing the transfers. This is not a bug, the code is meant to work this way.

Side note: the 118-word chunk looks like an attempt at replicating the GXFIFO DMA mode, where it transfers 112-word chunks whenever the FIFO level is below 128. I don't see what the point, even moreso when the implementation is flawed in that it never actually checks FIFO levels and just keeps firing at the FIFO.

So the immediate implication of this is that melonDS is likely taking too long to run the display lists.

Which is where I do a few hardware tests, checking whether writing to the GXFIFO has higher waitstates or whatever, but nope...

However it turns out that some commands execute faster than we thought.

For example, TEXCOORD, NORMAL, VTX_16. Individually, they take 1, 9 and 9 cycles respectively, as GBAtek says. All is good.

If you combine the three, it should thus take 19 cycles. But on hardware, it takes... 9 cycles.

This is showing us that there is parallel execution going on. Which explains why our display lists are taking too long.

So, here it is. I'm way down the rabbit hole, working out how this works, which commands support parallel execution, with which commands and when.
Bugfix0ring streak
So many things to do and so few coders. Poor melonDS company :P

Regardless, a bugfixing streak started happening. So, while I'm busy brainstorming about the current issue (which I'll talk about), have a post with juicy technical details about the process. It's oh so fun, you'll see.



First bug to go down is this weird 3D glitch that shows up in Mario Kart DS and probably others. The issue post shows what it's about: random glitch lines showing up in 3D graphics.

First attempts are some tests to determine the nature of the glitchy pixels. Disabling polygon edges, disabling translucent polygons, disabling antialiasing, whatever, you name it.

Eventually, we find out that those pixels are edge pixels.

Then, I suspected a depth test related issue. Stencil test can be excluded, shadow polygons are always translucent, and if you've read the previous post about the GPU innards, you know that translucent pixels don't set edge flags.

Enter depth buffer debugging. AKA Mario Kart LSD edition.

The purpose of this isn't to have an epic trip racing crapo AI players in a trippy setting, but to have a visual reading of the depth buffer. Not very readable to the bare eye, but the 24-bit Z values are mapped to 24-bit RGB, and this is a gross hack that bypasses the whole rendering pipeline (which is 18-bit).

You can see a glitched line in the middle of this frame, and use any good image editing software to read the raw values of the pixels. We find out that the Z values on the glitch line are greater than those of surrounding pixels, which means it should not punch through the polygons behind it (those polygons should be in front of it).

What the fuck?

Attempting to log these bogus depth test cases, we find out that interpolation is somehow generating Z values that are out of range, of course in a way that manages to slip past Depth Buffer Viewer 5000000. Sometimes they're so out-of-bounds that they end up negative, which causes the glitch lines.

(should probably not be using signed ints there are Z values are always positive, anyway. but then that'd have hidden the glitch, probably)

Tracking those, we find that some polygons with long, nearly-horizontal edges cause the rasterizer to accidentally draw outside of the polygon bounds. In turn, the perspective-correct interpolation factor for these pixels is also out of bounds, which, in some cases, screws up interpolation of vertex attributes, making things overflow and shit themselves.

Talk about a butterfly effect.

(and remind me to talk about that 'perspective-correct interpolation factor' thing someday, but for now, that's how the DS GPU does things. no W reciprocals.)

But, finally, the bug is down.



Next one on the hunt is this one: enemies invisible in Bionicle Heroes. SilasLaspada usefully noted that the console keeps complaining about 3D-related matrix stack overflows/underflows.

What's this about? I mentioned it in the GPU post, the GPU has a set of matrix stacks that are basically a hardware implementation of glPushMatrix() & co. You get the standard push/pop operations, but also store/restore operations which allow accessing arbitrary slots in the stack.

There are four matrix stacks: projection, position, vector, and texture. The projection matrix stack is limited to one slot. I'm pretty sure the texture matrix stack is too, but this one is hazy. Finally, the position and vector matrix stacks have 32 slots, which makes sense. These two are interlinked, the position matrix is used to transform models and the vector matrix is used for lighting calculations. The idea is to avoid having to normalize normals and light directions after having transformed them, by instead supplying a vector matrix which is (normally) an unscaled version of the position matrix. For this purpose, you get a matrix mode in which all matrix manipulation commands apply to both position and vector matrices, except, of course, the scale command.

Anyway, it's not too uncommon for games to accidentally overflow/underflow a matrix stack by pushing or popping too much, as a result of a bug in the game code. I saw it happen in NSMB's intro when the castle gets struck by lightning, for example, without any visible consequence. So I dismissed such cases as benign game bugs.

Until, well, this one bug. Missing shit onscreen, constant complaints about overflow/underflow, too fishy to be a coincidence.

A bit of logging shows that, at some point, the game proceeds to push the position matrix 36 times, without ever popping. And it seems to do a lot of stupid shit generally, like pretending the projection matrix stack has two slots, etc... Terrible game code? Emulator bug? One way to find out, I guess.

So how does this work? What happens if you overflow or underflow a matrix stack?

GXSTAT has a flag signalling that, so I guess you... raise that flag, and cancel the operation?

Right?

Hahaha.

Half of that is true, because of course. This is the DS we're dealing with.

It raises the over/underflow flag, and... it fucking goes on and performs the operation anyway, using only the relevant bits of the stack pointer. For example, if the stack pointer is 31 and you push the matrix, it's stored in slot 31, pointer is incremented, error flag is raised (it starts raising at 31). Now if you push again, the pointer is 32, whose lower 5 bits are 0, so your matrix is stored in slot 0. Yes.

For the position/vector stacks, the stack pointer is 6-bit, so if you push or pop enough, it will eventually wrap around and stop raising the over/underflow flag.

For the projection and texture stacks, abusive pushing and popping will keep accessing the same slot, since there's only one.

Anyway, revising melonDS according to all this seemed to fix the bug, so I guess this is shoddy programming on the game's part. Another bug bites the dust.



Next one is FMVs flickering in Who Wants To Be A Millionaire.

In particular, the bottom screen is partly flickering. Quick test, this only happens when using the threaded 3D renderer, which means that the game is changing VRAM mappings while the renderer is running. As melonDS has safeguards to ensure the threaded renderer doesn't run outside of the real hardware's 3D rendering period, that means the game is malfunctioning.

Basically, it's using the 3D renderer to draw FMV to both screens. Which is a bit silly during the intro as the bottom screen is a fixed image. I haven't seen other FMVs in that game though, so... yeah.

So, we log shit, as usual. We find out that, regularly, the game unmaps texture VRAM, updates the contents via DMA, then maps it back to the GPU. This normally starts upon VBlank, so the texture memory is updated before the GPU starts rendering. Oh but sometimes it gets delayed by something and starts at scanline 243, which is way too late (3D rendering starts at scanline 214).

The non-threaded renderer wouldn't care, it runs 'instantly' upon scanline 214, so it would only be lagging one frame behind, which nobody would notice. The threaded renderer, however, cares.

We find out that the VBlank handler is being held back because something keeps IRQs disabled. This 'something' turns out to be the FMV decoder. It's running in ITCM and keeps IRQs disabled so it can complete ASAP, which makes sense. However, as far as melonDS is concerned, the decoder is taking too much time to do its thing and ends up delaying the VBlank handler.

There is no doubt that this is a big fat timing issue. We have a few of those already, namely RaymanDS which does weird things given bad timings. But also, less severely, 3D graphics sporadically glitching or disappearing for one frame, FMVs playing slightly too slow...

Y'know what this means: time to renovate the timings in melonDS to make them suck less. We can't get perfect without severe performance loss given the complexity of the DS architecture (two CPUs, shared memory, ARM9 caches...), but we can get close. Which wouldn't matter too much, timing on the real thing tends to be rather nondeterministic. With the ARM9, there are many little things factoring in: two-opcodes-at-once THUMB prefetch, parallel code and data accesses under certain circumstances, sometimes skipping internal cycles... and also the fact that the ARM9 runs at 66MHz, but the system clock is 33MHz, so the ARM9 may have to add one extra cycle to align to the system clock when using the bus. The ARM9 is on the borderline where CPUs start getting too complex for accurate emulation of their timings.

Anyway, we're not there yet, but brainstorming the renovation.

For example, complete emulation of the ARM9 caches will not be possible without a performance hit, but I guess we can emulate the MPU and thus properly determine which regions are cached instead of hardcoding it.

We'll see how that goes.
The DS GPU and its fun quirks
Oh hey, another 'technical shito' post. Kind of a preamble to what will be the flagship feature of 0.8.

(would be nice to add actual tags/categories to this blog, btw. heh)



Anyway, I want to talk about how the DS GPU works, how it's different from modern GPUs, and why I don't think that using Vulkan over OpenGL would be any benefit.

I don't know Vulkan a lot, so don't quote me on that, but from what I get, Vulkan stands out by working on a lower level than OpenGL, letting you manage the GPU memory and similar things. This may be good for emulating more modern consoles, where sometimes proprietary graphics APIs are used that allow levels of control that aren't possible with OpenGL.

For example, the blargSNES hardware renderer -- one of the tricks it pulls is that during some of the operations, the same depth/stencil buffer is used with different color buffers. This isn't possible with OpenGL.

Also, there's less cruft between the application and the GPU, meaning better performance, provided you're doing things right. While OpenGL drivers are full of optimizations for common use cases and even for specific games, with Vulkan it's all up to the application to be, you know, well programmed.

So basically, with more power comes more responsibility.



I am no 3D API expert, so, back to what we know well: the DS GPU.

There are already a few posts about specific parts of it (the fancypants quads, the viewport shito, a fun rasterizer quirk, the wonderful antialiasing implementation), but here we're going to cover the whole things, with all the juicy details. Or atleast all those that we know of. Heh.

The GPU in itself is a fairly old and obsolete piece of hardware. It is limited to 2048 polygons and/or 6144 vertices per frame. The resolution is 256x192. Even if you increased that by 4, performance would not be a concern. The DS can output a maximum of 122880 polygons per second under optimal conditions, which is puny compared to modern GPUs.

Now, we get into the details of the GPU operation. It looks fairly standard on the top, but deep down the operation is different from modern GPUs, which can make it difficult to emulate some features properly.

The GPU is split in two parts: the geometry engine and the rendering engine. The geometry engine processes incoming vertices, builds polygons and transforms them so they can be passed to the rendering engine, which (you guess) draws them to the screen.



The geometry engine

Fairly standard geometry pipeline.

A detail worth noting is that all the arithmetic is done using fixed-point integers, as the DS doesn't support floating-point numbers.

The geometry engine is emulated entirely in software (GPU3D.cpp), so it's not too relevant to what we use to render graphics, but I'm going to detail it anyway.

1. Transform and lighting. Incoming vertices and texture coordinates are transformed using sets of 4x4 matrices. Lighting is optionally applied to vertex colors. Pretty standard stuff there, the only nonstandard part is how texture coordinates work (1.0 = one texel on the DS). We can also note the whole matrix stack system, which is more or less a hardware implementation of glPushMatrix() et al.

2. Polygon setup. Transformed vertices are assembled into polygons, which can be triangles, quads, triangle strips or quad strips. Quads are natively handled and not converted to triangles, which is a bit of a problem as modern GPUs only support triangles. However it seems that someone came up with a solution which I will have to test out.

3. Culling. Polygons can be eliminated based on whether they're facing the screen and which culling mode is selected. Pretty standard too. I have yet to determine how this works for quads, though.

4. Clipping. Polygons that are outside of the view volume are eliminated. Polygons that are partly outside are clipped, this step doesn't create new polygons but can add vertices to the existing ones. Basically, each of the 6 clipping planes can end up adding one vertex to the polygon, which means we can end up with as much as 10 vertices. The part about the rendering engine will explain how this is dealt with.

5. Viewport transform. X/Y coordinates are transformed to screen coordinates. Z coordinates are transformed to fit within the depth buffer's 24-bit range.

An interesting bit would be how W coordinates are handled: those are 'normalized' to fit within a 16-bit range. For this, each polygon's W coordinates are considered, and if any is greater than 0xFFFF, they are shifted right by 4 until they all fit within 16 bits. And conversely, if they're all less than 0x1000, they're shifted left until they get there, presumably to get better ranges and thus better precision for interpolation.

6. Sorting. Polygons are sorted so that translucent polygons are drawn last. Then they are sorted by their Y coordinates (yes), which is mandatory for opaque polygons and optional for translucent ones.

This is also why it's limited to 2048 polygons: it has to store them somewhere to perform the sorting. There are two internal memory banks dedicated to storing polygons and vertices. There is even a register telling you how many polygons and vertices are stored.



The rendering engine

This is where the fun begins!

Once all the polygons are all set up and sorted, the rendering engine begins its work.

First fun thing is how it fills polygons. It's nothing like modern GPUs, which seem to fill by tiles and use algorithms optimized for triangles. I don't know how all of them work (heh), but I have seen a bit of how the 3DS GPU does it, and it's tile-based.

Anyway, on the DS, it's a scanline renderer. They have to do it this way so that it can render in parallel with the oldschool 2D tile engines, which operate on a per-scanline basis. There is a small buffer of 48 scanlines which can serve to even out overload on some scanlines.

The rasterizer is a scanline-based convex polygon renderer. It can handle arbitrary amounts of vertices. It can render things wrong if you give it polygons that aren't convex or that have crossed edges, for example:



A butterfly polygon. All fine and dandy.

What happens if we rotate it, though?



Oops.

What went wrong there? Let's outline the original polygon to get an idea:



The renderer is only able to fill one span per scanline. It determines the left and right edges starting at the topmost vertices, and follows those edges until it meets new vertices.

In the picture above, it starts from the topmost vertex, the top left one, and keeps filling until it meets the end of the left edge (the bottom left vertex). It is unaware that the edges are crossing.

At this point, it looks for the next vertex on the left edge, noting that interestingly, it knows not to pick vertices that are higher than the current one, and also knows that the left and right edges have been swapped. Thus, it continues filling until the end of the polygon.

I'd also picture some examples of non-convex polygons, but we're drifing away here, you guess how those go.

Instead, you probably wonder how Gouraud shading and texturing can work with arbitrary amounts of vertices. There are barycentric algorithms that work for interpolating things over a triangle, but... yeah.

The DS renderer, once again, does its own thing. More fancy pics to describe it.



The polygon vertices are points 1, 2, 3 and 4. The numbers don't match what the actual winding order would be, but you get the point.

For the current scanline, the renderer determines the vertices that directly surround the edges (as said above, it starts with the topmost vertices, then walks the edges until they end). In this case, those vertices are 1 and 2 for the left edge, 3 and 4 for the right edge.

The edge slopes are used to determine the limits of the span, which are 5 and 6. The vertex attributes at these points are interpolated, based on the vertical positions within the edges (or horizontal positions for edges whose slopes are X-major).

With that, for each pixel within the span (for example, point 7), the attributes at that pixel are interpolated from the attributes previously calculated at points 5 and 6, based on the X position within the span.

The factors used here are all 50% to make things easier to deal with, but you get the point.

I'm not getting into how attributes are interpolated in detail, this would be beyond the scope of this post. Although this would be interesting to write about too. It is basically perspective-correct interpolation, but there are some fun shortcuts and quirks.

Now you know how the DS fills polygons.


How about the actual fill rules now? This is a bunch of fun too!


Yeah. Well, first of all, there are different fill rules for opaque and translucent polygons. But the killer detail is that the rules are applied per pixel. Translucent polygons can have opaque pixels, and those will follow the same rules as opaque polygons. You guess how this goes -- emulating this kind of shit on a modern GPU requires separate rendering passes.

Then, various polygon attributes can affect rendering in various fun ways. Additionally to the pretty standard color and depth buffers, the renderer also has an attribute buffer, which keeps track of all sorts of fun things. Namely: polygon IDs (separately for opaque and translucent pixels), whether a pixel was translucent, whether it should receive fog, whether it is from a front-facing or back-facing polygon (yes), and whether it is on the edge of a polygon. And maybe more.

Emulating this sort of thing is not going to be trivial. Your average GPU has a stencil buffer that is limited to 8 bits, which is not nearly enough to emulate all the things the attribute buffer can store. So we will need to think of clever workarounds for this.

Let's see:

* depth buffer update: it is mandatory for opaque pixels, optional for translucent ones.

* polygon IDs: polygons are assigned 6-bit IDs, that can serve a few purposes. Opaque polygon IDs are used for edge marking. Translucent polygon IDs can be used to control where those are drawn: a translucent pixel will not be drawn if its polygon ID is the same as the existing translucent polygon ID in the attribute buffer. Both polygon IDs also serve to control where shadows are drawn, in a similar fashion, so for example you can have a shadow that covers the floor but not your character.

(side note: shadows are a simple stencil buffer implementation, nothing too terrible there)

Worth noting that when drawing translucent pixels, the existing opaque polygon ID is preserved, as well as edge flags from the last opaque polygon.

* fog flag: determines whether the fog pass should be applied to this pixel. How it is updated depends on whether the incoming pixel is opaque or translucent.

* front-facing flag: this one is a bastard. If you happen to remember:


(oh, what a lazy ass, she reused old screenshots for this)

Sands of Destruction, living up to its name. The screens in that game are quirk alley. Not only do they do things like tweaking their Y coordinates to influence Y-sorting, the screen shown in that screenshot is probably the worst.

It relies on a depth test edge case: the 'less than' compare function accepts equal values when you're drawing a front-facing polygon over opaque back-facing polygon pixels. Yes. As the polygons' Z values are all zero, the screen will be missing elements if you don't emulate this quirk.

I guess the intended effect was that you'd want your object's front side to always be visible over the back side, even if it's so flat that the Z values are all the same. The DS renderer feels like a hardware version of those DOS-era renderers in that it's full of hacks and quirks like that.

Anyway, this will be tricky to emulate with a GPU. And there are other similar depth test edge cases that have to be tested and documented, too.

* edge flags: the renderer keeps track of where polygon edges are. These serve for the final passes, namely, edge marking and antialiasing. There are also special edge filling rules for opaque polygons when antialiasing is disabled. The following graph, made from an old image I found, describes these rules:



Side note: wireframe polygons are rendered by only filling the edges! How smart.


A fun note on depth buffering, too:

On the DS, you get two possible modes for depth buffering: Z-buffering and W-buffering. Seems quite typical. Of course that's without getting into the details.

* Z-buffering uses Z coordinates, transformed to fit within the 24-bit range of the depth buffer. The Z coordinates are interpolated linearly across polygons (with a few oddities, but that shouldn't matter). Nothing too atypical there.

* W-buffering uses W coordinates as-is. Typically, modern GPUs seem to use 1/W instead, but, y'know, the DS uses fixed-point arithmetic, so using reciprocals wouldn't be a very good thing. Anyway, in this mode, the W coordinates are interpolated with perspective correction.


And, finally, the final rendering passes, in order:

* edge marking: the pixels that have edge flags set are given a color, taken from a table and determined based on the opaque polygon ID.

This will color polygon edges. Noting that if a translucent polygon was drawn over an opaque polygon, that polygon's edges will still be colored.

Side effect of how clipping works: the borders where polygons intersect with the screen borders will also be colored. You can notice it for example in Picross 3D screenshots.

* fog: it is applied per-pixel, based on depth values which are used to index a fog density table. You guess, it is applied where fog flags in the attribute buffer are set.

* antialiasing: it is applied to the (opaque) polygon edges. Pixel coverage values are calculated during polygon rendering, based on edge slopes. During the final pass, those pixels are blended with the pixels below, via a clever mechanism that was described in a previous post.

Antialiasing doesn't need to (and cannot) be emulated this way on a GPU, so this doesn't matter here.

Except for the fact that if edge marking and antialising are to be applied to the same pixels, they only get edge marking but with 50% opacity.


I guess we have more or less well described the rendering process. We didn't get into texture blending (combining vertex and texture colors), but this can be emulated with a fragment shader. Same goes for edge marking and fog, provided we can find ways around the whole attribute buffer thing.


But regardless, my point there is that OpenGL or Vulkan (or Direct3D or Glide or whatever) will not make a difference here. Our modern GPUs have more than enough power to handle the raw polygons. The issue is all the rasterization details and quirks. It's not even about pixel perfection here, just take a look at DeSmuME's issue tracker for example to see the issues they're facing with OpenGL rendering, the same issues which we will have to tackle somehow.


As a side note, going with OpenGL would have the advantage of allowing it to be ported to, say, the Switch (as Github user Hydr8gon started making a Switch port of our emulator).


Well... wish me luck.


(fun reminders of the blargSNES times, where I initially thought that hardware rendering was impossible... and you know how that went)
Quick bugfix update for Windows
If you have downloaded melonDS until now, and it crashes when you try to launch it, try redownloading it.

I pushed a quick bugfix update for the Windows build. Fixed the code that was responsible for loading INI/BIOS/firmware from AppData. So now, it shouldn't crash randomly anymore, and it should actually work as intended.

The bug went unnoticed because a) it strikes on a fresh melonDS install, and b) it's a heisenbug! It tried to access memory that was uninitialized and potentially freed, depending on a lot of things internal to the OS. You guess how it goes.
melonDS 0.7 -- Granting popular wishes
Or atleast starting to do so. There isn't a lot in this release, but hey, we have to start somewhere.

Atleast, The Spark is back, somewhat. So I guess we can take melonDS somewhere.


Not a lot of novelty visually speaking, so there will only be one screenshot:



Nothing changed! :D

Except, if you look closely, the bottom border of the blue platform thing.

Couldn't resist.

Fixed a small bug regarding shadows and antialiasing, that caused that.


What else? Miscellaneous fixes. melonDS shouldn't crash randomly when closing it anymore. And other things.


Oh, and savestates. Which were one of the popular demands, and which are finally brought to you.

Here is the outline of our implementation:

• 8 savestate slots, which are per-game. Savestate files are stored in the same place as save files.
• 'Undo state load' will revert to the state from right before loading a savestate. For example, if you loaded it accidentally.
• In savestate settings: 'Separate savefiles' allows saving to a separate save file after loading a savestate. The separate save file is bound to that savestate slot.
• Shift-Fx to save state, Fx to load state. x = 1 to 8 for regular slots, 9 for using an arbitrary file. F12 to undo the latest state load.


Have fun!


Windows 64-bit
Linux 64-bit

melonDS Patreon