melonDS aims at providing fast and accurate Nintendo DS emulation. While it is still a work in progress, it has a pretty solid set of features:

• Nearly complete core (CPU, video, audio, ...)
• JIT recompiler for fast emulation
• OpenGL renderer, 3D upscaling
• RTC, microphone, lid close/open
• Joystick support
• Savestates
• Various display position/sizing/rotation modes
• (WIP) Wifi: local multiplayer, online connectivity
• (WIP) DSi emulation
• DLDI
• (WIP) GBA slot add-ons
• and more are planned!







Download melonDS

If you're running into trouble: Howto/FAQ
A tour through melonDS's JIT recompiler Part 1
I already talked about the JIT recompiler on this blog before, but that was mostly blabla. Now we go into the nitty gritty details of how everything works! Maybe this will help other people working on JIT recompilers as there seems to be not so much written on this, so I learned a lot about this from reading other people's source code and talking to them (which I still encourage!). Also the JIT isn't my only work on melonDS, so I have some other topics to talk about later as well.

The heart of almost every emulator is the CPU emulation. In the case of the Nintendo DS it has two ARM cores, the ARM7 inherited from the GBA and an ARM9 core which is the main processor. Fortunately the main difference between these two for us are a few extra instructions and memory integrated in it (DTCM, ITCM and cache, the latter deserves it's own article btw). Otherwise it's also just a faster processor.

The most straightforward way to emulate a processor is an interpreter, i.e. replicating it's function step by step. So first the current instruction is fetched, then it's decoded to determine which handler is the appropriate one to execute it, which then is invoked. Then the program counter is increased and the cycle starts again.

This approach has the advantage that it's relatively easy to implement while allowing for very accurate emulation, of course only if you take everything into account (instruction behaviour, timing, …), but has the major disadvantage that it's pretty slow. For every emulated instruction quite a lot native instrutions have to be executed.

One way to improve this is what Dolphin and Mupen call a "cached interpreter". The idea is to take a few instructions at a time (a block) when they're first executed and save the decoding for them. Next time this block is executed we just need to follow this list of saved handlers. Viewing multiple instructions at once has other advantages as well, like e.g. we can analyse it and detect idle loops to skip them.

But even the cached interpreter is still comparatively inefficent. But what if we can generate a function at runtime which does the equivalent job of a block of emulated instructions and save it, so next time this block of instructions has to be executed we only need to call this function? With this method we could completely bypass branching out to the handlers which implement the respective instructions, because essentially everything is inlined. Other optimisations become possible, like we can keep emulated registers in native registers or we can completely eliminate the computation of values which aren't used and that's merely the beginning. That's where the speed of JIT recompilers comes from.

Before we can start recompiling instructions we first need to clear up on blocks of instructions. There are two main questions here:
  • where does a block begin and where does it end?
  • how are blocks saved/looked up?
Note that most of this applies for cached interpreters as well.

First we say a block can only be entered via the first instruction and left via the last one. This makes the code generation significantly more easier for us, but also the generated code more efficient. So it's not possible to jump into a block half way in, instead we would create another block which would start at that point. This has one problem: with the interpreter we can leave or execute at another point after every instruction, e.g. when an interrupt occured or the timeslot of the cpu is over, while a JIT block has to be executed until the end. For this reason the maximum block size is adjustable in desmume (and some games require setting it below a certain value) which is the case for melonDS as well, though we have some more hacks haven't heared of a game breaking at too high block sizes yet ;). The last thing to consider is that we can't just take the next n instructions from the first one and compile them into a block. We need to keep in mind that branch instructions can bring the pc to any other places, including somewhere inside this block and can also split the execution into two paths if they're conditional. While this all could be handled to generate even more efficient code (we do this to some degree, more on that later), for now we leave this out. So after a branch instruction we end a block.

The pivot of the second question is the block cache. melonDS's block cache has gone through a few iterations, though originally I just copied desmume's which is the one I'm going to describe here, we get fancier in the future. The way the generated code is stored might sound crude but it's simply a large buffer (32 MB) which we fill from bottom to top, once it's full we reset everything. That works surprisingly well, as it fits the code of most games and we still do it like this. Now we need to associate the entry point of a block inside that buffer with the pc in the emulated system where that block starts. Since big parts of the address space are unused it would be unwise to have a big buffer with a pointer for every possible address (that would also take 32 GB on an 64-bit system). A hash table would be an option but lookup can be relatively slow with those. Instead we add one layer of indirection. There is a first array of pointers which divides the address space into 16 KB or so regions. Each of those pointers point into other arrays for all the memory banks which exist which then point to the entry point of each JIT block function. We also only need to store a pointer for every second address, as ARM (4 byte) and Thumb (2 byte) instructions are always aligned to their respective sizes.

... read more
melonDS - now also for macOS!


Yep.
If you want to test it, scroll down to the bottom of the post. I’ll be explaining about what needed to be changed for it to work.

This originally started as a little challenge. "It shouldn't be that hard," I thought. However, it wasn't as easy as I would have hoped, but I got there in the end.

- The JIT recompiler

Thanks Generic (aka RSDuck) for helping me out a lot here and guiding me!

Fastmem

It mapped memory using "memfd_create()" on Linux, which didn't exist on macOS. Instead, on macOS shm_open is used to create the fastmem memory.
macOS also didn't have "->gregs" in "uc_mcontext" and no "REG_RIP" either. This has to be changed to "->__ss.__rip" instead.
Then, it would crash with a "bus error" on attempting to load. This was caused because macOS returned "bus error" instead of "segmentation fault", so the signal handler couldn't handle it.
Note: fastmem was disabled because it caused all sorts of errors while trying to boot firmware or run games. If anyone manages to fix it, send a pull request!

The JIT itself

The JIT would build, but at link time it would complain about "ARM_Dispatch" and "ARM_Ret" being undefined. Apparently in the Mach-O format (used in macOS) global function names defined in assembly are required to be prepended by an underscore.
Then it would crash upon booting firmware or trying to load a game. This was caused by the line here which tried to reprotect some memory to make it executable. On macOS, new memory is now mmap'ed instead.

... read more
A lil' message to would-be translators
Since there's been a bunch of comments from people offering to translate the emulator's UI, I figured I would state this.

I am wary about internationalizing software before the end of the dev cycle. That being said, is there really an 'end of dev cycle' for an emulator project? I think it'd be a good idea to make melonDS accessible to languages that aren't English. I have a couple concerns about this though:

- I'd like translators to stick around. If they can be around to fix up their respective translations before each release, that will be great. I just really want to avoid having translations become incomplete and/or obsolete because their author is long gone.

- I want to ensure the translators are good at English and understand the terminology used in melonDS's UI. Just basic quality insurance, no Google Translate crap.

- What about this website? It's a whole different can of worms. The interface could be translated, but having to translate each and every blog post would be a massive pain in the ass.

There are also a bunch of technical concerns, but, overall, maybe we can try and pull this off for melonDS 1.0, or even earlier?

If you're in, check out this thread.

Thank you!
Changes to the website
This blog now uses the same user accounts as the forums. So if you have an account there, you can now use it to post comments here as well.

Of course, comments are also still open to guests.

There are more updates planned to this site, so, let me know asap if anything breaks.
The DSi camera adventure
I mentioned the DSi cameras in my previous post, and that's what I was working on lately. Mostly trying to get the cameras going on the DSi itself, so I could test the transfer hardware.

Well, it's not been that easy.

I had started work in the dsi_camera branch, but so far it was a large trainwreck. I couldn't really understand how camera transfers work and how everything interacts together. My attempt at a guessed implementation was getting nowhere, which meant it was time for some hardware research.

So I started work on a DSi camera test homebrew. I first went and implemented the initialization procedure found in GBAtek, only to be rewarded with a hang when trying to activate a camera. I tried many things, taking the init procedure from some open-source Aptina MT9V113 driver (the model of camera the DSi uses), reverse-engineering the DSi camera app to use its exact init procedure, all to no avail.

I felt stuck there. I even tried looking for existing examples using the DSi cameras, found this one by Epicpkmn11, but at the time it seemed to have the same issue I was having.

I eventually went out and asked for help on several places. A side effect is that I'm now found in some Discord servers. I also posted a thread at nesdev, knowing nocash hangs around there. The documentation in GBAtek implied he did get the cameras working, so I figured he'd be able to help. And he did, thanks there.

I first looked at the code he provided, checking for any meaningful differences in the init procedure, but it looked like I had all the essential stuff right. I was stumped.

It eventually occured to me that maybe I should try initializing both cameras simultaneously, like Nintendo does, rather than only initializing one camera. You know how it is, when you're desperate, anything can look like a valid solution. Anyway, that didn't cut it, but it revealed something interesting when I tried to read some registers from both cameras. Some reads were getting corrupted. So I knew something was up with the I2C code.

Looking at nocash's I2C code, I was able to spot and fix the issue. Turns out that during an I2C read, you don't raise an ack when reading the last byte. This fixed the corruption I was observing, and finally allowed the camera to activate successfully. At the same time, Epicpkmn11 happened to be in the same Discord server I was in, so they could fix their code too (turns out it did have the same issue as mine).

... read more
Sorry for the silence lately
I know that since the 0.9 release, there hasn't been a lot of progress. On my side, things have been pretty rough, especially regarding depression. Under these circumstances, it's good to take a break.

Anyway, what can we attempt doing, at this point?

Besides dealing with the pull requests and issue reports?

One thing I was working on lately was DSi camera support, but I didn't get too far. I'm going to need hardware tests to figure out how the camera hardware works. Considering the lenghty initialization procedure for those, it's not quite something I look forward to. So I'll post more about this when I get further into it.

I have ideas for the OpenGL renderer, namely, a better method for rendering quads. It would need more work for an implementation though, but might be worth it.

But, one of my main concerns is about wifi, especially local multiplayer.

At this point, melonDS is mainly known as 'the wifi emulator'. It's a bit sad that, 3 years after we got it working, we're still telling people to disable their framerate limiter and pray. We can probably do better.

It's not like we haven't tried, though. You might have seen that branch named 'betterer_wifi' in the repo. I was hoping to run the wifi with more stable timing, but it was a trainwreck, it performed even worse than our current method.

The main issue with local multiplayer is that it requires tight synchronization to function. You might remember how finicky it was back in the old days, you would start lagging and disconnecting as soon as your friend was more than 10m away from you. Long story short, the protocol works by having the host repeatedly poll its clients, multiple times per frame, and each client is given a narrow window to respond (the time given is barely greater than what it takes to transfer the response frame).

... read more
melonDS 0.9 is out!
It's been forever, but, finally, here it is. melonDS 0.9.

And it's big.



So, what are the highlights of this release?


- JIT recompiler

Brought to you by Generic (aka RSDuck), the new JIT recompiler enables melonDS to run much faster, and quite often reach fullspeed even when emulating DSi titles!

There are a few settings you can try out to get the most out of this JIT. While it has been heavily tested and worked on, it's still imperfect.


- DSi emulation

This is the other flagship feature of this release: melonDS now emulates the DSi!

... read more
Getting there...


I'll let you guess ;)
Messin' around with the GL renderer
First of all, lil' status update. Things are nearly ready for the 0.9 release, we are mostly busy ironing out small issues here and there to ensure everything is good. Nobody would want something big like that 0.9 release to end up being a total flop.

Also, if you ever wondered why progress is slow: an emulator project is like a tree in that once you're done with the trunk, it branches off in a billion different directions. At this point, melonDS is too big to be a one-man show. There are a few people working on it now, but there's only so much we can do with our time and motivation.

Anyway, while Generic is busy polishing the JIT, I figured I'd go around and try fixing some of the issues on the issue tracker.

This one, Deformed floor textures in the Celestial Tower (Pokemon Black/White) (OpenGL only), is an interesting problem. The base issue is that, if you happen to remember, the DS can draw quads natively, while modern GPUs can't. Actually, it's even worse: clipping on the DS cuts through polygons but doesn't create more polygons, which means that you can end up with a maximum of 10 vertices per polygon. For example, a triangle that sticks out of the view volume can become a quad, a pentagon, or more.

The base issue here is faulty rendering of these polygons that have more than 3 vertices. The DS employs a scanline-based convex polygon renderer, so it doesn't care how many vertices your polygon has. Software renderers used in DS emulators use similar filling algorithms, so no problem there either. However, when you use OpenGL, it's another deal entirely -- modern OpenGL-compatible GPUs are very good at drawing triangles, but... that's about it.

So, what do you do when you need to render a quad? Easy, split it into two triangles!

Suppose the quad below.



All fine and dandy. Now we split this, like this:

... read more
maxmod fixes
Sorry for the silence lately -- these times are getting pretty busy. melonDS wise, we're trying to perfect the JIT and the few other things we want for the 0.9 release. Real life wise, on my side, I'm starting the procedure to get my gender marker updated, which is going to be the last big thing for my transition.

Anyway, just yesterday, asie told me that maxmod's interpolated mode was broken in melonDS, which piqued my curiosity.

What's maxmod, you ask? It's a real fancy audio library for the GBA and DS. I don't know much about the GBA side of it, but on the DS, it supports three modes: hardware mode, interpolated mode, and extended mode. Hardware mode is fairly straightforward. Interpolated mode resamples audio on the fly, applying interpolation to make up for the lack of hardware interpolation. Extended mode does its own mixing, adding support for more channels than the hardware can offer.

In our case, interpolated mode was broken, outputting pretty much short high-pitched beeps and nothing else. I was curious to see if maxmod used any of the fancy audio capture/output modes that melonDS doesn't support (because no commercial game uses them, and my policy is to avoid implementing things until I have test cases). melonDS is set to report if any such features are being used, but in this case, it didn't report anything out of the ordinary. So this meant I'd have to dig further.

Quick regression testing showed that interpolated mode worked okay-ish on melonDS versions prior to 0.6. Well, it didn't sound as it should, but it was atleast reasonably close, instead of just being high-pitched beeps.

So apparently it was completely broken when sound FIFOs were implemented. The FIFO logic was fine, but the addition of that feature worsened the consequences of a bug that had always been there.

I logged what was going on during playback, to try and figure out where it failed. It appeared that the data being fed to the audio channels was fine, but, for whatever reason, the channels themselves failed to actually pull the data from memory. More logging revealed some strange things, like how certain things started at zero when they shouldn't. Notably, the channel timers started at zero, when they're supposed to start at the SOUNDxTMR reload value. But also, the FIFO level started at zero, causing it to immediately go negative and break the FIFO filling logic.

For a while, I scratched my head at all that, until it finally clicked.

In interpolated mode, maxmod will first disable the mixer, then sequentially initialize the channel registers, then enable the mixer, letting all channels start in perfect sync.

... read more