A tour through melonDS's JIT recompiler Part 1
I already talked about the JIT recompiler on this blog before, but that was mostly blabla. Now we go into the nitty gritty details of how everything works! Maybe this will help other people working on JIT recompilers as there seems to be not so much written on this, so I learned a lot about this from reading other people's source code and talking to them (which I still encourage!). Also the JIT isn't my only work on melonDS, so I have some other topics to talk about later as well.

The heart of almost every emulator is the CPU emulation. In the case of the Nintendo DS it has two ARM cores, the ARM7 inherited from the GBA and an ARM9 core which is the main processor. Fortunately the main difference between these two for us are a few extra instructions and memory integrated in it (DTCM, ITCM and cache, the latter deserves it's own article btw). Otherwise it's also just a faster processor.

The most straightforward way to emulate a processor is an interpreter, i.e. replicating it's function step by step. So first the current instruction is fetched, then it's decoded to determine which handler is the appropriate one to execute it, which then is invoked. Then the program counter is increased and the cycle starts again.

This approach has the advantage that it's relatively easy to implement while allowing for very accurate emulation, of course only if you take everything into account (instruction behaviour, timing, …), but has the major disadvantage that it's pretty slow. For every emulated instruction quite a lot native instrutions have to be executed.

One way to improve this is what Dolphin and Mupen call a "cached interpreter". The idea is to take a few instructions at a time (a block) when they're first executed and save the decoding for them. Next time this block is executed we just need to follow this list of saved handlers. Viewing multiple instructions at once has other advantages as well, like e.g. we can analyse it and detect idle loops to skip them.

But even the cached interpreter is still comparatively inefficent. But what if we can generate a function at runtime which does the equivalent job of a block of emulated instructions and save it, so next time this block of instructions has to be executed we only need to call this function? With this method we could completely bypass branching out to the handlers which implement the respective instructions, because essentially everything is inlined. Other optimisations become possible, like we can keep emulated registers in native registers or we can completely eliminate the computation of values which aren't used and that's merely the beginning. That's where the speed of JIT recompilers comes from.

Before we can start recompiling instructions we first need to clear up on blocks of instructions. There are two main questions here:
  • where does a block begin and where does it end?
  • how are blocks saved/looked up?
Note that most of this applies for cached interpreters as well.

First we say a block can only be entered via the first instruction and left via the last one. This makes the code generation significantly more easier for us, but also the generated code more efficient. So it's not possible to jump into a block half way in, instead we would create another block which would start at that point. This has one problem: with the interpreter we can leave or execute at another point after every instruction, e.g. when an interrupt occured or the timeslot of the cpu is over, while a JIT block has to be executed until the end. For this reason the maximum block size is adjustable in desmume (and some games require setting it below a certain value) which is the case for melonDS as well, though we have some more hacks haven't heared of a game breaking at too high block sizes yet ;). The last thing to consider is that we can't just take the next n instructions from the first one and compile them into a block. We need to keep in mind that branch instructions can bring the pc to any other places, including somewhere inside this block and can also split the execution into two paths if they're conditional. While this all could be handled to generate even more efficient code (we do this to some degree, more on that later), for now we leave this out. So after a branch instruction we end a block.

The pivot of the second question is the block cache. melonDS's block cache has gone through a few iterations, though originally I just copied desmume's which is the one I'm going to describe here, we get fancier in the future. The way the generated code is stored might sound crude but it's simply a large buffer (32 MB) which we fill from bottom to top, once it's full we reset everything. That works surprisingly well, as it fits the code of most games and we still do it like this. Now we need to associate the entry point of a block inside that buffer with the pc in the emulated system where that block starts. Since big parts of the address space are unused it would be unwise to have a big buffer with a pointer for every possible address (that would also take 32 GB on an 64-bit system). A hash table would be an option but lookup can be relatively slow with those. Instead we add one layer of indirection. There is a first array of pointers which divides the address space into 16 KB or so regions. Each of those pointers point into other arrays for all the memory banks which exist which then point to the entry point of each JIT block function. We also only need to store a pointer for every second address, as ARM (4 byte) and Thumb (2 byte) instructions are always aligned to their respective sizes.

Now instead of the usual interpreter loop described before we instead lookup if there's a JIT block at the current execution address. If yes we execute it, otherwise we compile a new one starting at that address and insert it into the block cache for that address.

That concludes the first part of this series. Next time we'll look into the recompilation process itself!
Rayyan says:
Dec 5th 2020
Nice to see technical insights such as this!
Arnie says:
Dec 6th 2020
This was a riveting and insightful read. Good one
Nixel says:
Dec 7th 2020
As a less-experienced programmer, I don't quite understand what all of this means. I got the gist of it though, and trying to understand complicated things is always fun to me.
Niklink says:
Dec 7th 2020
I don't understand WHY JIT is faster than regular interpretation. Why is 'generating a function to execute a block of instructions' a better approach than a regular interpreter? Are we somehow skipping the need to decode and translate the instructions? Or is interpretation done on the whole block in one swoop somehow? How is this even possible to do, and why is it faster if it's being done at runtime still? In short, what's the actual difference in the technique?
Generic aka RSDuck says:
Dec 7th 2020
it is of course only faster, because we only do the translation once and save the result. If we wouldn't save the result, this approach would probably even be slower. I should have made that clearer.

EDIT: done
Anon says:
Dec 8th 2020
Wait, so is the interpreter running during compile time? I'm sorry if I'm being dense, I'm very new to all of this. So would a higher block size mean more CPU utilization considering that it's processing more instructions? I remember back in the day all of the desmume tutorials would recommend setting the highest possible block size for maximum performance, so I'm a little confused.
Generic aka RSDuck says:
Dec 8th 2020
no the interpreter runs while a game is running.

The block size only affects how many instructions will be grouped and compiled into one JIT block. Usually you want as large as possible block sizes for optimal performance, because processors work the fastest if the path of execution they're taking is very straightforward. So with the interpreter it has to branch out and back for every single emulated instruction, while with the JIT it only has to do this for every block of instructions. We can also generate more efficient code if we have more context.

But as I outlined before, if we make the maximum block size too high it will be less accurate, because there are other pieces of hw which have to run as well. So the cpu needs to be stopped regulary to let them run.
sards says:
Dec 10th 2020
Thanks for the write up. I have a few more technical questions.

1. How do you handle timing/cycle counting? Do you just count up the total CPU cycles of a block during the JIT process and then use that for synchronization when executing blocks?

2. How do you handle the persistence of the ARM CPU state? I imagine that in host memory somewhere, you store the ARM CPU state (registers, flags, etc.) So when you execute a JITed block, do you have to transfer the ARM CPU state from memory to X86 registers every time, and then store the values in the X86 registers back to the in-memory state at the end of every block? Or do you make some effort to keep some state in the X86 registers across blocks?

3. How do you handle guest memory access? The typical interpreter has Read() and Write() functions for memory access that check the requested memory address to see if it is RAM, ROM, unmapped, etc. and then direct the read/write to the appropriate host memory buffer. In the JITed blocks, do you call out to this type of function for every guest memory access, or do you inline it somehow?
Generic aka RSDuck says:
Dec 10th 2020
1. yes, basically though there's one additional twist we have. We get there in the next entry

2. yes, though we're pretty smart about this and only load registers which are going to be read and write registers back which were written to. Preserving state across blocks would require some kind of special dispatcher which wouldn't touch those registers (because by standard calling convention you can't ask for the preservation of registers after a return), or you would need to dispatch the next block from within the current block (block linking, something I tried which made performance worse). And you would also need either to hardcode certain register mappings which is a risk I wouldn't take on x64 where register pressure is quite hard. Alternatively you compile special variants of blocks for certain register mappings, but you can see how this turns into a mess.

Though that is about inter block jumps. Jumping within a block, which is the more common case anyway, with preservation of registers is something I consider possible and probably useful, but this requires a very sophisticated code generator to get it right. While it's probably possible to shoehorn it into melonDS's simple code generator (which translates on a per instruction base, without some kind of IR which would make things like this a lot easier), the timing slots on a DS are relatively short, so you would need to plaster the resulting code with checks and exits which would make everything slower again (similar to my block linking experiment). I want to do this for another project which is an emulator for a newer system, so this doesn't apply there.

3. We have fastmem for a while now so mapping virtual memory so that it replicates the memory layout of the guest system which is very fast but of course doesn't cover I/O registers and certain memory regions (fastmem is definitely worth it's own article). But otherwise if fastmem is disabled or a load/store accesses memory not mapped via fastmem we just use a C function which turned out to be the fastetst thing. I tried some other things like an inlined pointer table but that was slower.
sards says:
Dec 10th 2020
Thanks for the detailed reply!
modwizcode says:
Feb 22nd 2021
Out of curiosity have you looked into how qemu's. JIT (which is calls TCG) handles those? I've been working with QEMU a lot recently, while I've mostly kept out of the instruction side of things a couple of strategies they use seem to allow handling more things online. They also (supposedly) schedule instructions in a block such that IO is always performed at the end of a block (although I should look at how this works since theoretically that's not deterministic).

QEMU does some merging and optimization of virtualized guest memory, but it seems to lack any fancier memory strategies like Dolphin has.

Also one more question, any reason you didn't start from something like Dolphin's architecture? You talk about it in the post and I know it to be well built, so I'm curious if it just wasn't really relevant? Dolphin's code almost makes me like C++ ;p
Generic aka RSDuck says:
Feb 23rd 2021
I haven't looked at qemu until now where you mentioned it. It seems to use an IR based approach which I definitely want to try in the future for a different project, though I don't know what kind of block cache they use. It's probably a lot simpler than melonDS's for the reasons outlined in the last paragraph.

Regarding the memory optimisations, we do have a few, I want to go into further detail in a future post. Though I can't imagine either how this memory access reordering is supposed to work.

When I started working on the JIT for melonDS I had just finished the aarch64 JIT for desmume (which was my first endevour in this area) and such it was also a big influence for me. Though as I started doing more advanced optimisations which went beyond what desmume was doing, I looked a lot into the code of Dolphin, PPSSPP and Mupen and also copied a lot from them structurally. Though the main reason I couldn't copy the block cache from either of them is because of the way memory is structured on the DS compared to most other systems. E.g. on GC there's only a single 24 MB chunk of memory accessible to the main cpu. We can only dream of conditions like this on DS: we have two processors with different memory layouts with a bunch of remappable areas and mirrors, it's honestly a mess.
Post a comment
Name:
DO NOT TOUCH