|Home | Downloads | Screenshots | Forums | Source code | RSS | Donate|
|Register | Log in|
|< melonDS - now also for macOS!Back in business >|
A tour through melonDS's JIT recompiler Part 1
Dec 5th 2020, by Generic aka RSDuck
I already talked about the JIT recompiler on this blog before, but that was mostly blabla. Now we go into the nitty gritty details of how everything works! Maybe this will help other people working on JIT recompilers as there seems to be not so much written on this, so I learned a lot about this from reading other people's source code and talking to them (which I still encourage!). Also the JIT isn't my only work on melonDS, so I have some other topics to talk about later as well.
The heart of almost every emulator is the CPU emulation. In the case of the Nintendo DS it has two ARM cores, the ARM7 inherited from the GBA and an ARM9 core which is the main processor. Fortunately the main difference between these two for us are a few extra instructions and memory integrated in it (DTCM, ITCM and cache, the latter deserves it's own article btw). Otherwise it's also just a faster processor.
The most straightforward way to emulate a processor is an interpreter, i.e. replicating it's function step by step. So first the current instruction is fetched, then it's decoded to determine which handler is the appropriate one to execute it, which then is invoked. Then the program counter is increased and the cycle starts again.
This approach has the advantage that it's relatively easy to implement while allowing for very accurate emulation, of course only if you take everything into account (instruction behaviour, timing, …), but has the major disadvantage that it's pretty slow. For every emulated instruction quite a lot native instrutions have to be executed.
One way to improve this is what Dolphin and Mupen call a "cached interpreter". The idea is to take a few instructions at a time (a block) when they're first executed and save the decoding for them. Next time this block is executed we just need to follow this list of saved handlers. Viewing multiple instructions at once has other advantages as well, like e.g. we can analyse it and detect idle loops to skip them.
But even the cached interpreter is still comparatively inefficent. But what if we can generate a function at runtime which does the equivalent job of a block of emulated instructions and save it, so next time this block of instructions has to be executed we only need to call this function? With this method we could completely bypass branching out to the handlers which implement the respective instructions, because essentially everything is inlined. Other optimisations become possible, like we can keep emulated registers in native registers or we can completely eliminate the computation of values which aren't used and that's merely the beginning. That's where the speed of JIT recompilers comes from.
Before we can start recompiling instructions we first need to clear up on blocks of instructions. There are two main questions here:
First we say a block can only be entered via the first instruction and left via the last one. This makes the code generation significantly more easier for us, but also the generated code more efficient. So it's not possible to jump into a block half way in, instead we would create another block which would start at that point. This has one problem: with the interpreter we can leave or execute at another point after every instruction, e.g. when an interrupt occured or the timeslot of the cpu is over, while a JIT block has to be executed until the end. For this reason the maximum block size is adjustable in desmume (and some games require setting it below a certain value) which is the case for melonDS as well, though we have some more hacks haven't heared of a game breaking at too high block sizes yet ;). The last thing to consider is that we can't just take the next n instructions from the first one and compile them into a block. We need to keep in mind that branch instructions can bring the pc to any other places, including somewhere inside this block and can also split the execution into two paths if they're conditional. While this all could be handled to generate even more efficient code (we do this to some degree, more on that later), for now we leave this out. So after a branch instruction we end a block.
The pivot of the second question is the block cache. melonDS's block cache has gone through a few iterations, though originally I just copied desmume's which is the one I'm going to describe here, we get fancier in the future. The way the generated code is stored might sound crude but it's simply a large buffer (32 MB) which we fill from bottom to top, once it's full we reset everything. That works surprisingly well, as it fits the code of most games and we still do it like this. Now we need to associate the entry point of a block inside that buffer with the pc in the emulated system where that block starts. Since big parts of the address space are unused it would be unwise to have a big buffer with a pointer for every possible address (that would also take 32 GB on an 64-bit system). A hash table would be an option but lookup can be relatively slow with those. Instead we add one layer of indirection. There is a first array of pointers which divides the address space into 16 KB or so regions. Each of those pointers point into other arrays for all the memory banks which exist which then point to the entry point of each JIT block function. We also only need to store a pointer for every second address, as ARM (4 byte) and Thumb (2 byte) instructions are always aligned to their respective sizes.
Now instead of the usual interpreter loop described before we instead lookup if there's a JIT block at the current execution address. If yes we execute it, otherwise we compile a new one starting at that address and insert it into the block cache for that address.
That concludes the first part of this series. Next time we'll look into the recompilation process itself!
|12 comments have been posted.|
|< melonDS - now also for macOS!Back in business >|