|
| Home | Downloads | Screenshots | Forums | Source code | RSS | Donate |
| Register | Log in |
| < New direction for melonDSThe timings saga, ep 2 > |
|
Actual status update on timings Apr 2nd 2026, by Arisotura |
|
As promised. Timings work is underway. I'm mostly done collecting numbers, minus stuff like CP15 operations and any special cases that might crop up. I started fixing up ARM7 timings. Thing is, timings work isn't always easy. Collecting timing numbers is one thing, understanding the logic behind them is another thing entirely. Geometry engine timings, which I've worked on years ago, are a prime example of this: when you submit a display list to the geometry engine, the total execution time isn't simply the sum of each command's execution time. The reason for this is that the geometry engine has a pipeline: it can execute certain commands in parallel. It took me a lot of testing and number collecting to understand the logic behind it. If the geometry engine, a simple 3D command processor, is already like that, you can imagine what a full-fledged CPU is going to be like. This is why CPU timings are an area of melonDS that has been more or less neglected for so long. On the flip side, the DS isn't like older consoles such as the GameBoy or the NES, where games might rely on very precise timings in order to get the best out of the system's limited power, and they might crash if a certain event occurs 2 cycles too late. On the DS, from what I've seen, it's more about timings at the macro level, ie. how long a bigger operation takes. For example, consider a DMA transfer to copy a screen-sized bitmap, that is, 24576 words. A difference of 1 cycle in the initial DMA setup time doesn't matter, but a difference of 1 cycle per word transferred quickly adds up. Like here. So, what are ARM7 timings like? The ARM7 has a 3-stage pipeline, which means it can execute an instruction while decoding the next one and fetching the one after. So, for a lot of instructions, the execution time is capped by the access time for the memory region we're executing from. Some instructions may need more cycles to execute. Load and store instructions are also particular, since they access memory. Different memory regions have different access times, depending on their data bus width and the type of memory access. Most of the available memory regions are integrated into the DS SoC, so they can be accessed in one cycle. VRAM has a 16-bit bus, so 32-bit accesses are broken into two 16-bit accesses and take two cycles. External memory, however, has its own rules. There is a notion of nonsequential and sequential accesses. Basically, the memory being accessed may require a setup time for an initial access, but may be able to chain subsequent accesses in a faster way. The ARM7 makes use of this when fetching code: as long as a given instruction doesn't require extra cycles, the next one may be fetched in a sequential memory access, resulting in faster execution. Instructions like LDM and STM, which can load or store several words in a row, also make use of this. Rules for the GBA slot ROM region are simple: the nonsequential and sequential access times are configured in EXMEMCNT. The data bus is 16-bit, so 32-bit accesses are broken down into two 16-bit accesses, where the second one is always sequential. The wifi card uses a similar interface, so the rules are the same. Main RAM, however, is more complicated. To understand how the timings there work, you have to know how those RAM chips operate. I mentioned my little Wii U gamepad project before, and it's relevant here: it involves a FPGA-based SPI FLASH emulator, which uses SDRAM to store firmware code. I first tried using spispy, but since it wasn't up to the task, I ended up making my own - including a simple SDRAM controller. The DS and DSi use FCRAM, which is a bit different: from my understanding, you address FCRAM by sending the full address in one go, instead of having to send separate row and column addresses. Besides that, FCRAM seems to work in a fairly similar way. For example, SDRAM supports burst modes: you send the address once and access several data units in a row (in my case, 16-bit units). Since burst accesses are typically constrained to a given memory block (which appears to be 32 bytes for the DS), a smart memory controller needs to know when to prepare the next burst ahead of time. There is also a delay to terminating a burst. Furthermore, the DS also seems to have a hard limit on how long a main RAM burst can last. Presumably, since main RAM is shared, this serves to avoid having one side hogging it for too long. So basically, when main RAM is involved, access times aren't just a matter of "how long does this particular access takes", the context around it has to be taken into consideration to some extent. There is also, of course, the issue of main RAM contention: basically, when both ARM9 and ARM7 are trying to access main RAM at the same time, the memory controller will delay priorize one of the two, based on the EXMEMCNT priority setting. In practice, this setting seems to always be set to give priority to the ARM7. The issue here is apparent if you consider how melonDS works. Basically, execution is split into bursts. The scheduler determines how long a burst should last: the length is based on the time until the next scheduled event, and capped at 64 cycles. The ARM9 is run for this many cycles, then the ARM7 is run until it has caught up, then scheduled events are executed if needed; rinse and repeat. It is less accurate than running everything in a tight lockstep, but it's also way more efficient. In practice, DS games don't rely on extremely tight synchronization between the two CPUs, so doing things this way is fine. However, it makes it impossible to model things such as main RAM contention. There is simply no way to tell whether the two CPUs are going to be accessing main RAM at the same time. It could be estimated based on heuristics, but that would be a big fat hack. In practice, I don't yet know if anything relies on main RAM contention, besides the infamous DSi menu loader. That's a bit of a special case, since the ARM9 runs with caches off, and the two CPUs are actively competing for main RAM access. In most games, the ARM7 runs its code from internal WRAM, and the ARM9 has caches; however caches aren't perfect, and the sound hardware may be accessing main RAM too. I don't know how much main RAM contention matters in games, given this. Then, since we're talking about the ARM9, there's this too. Timings there are even more complicated. First, the ARM9 runs at 67 MHz, or 134 MHz on the DSi, but the bus still runs at 33 MHz. This means that when the ARM9 needs to access any memory outside of its own caches and TCM, additional delays are incurred: in hardware terms, crossing clock domains requires additional buffering. Atleast, from what I've observed, setting the ARM9 clock to 67 MHz on the DSi yields identical timings to the DS, so there's that. However, setting it to 134 MHz results in several differences, so I have to figure out the logic behind that. Then, the ARM9 is also a more complex beast in itself. The caches can greatly help performance, considering how slow main RAM is. However, they also add complexity to the overall timings. For example, upon a cache miss, an entire cache line needs to be loaded, however the CPU may continue running while this happens, as long as it doesn't try to use the bus. That's what they call cache streaming. There's also a write buffer, which has its own mechanics too. The ARM9 also has a longer pipeline, with 5 stages. This means some instructions may need more time to properly write their results to the CPU registers. Jakly said that there are special paths where a given instruction may send its result to the next instruction early if needed, but this doesn't cover every possible case. For example, take the QADD instruction. It's a signed 32-bit addition with a sticky overflow flag. This instruction takes 1 cycle to execute, however it needs one extra cycle to actually write back its result to the destination register. So if the next instruction tries to use that register, it needs to wait one extra cycle. I don't know all the details there. Jakly is more knowledgeable than I am, so I'll probably be relying on her to help figure out all the logic. The challenge is to implement it in a way that doesn't result in overbearing complexity and is optimizable to some extent. As I said, at this point, I'm fixing up ARM7 timings. According to my tests, melonDS wasn't that far off, but the ARM7 still needs some fixes. I'll also need a lot more testing, in real world situations, not just specific instruction sequences repeated many times. We'll also have to see what to do with the JIT. Surely, a lot of the overall timing model can be precomputed when compiling code blocks, so the JIT would have a decent advantage there. But I'm not sure about the more dynamic things, like the ARM9 caches. I'll have to see with Generic. But not before I've fininshed making all this work in the interpreter. All in all, fun shit. |
| 6 comments have been posted. |
| < New direction for melonDSThe timings saga, ep 2 > |
|
curlyq says: Apr 2nd 2026 |
| just popping in to say thank you for all the work that you do! |
|
Mithrandir says: Apr 3rd 2026 |
|
I can only say you are doing a fantastic job, all the people who enjoy your emulator can be pleased to you for your incredible job. Thank you very much |
|
Swedish New Moderate says: Apr 4th 2026 |
| Thank you so much for everything 🤗 |
|
Cobalt66 says: Apr 4th 2026 |
| This is like Greek to me, but I am so glad you exist and that you work so hard on such a great service that brings countless people joy for free. |
|
evieamity says: Apr 4th 2026 |
| I can't thank you enough for all you've been doing. My friend and I have been having so much fun using it to share our Pokémon Black 2 and White 2 gameplay with each other. It's so nice to see that things are progressing. I love this emulator so much~! ♥ |
|
marion iggers says: Apr 9th 2026 |
| i made a full pokemon marathon with your emulator thanks |