The timings saga, ep 2
At this point, this might as well be its own little saga... albeit not as exciting as the local multiplayer one, atleast on the surface. What is exciting, though, is the possibility to deal with timing issues once and for all.

Maybe not. We can't get it 100% perfect without cycle accuracy. But we can get as close as possible, and then see how much leeway we have. It should be easier to figure this out with a more correct model. When your timing model is fundamentally inaccurate, you can tweak the numbers, but that's a game of whack-a-mole: it fixes some bugs and causes other bugs. This is what the "Enable Advanced Bus-level Timings" option in DeSmuME does. The cache timing constants in melonDS, if they could be modified by the end user, would yield similar results.

Fun fact, melonDS's current timing model is flawed. Due to a few bugs, it doesn't even work as intended. But even if it did, it would still be inaccurate.

DesperateProgrammer's cache PR, aka PR #1955, is a pretty solid base for cache emulation. I think it's a good sign that it alone has showed promising results, despite being based on the same fundamentally flawed timing model.

Anyway, enough talking. Let's see where we're at today.


So far, I'm mostly done with ARM7 timings. I've been collecting timing numbers, figuring out the logic behind those with Jakly's help (thanks there!), and implementing it. In my timing tests, melonDS yields the same numbers as hardware, so that's nice. It feels pretty satisfying when you figure out the logic and everything clicks into place.

I still need to doublecheck everything, and do actual real-world timing tests, to make sure I have everything right.

However, as far as CPU timings are concerned, the ARM7 is only the appetizer.

The timing model is fairly simple. The ARM7 doesn't have internal memory and only has one bus, and everything runs at the same frequency, so most timings are fairly straightforward.

Things get more complex when main RAM is involved.

To understand how this works, it's helpful to consider how the RAM itself works. There are atleast two possible RAM chips for the DS, as well as different, bigger RAM chips found in devkits and such. Since those are likely to have different limitations, the DS's memory controller needs to adhere to the lowest common denominator in order to ensure consistent operation.

GBAtek states that the access time for main RAM is 8 cycles, but the reality is more complicated. There is a setup time and a termination time. The former means that when accessing the RAM, there will be some latency before it begins to return data (or to accept data, when writing). At this point, it's possible to access multiple consecutive addresses in a burst. When ending a burst, there is a minimum delay before you can access the RAM again: that's the termination time.

The interesting part is that after a main RAM access, the ARM7 is free to do other things during the burst termination time, as long as it doesn't try to access main RAM again - in that case, it would have to wait.

There are also other limitations and edge cases. For example, there is a limit on how long a burst can last. In practice, though, this particular limitation only matters for DMA. On the ARM7, consecutive data fetches (ie. LDM) are limited to 16 words, which isn't enough to hit the burst limit. Code fetches can hit it under very specific circumstances which are highly unlikely to occur in real world situations. And it doesn't apply to the ARM9 at all.

That's about it, though. The other memory regions are pretty straightforward timings-wise.

There is only one thing that's really missing to ARM7-side timings, besides the holy grail of main RAM conention, and that is audio DMA. The audio hardware accesses memory periodically, either to read sample data or to write captured data. Those memory accesses will incur stalls on the ARM7, and probably on DMA too.

Audio memory accesses are done in bursts of 4 words, through the FIFO system. melonDS emulates this part already, but ignores the timing aspect of it. Emulating this bit correctly would require deeper research into how the audio hardware operates and when it accesses memory.

Of course, I don't know how truly important it is, but it's worth keeping in mind. There are also several timing-related issues on the ARM7 side in DSi mode: the faster SPI clock isn't supported yet, timings for SD transfers are a big fat guess, I2C transfers and AES operations are instant.


So now, let's get to the main course: the ARM9.

The ARM9 is much more important, since that's where all the game logic runs. It's also more complex.

I've been banging my head against the numbers I've collected from my timing tests, trying to figure out the logic behind them. Jakly has been very helpful here, too.

Atleast, we've figured out the logic behind memory read instructions. It appears that the ARM9 can try to initiate the next code fetch early, but this doesn't always work as intended - sometimes resulting in weird extra latency, because every bus access must go through buffering.

The ARM9 also introduces us to interlocks.

For example, if you run a LDR instruction (32-bit memory read) on the ARM7, the CPU will retrieve the data from memory, then spend one more cycle storing that data into the desired register.

The ARM9 works the same way, except it will attempt to save time by starting the next instruction before LDR is actually done storing its result. But what if the next instruction tries to use the same register that LDR is supposed to write to? That's an interlock - this next instruction will need to wait for LDR to actually finish.

This is something that will need a great deal of testing. Supposedly, the ARM9 has "fast paths" meant to avoid interlocks in certain situations, so we need to figure out where this applies.

Also, fun observation: sometimes the ARM9 is just as hacky as the rest of the console!

Take LDRD, for example. This is a new instruction that has been added to the ARM9. It functions as a 64-bit memory read: it reads two words from memory to a pair of registers.

I have observed LDRD's timing characteristic long and hard, and it is evident to me that LDRD is basically LDR with a second memory read duct-taped at the end. This also means that the second memory read is nonsequential, which makes it inefficient.

By comparison, its counterpart STRD (64-bit memory write) seems more sane.

In general, all the memory write instructions seem sane - they don't have to worry about interlocks. I think the rest isn't too difficult - the instructions that don't touch memory are pretty straightforward. I haven't yet looked into the CP15 (system coprocessor) operations, though.

The issue with the ARM9 is going to be everything pertaining to memory. The caches. The separate busses. All that. For example, the ARM9 can continue running during a DMA transfer as long as it doesn't try to use the external bus. melonDS's very crude implementation of bus stalls doesn't support that.


This also brings me to the JIT.

At this point, it's a balancing act. There is a feeling that the JIT hinders further accuracy improvements. Understandable... the JIT is fundamentally inaccurate. To get a speed benefit, the JIT needs to operate on large enough code blocks, which doesn't mesh well with the way melonDS's scheduler works.

On the other hand, it seems pretty evident to me that removing the JIT would be massively unpopular. I'm against it.

I need to get more familar with the JIT's inner workings. At this point, I know the basic idea of how a JIT works, but I've never actually written one, which is a bit ironic for someone in my position.

What interests me is to see how far we can go with timings in the JIT. I imagine that a lot of the overall CPU timing model could be handled at compile time. On the other hand, I don't think emulating the ARM9 caches in the JIT would work terribly well, performance wise. This makes me wonder how much performance there would be to gain from a cached interpreter approach.

Food for thought.
9wo says:
Apr 10th 2026
hi can you add retro achivements in the pc version if its okay with you
Klauserus says:
Apr 10th 2026
Hey, I just wanted to say that I really appreciate the work you're doing here. It's incredibly impressive to see the level of detail and effort you're putting into improving the timing accuracy.

I hope you stay motivated and, just as importantly, take care of your health along the way. Looking forward to seeing how this evolves!
chase says:
Apr 10th 2026
I assume unpopular means it's related to speed? It's 2026 my pc can handle it, prioritize accuracy :)
Slayer_ says:
Apr 11th 2026
I agree with the other user. We have other emulators that can give speed for who cares going over the 100% speed.
Given your effort into the emulator, I'd prioritize accuracy over everything else, for an emulator that would behave just like a console.
Stay well, thanks for your work.
poudink says:
Apr 11th 2026
I'm a little worried by the amount of people who seem to be seriously suggesting removing the JIT. It's an optional feature and it gives a sizable performance boost at little cost to compatibility. Please don't remove it.
Anon says:
Apr 11th 2026
Been writing there under different names, I really loved the post although I'm a total profane about ultra lowlevel emulation and emulation in general, but personally this whole JIT and timings post gives me an idea, why not repurposing the JIT to only cache operations unrelated with timings and use the waiting time to interpret timing related instructions? Just throwing ideas at a wall hoping something I say can be useful
Arisotura says:
Apr 12th 2026
JIT is not going anywhere

I do really want to experiment with a cached interpreter design tho -- pondering how it could help optimize stuff like timing calculations. the tricky part is that timing calculations aren't strictly per-instruction, even moreso with the ARM9 as multiple things can run in parallel
Post a comment
Name:
DO NOT TOUCH