|Home | Downloads | Screenshots | Forums | Source code | RSS | Donate|
|Register | Log in|
|< Well, guess we owe you another release soonTragic news >|
Pride month and DMA timings
Jun 7th 2021, by Arisotura
Well, happy Pride month, or whatever, dunno how to put it. As a trans girl of some flavor, I just want to reiterate support to the LGBTI+ community. I'm also not the only LGBTI+ member of the melonDS team, btw.
Now, let's talk about something more technical (so this post is not just a 'political statement' :P ).
You might have noticed the timing17 branch. So what, another timing branch. Arisotura just loves these. Or something.
It's going to be some general timing renovation, depending on how far my motivation will take me. I started the work with DMA timings, figuring it would just be matter of taking into account that sequential timings only work when the address is incrementing linearly... well, it's hairier than that.
This is based on tests done at like 04:00, and this is the DS, so take this with a rock of salt.
Most memory regions in the DS have such timing characteristics that sequential accesses make no difference, atleast from the standpoint of DMA. The ARM9 and its 3-cycle penalty when using the bus are another story.
However, there are a couple regions that have different rules.
Main RAM (0x02000000)
Seeing how slow the DS's main RAM is (8 cycles for a 16-bit access), it makes sense for it to support some form of burst access: when accessing a bunch of consecutive memory addresses, the first one will get the full waitstate, then consecutive ones will be faster.
However, how this interacts with DMA is... weird. I did some testing on the ARM9 side, it's probably similar on the ARM7 side but I will have to test there to confirm it.
Main RAM reads can be parallelized to some extent with writes to other memory regions. In practice, this seems to shave off one cycle from the nominal nonsequential timing.
Main RAM writes can be done sequentially in maximum bursts of 120 halfwords in 16-bit mode, or 80 words in 32-bit mode. You guess, the first write in the burst gets the nonsequential timing, the rest get the sequential timing.
Reads a bit weirder. In 32-bit mode, we get a maximum burst length of 118 words. In 16-bit mode, however, there seems to be a hardware glitch: we get two bursts of 119 halfwords each, then one burst of two halfwords, then two bursts of 119 halfwords, and so on, in a repeating pattern. It's like something (the DMA controller?) is enforcing a maximum burst length of 240 halfwords which was miscalculated.
The whole sequential burst thing assumes two things: a) that you are DMAing from main RAM to another memory region, or vice versa, and b) that the main RAM address is incrementing linearly.
Regarding a), DMAing from main RAM to main RAM will force each access to be nonsequential with no parallelization possible, resulting in abysmal performance. It is even faster to DMA from main RAM to another memory and then back to main RAM in two separate transfers.
Regarding b), burst access only works with a linearly incrementing address. Setting the DMA to use a fixed or decrementing address results in all-nonsequential accesses. For whatever reason, it seems that in this case, reads to the last halfword of each 32-byte block are one cycle faster.
Got it all? Now figure out an elegant way to implement all this into melonDS. I'm not too concerned about performance, DMA transfers aren't a bottleneck, I just want the code not to be a huge mess.
The GBA slot
Things are simpler there. The timings for the GBA ROM region are as configured in EXMEMCNT. The DMA controller can do burst accesses with little restrictions: the first halfword is nonsequential, as well as the last halfword of each 0x20000-byte block, the rest are all sequential. This even works with a fixed or decrementing address, which hints me that maybe these modes don't work correctly in this region.
There are no weird parallelizing shenanigans or whatever, either.
The GBA RAM region does not support sequential accesses. And, seeing the results I get on hardware, it might just not support DMA at all, or only under certain specific circumstances.
The wifi regions
The wifi I/O ports and RAM are mapped across two mirrored regions, at 0x04800000 (WS0) and 0x04808000 (WS1). Each region can be configured to have different waitstates, via WIFIWAITCNT. The point seems to be that the I/O and RAM regions each have different preferential timings.
This one needs a bit more testing, but the interface seems very similar to that of the GBA ROM region, even down to the way WIFIWAITCNT works, so there's probably nothing too fancy there.
WIFIWAITCNT (ARM7 - 0x04000206)
Bit 0-1: WS0 nonsequential timing (0-3 = 10, 8, 6, 18 cycles)
Bit 2: WS0 sequential timing (0-1 = 6, 4 cycles)
Bit 3-4: WS1 nonsequential timing (0-3 = 10, 8, 6, 18 cycles)
Bit 5: WS1 sequential timing (0-1 = 10, 4 cycles)
Testing will need to be done on the DSi too. Most of the timing characteristics should be the same, but there are a couple things to take into account. Main RAM might have slightly different timings in some of the edge cases. There are the new WRAM regions. There are settings that can expand some busses from 16-bit to 32-bit (like for VRAM), which likely affects timings. There is the NDMA controller.
Why go through all that trouble?
Do the DMA timing details matter a lot? Yes and no.
Let's say you have a 4096-word DMA transfer, and you determine its duration to be 8192 cycles (assuming 2 cycles per word). If the actual hardware timing is like 8 cycles off, it doesn't matter a lot.
However, let's say the actual hardware timing is off by one cycle every 16 words. That's a total error of 256 cycles, which has more of an impact on things. Worse, in this hypothetical scenario, such an error would grow with larger DMA transfers too.
So, while I'm not spending a lot of effort on things like DMA setup delays, I'm putting all the effort into understanding how the overall timing of a DMA transfer relates to the transfer length, because that is the important part.
As I said in the previous post, I really want to deal with the timing issues once and for all.
This is where emu coders tend to be like "oh no, you have to emulate the ARM9 caches". This sort of thing is a bit of an emulation holy-grail. The last big timing improvement possible before needing a cycle-accurate emulator (which, with the DS, well, good luck).
But, in reality, emulating the ARM9 caches is not going to be a magical fix if our underlying timing model is inaccurate. At best, it will just create a different inaccurate timing profile, and we would be telling users to switch between the two and hope one of them will work.
We might end up having to emulate the ARM9 caches. That would suck big time. But we're not going to try to find out before our timing model is accurate enough. And, at that point, that accuracy along with the current kCodeCacheTiming/kDataCacheTiming approximation might be enough. Who knows.
Well, this went ass-far. And we haven't even started doing CPU timings. That's gonna be a fun ride for sure.
|12 comments have been posted.|
|< Well, guess we owe you another release soonTragic news >|