Views: 291,004 Homepage | Main | Rules/FAQ | Memberlist | Active users | Last posts | Calendar | Stats | Online users | Search 11-13-18 08:12 PM
Guest:

0 users reading TIMING NOTES | 1 bot

Main - Development - TIMING NOTES New reply


StapleButter
Posted on 08-08-18 10:54 AM (rev. 8 of 11-05-18 08:01 PM) Link | #638
DMA TIMING


measurements are to be taken with a rock of NaCl. overhead w/ setting up timer/DMA/etc seems variable. timing is unreliable despite disabling cache and IRQ.


oh well.


memory / cycles32 / cycles16

mainRAM -> mainRAM / 18 / 16
mainRAM -> VRAM / 3 / 2
mainRAM -> VRAM unmapped(?) / 2 / 2
VRAM -> mainRAM / 4 / 2
pal -> mainRAM / 4 / 2
OAM -> mainRAM / 3 / 2
mainRAM -> OAM / 2 / 2
VRAM -> VRAM / 3 / 2
BIOS -> mainRAM / 3 / 2
mainRAM -> BIOS / 2 / 2 (does it detect that can't be written to??)
NULL -> mainRAM / 3 / 2 (this DMA does run)
0E000000 -> mainRAM / 3 / 2
0F000000 -> mainRAM / 3 / 2
mainRAM -> NULL / 2 / 2 (does run)
mainRAM -> 0E000000 / 2 / 2
mainRAM -> 0F000000 / 2 / 2
NULL -> NULL / 2 / 2 (runs)

results aren't very precise tho.

I'm hungry.

pal/VRAM/OAM didn't account for video controllers possibly also reading from it. only one supposed to be working is the sub one, which might still access pal/OAM.


organizing shit a bit


mainRAM -> X

00000000: 2/2
01000000: 2/2
02380000: 18/16
03000000: 2/2 (shared WRAM probably not mapped, has to be checked)
03800000: 2/2
04000000: DS turned off. guess DMAing large chunks of shit there is not so much a good idea.
04200000: 2/2
04800000: 2/2
05000000: 3/2
06000000: 2/2 (unmapped VRAM)
06800000: 3/2 (mapped VRAM)
07000000: 2/2
08000000 thru 0F000000: 2/2
FFFF0000: 2/2
FFFF1000: 2/2

01000000(null) -> X

01000000: 2/2
02380000: 3/2
05000000: 3/2
06800000: 3/2
07000000: 2/2
FFFF0000: 2/2

04000000 -> X

01000000: 2/2
02380000: 3/2
05000000: 3/2
06800000: 3/2
07000000: 2/2
FFFF0000: 2/2
basically same as null.

05000000 -> X

01000000: 3/2
02380000: 4/2
05000000: 5/3
06800000: 4/2
07000000: 3/2
FFFF0000: 3/2

06800000 -> X

01000000: 3/2
02380000: 4/2
05000000: 4/2
06800000: 5/3
07000000: 3/2
FFFF0000: 3/2

07000000 -> X

01000000: 2/2
02380000: 3/2
05000000: 3/2
06800000: 3/2
07000000: 3/3
FFFF0000: 2/2

FFFF0000 -> X

01000000: 2/2
02380000: 3/2
05000000: 3/2
06800000: 3/2
07000000: 2/2
FFFF0000: 2/2 (knows it can't write????)


NOTE ON SAME-REGION DMA PENALTY

06800000->06800000: 5/3
06800000->06810000: 5/3 (within same bank)
06800000->06820000: 4/2 (different banks)
06800000->06840000: 3/2 (unmapped)

penalty applies when the same memory bank is accessed for reading and writing, in general.

side note on VRAM: overlapping banks don't add more waitstates.



sooo. timing rules, barring mainRAM which is a bit special.

16bit

always 2. except when doing sameregion transfer, then it's 3.

32bit

hairy, w/ different bus sizes.

16: VRAM, palette. mainRAM seems to be on a different bus.
32: OAM, BIOS, WRAM...

16->16: 4, 5 when sameregion
16->32: 3
32->16: 3
32->32: 2, 3 when sameregion

mainRAM->X:

16->16: 3
16->32: 2

it has different read timing??


cycle breakdown

16bit: 1 read + 1 write + 1 sameregion-penalty

32bit: 1 read + 1 read-from-16bit-bus-penalty + 1 write + 1 write-to-16bit-bus-penalty + 1 sameregion-penalty

mainRAM:

"In some cases DMA main memory read cycles are reportedly performed simultaneously with DMA write cycles to other memory."

I guess.

16bit: 1 read + 1 write (no optimization then, I guess)

32bit: merge(1 read + 1 write) + 1 read-from-16bit-bus-penalty + 1 write-to-16bit-bus-penalty. I guess.

the stinky case of mainRAM->mainRAM.

18/16. what a motherfucking trainwreck.

16bit: 1 read + 1 write + 1 sameregion penalty + 13 shito.

32bit: 1 read + 1 read-from-16bit-bus-penalty + 1 write + 1 write-to-16bit-bus-penalty + 1 sameregion-penalty + 13 shito.

13 shito = 7 for reading, 6 for writing???? NS penalty. 8 for reading, 5 for writing??? need checking against non-DMA timings.

StapleButter
Posted on 11-06-18 06:45 AM (rev. 9 of 11-08-18 08:03 AM) Link | #745
working out NS timings


LDR repeated 0x10000 times. cache disabled.

overhead=8 consistently.

02000000: 1196658 (consistent) -> 18 cycles. 9 code, 9 data.
05000000: 759174-765830 -> 11 cycles. 9 code, 5 data, 3 cycle gain (parallel-ish)
06800000: 729405-737764 -> same.
07000000: 663869 (consistent) -> 10 cycles. 9 code, 4 data, 3 cycle gain.
FFFF0000: 663869 (consistent) -> same.


LDR repeated 0x1000 times. cache disabled. code in ITCM.

02000000: 36866-36982 -> 9 cycles. 0.5 code, 9 data, gain can only be as much as 0.5.
05000000: 23254-23271 -> 5 cycles. same shit.
06800000: 20482 consistently
07000000: 16386-24719 -> 4 cycles or 6 cycles. weird. 4 cycles data.
FFFF0000: same


STR repeated 0x10000 times. cache disabled.

overhead=8 consistently.

02380000: 1196658 or 1205017
05000000: 756161-774193
06800000: 729405-737764
07000000: 663869-672228
FFFF0000: 663869-672228 (same numbers as above)


STR repeated 0x1000 times. cache disabled. code in ITCM.

02380000: 36864-36978
05000000: 23257-23271
06800000: 20482 consistently
07000000: 16386-24719 (one or the other?? weird. either 4 or 6?? alignment of 66MHz cycles to bus shito??)
FFFF0000: same


pretty similar timings for read and write.



ARM7 ----


running from WRAM (normal shit)

00000000 -> 3 (1 code fetch + 1 data fetch + 1 internal??)
01000000 -> 3
02000000 -> 9 (1 code fetch + 1 data fetch + 1 16bit-penalty + 1 internal + 5 penalty)
03000000 -> 3
03800000 -> 3
04000000 -> 3
04800000 -> 14 (1 code + 1 data + 1 16bit-penalty + 1 internal + 10. I guess)
04808000 -> 14
06000000 -> 4 (1 code + 1 data + 1 16bit-penalty + 1 internal)
08000000 -> 18 (1 code + 1 data + 1 16bit-penalty + 1 internal + 14 penalty)
0F000000 -> 3
FFFF0000 -> 3

running from VRAM

00000000 -> 4 (1 code fetch + 1 16bit-penalty + 1 data fetch + 1 internal??)
01000000 -> 4
02000000 -> 9 (1 code fetch + 1 16bit-penalty + 1 data fetch + 1 16bit-penalty + 1 internal + 5 penalty ???? doesn't fit)
03000000 -> 4
03800000 -> 4
04000000 -> 4
06000000 -> 5 (1 code + 1 16bit-penalty + 1 data + 1 16bit-penalty + 1 internal)
08000000 -> 19 (1 code + 1 16bit-penalty + 1 data + 1 16bit-penalty + 1 internal + 14 penalty)

running from mainRAM

00000000 -> 9
01000000 -> 9
02000000 -> 18
03000000 -> 9
03800000 -> 9
04000000 -> 9
06000000 -> 9 (1 code + 1 16bit-penalty + 7 penalty + 1 data + 1 16bit-penalty + 1 internal - 3 gain)
08000000 -> 23 (22 when writing) (1 code + 1 16bit-penalty + 7 penalty + 1 data + 1 16bit-penalty + 1 internal + 14 penalty - 3 gain)

STR seems to get 1c penalty when accessing same memory as code

main RAM is always 9c. as if it was somehow able to do parallel accesses, when the other fetch is in another memory region. with a max gain of 3c, like the ARM9. this also eats up internal cycles.

so, seems the penalty is 7c, like on the ARM9.


LDRH TIMINGS

code on WRAM/VRAM/mainRAM
timings for 32bitbus/mainRAM/wifi0/wifi1/VRAM/GBA

WRAM: 3 / 8 / 8 / 20 / 3 / 12
VRAM: 4 / 8 / 9 / 21 / 4 / 13
mainRAM: 9 / 17 / 13 / 25 / 9 / 17


STRH TIMINGS

code on WRAM/VRAM/mainRAM
timings for 32bitbus/mainRAM/wifi0/wifi1/VRAM/GBA

WRAM: 2 / 8 / 7 / 19 / 2 / 11
VRAM: 3 / 8 / 8 / 20 / 4 / 12 (noting penalty for storing to same region as code)
mainRAM: 9 / 17 / 12 / 24 / 9 / 16

same effect observed with code in mainRAM. internal/data cycles seem to get merged with code cycles, for a max gain of 3c.

like, GBA:
from WRAM: 1 code cycle, 10 data cycles.
from mainRAM: 9 code cycles, 10 data cycles, 3 gain.

noting we still get the internal cycle if data>code. internal cycle is lumped with data.

wifi timing is 8/20 (5/17), compared to 14/24 (10/20) in 32bit mode. odd.


the timings are nice and clean, nothing like the ARM9.

also, crap, forgot about the LDR internal cycle for the ARM9 part. then again the ARM9 does some weird parallel shito. oh well. also its internal cycles are weird. fuck the ARM9.

wifi timings are weird:

WIFIWAITCNT
002A (2,5): 14,14 (10,10)
003A (2,7): 14,24 (10,20)
003B (3,7): 26,24 (22,20)

timings barring code cycles, 32/16:

WS0:
0: 16/10
1: 14/8
2: 12/6
3: 24/18
4: 14/10
5: 12/8
6: 10/6
7: 22/18

WS1:
0: 20/10
1: 18/8
2: 16/6
3: 28/18
4: 14/10
5: 12/8
6: 10/6
7: 22/18

weird as fuck. actually kind of similar to the EXMEMCNT settings for GBA shito.

16bit timings are always 10/8/6/18. same as EXMEMCNT.

bit0-1 set the base timing. bit2 sets the 2nd access timing for 32bit mode. which is: 6/4 for WS0, 10/4 for WS1. weird.

StapleButter
Posted on 11-07-18 11:58 AM (rev. 9 of 11-09-18 11:00 AM) Link | #747
soooo, summary of timings

for now, barring shit like GBA slot


general rules

* 1 cycle baseline for all accesses
* 1 cycle penalty when using a 16bit bus for a 32bit access


ARM9

* nonseq penalty of 3 cycles when using the bus (even when accessing unmapped areas)
* extra nonseq penalty of 4 cycles when accessing mainRAM (total penalty 7 cycles)
* code/data accesses in parallel if in different memory regions. somewhat. weird. gains 3 cycles max.
* code fetches forced nonseq 32bit


ARM7

* nonseq penalty of 7 cycles when accessing mainRAM
* mainRAM accesses can be parallelized to some extent. they can happen alongside internal cycles and accesses to any other memory region, for a max gain of 3 cycles. same as ARM9.


DMA

* in 32bit mode, transferring from mainRAM to another memory region is 1 cycle faster
* 1 cycle penalty if source and destination are the same memory region
* if source and destination are mainRAM, all accesses are forced nonseq, resulting in trainwreck timings of 18 cycles/unit in 32bit mode and 16 cycles/unit in 16bit mode.
* seems that the maximum length for a sequential burst is 120 units? needs more checking


note on 'memory regions', esp VRAM

* different VRAM banks are considered different regions!
* VRAM address space with no bank mapped is the same as empty space (no 16bit bus penalty for a 32bit access)
* overlapping banks don't add penalties or affect timings
* shared WRAM is one bank


rules for parallel cycles

* ARM9: code cycles vs data cycles. max gain 3c.
* ARM7: when accessing mainRAM. max gain 3c.
* DMA: when reading from VRAM in 32bit mode. max gain 1c.

StapleButter
Posted on 11-08-18 09:36 AM (rev. 4 of 11-08-18 03:09 PM) Link | #748
ARM7 DMA

32/16

wifiwaitcnt: 2/7

mainRAM->mainRAM: 18/16
WRAM->mainRAM: 3/2
IO->mainRAM: 3/2
wifi0->mainRAM: 14/7
wifi1->mainRAM: 10/5
VRAM->mainRAM: 4/2
mainRAM->VRAM: 3/2
WRAM->VRAM: 3/2
WRAM->VRAM: 3/2
wifi0->VRAM: 14/7
wifi1->VRAM: 10/5
VRAM->VRAM: 5/3 (same-region penalty)


WIFI DMA TIMINGS since it's weird too

setting: WS0 32/16, WS1 32/16
0-3: 14/7, 22/11
4:7: 10/5, 10/5

so, all are sequential and not just 1/2?

weird.

timings of wifi0->wifi0, wifi1->wifi1

setting: WS0 32/16, WS1 32/16
0-3: 24/12, 40/20 (!!) -> seq cycles 6, 10
4:7: 16/8, 16/8 -> seq cycles 4

ok now I guess it makes sense again?

in 32bit mode we just do two accesses and thus double the 16bit timing. no 16bit-bus-penalty, no nonseq shito. no sameregion penalty either.


Main - Development - TIMING NOTES New reply

Page rendered in 0.017 seconds. (2048KB of memory used)
MySQL - queries: 28, rows: 79/79, time: 0.011 seconds.
[powered by Acmlm] Acmlmboard 2.064 (2018-07-20)
© 2005-2008 Acmlm, Xkeeper, blackhole89 et al.