Views: 6,699,104 Homepage | Main | Rules/FAQ | Memberlist | Active users | Last posts | Calendar | Stats | Online users | Search 03-28-24 06:45 PM
Guest:

0 users reading TIMING NOTES | 1 bot

Main - Development - TIMING NOTES Hide post layouts | New reply


Arisotura
Posted on 08-08-18 02:54 PM (rev. 8 of 11-06-18 01:01 AM) Link | #638
DMA TIMING


measurements are to be taken with a rock of NaCl. overhead w/ setting up timer/DMA/etc seems variable. timing is unreliable despite disabling cache and IRQ.


oh well.


memory / cycles32 / cycles16

mainRAM -> mainRAM / 18 / 16
mainRAM -> VRAM / 3 / 2
mainRAM -> VRAM unmapped(?) / 2 / 2
VRAM -> mainRAM / 4 / 2
pal -> mainRAM / 4 / 2
OAM -> mainRAM / 3 / 2
mainRAM -> OAM / 2 / 2
VRAM -> VRAM / 3 / 2
BIOS -> mainRAM / 3 / 2
mainRAM -> BIOS / 2 / 2 (does it detect that can't be written to??)
NULL -> mainRAM / 3 / 2 (this DMA does run)
0E000000 -> mainRAM / 3 / 2
0F000000 -> mainRAM / 3 / 2
mainRAM -> NULL / 2 / 2 (does run)
mainRAM -> 0E000000 / 2 / 2
mainRAM -> 0F000000 / 2 / 2
NULL -> NULL / 2 / 2 (runs)

results aren't very precise tho.

I'm hungry.

pal/VRAM/OAM didn't account for video controllers possibly also reading from it. only one supposed to be working is the sub one, which might still access pal/OAM.


organizing shit a bit


mainRAM -> X

00000000: 2/2
01000000: 2/2
02380000: 18/16
03000000: 2/2 (shared WRAM probably not mapped, has to be checked)
03800000: 2/2
04000000: DS turned off. guess DMAing large chunks of shit there is not so much a good idea.
04200000: 2/2
04800000: 2/2
05000000: 3/2
06000000: 2/2 (unmapped VRAM)
06800000: 3/2 (mapped VRAM)
07000000: 2/2
08000000 thru 0F000000: 2/2
FFFF0000: 2/2
FFFF1000: 2/2

01000000(null) -> X

01000000: 2/2
02380000: 3/2
05000000: 3/2
06800000: 3/2
07000000: 2/2
FFFF0000: 2/2

04000000 -> X

01000000: 2/2
02380000: 3/2
05000000: 3/2
06800000: 3/2
07000000: 2/2
FFFF0000: 2/2
basically same as null.

05000000 -> X

01000000: 3/2
02380000: 4/2
05000000: 5/3
06800000: 4/2
07000000: 3/2
FFFF0000: 3/2

06800000 -> X

01000000: 3/2
02380000: 4/2
05000000: 4/2
06800000: 5/3
07000000: 3/2
FFFF0000: 3/2

07000000 -> X

01000000: 2/2
02380000: 3/2
05000000: 3/2
06800000: 3/2
07000000: 3/3
FFFF0000: 2/2

FFFF0000 -> X

01000000: 2/2
02380000: 3/2
05000000: 3/2
06800000: 3/2
07000000: 2/2
FFFF0000: 2/2 (knows it can't write????)


NOTE ON SAME-REGION DMA PENALTY

06800000->06800000: 5/3
06800000->06810000: 5/3 (within same bank)
06800000->06820000: 4/2 (different banks)
06800000->06840000: 3/2 (unmapped)

penalty applies when the same memory bank is accessed for reading and writing, in general.

side note on VRAM: overlapping banks don't add more waitstates.



sooo. timing rules, barring mainRAM which is a bit special.

16bit

always 2. except when doing sameregion transfer, then it's 3.

32bit

hairy, w/ different bus sizes.

16: VRAM, palette. mainRAM seems to be on a different bus.
32: OAM, BIOS, WRAM...

16->16: 4, 5 when sameregion
16->32: 3
32->16: 3
32->32: 2, 3 when sameregion

mainRAM->X:

16->16: 3
16->32: 2

it has different read timing??


cycle breakdown

16bit: 1 read + 1 write + 1 sameregion-penalty

32bit: 1 read + 1 read-from-16bit-bus-penalty + 1 write + 1 write-to-16bit-bus-penalty + 1 sameregion-penalty

mainRAM:

"In some cases DMA main memory read cycles are reportedly performed simultaneously with DMA write cycles to other memory."

I guess.

16bit: 1 read + 1 write (no optimization then, I guess)

32bit: merge(1 read + 1 write) + 1 read-from-16bit-bus-penalty + 1 write-to-16bit-bus-penalty. I guess.

the stinky case of mainRAM->mainRAM.

18/16. what a motherfucking trainwreck.

16bit: 1 read + 1 write + 1 sameregion penalty + 13 shito.

32bit: 1 read + 1 read-from-16bit-bus-penalty + 1 write + 1 write-to-16bit-bus-penalty + 1 sameregion-penalty + 13 shito.

13 shito = 7 for reading, 6 for writing???? NS penalty. 8 for reading, 5 for writing??? need checking against non-DMA timings.

____________________
Kuribo64

Arisotura
Posted on 11-06-18 11:45 AM (rev. 12 of 12-03-18 03:05 AM) Link | #745
working out NS timings


LDR repeated 0x10000 times. cache disabled.

overhead=8 consistently.

02000000: 1196658 (consistent) -> 18 cycles. 9 code, 9 data.
05000000: 759174-765830 -> 11 cycles. 9 code, 5 data, 3 cycle gain (parallel-ish)
06800000: 729405-737764 -> same.
07000000: 663869 (consistent) -> 10 cycles. 9 code, 4 data, 3 cycle gain.
FFFF0000: 663869 (consistent) -> same.


repeated NOP:
mainRAM: 9
ITCM: 0.5


LDR repeated 0x1000 times. cache disabled. code in ITCM.

02000000: 36866-36982 -> 9 cycles. 0.5 code, 9 data, gain can only be as much as 0.5.
05000000: 23254-23271 -> 5 cycles. same shit.
06800000: 20482 consistently
07000000: 16386-24719 -> 4 cycles or 6 cycles. weird. 4 cycles data.
FFFF0000: same


STR repeated 0x10000 times. cache disabled.

overhead=8 consistently.

02380000: 1196658 or 1205017
05000000: 756161-774193
06800000: 729405-737764
07000000: 663869-672228
FFFF0000: 663869-672228 (same numbers as above)


STR repeated 0x1000 times. cache disabled. code in ITCM.

02380000: 36864-36978
05000000: 23257-23271
06800000: 20482 consistently
07000000: 16386-24719 (one or the other?? weird. either 4 or 6?? alignment of 66MHz cycles to bus shito??)
FFFF0000: same


pretty similar timings for read and write.



ARM7 ----


running from WRAM (normal shit)

00000000 -> 3 (1 code fetch + 1 data fetch + 1 internal??)
01000000 -> 3
02000000 -> 9 (1 code fetch + 1 data fetch + 1 16bit-penalty + 1 internal + 5 penalty)
03000000 -> 3
03800000 -> 3
04000000 -> 3
04800000 -> 14 (1 code + 1 data + 1 16bit-penalty + 1 internal + 10. I guess)
04808000 -> 14
06000000 -> 4 (1 code + 1 data + 1 16bit-penalty + 1 internal)
08000000 -> 18 (1 code + 1 data + 1 16bit-penalty + 1 internal + 14 penalty)
0F000000 -> 3
FFFF0000 -> 3

running from VRAM

00000000 -> 4 (1 code fetch + 1 16bit-penalty + 1 data fetch + 1 internal??)
01000000 -> 4
02000000 -> 9 (1 code fetch + 1 16bit-penalty + 1 data fetch + 1 16bit-penalty + 1 internal + 5 penalty ???? doesn't fit)
03000000 -> 4
03800000 -> 4
04000000 -> 4
06000000 -> 5 (1 code + 1 16bit-penalty + 1 data + 1 16bit-penalty + 1 internal)
08000000 -> 19 (1 code + 1 16bit-penalty + 1 data + 1 16bit-penalty + 1 internal + 14 penalty)

running from mainRAM

00000000 -> 9
01000000 -> 9
02000000 -> 18
03000000 -> 9
03800000 -> 9
04000000 -> 9
06000000 -> 9 (1 code + 1 16bit-penalty + 7 penalty + 1 data + 1 16bit-penalty + 1 internal - 3 gain)
08000000 -> 23 (22 when writing) (1 code + 1 16bit-penalty + 7 penalty + 1 data + 1 16bit-penalty + 1 internal + 14 penalty - 3 gain)

STR seems to get 1c penalty when accessing same memory as code

main RAM is always 9c. as if it was somehow able to do parallel accesses, when the other fetch is in another memory region. with a max gain of 3c, like the ARM9. this also eats up internal cycles.

so, seems the penalty is 7c, like on the ARM9.



timings for 32bitbus/mainRAM/wifi0/wifi1/VRAM/GBA

wifi0 = 2 (6/6)
wifi1 = 7 (18/4)

LDR unaligned: no change

LDMIA

code in mainRAM:
1r: 9 / 18 / 19 / 29 / 9 / 23
2r: 9 / 20 / 31 / 37 / 11 / 35

max gain: 3c (2c on memory timings, LDMIA has 1I)


NOP

code in mainRAM: 2c (sequential code fetch)



LDRH TIMINGS

code on WRAM/VRAM/mainRAM
timings for 32bitbus/mainRAM/wifi0/wifi1/VRAM/GBA

WRAM: 3 / 8 / 8 / 20 / 3 / 12
VRAM: 4 / 8 / 9 / 21 / 4 / 13
mainRAM: 9 / 17 / 13 / 25 / 9 / 17


STRH TIMINGS

code on WRAM/VRAM/mainRAM
timings for 32bitbus/mainRAM/wifi0/wifi1/VRAM/GBA

WRAM: 2 / 8 / 7 / 19 / 2 / 11
VRAM: 3 / 8 / 8 / 20 / 4 / 12 (noting penalty for storing to same region as code)
mainRAM: 9 / 17 / 12 / 24 / 9 / 16

same effect observed with code in mainRAM. internal/data cycles seem to get merged with code cycles, for a max gain of 3c.

like, GBA:
from WRAM: 1 code cycle, 10 data cycles.
from mainRAM: 9 code cycles, 10 data cycles, 3 gain.

noting we still get the internal cycle if data>code. internal cycle is lumped with data.

wifi timing is 8/20 (5/17), compared to 14/24 (10/20) in 32bit mode. odd.


the timings are nice and clean, nothing like the ARM9.

also, crap, forgot about the LDR internal cycle for the ARM9 part. then again the ARM9 does some weird parallel shito. oh well. also its internal cycles are weird. fuck the ARM9.

wifi timings are weird:

WIFIWAITCNT
002A (2,5): 14,14 (10,10)
003A (2,7): 14,24 (10,20)
003B (3,7): 26,24 (22,20)

timings barring code cycles, 32/16:

WS0:
0: 16/10
1: 14/8
2: 12/6
3: 24/18
4: 14/10
5: 12/8
6: 10/6
7: 22/18

WS1:
0: 20/10
1: 18/8
2: 16/6
3: 28/18
4: 14/10
5: 12/8
6: 10/6
7: 22/18

weird as fuck. actually kind of similar to the EXMEMCNT settings for GBA shito.

16bit timings are always 10/8/6/18. same as EXMEMCNT.

bit0-1 set the base timing. bit2 sets the 2nd access timing for 32bit mode. which is: 6/4 for WS0, 10/4 for WS1. weird.

____________________
Kuribo64

Arisotura
Posted on 11-07-18 04:58 PM (rev. 14 of 12-03-18 02:21 AM) Link | #747
soooo, summary of timings

for now, barring shit like GBA slot


general rules

* 1 cycle baseline for all accesses
* 1 cycle penalty when using a 16bit bus for a 32bit access (really two accesses)


ARM9

* nonseq penalty of 3 cycles when using the bus (even when accessing unmapped areas)
* extra nonseq penalty of 4 cycles when accessing mainRAM (total penalty 7 cycles)
* code/data accesses in parallel if in different memory regions. somewhat. weird. gains 3 cycles max.
* code fetches forced nonseq 32bit


ARM7

* nonseq penalty of 7 cycles when accessing mainRAM
* mainRAM accesses can be parallelized to some extent. they can happen alongside internal cycles and accesses to any other memory region. max gain: 3c for code in mainRAM, 5c for data in mainRAM and writing. weird.
* separate bus for mainRAM?
* data accesses cause simultaneous code accesses to be nonseq. applies everywhere. matters a lot when running code from mainRAM.
* writing data to same region as code has 1c extra penalty, except for mainRAM and wifi/gba.


DMA

* in 32bit mode, transferring from mainRAM to another memory region is 1 cycle faster
* 1 cycle penalty if source and destination are the same memory region
* if source and destination are mainRAM, all accesses are forced nonseq, resulting in trainwreck timings of 18 cycles/unit in 32bit mode and 16 cycles/unit in 16bit mode.
* seems that the maximum length for a sequential burst is 120 units? needs more checking


note on 'memory regions', esp VRAM

* different VRAM banks are considered different regions!
* VRAM address space with no bank mapped is the same as empty space (no 16bit bus penalty for a 32bit access)
* overlapping banks don't add penalties or affect timings
* shared WRAM is one bank


rules for parallel cycles

* ARM9: code cycles vs data cycles. max gain 3c.
* ARM7: when accessing mainRAM. max gain 3c/5c.
* DMA: when reading from mainRAM in 32bit mode. max gain 1c.

____________________
Kuribo64

Arisotura
Posted on 11-08-18 02:36 PM (rev. 4 of 11-08-18 08:09 PM) Link | #748
ARM7 DMA

32/16

wifiwaitcnt: 2/7

mainRAM->mainRAM: 18/16
WRAM->mainRAM: 3/2
IO->mainRAM: 3/2
wifi0->mainRAM: 14/7
wifi1->mainRAM: 10/5
VRAM->mainRAM: 4/2
mainRAM->VRAM: 3/2
WRAM->VRAM: 3/2
WRAM->VRAM: 3/2
wifi0->VRAM: 14/7
wifi1->VRAM: 10/5
VRAM->VRAM: 5/3 (same-region penalty)


WIFI DMA TIMINGS since it's weird too

setting: WS0 32/16, WS1 32/16
0-3: 14/7, 22/11
4:7: 10/5, 10/5

so, all are sequential and not just 1/2?

weird.

timings of wifi0->wifi0, wifi1->wifi1

setting: WS0 32/16, WS1 32/16
0-3: 24/12, 40/20 (!!) -> seq cycles 6, 10
4:7: 16/8, 16/8 -> seq cycles 4

ok now I guess it makes sense again?

in 32bit mode we just do two accesses and thus double the 16bit timing. no 16bit-bus-penalty, no nonseq shito. no sameregion penalty either.

____________________
Kuribo64

Arisotura
Posted on 01-02-19 12:59 PM (rev. 3 of 01-05-19 05:00 AM) Link | #837
DMA timing:

mainRAM timings are different when the address is not set to increment. mainRAM accesses under these conditions become N but there's some weird parallelism going on.

observed timings:

pal->mainRAM: (32bit/16bit)
* increment: 4/2
* fixed/decr: 10/8

OAM->mainRAM:
* increment: 3/2
* fixed/decr: 9/8

mainRAM->pal:
* increment: 3/2
* fixed/decr: 11/9 11/8? (1237/2392/293268 1241/2394/295316)

mainRAM->OAM:
* increment: 2/2
* fixed/decr: 10/9? 9/8?

this only affects mainRAM. other memory regions have identical N/S timings.

also: 3c penalty for N fetches only affects the ARM9 and not DMA. however DMA still gets the slowass N penalty for mainRAM. (this is going to be shitty to emulate)

____________________
Kuribo64

Arisotura
Posted on 06-03-21 12:06 AM (rev. 2 of 06-03-21 04:21 PM) Link | #3805
the FUN details of DMA timings


mainRAM

* sequential burst only works with incrementing address (otherwise all accesses are nonseq) and not when src and dst are both in mainRAM
* sequential burst has a max length of 118 units (checkme in 16bit mode to be sure?)
* parallel access to some extent
* when reading, last halfword of each 32 byte block is faster


GBA ROM

* seems to be a fairly simple controller
* first access is nonseq
* further accesses are sequential, no burst size limit, and also works with fixed/decrementing addresses
* last halfword of each 0x20000 byte block is nonseq
* access still sequential even if src and dst are both GBA ROM
* 32bit accesses are split into two 16bit accesses


GBA RAM

* doesn't support sequential access
* timing is as specified in EXMEMCNT, same in both 16bit and 32bit modes


WIFI

* seems even simpler than GBA ROM
* first access is nonseq
* further accesses are sequential, no burst size limit, and also works with fixed/decrementing addresses
* no 'last halfword of X block is nonseq' effect? needs further testing.

____________________
Kuribo64

Arisotura
Posted on 06-03-21 04:00 PM (rev. 2 of 06-03-21 04:30 PM) Link | #3809
the different classes of memory region, depending what their timing model is

* main RAM
* GBA ROM
* GBA RAM
* wifi 0
* wifi 1
* regular 16bit bus: VRAM, palette
* regular 32bit bus: WRAM, BIOS, OAM, I/O
* ARM9: cache/TCM


so, all the interaction cases to test for (in both directions), regarding DMA:

* mainRAM -> 16bit
* mainRAM -> 32bit
* mainRAM -> GBAROM
* mainRAM -> GBARAM
* mainRAM -> wifi0
* mainRAM -> wifi1

* GBAROM -> 16bit
* GBAROM -> 32bit
* GBAROM -> GBARAM
* GBAROM -> wifi0
* GBAROM -> wifi1

* GBARAM -> 16bit
* GBARAM -> 32bit
* GBARAM -> wifi0
* GBARAM -> wifi1

* wifi0 -> 16bit
* wifi0 -> 32bit
* wifi0 -> wifi1

* wifi1 -> 16bit
* wifi1 -> 32bit

* 16bit -> 32bit

____________________
Kuribo64


Main - Development - TIMING NOTES Hide post layouts | New reply

Page rendered in 0.081 seconds. (2048KB of memory used)
MySQL - queries: 28, rows: 87/87, time: 0.016 seconds.
[powered by Acmlm] Acmlmboard 2.064 (2018-07-20)
© 2005-2008 Acmlm, Xkeeper, blackhole89 et al.