Posts by Arisotura - melonDS board

that sounds plenty, considering mine is 2.4GHz and it manages to run some shit at 60FPS

but that also depends what you're trying to run

also, it doesn't use the GPU, except for drawing the final screens to the window

____________________
Kuribo64

mhh

take a screenshot of the directory melonDS 0.7 is in? with all the files

____________________
Kuribo64

~~that's 0.6 you've got there, not 0.7~~

other than that, all the files seem to be here and good

ninja'd

anyway might have to do with not finding melonDS.ini, there was a related bug but weird shit

check the archive you downloaded -- I had updated it to include a stock melonDS.ini so this wouldn't happen

____________________
Kuribo64

well there were multiple issues in some code meant to look for config files in AppData

* got a string buffer with the appdata path base, resized it, but didn't complete it, leaving the end uninitialized
* CoTaskMemRealloc() can move the buffer if needed, which was not accounted for, so possibly trying to access freed memory

ie. bad bad bad

and likely why it crashed at random

____________________
Kuribo64

we have an IRC channel. irc.badnik.net, #melonDS

Discord is bad and can go fuck itself.

____________________
Kuribo64

it's a software renderer.

there might be some obscure emulator somewhere that uses DirectX? not sure. they seem to all use either OpenGL or software renderers.

____________________
Kuribo64

AppData/Roaming/melonDS

you can put melonDS.ini and BIOS/firmware there if you wish. you don't have to.

____________________
Kuribo64

you can use good ol' dsbf_dump.nds in DS mode.

____________________
Kuribo64

noting. if I can make it work with SDL.

not sure how it'd work for keyboard input and if that would be desirable tho.

____________________
Kuribo64

working out NS timings

LDR repeated 0x10000 times. cache disabled.

overhead=8 consistently.

02000000: 1196658 (consistent) -> 18 cycles. 9 code, 9 data.
05000000: 759174-765830 -> 11 cycles. 9 code, 5 data, 3 cycle gain (parallel-ish)
06800000: 729405-737764 -> same.
07000000: 663869 (consistent) -> 10 cycles. 9 code, 4 data, 3 cycle gain.
FFFF0000: 663869 (consistent) -> same.

repeated NOP:
mainRAM: 9
ITCM: 0.5

LDR repeated 0x1000 times. cache disabled. code in ITCM.

02000000: 36866-36982 -> 9 cycles. 0.5 code, 9 data, gain can only be as much as 0.5.
05000000: 23254-23271 -> 5 cycles. same shit.
06800000: 20482 consistently
07000000: 16386-24719 -> 4 cycles or 6 cycles. weird. 4 cycles data.
FFFF0000: same

STR repeated 0x10000 times. cache disabled.

overhead=8 consistently.

02380000: 1196658 or 1205017
05000000: 756161-774193
06800000: 729405-737764
07000000: 663869-672228
FFFF0000: 663869-672228 (same numbers as above)

STR repeated 0x1000 times. cache disabled. code in ITCM.

02380000: 36864-36978
05000000: 23257-23271
06800000: 20482 consistently
07000000: 16386-24719 (one or the other?? weird. either 4 or 6?? alignment of 66MHz cycles to bus shito??)
FFFF0000: same

pretty similar timings for read and write.

ARM7 ----

running from WRAM (normal shit)

00000000 -> 3 (1 code fetch + 1 data fetch + 1 internal??)
01000000 -> 3
02000000 -> 9 (1 code fetch + 1 data fetch + 1 16bit-penalty + 1 internal + 5 penalty)
03000000 -> 3
03800000 -> 3
04000000 -> 3
04800000 -> 14 (1 code + 1 data + 1 16bit-penalty + 1 internal + 10. I guess)
04808000 -> 14
06000000 -> 4 (1 code + 1 data + 1 16bit-penalty + 1 internal)
08000000 -> 18 (1 code + 1 data + 1 16bit-penalty + 1 internal + 14 penalty)
0F000000 -> 3
FFFF0000 -> 3

running from VRAM

00000000 -> 4 (1 code fetch + 1 16bit-penalty + 1 data fetch + 1 internal??)
01000000 -> 4
02000000 -> 9 (1 code fetch + 1 16bit-penalty + 1 data fetch + 1 16bit-penalty + 1 internal + 5 penalty ???? doesn't fit)
03000000 -> 4
03800000 -> 4
04000000 -> 4
06000000 -> 5 (1 code + 1 16bit-penalty + 1 data + 1 16bit-penalty + 1 internal)
08000000 -> 19 (1 code + 1 16bit-penalty + 1 data + 1 16bit-penalty + 1 internal + 14 penalty)

running from mainRAM

00000000 -> 9
01000000 -> 9
02000000 -> 18
03000000 -> 9
03800000 -> 9
04000000 -> 9
06000000 -> 9 (1 code + 1 16bit-penalty + 7 penalty + 1 data + 1 16bit-penalty + 1 internal - 3 gain)
08000000 -> 23 (22 when writing) (1 code + 1 16bit-penalty + 7 penalty + 1 data + 1 16bit-penalty + 1 internal + 14 penalty - 3 gain)

STR seems to get 1c penalty when accessing same memory as code

main RAM is always 9c. as if it was somehow able to do parallel accesses, when the other fetch is in another memory region. with a max gain of 3c, like the ARM9. this also eats up internal cycles.

so, seems the penalty is 7c, like on the ARM9.

timings for 32bitbus/mainRAM/wifi0/wifi1/VRAM/GBA

wifi0 = 2 (6/6)
wifi1 = 7 (18/4)

LDR unaligned: no change

LDMIA

code in mainRAM:
1r: 9 / 18 / 19 / 29 / 9 / 23
2r: 9 / 20 / 31 / 37 / 11 / 35

max gain: 3c (2c on memory timings, LDMIA has 1I)

NOP

code in mainRAM: 2c (sequential code fetch)

LDRH TIMINGS

code on WRAM/VRAM/mainRAM
timings for 32bitbus/mainRAM/wifi0/wifi1/VRAM/GBA

WRAM: 3 / 8 / 8 / 20 / 3 / 12
VRAM: 4 / 8 / 9 / 21 / 4 / 13
mainRAM: 9 / 17 / 13 / 25 / 9 / 17

STRH TIMINGS

code on WRAM/VRAM/mainRAM
timings for 32bitbus/mainRAM/wifi0/wifi1/VRAM/GBA

WRAM: 2 / 8 / 7 / 19 / 2 / 11
VRAM: 3 / 8 / 8 / 20 / 4 / 12 (noting penalty for storing to same region as code)
mainRAM: 9 / 17 / 12 / 24 / 9 / 16

same effect observed with code in mainRAM. internal/data cycles seem to get merged with code cycles, for a max gain of 3c.

like, GBA:
from WRAM: 1 code cycle, 10 data cycles.
from mainRAM: 9 code cycles, 10 data cycles, 3 gain.

noting we still get the internal cycle if data>code. internal cycle is lumped with data.

wifi timing is 8/20 (5/17), compared to 14/24 (10/20) in 32bit mode. odd.

the timings are nice and clean, nothing like the ARM9.

also, crap, forgot about the LDR internal cycle for the ARM9 part. then again the ARM9 does some weird parallel shito. oh well. also its internal cycles are weird. fuck the ARM9.

wifi timings are weird:

WIFIWAITCNT
002A (2,5): 14,14 (10,10)
003A (2,7): 14,24 (10,20)
003B (3,7): 26,24 (22,20)

timings barring code cycles, 32/16:

WS0:
0: 16/10
1: 14/8
2: 12/6
3: 24/18
4: 14/10
5: 12/8
6: 10/6
7: 22/18

WS1:
0: 20/10
1: 18/8
2: 16/6
3: 28/18
4: 14/10
5: 12/8
6: 10/6
7: 22/18

weird as fuck. actually kind of similar to the EXMEMCNT settings for GBA shito.

16bit timings are always 10/8/6/18. same as EXMEMCNT.

bit0-1 set the base timing. bit2 sets the 2nd access timing for 32bit mode. which is: 6/4 for WS0, 10/4 for WS1. weird.

____________________
Kuribo64

soooo, summary of timings

for now, barring shit like GBA slot

general rules

* 1 cycle baseline for all accesses
* 1 cycle penalty when using a 16bit bus for a 32bit access (really two accesses)

ARM9

* nonseq penalty of 3 cycles when using the bus (even when accessing unmapped areas)
* extra nonseq penalty of 4 cycles when accessing mainRAM (total penalty 7 cycles)
* code/data accesses in parallel if in different memory regions. somewhat. weird. gains 3 cycles max.
* code fetches forced nonseq 32bit

ARM7

* nonseq penalty of 7 cycles when accessing mainRAM
* mainRAM accesses can be parallelized to some extent. they can happen alongside internal cycles and accesses to any other memory region. max gain: 3c for code in mainRAM, 5c for data in mainRAM and writing. weird.
* separate bus for mainRAM?
* data accesses cause simultaneous code accesses to be nonseq. applies everywhere. matters a lot when running code from mainRAM.
* writing data to same region as code has 1c extra penalty, except for mainRAM and wifi/gba.

DMA

* in 32bit mode, transferring from mainRAM to another memory region is 1 cycle faster
* 1 cycle penalty if source and destination are the same memory region
* if source and destination are mainRAM, all accesses are forced nonseq, resulting in trainwreck timings of 18 cycles/unit in 32bit mode and 16 cycles/unit in 16bit mode.
* seems that the maximum length for a sequential burst is 120 units? needs more checking

note on 'memory regions', esp VRAM

* different VRAM banks are considered different regions!
* VRAM address space with no bank mapped is the same as empty space (no 16bit bus penalty for a 32bit access)
* overlapping banks don't add penalties or affect timings
* shared WRAM is one bank

rules for parallel cycles

* ARM9: code cycles vs data cycles. max gain 3c.
* ARM7: when accessing mainRAM. max gain 3c/5c.
* DMA: when reading from mainRAM in 32bit mode. max gain 1c.

____________________
Kuribo64

ARM7 DMA

32/16

wifiwaitcnt: 2/7

mainRAM->mainRAM: 18/16
WRAM->mainRAM: 3/2
IO->mainRAM: 3/2
wifi0->mainRAM: 14/7
wifi1->mainRAM: 10/5
VRAM->mainRAM: 4/2
mainRAM->VRAM: 3/2
WRAM->VRAM: 3/2
WRAM->VRAM: 3/2
wifi0->VRAM: 14/7
wifi1->VRAM: 10/5
VRAM->VRAM: 5/3 (same-region penalty)

WIFI DMA TIMINGS since it's weird too

setting: WS0 32/16, WS1 32/16
0-3: 14/7, 22/11
4:7: 10/5, 10/5

so, all are sequential and not just 1/2?

weird.

timings of wifi0->wifi0, wifi1->wifi1

setting: WS0 32/16, WS1 32/16
0-3: 24/12, 40/20 (!!) -> seq cycles 6, 10
4:7: 16/8, 16/8 -> seq cycles 4

ok now I guess it makes sense again?

in 32bit mode we just do two accesses and thus double the 16bit timing. no 16bit-bus-penalty, no nonseq shito. no sameregion penalty either.

____________________
Kuribo64

this is only a wild theory but maybe the RNG uses the system time, which advances slower than it might expect if you fast-forward

____________________
Kuribo64

might just be that some OpenGL drivers for Android are total shit. dunno tho.

Android is crap too tbh.

but, like...

dunno

maybe this is not running in performance mode or whatever.

I don't know shit about Android.

____________________
Kuribo64

well yeah debug builds are slow as shit

that's why the CodeBlocks project has that DebugFast config btw

it gets all the optimizations so it's fast, but it gets shit like the debug console

though it's not well suited to actual debugging or profiling

____________________
Kuribo64

that's a great cake there

also well, it's going to be a bit tricky if you don't have a paypal

____________________
Kuribo64

hey hey hey comrades

how 'bout joining the melonDS IRC so we could exchange about all that

irc.badnik.net #melonds

____________________
Kuribo64

re: semaphores

in the case of the 3D thread, there is a semaphore that is incremented each time the renderer finishes a scanline, and decremented each time a scanline is read to be composited in the final video output.

there's another semaphore that tells the renderer when it can start, and one that the renderer uses to signal when it has completed a frame.

____________________
Kuribo64

Actually GBAtek isn't too far off, as far as individual timings are concerned.

Matrix command timings

The timings for commands 0x11-0x1C depend on the current matrix mode.

Mode 0:

Command	Cycles	Remarks
0x10 - MTX_MODE	1
0x11 - MTX_PUSH	17
0x12 - MTX_POP	36
0x13 - MTX_STORE	17
0x14 - MTX_RESTORE	36
0x15 - MTX_IDENTITY	19
0x16 - MTX_LOAD_4x4	34
0x17 - MTX_LOAD_4x3	30
0x18 - MTX_MULT_4x4	35
0x19 - MTX_MULT_4x3	35
0x1A - MTX_MULT_3x3	35
0x1B - MTX_SCALE	35
0x1C - MTX_TRANS	35

Mode 1:

Timings are identical to mode 0.

Mode 2:

Timings are identical to mode 0. MULT/TRANS take 30 more cycles.

Command	Cycles	Remarks
0x10 - MTX_MODE	1
0x11 - MTX_PUSH	17
0x12 - MTX_POP	36
0x13 - MTX_STORE	17
0x14 - MTX_RESTORE	36
0x15 - MTX_IDENTITY	19
0x16 - MTX_LOAD_4x4	34
0x17 - MTX_LOAD_4x3	30
0x18 - MTX_MULT_4x4	65
0x19 - MTX_MULT_4x3	65
0x1A - MTX_MULT_3x3	65
0x1B - MTX_SCALE	35
0x1C - MTX_TRANS	65

Mode 3:

This mode has completely different timings. Probably because the texture matrices are smaller internally, or because it doesn't have to update the clip matrix, or both. The latter would explain the huge timing difference for command 0x15.

Command	Cycles	Remarks
0x10 - MTX_MODE	1
0x11 - MTX_PUSH	17
0x12 - MTX_POP	18
0x13 - MTX_STORE	17
0x14 - MTX_RESTORE	18
0x15 - MTX_IDENTITY	1
0x16 - MTX_LOAD_4x4	26
0x17 - MTX_LOAD_4x3	19
0x18 - MTX_MULT_4x4	33
0x19 - MTX_MULT_4x3	33
0x1A - MTX_MULT_3x3	33
0x1B - MTX_SCALE	33
0x1C - MTX_TRANS	33

Other commands

Command	Cycles	Remarks
0x20 - COLOR	1
0x21 - NORMAL	9-12 / 2-5	9/9/10/11/12 for 0/1/2/3/4 lights enabled. also, see: normal parallel execution
0x22 - TEXCOORD	1
0x23 - VTX_16	5 / 7 / 9	see: vertex parallel execution. one extra cycle because one more parameter.
0x24 - VTX_10	4 / 6 / 8	see: vertex parallel execution
0x25 - VTX_XY	4 / 6 / 8	see: vertex parallel execution
0x26 - VTX_XZ	4 / 6 / 8	see: vertex parallel execution
0x27 - VTX_YZ	4 / 6 / 8	see: vertex parallel execution
0x28 - VTX_DIFF	4 / 6 / 8	see: vertex parallel execution
0x29 - POLYGON_ATTR	1
0x2A - TEXIMAGE_PARAM	1
0x2B - PLTT_BASE	1
0x30 - DIF_AMB	4
0x31 - SPE_EMI	4
0x32 - LIGHT_VECTOR	6	+ pipeline stall
0x33 - LIGHT_COLOR	2
0x34 - SHININESS	32	(could not be measured accurately)
0x40 - BEGIN_VTXS	1	+ pipeline stall
0x41 - END_VTXS	1
0x50 - SWAP_BUFFERS	392	wait till VBlank + 325 cycles (measured: 319/325/331)
0x60 - VIEWPORT	1
0x70 - BOX_TEST	257	+ pipeline stall
0x71 - POS_TEST	7
0x72 - VEC_TEST	5

All other commands (nop/invalid) take one cycle.

Vertex parallel execution

Vertex commands are able to execute in parallel with most other commands.

Timings are expressed from the moment the vertex command starts. VTX_16 is preceded by one cycle because it takes two parameters, and starts upon the second cycle.

Commands 0x20, 0x30, 0x31, 0x72 can run 6 cycles after a vertex command.

Commands 0x29, 0x2A, 0x2B, 0x33, 0x34, 0x41, 0x60, 0x71 run 8 cycles after a vertex command (they cannot run in parallel).

Commands 0x32, 0x40, 0x70 stall the pipeline (see below for what this implies).

All other commands are able to run 4 cycles after a vertex command.

Further commands also abide by these rules, atleast until the end of the vertex command. For example: vertex/texcoord/color: texcoord runs 4 cycles after the vertex, color is delayed by one cycle (starts 6 cycles after).

Normal parallel execution

Normals are able to run in parallel with vertices coming right after.

The vertex can run 2/2/3/4/5 cycles after the normal starts, for 0/1/2/3/4 lights enabled respectively.

Under these circumstances, further commands don't get delayed until the normal has finished. (maybe some commands do! I haven't tested them all)

This explains why "texcoord/normal/vertex" runs faster than "normal/texcoord/vertex".

Polygon pipeline

Each vertex which completes a polygon places restrictions on when further vertices can run.

The process lasts 27 cycles for a triangle and 36 cycles for a quad. This duration is divided into 9-cycle slots in which vertices have to fit. The first slot is obviously occupied by the vertex that is executing (and building a polygon). Exceptions for strips: for triangle strips, all 3 slots are occupied; for quad strips, the first 2 slots are occupied.

EXCEPT: the process only lasts one slot if the polygon is rejected by culling/clipping

When a vertex starts within one slot, the slot is occupied, and the next vertex is delayed until the next slot.

Vertices running outside of the polygon-building process are free of any restrictions, and can run 4 cycles after the start of a previous vertex.

Pipeline stalls

Commands 0x32, 0x40 and 0x70 stall the pipeline. That is, if they happen during the polygon-building process described above, they are delayed until the end of the process.

Commands 0x32 and 0x70 get an extra delay when a pipeline stall happens: 8 and 10 cycles respectively.

If 0x32 happens outside of the polygon-building process, it can run 6 cycles after a vertex.

____________________
Kuribo64


Views: 6,934,346	Homepage \| Main \| Rules/FAQ \| Memberlist \| Active users \| Last posts \| Calendar \| Stats \| Online users \| Search	04-26-24 05:45 PM
Guest: