Views: 23,291,392 | Homepage | Main | Rules/FAQ | Memberlist | Active users | Last posts | Calendar | Stats | Online users | Search | 12-11-24 08:58 AM |
Guest: |
Main - Posts by Mighy Max |
Mighy Max |
| ||
Member Normal user Level: 7 Posts: 1/10 EXP: 1118 Next: 330 Since: 07-08-21 Last post: 329 days ago Last view: 329 days ago |
Addendum to NWRAM:
Priorities between the sets of WRAM from highest to lowest. They are the same for reading and writing:
Within a set of WRAM, the parts can be set to lay on top of each other. The lowest part has the highest read priority. Write Priority is special. Writing to overlapping parts, writes to them all at once! Result of a testcase on HW: |
Mighy Max |
| ||
Member Normal user Level: 7 Posts: 2/10 EXP: 1118 Next: 330 Since: 07-08-21 Last post: 329 days ago Last view: 329 days ago |
Posted by Arisotura Yes, that is the first part of the check. I write an initial individual value to all parts and overlay them. Afterwareds I read the value. The Part containing that value then is moved away and the read repeats, until all parts are done. See the line WRAM Bank Read Priorities, which notes which line is moved in which order. Also checked for the different sets. The WRAM Write Priority line shows which set really got written to, when the windows overlap: the write and read only applies to the highest priority.
Yes. But I have not yet have a safe method to test this behavior. And I doubt this is something that is used intentionally .... I thought about using it to blend over the Wifi RAM at 04800000h but you can't specify a window start that high, only the window end can be in the 04... range. Will make a repository of the NWRAM Testcase. It was first only a way to verify I understood the access found in my reversing of the stage 2 loader. Edit: Testcase source is here: https://github.com/DesperateProgrammer/DSiTestCases |
Mighy Max |
| ||
Member Normal user Level: 7 Posts: 3/10 EXP: 1118 Next: 330 Since: 07-08-21 Last post: 329 days ago Last view: 329 days ago |
Posted by PoroCYon I got curious today and wanted to check, if we can find out by either write or read through happening, if a HW register is actually present at a given address. Unfortunally this is not the case. The area 0x04.... is completely prioritized by the HW Registers. No NWRAM write passes nor any read operation shines through. But this makes the emulation simplier for melonDS, as NWRAM does not need to be implemented on this region. It would require the Bus Client for the NWRAM to do a sub instead of a mask (You can just mask with a fixed 0x03000000) but if you allow 03 and 04 you need to sub 1 to create such a mask or create a mask for each region and oring them. Both ways slow down the cirquit and increase its gatecount for no apperent reason. But that opens some other questions i would like to check: - is Bit28 really writeable? - did not check if the info on GBATEK is correct here - if this bit is writeable, does it have any other function then that in GBATEK? Does it change the priority, the fallthrough, timing? Any other suggestions or ideas what this bit ould control instead of an increased mapping region? For completeness and crosschecking, the simple test I made below. It shows 0 counts n both output lines. Image Size of set A was set to 64k from the previos tests. /***************************************************************************** * Test HW Registers via overlay. * Check if the NWRAM below HW registers show through, if there is no * physical register present *****************************************************************************/ int count = 0; WRAMSetWindow(0, 0x03FF0000, 0x04000000) ; memset((void *)0x03FF0000, 0xAA, 0x10000) ; WRAMSetWindow(0, 0x03FF0000, 0x04FF0000) ; for (int i=0;i<0x0ff0000;i++) { uint8_t val = ((volatile uint8_t *)0x04000000)[i] ; if (val == 0xAA) { printf("| %08x ", i) ; count++; if (count > 20) break ; } } printf("Read HW Regs Fallthrough Check: %i\n", count) ; count = 0; memset((void *)0x03FF0000, 0xAA, 0x10000) ; for (int i=0;i<0x0ff0000;i++) { ((volatile uint8_t *)0x04000000)[i] |= 0; if (((volatile uint8_t *)0x03FF0000)[i & 0xFFFF] == ((volatile uint8_t *)0x04000000)[i]) { printf("| %08x:%02x ", i, ((volatile uint8_t *)0x03FF0000)[i & 0xFFFF]) ; count++; if (count > 20) break ; } } printf("Write HW Regs Fallthrough Check: %i\n", count) ; WRAMSetWindow(0, 0x03FF0000, 0x04000000) ; |
Mighy Max |
| ||
Member Normal user Level: 7 Posts: 4/10 EXP: 1118 Next: 330 Since: 07-08-21 Last post: 329 days ago Last view: 329 days ago |
Posted by Arisotura Yes, it kind of makes sense, but i am not 100% sure on that. I thought of another test for that. I will do this as i get time. If bit28 extends the region, it must blend in 03ff0000 to 0x3ffffff, he start index is ff and the end is 1ff. Otherwise it should fall through. :edit: It dawned to me just now. It is part of the region. Its required so the last 32kB can be mapped and a zero length window still can be created. This means the end index is cropped internally at 0x100 |
Mighy Max |
| ||
Member Normal user Level: 7 Posts: 5/10 EXP: 1118 Next: 330 Since: 07-08-21 Last post: 329 days ago Last view: 329 days ago |
Tested more today, until the battery died just a few mins ago:
The end indizes in the windows are indeed completely writeable and do not reflect the cut at 0x04000000 happening. With Bit28 is set, last 32/64kB is accessible (It actually was already tested but not realized in the code posted above). So it is really part of the window region, allthough there is no observed difference if the end index is any greater. than 100(A)/200(B&C) The bits 0xE00FC00F at window for set A and 0xE007C007 in the windows for set B and C can not be chnaged. The bits 0x72 in set A banks and 0x60 in set B and C banks can not be set. This is not yet reflected in code and will be implemented (at the write, the performance impact is minimal) Some addendum to the SCFG_EXT7/9 Bit 25: Allthough gbatek states this enables/disables NWRAM, the NWRAM related HW registers seems still accessible and changeable with the same masks. So I need to revert the fall-through for NWRAM HW registers, when disabled NWRAM bit in SCFG_EXT and get some better understanding on that bit effect on the NWRAM. Did someone already play with it? |
Mighy Max |
| ||
Member Normal user Level: 7 Posts: 6/10 EXP: 1118 Next: 330 Since: 07-08-21 Last post: 329 days ago Last view: 329 days ago |
SCFG_EXT7/9. Bit 25: R/W on Arm7. RO on Arm9.
If cleared, the NWRAM is not blended in on the Arm7 or Arm9 system bus. However, the HW Registers at 0x04004040..0x04004063 are still working. Test: https://github.com/DesperateProgrammer/DSiTestCases/tree/master/NWRAM |
Mighy Max |
| ||
Member Normal user Level: 7 Posts: 7/10 EXP: 1118 Next: 330 Since: 07-08-21 Last post: 329 days ago Last view: 329 days ago |
Heya,
I am currently working on a gdb server to be build into melonDS in a separate branch. It is basically a RSPServer class that can be attached to an ARM core and right now is instanced twice - once for each core. It has a bit of a performance hit (I'd say about -10%) so I intend it to only be useable in a special configured build. It's already connecting to gdb and can a least read-access both cores. But for what I searched, there is no non-hacky way to control or read the emulation state, or is there? There is an EmuThread in main.cpp that ontrols the emulation/execution state, but that is not exported in any way. Is there any idea as to how an API allowing getting the emulation state and controlling it should look like? Or how that emuThread should be de-tangled from main? A small peek: |
Mighy Max |
| ||
Member Normal user Level: 7 Posts: 8/10 EXP: 1118 Next: 330 Since: 07-08-21 Last post: 329 days ago Last view: 329 days ago |
This seems really like a bug in the boot code.
The following is my try of reversing that part I did a while ago. Between IPC Sync Value 5 and 0 is the part that gives you the problem. There is no sync/barrier between the mem access and the SCFG write. The memory regions used are passed from the actual app/stage2 while the boot relict code seems to be SDK code, as it is also present in (some?) other apps. Imho there are two options here: have the timings correct, or patch the loader. Both have some serious issues: the timings will vary often until melonDS is mature. It will repeat to break within the development process. Patching the laoder does not reflect the actual console and its code, so the console and melon would move a step apart and detecting the boot relict code is tricky, as it might be used in multiple apps and might appear in different versions. oid bootClearMemory(uint32_t *target, uint32_t length) ;
void bootClearMemoryListEntries(uint32_t *memoryList) ; void bootSimpleCopy(uint32_t *source, uint32_t *target, uint32_t count) ; void bootReverseCopy(uint32_t *source, uint32_t *target, uint32_t count) ; typedef void (*ARM9BOOTFUNC)(void) ; void bootBootRelictCode(SBOOTCONFIG *cfg, uint32_t *memoryList, uint32_t memoryListLength, uint32_t stackPtr) { bootReverseCopy((uint32_t *)cfg, (uint32_t *)(stackPtr - 0x110), 0xe0) ; SBOOTCONFIG *cfgCopy = (SBOOTCONFIG *)(stackPtr - 0x110) ; /* sync with arm7 */ WaitForIPCSyncValue(4) ; SetIPCSyncValue(4) ; /* copy list of memory to be cleared to the new stack */ bootReverseCopy(memoryList, (uint32_t *)(stackPtr + 0x110), memoryListLength) ; bootClearMemoryListEntries(memoryList) ; /* wait for arm7 to complete memory clear up */ WaitForIPCSyncValue(5) ; /* Set up new WRAM Settings */ if (cfgCopy->wramFlag != 0) { bootSimpleCopy(cfgCopy->wramSettings, (uint32_t *)0x04004040, sizeof(cfgCopy->wramSettings)) ; *(volatile uint8_t *)0x4000247 = *(uint8_t *)(cfgCopy->wramCnt >> 24) & 3 ; } /* tell arm7 we set up the wram */ SetIPCSyncValue(5) ; ARM9BOOTFUNC arm9boot = *(ARM9BOOTFUNC *)(cfgCopy->entry) ; uint32_t targetScfgExt = cfgCopy->sfcg_ext ; bool setSCFG = !((cfgCopy->unknown_04 | cfgCopy->unknown_08) & 0x80000000) ; // TODO (or not) the memory list are actually 4 consequtive lists terminated with 0 // the first is only used. // the second list is used for copying data blocks from start to end // the third list is used for copying data blocks from end to start // and the lsit is to clear again // Since the memory list is hardcoded in the calling method, we know that // the last 3 lists are never used so we skip it // clear the bootloader stack bootClearMemory((uint32_t *)0x02fe0000, 0x3fc0) ; // Clear the memory of that list too and the cfgCopy bootClearMemory((uint32_t *)(stackPtr - (0x110 + memoryListLength)), 0x110 + memoryListLength) ; if (setSCFG) { *(volatile uint32_t *)0x04004008 = targetScfgExt ; } /* sync with arm7 */ WaitForIPCSyncValue(0) ; SetIPCSyncValue(0) ; /* start new code or die! */ if (arm9boot != 0) arm9boot() ; while (true) ; } void bootSimpleCopy(uint32_t *source, uint32_t *target, uint32_t count) { uint32_t *end = target + count / 4; while (target < end) { *(target++) = *(source++) ; } } void bootReverseCopy(uint32_t *source, uint32_t *target, uint32_t count) { uint32_t *curTarget = target + count / 4 ; uint32_t *curSource = source + count / 4; while (target < curTarget) { *(--curTarget) = *(--curSource) ; } } void bootClearMemory(uint32_t *target, uint32_t length) { while (length >= 4) { *(target++) = 0 ; length -= 4 ; } } void bootClearMemoryListEntries(uint32_t *memoryList) { while (true) { if (!*memoryList) break; bootClearMemory((uint32_t *)(memoryList[0]), memoryList[1]) ; memoryList += 2 ; } } Below is the code for the arm7 at the same time between ipc 5 and 0 (out of ghidra): 037b84b8 05 00 a0 e3 mov r0,#0x5 037b84bc 54 00 00 eb bl SetIPCSyncValue_1 undefined SetIPCSyncValue_1() 037b84c0 58 00 00 eb bl WaitForIPCSyncValue_1 undefined WaitForIPCSyncValue_1() 037b84c4 00 00 58 e3 cmp r8,#0x0 037b84c8 40 00 8b 12 addne r0,r11,#0x40 037b84cc 54 10 87 12 addne r1,r7,#0x54 037b84d0 10 20 a0 13 movne r2,#0x10 037b84d4 1d 00 00 1b blne bootSimpleCopy undefined bootSimpleCopy() 037b84d8 40 0f 9b e8 ldmia r11,{r6 r8 r9 r10 r11} 037b84dc 0a a0 89 e1 orr r10,r9,r10 LAB_037b84e0 XREF[1]: 037b84ec(j) 037b84e0 45 00 00 eb bl bootGetMemoryListEntry undefined bootGetMemoryListEntry() 037b84e4 01 00 00 0a beq LAB_037b84f0 037b84e8 18 00 00 eb bl bootSimpleCopy undefined bootSimpleCopy() 037b84ec fb ff ff ea b LAB_037b84e0 LAB_037b84f0 XREF[2]: 037b84e4(j), 037b84fc(j) 037b84f0 41 00 00 eb bl bootGetMemoryListEntry undefined bootGetMemoryListEntry() 037b84f4 01 00 00 0a beq LAB_037b8500 037b84f8 1a 00 00 eb bl bootReverseCopy undefined bootReverseCopy() 037b84fc fb ff ff ea b LAB_037b84f0 LAB_037b8500 XREF[1]: 037b84f4(j) 037b8500 35 00 00 eb bl FUN_037b85dc undefined FUN_037b85dc() 037b8504 05 10 4d e0 sub r1,sp,r5 037b8508 05 20 a0 e1 cpy r2,r5 037b850c 1d 00 00 eb bl bootClearMemory undefined bootClearMemory() 037b8510 00 00 a0 e3 mov r0,#0x0 037b8514 3e 00 00 eb bl SetIPCSyncValue_1 undefined SetIPCSyncValue_1() |
Mighy Max |
| ||
Member Normal user Level: 7 Posts: 9/10 EXP: 1118 Next: 330 Since: 07-08-21 Last post: 329 days ago Last view: 329 days ago |
Posted by StrikerX3 How sure are you about this? I have been busy some days now to measure out memory and instruction timings on the (Phat-)DS, my DSi and the various emulators. Out of curiosity and because I think that this is still a bit poor documented/implemented (yeah its a performance hit without much if any benefit) The thing here is: All my measurements point to 2kB data cache on the DS and only 1kB data cache on the DSi. I checked multiple times if my measurement is off, or some hardware setting (i.e. cache lockdown) is causing this but to no avail. On all emulators (as expected) no cache size can be detrminated and all runs need the same time. Reference clock is the 33MHz timer on the arm9. The result does not match the public info about 4kB data cache. What I do to measure the instruction and data cache sizes: Enabled Instruction cache, disable cache lockdowns
for n in [8..16]: Do 11 runs of the following measurement and take the median time of these measurements: run a loop of (2^n)-3 'mov r8, r8' instructions. together with 3 instruction for the looping If the median divided by the amount of instructions exceeds the double of the previous run Report the previous run length as instruction cache size For the data cache the rise of the execution time is not that sharp because the instruction cache is disabled and results in a larger prefetch time for each instruction. I do this to eleminate the effect of instruction caching on this measurement. Disabled Instruction cache, disable cache lockdowns
Enabled Data cache, disable cache lockdowns for n in [8..16]: Do 11 runs of the following measurement and take the median time of these measurements: run a loop of (2^n)-3 'ldr r2, [pc]' instructions. together with 3 instruction for the looping If the median divided by the amount of instructions exceeds 120% of the previous run Report the previous run length as data cache size I can provide a simple .NDS testing this behaviour, if someone wants to double check. |
Mighy Max |
| ||
Member Normal user Level: 7 Posts: 10/10 EXP: 1118 Next: 330 Since: 07-08-21 Last post: 329 days ago Last view: 329 days ago |
Hello again,
I cleaned up and published a little app to measure cache and memory timings. The issue with the cache did not resolve. I'm quite certain that the 4kB data cache are ot present (or at least not effective) on the DS and DSi. The measurements are consistent on the HW between runs and variant of the test, while they are different for the generations (DS vs DSi) You can find the source code to measure and verify yourself at https://github.com/DesperateProgrammer/DSMemoryCycleCounter I hope the data can improve the accuracy of the emulators in the future. I am aware that it has not much benefit at the moment, as implementing the mechanics of a cache would impose a significant performance impact in the emulation. Maybe the timings of the non-cached memory regions can be improved and compatibility increases. The app might provide a benchmark for this. If you happen to find an error in the calculation/measurement of cycles please let me know. There are still a lot of things i want to include into the measurements such as impact of DMA, verification of BUS priority, verification of cache-writeback sizes and much more. Without further ado, the memory timings and cache sizes: DSi Caches ========== Reported Measured ICACHE Size 8192 8192 DCACHE Size 4096 1024 ! Conflict ICACHE Line Size 32 32 DCACHE Line Size 32 32 DS Caches ========== Reported Measured ICACHE Size 8192 8192 DCACHE Size 4096 2048 ! Conflict ICACHE Line Size 32 32 DCACHE Line Size 32 32 DSi Memory Timings ================== In cpu cycles @ 66MHz N16 S16 N32 S32 Main RAM 16 16 18 4 ITCM 2 2 2 2 DTCM 1 1 1 1 WRAM 8 8 8 2 VRAM 8 8 8 2 GBA ROM 26 26 38 24 EXMEMCNT[4..2] = 000 GBA ROM 22 22 34 24 EXMEMCNT[4..2] = 001 GBA ROM 18 18 30 24 EXMEMCNT[4..2] = 010 GBA ROM 42 42 52 24 EXMEMCNT[4..2] = 011 GBA ROM 26 26 34 8 EXMEMCNT[4..2] = 100 GBA ROM 22 22 26 8 EXMEMCNT[4..2] = 101 GBA ROM 18 18 26 8 EXMEMCNT[4..2] = 110 GBA ROM 42 42 50 8 EXMEMCNT[4..2] = 111 GBA RAM 26 26 26 20 EXMEMCNT[1..0] = 00 GBA RAM 22 22 22 16 EXMEMCNT[1..0] = 01 GBA RAM 18 18 18 12 EXMEMCNT[1..0] = 10 GBA RAM 42 42 42 36 EXMEMCNT[1..0] = 11 In cpu cycles @ 133MHz N16 S16 N32 S32 Main RAM 32 32 36 8 ITCM 2 2 2 2 DTCM 1 1 1 1 WRAM 12 12 12 4 VRAM 12 12 12 4 GBA ROM 48 48 72 48 EXMEMCNT[4..2] = 000 GBA ROM 40 40 64 48 EXMEMCNT[4..2] = 001 GBA ROM 32 32 56 48 EXMEMCNT[4..2] = 010 GBA ROM 80 80 104 48 EXMEMCNT[4..2] = 011 GBA ROM 48 48 64 32 EXMEMCNT[4..2] = 100 GBA ROM 40 40 56 32 EXMEMCNT[4..2] = 101 GBA ROM 32 32 48 32 EXMEMCNT[4..2] = 110 GBA ROM 80 80 96 32 EXMEMCNT[4..2] = 111 GBA RAM 48 48 48 40 EXMEMCNT[1..0] = 00 GBA RAM 40 40 40 32 EXMEMCNT[1..0] = 01 GBA RAM 32 32 32 24 EXMEMCNT[1..0] = 10 GBA RAM 80 80 80 72 EXMEMCNT[1..0] = 11 In bus cycles @ 66MHz N16 S16 N32 S32 Main RAM 8 8 9 2 ITCM 1 1 1 1 DTCM 0.5 0.5 0.5 0.5 WRAM 4 4 4 1 VRAM 4 4 4 1 GBA ROM 13 13 19 12 EXMEMCNT[4..2] = 000 GBA ROM 11 11 17 12 EXMEMCNT[4..2] = 001 GBA ROM 9 9 15 12 EXMEMCNT[4..2] = 010 GBA ROM 21 21 27 12 EXMEMCNT[4..2] = 100 GBA ROM 13 13 17 8 EXMEMCNT[4..2] = 100 GBA ROM 11 11 15 8 EXMEMCNT[4..2] = 101 GBA ROM 9 9 13 8 EXMEMCNT[4..2] = 110 GBA ROM 21 21 25 8 EXMEMCNT[4..2] = 111 GBA RAM 13 13 13 10 EXMEMCNT[1..0] = 00 GBA RAM 11 11 11 8 EXMEMCNT[1..0] = 01 GBA RAM 9 9 9 6 EXMEMCNT[1..0] = 10 GBA RAM 21 21 21 18 EXMEMCNT[1..0] = 11 In bus cycles @ 133MHz N16 S16 N32 S32 Main RAM 8 8 9 2 ITCM 0.5 0.5 0.5 0.5 DTCM .25 .25 .25 .25 WRAM 3 3 3 1 VRAM 3 3 3 1 GBA ROM 12 12 18 12 EXMEMCNT[4..2] = 000 GBA ROM 10 10 16 12 EXMEMCNT[4..2] = 001 GBA ROM 8 8 14 12 EXMEMCNT[4..2] = 010 GBA ROM 20 20 26 12 EXMEMCNT[4..2] = 100 GBA ROM 12 12 16 8 EXMEMCNT[4..2] = 100 GBA ROM 10 10 14 8 EXMEMCNT[4..2] = 101 GBA ROM 8 8 12 8 EXMEMCNT[4..2] = 110 GBA ROM 20 20 24 8 EXMEMCNT[4..2] = 111 GBA RAM 12 12 12 10 EXMEMCNT[1..0] = 00 GBA RAM 10 10 10 8 EXMEMCNT[1..0] = 01 GBA RAM 8 8 8 6 EXMEMCNT[1..0] = 10 GBA RAM 20 20 20 18 EXMEMCNT[1..0] = 11 DS Memory Timings ================= In cpu cycles: Main RAM 16 16 18 4 ITCM 2 2 2 2 DTCM 1 1 1 1 VRAM 8 8 8 2 GBA ROM/RAM *problems measuring* In bus cycles: N16 S16 N32 S32 Main RAM 8 8 9 2 ITCM 1 1 1 1 DTCM 0.5 0.5 0.5 0.5 VRAM 4 4 4 1 GBA ROM/RAM *problems measuring* |
Main - Posts by Mighy Max |
Page rendered in 0.076 seconds. (2048KB of memory used) MySQL - queries: 23, rows: 85/85, time: 0.030 seconds. Acmlmboard 2.064 (2018-07-20) © 2005-2008 Acmlm, Xkeeper, blackhole89 et al. |