Views: 23,291,536 | Homepage | Main | Rules/FAQ | Memberlist | Active users | Last posts | Calendar | Stats | Online users | Search | 12-11-24 10:27 AM |
Guest: |
0 users reading GBAtek addendum/errata | 1 bot |
Main - Development - GBAtek addendum/errata | Hide post layouts | New reply |
Arisotura |
| ||
Big fire melon magical melon girl Level: 58 Posts: 801/945 EXP: 1541388 Next: 36158 Since: 03-28-17 From: France Last post: 41 days ago Last view: 10 hours ago |
I'm letting the DSi discharge, while running a test program that keeps track of change in the battery register
so far: when charging: 8F full -> empty: 0F 0B 07 03 (light red) 01 (light red, blinking) so bit0 = not-critical bit? bit1 = 'power good' bit, in the sense of the old DS bit2-3: level from 0 to 3 ---- DSi BPTWL Battery Level register and DS Powerman Battery Status 0F, 0B, 07 (full, three bars, two bars) = 0 (okay) 03, 01 (one red bar, one flashing red bar) = 1 (low) ____________________ Kuribo64 |
Rayyan |
| ||
Big melon Administrator Level: 30 Posts: 208/238 EXP: 148241 Next: 17628 Since: 06-25-20 From: UK Last post: 212 days ago Last view: 13 days ago |
GBATEK seems to have info on this: 0x20 1 Battery flags. When zero the battery is at critical level, arm7 does a shutdown. Bit7 is set when the battery is charging. Battery levels in the low 4-bits: battery icon bars full 0xF, 3 bars 0xB, 2 bars 0x7, one solid red bar 0x3, and one blinking red bar 0x1. When plugging in or removing recharge cord, this value increases/decreases between the real battery level and 0xF, thus the battery level while bit7 is set is useless. ____________________
How to write an emulator
1. throw code to be emulated somewhere 2. make memory system that allows accessing that code 3. emulate CPU 4. have fun implementing all the other hardware -- Arisotura, Tuesday 5th January 2021, 22:00:17 |
Generic aka RSDuck |
| ||
Big fire melon Administrator Level: 45 Posts: 508/610 EXP: 654421 Next: 5743 Since: 10-12-19 Last post: 70 days ago Last view: 10 hours ago |
DSlite specific registers, provided by Gericom:
#pragma once //As far as I know these registers should only exist on a DSLite //On boot, the firmware writes 0xFFFF to REG_REGCNT, locking both //reading and writing of the registers below, and making them //impossible to be used (since REG_REGCNT is write-once) //In order to prevent this flashme can be used. When the direct-boot //keycombo A+B+START+SELECT is held while booting the lock write //will not happen, and as such the registers remain usable //Lockout register for nitro2 features, WRITE-ONLY and WRITE-ONCE!! #define REG_REGCNT (*(vu16*)0x04001080) #define REGCNT_WE0 (1 << 0) //disables writing to REG_DISPCNT2 #define REGCNT_WE1 (1 << 1) //disables writing to REG_DISPSW #define REGCNT_WE2 (1 << 2) //disables writing to REG_CLK11M #define REGCNT_RE0 (1 << 8) //disables reading from REG_DISPCNT2 #define REGCNT_RE1 (1 << 9) //disables reading from REG_DISPSW #define REGCNT_RE2 (1 << 10) //disables reading from REG_CLK11M //Selects dual or single screen mode #define REG_DISPCNT2 (*(vu16*)0x04001090) #define DISPCNT2_MOD_DUAL_SCREEN 0 //default mode with 2 screens #define DISPCNT2_MOD_SINGLE_SCREEN 1 //disables the top screen and enables some special features //Configures single screen mode //Note that main and sub here refer to the main and sub screens as configurable in this register //and NOT the main and sub engines #define REG_DISPSW (*(vu16*)0x040010A0) //Selects the display mode #define DISPSW_WIN_SHIFT 0 #define DISPSW_WIN_MASK (3 << DISPSW_WIN_SHIFT) #define DISPSW_WIN(x) ((x) << DISPSW_WIN_SHIFT) #define DISPSW_WIN_MAIN_ONLY 0 //displays only the main screen #define DISPSW_WIN_MAIN_FULL_SUB 1 //blends the main screen with the sub screen #define DISPSW_WIN_MAIN_HALF_SUB_BOTTOM_LEFT 2 //displays the sub screen at 128x96 in the bottom-left corner with optional blending #define DISPSW_WIN_MAIN_HALF_SUB_BOTTOM_RIGHT 3 //displays the sub screen at 128x96 in the bottom-right corner with optional blending #define DISPSW_A_SHIFT 4 #define DISPSW_A_MASK (3 << DISPSW_A_SHIFT) #define DISPSW_A(x) ((x) << DISPSW_A_SHIFT) //Blending for DISPSW_WIN_MAIN_FULL_SUB mode #define DISPSW_A_FULL_7_1 0 //main 7/8, sub 1/8 #define DISPSW_A_FULL_6_2 1 //main 6/8, sub 2/8 #define DISPSW_A_FULL_5_3 2 //main 5/8, sub 3/8 #define DISPSW_A_FULL_4_4 3 //main 4/8, sub 4/8 //Blending for DISPSW_WIN_MAIN_HALF_SUB modes #define DISPSW_A_HALF_3_1 0 //main 3/4, sub 1/4 #define DISPSW_A_HALF_2_2 1 //main 2/4, sub 2/4 #define DISPSW_A_HALF_1_3 2 //main 1/4, sub 3/4 #define DISPSW_A_HALF_0_4 3 //main 0/4, sub 4/4 //select the screen to output on tv when tv-out is on #define DISPSW_M0_SHIFT 8 #define DISPSW_M0_MASK (1 << DISPSW_M0_SHIFT) #define DISPSW_M0(x) ((x) << DISPSW_M0_SHIFT) #define DISPSW_M0_TV_SUB 0 //output the sub screen to tv-out #define DISPSW_M0_TV_MAIN 1 //output the main screen to tv-out //select which screens picture is main and which is sub #define DISPSW_M1_SHIFT 9 #define DISPSW_M1_MASK (1 << DISPSW_M1_SHIFT) #define DISPSW_M1(x) ((x) << DISPSW_M1_SHIFT) #define DISPSW_M1_MAIN_BOTTOM_SUB_TOP 0 //main = bottom screen, sub = top screen #define DISPSW_M1_MAIN_TOP_SUB_BOTTOM 1 //main = top screen, sub = bottom screen //tv-out enable/disable #define DISPSW_TVOUT_SHIFT 14 #define DISPSW_TVOUT_MASK (1 << DISPSW_TVOUT_SHIFT) #define DISPSW_TVOUT(x) ((x) << DISPSW_TVOUT_SHIFT) #define DISPSW_TVOUT_DISABLED 0 //disables tv-out, top screen will be white #define DISPSW_TVOUT_ENABLED 1 //enabled tv-out, top screen signals are used to output 10 bit digital NTSC //key input enable/disable //in single screen mode three of the top screen signals are used as button inputs to configure the output //the hardware will change the register values of WIN, A and M01 when the buttons are pressed //this feature can be disabled by setting this bit #define DISPSW_KEYLOCK_SHIFT 15 #define DISPSW_KEYLOCK_MASK (1 << DISPSW_KEYLOCK_SHIFT) #define DISPSW_KEYLOCK(x) ((x) << DISPSW_KEYLOCK_SHIFT) #define DISPSW_KEYLOCK_DISABLED 0 //allow configuring the display mode using the buttons #define DISPSW_KEYLOCK_ENABLED 1 //the buttons will do nothing //Enables/disables outputting an 11MHz clock on the CLK11M pin of the SOC //I believe this signal is not actually connected to anything #define REG_CLK11M (*(vu16*)0x040010B0) #define CLK11M_CK11_LOW 0 #define CLK11M_CK11_ACTIVE 1 ____________________ Take me to your heart / never let me go! "clearly you need to mow more lawns and buy a better pc" - Hydr8gon |
StrikerX3 |
| ||
Newcomer Inactive Level: 2 Posts: 1/1 EXP: 24 Next: 22 Since: 04-22-23 Last post: 596 days ago Last view: 193 days ago |
Recently I've been implementing the ARM9 cache on my NDS emulator and made several tests on real hardware (a 3DS) to figure out how exactly it works, because the ARM manuals are as unhelpful as ever. My emulator fully implements the instruction and data caches (minus the lockdown feature), actually storing and retrieving data from a separate block of memory and having the cache do line fetches and flushes, and partially implements the write buffer -- the FIFO exists, but is drained as soon as anything goes into it. All ROMs I tested seem to run normally, with around 5-40% performance loss depending on the title compared to no cache emulation. The larger losses happen on titles that already ran crazy fast (500+ fps), so they're still very much playable in real time.
Shoutouts to Gericom, Generic, asie and AntonioND on the gbadev Discord, they gave great feedback, suggestions, insights and coding help on this research. So, here are my findings on the ARM9 cache. NDS ARM9 cache inner workingsThe basics
Write buffer
Replacement strategiesNOTE: research is ongoing on this subject. This is a summary of the behavior I observed from collected data. Data was collected from my 3DS with a test program that:
Known facts
Here are a few hypotheses on how the replacement counter(s) might be implemented. Hypothesis 1: one shared counter for both strategies
Hypothesis 2: separate counters for each strategy
Here's the messy code of the test ROM I used if you want to play around. Can be compiled with devkitPro using the Makefile (dkp), or BlocksDS with the regular Makefile + Makefile.include. main.cpp must be in a folder called source. The replacement strategy tests mess with the PU table and assume region 3 is the GBA cart area or the DSi switchable IWRAM; I haven't checked if the PU regions are the same when compiled with dkP, so it might not work. |
Arisotura |
| ||
Big fire melon magical melon girl Level: 58 Posts: 862/945 EXP: 1541388 Next: 36158 Since: 03-28-17 From: France Last post: 41 days ago Last view: 10 hours ago |
PoroCYon |
| ||
Half-eaten melon Normal user Level: 11 Posts: 23/24 EXP: 5038 Next: 947 Since: 12-01-19 From: .be Last post: 485 days ago Last view: 482 days ago |
I found some more misc stuff, so:
Second NTR cartridge registersGBATEK is right that there's a second cartridge slot at 0x040021Ax (with its own AUXSPICNT/DAT, ROMCTRL, etc, and data-in at 0x04102010), but this has some consequences that are not always listed:
SCFG
CAM_MCNT (A9)
ConsoleID formatThis one seems to be a die/wafer/lot ID number (which you can also see in eg. MSP430 TLV data), with format (from LSB to MSB):
WIFIWAITCNTbit 7: MCLK disable? |
PoroCYon |
| ||
Half-eaten melon Normal user Level: 11 Posts: 24/24 EXP: 5038 Next: 947 Since: 12-01-19 From: .be Last post: 485 days ago Last view: 482 days ago |
Nintendo DSi XL testpoint namesThe testpoints on the regular DSi mobo (CPU-TWL-01) are named, the one on the DSi XL (CPU-UTL-01) aren't. So I went over them all with a continuity tester to figure it out. Took me ~12h spread across 4 days. |
Arisotura |
| ||
Big fire melon magical melon girl Level: 58 Posts: 867/945 EXP: 1541388 Next: 36158 Since: 03-28-17 From: France Last post: 41 days ago Last view: 10 hours ago |
entry 0x67 in firmware user settings
it's the RTC clock adjust value on DSi it is stored at offset 0x88 in HWINFO_N.dat also 2FFFDE8/2FFFDEC aren't specifically the RTC date/time, these addresses are used for RTC IO in general ____________________ Kuribo64 |
AntonioND |
| ||
Newcomer Normal user Level: 2 Posts: 1/1 EXP: 19 Next: 27 Since: 11-29-23 Last post: 378 days ago Last view: 365 days ago |
MMIO[82C0h/82C2h+(0..1)*80h] - BTDMP Receive/Transmit FIFO Status (R)
... 4 FIFO Empty (0=No, 1=Empty, 0x16bit words) This is incorrect. The bit is 0 when the FIFO is empty, 1 when it isn't empty. I would need to verify it more, but I've seen that with a couple of quick tests. |
CasualPokePlayer |
| ||
Member Normal user Level: 10 Posts: 15/25 EXP: 3932 Next: 482 Since: 03-27-22 Last post: 41 days ago Last view: 8 days ago |
075h 1 Extended Language (0..5=Same as Entry 064h, plus 6=Chinese)
Language 7 is Korean. Similar to Chinese this is only represented in the extended language in firmware userdata, the older language field will just be set to 1 (English). This also means the supported language bitmask is another bit large (bit 7 for Korean). Presumably DSi specific settings follow the same rules, although I don't have a Korean NAND to verify that.
(for language 6, entry 064h defaults to english; for compatibility) (for language 0..5, both entries 064h and 075h have same value) 076h 2 Bitmask for Supported Languages (Bit0..6) (007Eh for iQue DS, ie. with chinese, but without japanese) (0042h for iQue DSi, chinese (and english, but only for NDS mode)) (003Eh for DSi/EUR, ie. without chinese, and without japanese) 01Dh 1 Console type
0x35 is the console type for Korean DS Lite, which also has extended user settings. So gbatek's description of bit 6 seems to be wrong? Bit 0 seems to be more indicative of having extended user settings.FFh=Nintendo DS 20h=Nintendo DS-lite 57h=Nintendo DSi (also iQueDSi) 43h=iQueDS 63h=iQueDS-lite The entry was unused (FFh) in older NDS, ie. replace FFh by 00h) Bit0 seems to be DSi/iQue related Bit1 seems to be DSi/iQue related Bit2 seems to be DSi related Bit3 zero Bit4 seems to be DSi related Bit5 seems to be DS-Lite related Bit6 indicates presence of "extended" user settings (DSi/iQue) Bit7 zero |
Mighy Max |
| ||
Member Normal user Level: 7 Posts: 9/10 EXP: 1118 Next: 330 Since: 07-08-21 Last post: 329 days ago Last view: 329 days ago |
Posted by StrikerX3 How sure are you about this? I have been busy some days now to measure out memory and instruction timings on the (Phat-)DS, my DSi and the various emulators. Out of curiosity and because I think that this is still a bit poor documented/implemented (yeah its a performance hit without much if any benefit) The thing here is: All my measurements point to 2kB data cache on the DS and only 1kB data cache on the DSi. I checked multiple times if my measurement is off, or some hardware setting (i.e. cache lockdown) is causing this but to no avail. On all emulators (as expected) no cache size can be detrminated and all runs need the same time. Reference clock is the 33MHz timer on the arm9. The result does not match the public info about 4kB data cache. What I do to measure the instruction and data cache sizes: Enabled Instruction cache, disable cache lockdowns
for n in [8..16]: Do 11 runs of the following measurement and take the median time of these measurements: run a loop of (2^n)-3 'mov r8, r8' instructions. together with 3 instruction for the looping If the median divided by the amount of instructions exceeds the double of the previous run Report the previous run length as instruction cache size For the data cache the rise of the execution time is not that sharp because the instruction cache is disabled and results in a larger prefetch time for each instruction. I do this to eleminate the effect of instruction caching on this measurement. Disabled Instruction cache, disable cache lockdowns
Enabled Data cache, disable cache lockdowns for n in [8..16]: Do 11 runs of the following measurement and take the median time of these measurements: run a loop of (2^n)-3 'ldr r2, [pc]' instructions. together with 3 instruction for the looping If the median divided by the amount of instructions exceeds 120% of the previous run Report the previous run length as data cache size I can provide a simple .NDS testing this behaviour, if someone wants to double check. |
Mighy Max |
| ||
Member Normal user Level: 7 Posts: 10/10 EXP: 1118 Next: 330 Since: 07-08-21 Last post: 329 days ago Last view: 329 days ago |
Hello again,
I cleaned up and published a little app to measure cache and memory timings. The issue with the cache did not resolve. I'm quite certain that the 4kB data cache are ot present (or at least not effective) on the DS and DSi. The measurements are consistent on the HW between runs and variant of the test, while they are different for the generations (DS vs DSi) You can find the source code to measure and verify yourself at https://github.com/DesperateProgrammer/DSMemoryCycleCounter I hope the data can improve the accuracy of the emulators in the future. I am aware that it has not much benefit at the moment, as implementing the mechanics of a cache would impose a significant performance impact in the emulation. Maybe the timings of the non-cached memory regions can be improved and compatibility increases. The app might provide a benchmark for this. If you happen to find an error in the calculation/measurement of cycles please let me know. There are still a lot of things i want to include into the measurements such as impact of DMA, verification of BUS priority, verification of cache-writeback sizes and much more. Without further ado, the memory timings and cache sizes: DSi Caches ========== Reported Measured ICACHE Size 8192 8192 DCACHE Size 4096 1024 ! Conflict ICACHE Line Size 32 32 DCACHE Line Size 32 32 DS Caches ========== Reported Measured ICACHE Size 8192 8192 DCACHE Size 4096 2048 ! Conflict ICACHE Line Size 32 32 DCACHE Line Size 32 32 DSi Memory Timings ================== In cpu cycles @ 66MHz N16 S16 N32 S32 Main RAM 16 16 18 4 ITCM 2 2 2 2 DTCM 1 1 1 1 WRAM 8 8 8 2 VRAM 8 8 8 2 GBA ROM 26 26 38 24 EXMEMCNT[4..2] = 000 GBA ROM 22 22 34 24 EXMEMCNT[4..2] = 001 GBA ROM 18 18 30 24 EXMEMCNT[4..2] = 010 GBA ROM 42 42 52 24 EXMEMCNT[4..2] = 011 GBA ROM 26 26 34 8 EXMEMCNT[4..2] = 100 GBA ROM 22 22 26 8 EXMEMCNT[4..2] = 101 GBA ROM 18 18 26 8 EXMEMCNT[4..2] = 110 GBA ROM 42 42 50 8 EXMEMCNT[4..2] = 111 GBA RAM 26 26 26 20 EXMEMCNT[1..0] = 00 GBA RAM 22 22 22 16 EXMEMCNT[1..0] = 01 GBA RAM 18 18 18 12 EXMEMCNT[1..0] = 10 GBA RAM 42 42 42 36 EXMEMCNT[1..0] = 11 In cpu cycles @ 133MHz N16 S16 N32 S32 Main RAM 32 32 36 8 ITCM 2 2 2 2 DTCM 1 1 1 1 WRAM 12 12 12 4 VRAM 12 12 12 4 GBA ROM 48 48 72 48 EXMEMCNT[4..2] = 000 GBA ROM 40 40 64 48 EXMEMCNT[4..2] = 001 GBA ROM 32 32 56 48 EXMEMCNT[4..2] = 010 GBA ROM 80 80 104 48 EXMEMCNT[4..2] = 011 GBA ROM 48 48 64 32 EXMEMCNT[4..2] = 100 GBA ROM 40 40 56 32 EXMEMCNT[4..2] = 101 GBA ROM 32 32 48 32 EXMEMCNT[4..2] = 110 GBA ROM 80 80 96 32 EXMEMCNT[4..2] = 111 GBA RAM 48 48 48 40 EXMEMCNT[1..0] = 00 GBA RAM 40 40 40 32 EXMEMCNT[1..0] = 01 GBA RAM 32 32 32 24 EXMEMCNT[1..0] = 10 GBA RAM 80 80 80 72 EXMEMCNT[1..0] = 11 In bus cycles @ 66MHz N16 S16 N32 S32 Main RAM 8 8 9 2 ITCM 1 1 1 1 DTCM 0.5 0.5 0.5 0.5 WRAM 4 4 4 1 VRAM 4 4 4 1 GBA ROM 13 13 19 12 EXMEMCNT[4..2] = 000 GBA ROM 11 11 17 12 EXMEMCNT[4..2] = 001 GBA ROM 9 9 15 12 EXMEMCNT[4..2] = 010 GBA ROM 21 21 27 12 EXMEMCNT[4..2] = 100 GBA ROM 13 13 17 8 EXMEMCNT[4..2] = 100 GBA ROM 11 11 15 8 EXMEMCNT[4..2] = 101 GBA ROM 9 9 13 8 EXMEMCNT[4..2] = 110 GBA ROM 21 21 25 8 EXMEMCNT[4..2] = 111 GBA RAM 13 13 13 10 EXMEMCNT[1..0] = 00 GBA RAM 11 11 11 8 EXMEMCNT[1..0] = 01 GBA RAM 9 9 9 6 EXMEMCNT[1..0] = 10 GBA RAM 21 21 21 18 EXMEMCNT[1..0] = 11 In bus cycles @ 133MHz N16 S16 N32 S32 Main RAM 8 8 9 2 ITCM 0.5 0.5 0.5 0.5 DTCM .25 .25 .25 .25 WRAM 3 3 3 1 VRAM 3 3 3 1 GBA ROM 12 12 18 12 EXMEMCNT[4..2] = 000 GBA ROM 10 10 16 12 EXMEMCNT[4..2] = 001 GBA ROM 8 8 14 12 EXMEMCNT[4..2] = 010 GBA ROM 20 20 26 12 EXMEMCNT[4..2] = 100 GBA ROM 12 12 16 8 EXMEMCNT[4..2] = 100 GBA ROM 10 10 14 8 EXMEMCNT[4..2] = 101 GBA ROM 8 8 12 8 EXMEMCNT[4..2] = 110 GBA ROM 20 20 24 8 EXMEMCNT[4..2] = 111 GBA RAM 12 12 12 10 EXMEMCNT[1..0] = 00 GBA RAM 10 10 10 8 EXMEMCNT[1..0] = 01 GBA RAM 8 8 8 6 EXMEMCNT[1..0] = 10 GBA RAM 20 20 20 18 EXMEMCNT[1..0] = 11 DS Memory Timings ================= In cpu cycles: Main RAM 16 16 18 4 ITCM 2 2 2 2 DTCM 1 1 1 1 VRAM 8 8 8 2 GBA ROM/RAM *problems measuring* In bus cycles: N16 S16 N32 S32 Main RAM 8 8 9 2 ITCM 1 1 1 1 DTCM 0.5 0.5 0.5 0.5 VRAM 4 4 4 1 GBA ROM/RAM *problems measuring* |
Jakly |
| ||
Newcomer Normal user Level: 5 Posts: 1/8 EXP: 367 Next: 162 Since: 03-22-24 Last post: 15 days ago Last view: 10 days ago |
Disclaimer: this research is not 100% complete and all info in it may be subject to change.
DS Raster Timings and relevant info: Testing done on a US new3DSxl One cycle in the context of the 3D GPU is 2 Arm 7 cycles. The basic steps in rasterization are as follows: 1a. Begin rasterizing two scanlines (a scanline pair). 1b-1. If possible, perform the finishing pass on two scanlines. (simultaneously with raster pass) 1b-2. Push finished scanlines to an intermediary buffer (48 scanlines big). 2. Check if the intermediary buffer has room for two scanlines, if not, wait for the 2d gpu to read enough scanlines. 3. Repeat the process until all 192 scanlines are done. Note: the 3d gpu repeats this process as fast as it can, it doesn’t wait for hblank or anything to start a new loop like the 2d gpu does. It does have to wait for the scanline buffer to be freed up by the 2d gpu, which limits how fast it can render scanlines. Scanlines are rendered in two passes: First pass: Scanlines are rendered simultaneously in pairs like so: (1 + 2) -> (3 + 4) -> (5 + 6)... Second Pass: Scanlines are finished simultaneously in pairs as well, just offset by -one, eg. (1) -> (2 + 3) -> (4 + 5)......(190 + 191) -> (192) After they are finished they are then pushed to the scanline buffer. This process always takes a fixed amount of cycles (approx. 500, i think?) Scanline Buffer: There is a 48 scanline buffer where scanlines are stored that the 2d gpu then reads from. Internal Scanline Buffer: Seems likely there’s two of them? Probably at least 3 scanlines big each? Most likely structured as such: sl0 - sl2 - sl4 - sl0 sl1 - sl3 - sl5 - sl1 I suspect this behavior because… Screen border edge marking bug: The gpu seems to do edge marking in such a way that left/right checks on the edge of the screen seem to result in viewing the second previous/second next scanline, largely matching up with the pairs the second pass uses Though depending on the exact timing behavior of the prev/next scanline it can wind up seeing the clear plane (not using the bitmap depth though, it always uses the clear depth reg’s value?) (exact timing details on this aren’t provided due to me being slightly confused by them still…) Scanline Timeout: If the 2d gpu is about to need a scanline and there isn’t a ready-to-go unread scanline in the buffer it’ll force the 3d gpu to “abort” rasterization of the current scanline pair, and begin the process of edgemarking+etc. the scanline in time for the 2d gpu to need it. DANGER: the first two scanlines can NOT be aborted. (dont ask why, I cannot even begin to theorize what went wrong here. We are in hard-mode research territory.) This causes. A LOT of bugs. Scanline Repeat: You can delay the first scanline pair so long that the 2d gpu starts reading from the buffer without there actually being a new scanline ready yet. This results in scanlines from the last successfully rendered quarter of the screen being repeated. Since the first pair has absolutely 0 timeout, you can delay it so long that subsequent scanlines also get delayed. Which allows you to break stuff even more. Partially Updated Scanline reads: It can also read scanlines while they’re being submitted. This seems to result in it reading two pixels once every two cycles? But it also sometimes behaves slightly weird? Idk what’s going on here tbh. Rasterizer Collapse: Delay scanline writes long enough and the gpu will eventually completely $#!% itself. The image’s stability will break down, lots of flashing colors, garbage being displayed. Seemingly also results in the RDLines_Count and the Color Buffer Underflow Flag getting confused? Display capture alters this behavior for some reason? It can recover from this state if you reduce the number of polygons rendered (though sometimes it struggles for a bit). All very strange. Theories: My best guess is that scanline rendering gets so delayed that it winds up with scanlines queued up (due to full buffer) and no way to empty it out since scanlines aren’t being read. Therefore causing the previous frame to linger through vblank, resulting in *serious* issues as it has the memory it was working with ripped out from under it and replaced with a new set via a buffer swap Basically i think understanding it fully requires ludicrously low level knowledge of how the gpu works. Bottom Scanline Pair Underflow Behavior: For some reason this scanline pair behaves oddly with regards to the top scanline pair bug. In such a manner that I can only say I don’t understand it. Like underflowing it can lock in some scanline bugs? Prevent others? It’s weirdddddddd. RDLines_Count Register: This register is a 6 bit unsigned integer. It tracks the lowest number of unread scanlines in the intermediary buffer after each read from it. It effectively works by updating an internal tracker whenever a scanline is read from the buffer. (though it seems to update this value 39/40 cycles (alternating every scanline for some reason?) earlier than I’d expect?) If the value of the tracker is greater than the number of unread scanlines after each read it is updated to the new value. This effectively results in the register’s cap of 46. Because the buffer is 48 scanlines large, the gpu cannot render new scanlines if there isn’t space for them to fit. And it always updates the tracker *after* reading. Thus you always end up with 2 scanlines missing. The externally readable value appears to be updated to the internal tracker’s value on VBlank (needs verification?) The internal tracker’s value is reset every frame. On first boot (verified by installing a test rom as DSiware on 3DS) and when the 3d gpu’s rasterizer is turned on (not off for some reason?) the value of this register is set to 63 (0x3F). It starts updating to a proper value once the rasterizer is “fully enabled” (which seems to require pushing two swap buffers commands after powering the rasterizer on.) 3DDispCnt Color Buffer Underflow Bit: Bit 12 of the 3DDispCnt register. The bit is set if any of the scanlines in the frame were timed out. The bit appears to be set the moment this happens? Once set, the bit can be cleared by writing a 1 to it. Misc Timing Info: As mentioned earlier, one cycle in the context of the 3D GPU will refer to 2 Arm 7 cycles. First Polygon Delays: The first polygon in a scanline has an extra delay that no successive polygons seem to possess. This delay seems to usually be a minimum of 4 cycles (but not always?) and increases slightly under certain factors. Seems to be based on slope? Translucency blending? Maybe some other stuff I’m not aware of? I don’t really understand it yet. Polygon delay: 12 cycles. Each polygon takes at least 12 cycles to begin for each scanline it’s in. Polygon Delay (Empty): 4 cycles. The bottom most scanline of polygons, which is normally not rendered, does, in fact, have timing characteristics. Min Spare To Start Rasterizing: 2 There must be 2 spare cycles after the initial polygon delay for the polygon to be rendered. Free Pixels: 4 The first 4 pixels of a polygon appear to be free? No idea why. Cycles Per Pixel: 1 Each pixel costs 1 cycle. Pixels that are behind another incur a cycle cost. Pixels not rendered due to edge fill still incur a cycle cost. Second Pass + Scanline Pushing Length: 500(?) Cycles. Tbh im not 100% sure about this number. Delay Between Scanline Reads: 809 cycles. (?) Probably correct? Time To Read Scanline: 256 cycles. Pixels seem to be read at a rate of 2 pixels (simultaneously?) every 2 cycles? It’s odd. Also sometimes not true.(?) Texture Lookup: Not done simultaneously. Seems like the two scanlines in a pair alternate between which one looks up textures every half-cycle (1 arm 7 cycle) |
Jakly |
| ||
Newcomer Normal user Level: 5 Posts: 3/8 EXP: 367 Next: 162 Since: 03-22-24 Last post: 15 days ago Last view: 10 days ago |
Revised Rasterizer:
SCFG EXT 9: 0x04004008 - Bit 2 Compressed Texture Bugfix: For compressed texture modes 1 and 3 they forgot to add 1 when converting the color space from 5 bit to 6 bit like with other textures. This results in the texture appearing 1/64th darker than it should. The revised rasterizer partially alleviates this issue by fixing the bug for the uninterpolated colors of these modes. They didn’t fix it for the interpolated colors though… Stencil buffer clearing (shadow/mask polygons) “Bugfix”: Normally clearing the stencil buffer follows simple logic: 1. Set a (per stencil buffer (there’s two stencil buffers!)) flag when shadow polygons are rendered. 2. When a shadow mask is rendered check for the aforementioned flag, if it is set, clear the stencil buffer and the flag. With the revised rasterizer it follows much more complex logic: 1. When executing a “swap buffers/flush” command or submitting a shadow polygon set a “shadow sent” flag. (this part happens even with the revised rasterizer disabled). 2. If manual translucency sorting, the revised rasterizer bit, and the “shadow sent” flag are set, when next submitting a translucent shadow mask, assign it a “clear stencil buffer” flag. 3a. When going through all the polygons in sorted order, while searching for one on the current scanline, if it comes across a translucent shadow polygon, set a “shadow encountered” flag. 4a. When going through all the polygons in sorted order, while searching for one on the current scanline, if it comes across a polygon with the “clear stencil buffer” flag set, and the “shadow encountered” flag is also set, clear the stencil buffer 3b. When rendering an opaque shadow polygon, set a “shadow rendered” flag. 4b. When rendering an opaque polygon with the “clear stencil buffer” flag set, and the “shadow rendered” flag is set, clear the stencil buffer and the “shadow rendered” flag. Or to summarize: only use shadow/shadow masks that are translucent and only use them with manual translucency sorting enabled, otherwise they will actively be more broken than without the revision bit set. Edit - May 8 2024: Fix some details about how the stencil buffer works under the revised rasterizer. |
Jakly |
| ||
Newcomer Normal user Level: 5 Posts: 5/8 EXP: 367 Next: 162 Since: 03-22-24 Last post: 15 days ago Last view: 10 days ago |
Some Notes on NDS ARM946E-S Pipelining and Timings:
Disclaimers: this is all tested on a new3DSxl, some details may vary slightly based on the ds revision, or even the memory chip used. We will, as a community, need to check these findings on other models to verify if they apply there as well. This info is also very incomplete, results in my model are still inaccurate, but i figured i should share my current findings to aid other interested folks in understanding and researching this subject. I also have done my testing with the base ds cpu clock, and have not tested with the DSi clock speed. There may be some behavioral differences with it. Note: All timings mentioned are in ARM9 clock cycles. Basics: The pipeline is structured as such: Fetch -> Decode -> Execute -> Memory -> Writeback Fetch (F): is where code is fetched Decode (D): is where the type and registers used by the prior fetched instruction is determined. Execute stage interlocks stall here? Execute (X): is where the bulk of the instruction is executed. Memory stage interlocks stall here? Memory (M): This is an instruction variable stage. Most instructions will simply "buffer" here for a cycle, though some (mostly multiply, qadd, load/store instructions) will actually do things of note here. This is where the bulk of interlocks originate from. Writeback (W): Where instructions will writeback their outputs to the register bank, some instructions (non-word ldrs, and misaligned ldr/swp) will do some extra processing here, this is a secondary source of interlocks. Forwarding: A technique used to allow instructions to use a result from a prior instruction without needing to wait for writeback to occur. Can be ignored in most cases since it's designed to be largely invisible to the program, though there is at least one bug with it. (im not explaining the bug here rn but it's in easily googleable ARM946E-S errata documents.) How the Pipeline Pipelines: ...Yeah that's the subtitle im going with. The general rule is it processes the stages in this order: 1. Memory goes first, it is (slightly unsure about this next bit actually, since interlocks are still weird in my model) followed immediately after by the writeback stage 2. Overlapping the Memory Stage by (5 probably?) cycles; the Fetch stage begins 3. Next, with a 1 cycle overlap with the fetch AND memory stage (whichever ends last), the Execute stage begins. NOTE: The order of the fetch and execute stage does make a difference, due to main ram contention behaviors. (at some point in here the decode stage occurs?, but it perfectly overlaps with either the execute or fetch stage, and only lasts one cycle so it doesn't matter...?) 4. Then with a 0 cycle overlap, the next Memory Stage begins. Memory Stage Overlaps: This is where the fun begins. Fetch stage will usually try to begin 5 cycles early, up to a max of the length of the last access of the burst (and bus aligned, ofc). This is complicated by the fixed 5 cycle penalty on ns accesses. It seems to occur at the end of the first access in a ldm? and at the end of the last access in a stm. This has the consequence of say: an ldm to main ram can only begin 3 cycles early since the last access is only 3 cycles, but a main ram stm can begin the full 5 cycles early, since the last access has the fixed delay as part of it. NOTE: Yes, main ram accesses have the 5 cycle penalty applied, the actual main ram timings should be 3n bus cycles for stores and 5n bus cycle for loads (for 16 bit accesses at least), this is supported by the timings for arm7 and arm9 dma being that. Interlocks: This is where the PAIN begins. That aside the gist is pretty simple. When a result from a prior instruction is needed and is unable to forwarded, due to the result not having been fetched or calculated yet. The instruction will be forced to wait until the result is available to begin executing (or memory-ing? strs can interlock in the memory stage...) This results in a 1-6(!!!) cycle delay as it must wait until the prior memory (or writeback in some cases) stage is done to begin executing. (effectively 0 cycles early) (or 1 cycles late in the case of writeback interlocks) NOTE: interlocks delay the beginning of the fetch stage and not just the execute stage. Bus Cycle Alignment: On the NDS, the ARM946E-S runs at 2x the bus clock. But every single non-TCM/Cache access must go through the bus to be performed. What this means is that the cpu will *frequently* have to wait an extra cycle for a fetch to align with the bus clock so it can begin. Misc: Some very important notes that didn't really belong anywhere else? Fetch Timing Note: Due to the nature of the bus cycle, (most) fetch timings can actually be thought of as 1 cycle shorter than normally assumed. (all timings assuming 32 bit accesses) Main RAM: N = 17, S = 3 VRAM(NTR): N = 9, S = 3 WRAM: N = 7, S = 1 TCM/Cache: N = 1, S = 1 (not reduced) NOTE: due to aforementioned bus cycle alignment, an ldm/stm will take an extra cycle between each fetch if the access requires it, thus making the fact that sequential accesses technically take a cycle less than most would assume, largely moot. Instruction Timings: the following instructions appear to be 1X 2M rather than 2X 1M like documentation seemingly claims? MUL/MLA/MRS/MRC/SMLALxy: 1 Execute + 2 Memory Edit - Aug 24, 2024: Update the pipelining order to clarify that Fetch occurs before Execute Edit - Nov 13, 2024: Correct several misc. details |
RakiSama |
|
asie |
| ||
Newcomer Normal user Level: 2 Posts: 1/1 EXP: 25 Next: 21 Since: 03-19-23 Last post: 22 days ago Last view: 2 days ago |
DLDI driver addendum:
BUG: DLDI patchers derived from libnds 1.x, used by some (I'm not sure how many) DS homebrew programs for on-device patching, only patch addresses in the range of (ALL start, ALL end). This means that if addresses exist in a FIX section (like GOT) which point to zero-initialized BSS memory, these will not be patched. Other programs behave differently: - Patchers derived from dlditool, including MoonShell, patch addresses in the range of (ALL start, ALL start + (1 << log2(driver size))); as ALL start is always the base address, this seems fine. - Patchers derived from Xenon's code, including dldipatch, patch addresses in the range of (ALL start, BSS end). This also means that, for maximum compatibility, the BSS start/end region addresses should always be valid, even if FIX_BSS is not used. In addition, as of libnds 2.0.0/calico, the "dldi area should be located at a 40h-byte aligned address in ROM image" assumption no longer holds true. For two binaries I checked, it appeared to be at location 0x4CF0 instead, now - but one should not rely on it. |
Jakly |
| ||
Newcomer Normal user Level: 5 Posts: 8/8 EXP: 367 Next: 162 Since: 03-22-24 Last post: 15 days ago Last view: 10 days ago |
Cache Streaming:
When disabled, filling a cache line forces the cpu to wait until the entire cache line is fetched to continue execution. When enabled, filling a cache line only forces the cpu to wait until the needed word is fetched to continue execution. The rest of the cache line continues fetching in the background asynchronously. Note: both of the above will always end on a non-bus aligned cycle, following the same logic as mentioned in my previous post. Sequential accesses chain this and result in each word being streamed in successively. Note: All code fetches count as sequential unless following a pipeline flush. (branches/jumps) If a sequential access begins after the next word is fully streamed in (can occur with code fetches) then you are forced to wait until the entire cache line finishes streaming in, with an extra 2 cycle penalty. If a non-sequential access on the same bus as the stream occurs, you are forced to wait until the entire cache line finishes streaming in plus one cycle. This applies even if the access is to a tcm or fully cached address. (have not tested if it applies to aborted accesses?) If a non-tcm/aborted/cache-hit access on the opposite bus begins then it is forced to wait until 6 (5 in practice? and i think twl clock would be 11 but that's untested) cycles before the stream would complete to begin. And i'll be honest. im not entirely sure why... it is consistent with overlap behavior with the memory stage and code fetches though. I believe an instruction cache prefetch "cp15 0, C7, C13, 1" does not use streaming, but I haven't tested that thoroughly. Tangentially related: i believe that cache streaming is handled before the write buffer starts writing back dirty cache lines in the case of the dcache. This also has not been tested thoroughly. |
Main - Development - GBAtek addendum/errata | Hide post layouts | New reply |
Page rendered in 0.152 seconds. (2048KB of memory used) MySQL - queries: 31, rows: 115/115, time: 0.034 seconds. Acmlmboard 2.064 (2018-07-20) © 2005-2008 Acmlm, Xkeeper, blackhole89 et al. |