GBAtek addendum/errata - melonDS board


Views: 6,913,240	Homepage \| Main \| Rules/FAQ \| Memberlist \| Active users \| Last posts \| Calendar \| Stats \| Online users \| Search	04-25-24 06:58 AM
Guest:

0 users reading GBAtek addendum/errata | 1 bot

Main - Development - GBAtek addendum/errata

Hide post layouts | New reply

Pages: 1 2

Arisotura

Posted on 11-22-21 07:52 PM (rev. 6 of 11-28-21 11:57 AM by Rayyan)

Link | #4750

Big fire melon
magical melon girl

Level: 56

Posts: 801/887
EXP: 1343169
Next: 55007

Since: 03-28-17
From: France

Last post: 14 hours ago
Last view: 12 hours ago

I'm letting the DSi discharge, while running a test program that keeps track of change in the battery register

so far:

when charging: 8F

full -> empty:
0F
0B
07
03 (light red)
01 (light red, blinking)

so

bit0 = not-critical bit?
bit1 = 'power good' bit, in the sense of the old DS
bit2-3: level from 0 to 3

----

DSi BPTWL Battery Level register and DS Powerman Battery Status

0F, 0B, 07 (full, three bars, two bars) = 0 (okay)
03, 01 (one red bar, one flashing red bar) = 1 (low)

____________________
Kuribo64

Rayyan

Posted on 11-22-21 09:54 PM

Link | #4751

Big melon
Administrator

Level: 29

Posts: 208/237
EXP: 136513
Next: 11372

Since: 06-25-20
From: UK

Last post: 316 days ago
Last view: 11 hours ago

GBATEK seems to have info on this:

  0x20    1       Battery flags. When zero the battery is at critical level,
                    arm7 does a shutdown. Bit7 is set when the battery is
                    charging. Battery levels in the low 4-bits: battery icon
                    bars full 0xF, 3 bars 0xB, 2 bars 0x7, one solid red bar
                    0x3, and one blinking red bar 0x1. When plugging in or
                    removing recharge cord, this value increases/decreases
                    between the real battery level and 0xF, thus the battery
                    level while bit7 is set is useless.

____________________

How to write an emulator
1. throw code to be emulated somewhere
2. make memory system that allows accessing that code
3. emulate CPU
4. have fun implementing all the other hardware
-- Arisotura, Tuesday 5th January 2021, 22:00:17

Generic aka RSDuck

Posted on 01-16-23 03:35 PM

Link | #5772

Big fire melon
Administrator

Level: 44

Posts: 508/593
EXP: 587756
Next: 23529

Since: 10-12-19

Last post: 4 days ago
Last view: 16 hours ago

DSlite specific registers, provided by Gericom:

#pragma once

//As far as I know these registers should only exist on a DSLite

//On boot, the firmware writes 0xFFFF to REG_REGCNT, locking both
//reading and writing of the registers below, and making them
//impossible to be used (since REG_REGCNT is write-once)

//In order to prevent this flashme can be used. When the direct-boot
//keycombo A+B+START+SELECT is held while booting the lock write
//will not happen, and as such the registers remain usable

//Lockout register for nitro2 features, WRITE-ONLY and WRITE-ONCE!!
#define REG_REGCNT          (*(vu16*)0x04001080)

#define REGCNT_WE0          (1 <<  0) //disables writing to REG_DISPCNT2
#define REGCNT_WE1          (1 <<  1) //disables writing to REG_DISPSW
#define REGCNT_WE2          (1 <<  2) //disables writing to REG_CLK11M

#define REGCNT_RE0          (1 <<  8) //disables reading from REG_DISPCNT2
#define REGCNT_RE1          (1 <<  9) //disables reading from REG_DISPSW
#define REGCNT_RE2          (1 << 10) //disables reading from REG_CLK11M

//Selects dual or single screen mode
#define REG_DISPCNT2        (*(vu16*)0x04001090)

#define DISPCNT2_MOD_DUAL_SCREEN    0 //default mode with 2 screens
#define DISPCNT2_MOD_SINGLE_SCREEN  1 //disables the top screen and enables some special features

//Configures single screen mode
//Note that main and sub here refer to the main and sub screens as configurable in this register
//and NOT the main and sub engines
#define REG_DISPSW          (*(vu16*)0x040010A0)

//Selects the display mode
#define DISPSW_WIN_SHIFT    0
#define DISPSW_WIN_MASK     (3 << DISPSW_WIN_SHIFT)
#define DISPSW_WIN(x)       ((x) << DISPSW_WIN_SHIFT)

#define DISPSW_WIN_MAIN_ONLY                    0 //displays only the main screen
#define DISPSW_WIN_MAIN_FULL_SUB                1 //blends the main screen with the sub screen
#define DISPSW_WIN_MAIN_HALF_SUB_BOTTOM_LEFT    2 //displays the sub screen at 128x96 in the bottom-left corner with optional blending
#define DISPSW_WIN_MAIN_HALF_SUB_BOTTOM_RIGHT   3 //displays the sub screen at 128x96 in the bottom-right corner with optional blending

#define DISPSW_A_SHIFT      4
#define DISPSW_A_MASK       (3 << DISPSW_A_SHIFT)
#define DISPSW_A(x)         ((x) << DISPSW_A_SHIFT)

//Blending for DISPSW_WIN_MAIN_FULL_SUB mode
#define DISPSW_A_FULL_7_1   0 //main 7/8, sub 1/8
#define DISPSW_A_FULL_6_2   1 //main 6/8, sub 2/8
#define DISPSW_A_FULL_5_3   2 //main 5/8, sub 3/8
#define DISPSW_A_FULL_4_4   3 //main 4/8, sub 4/8

//Blending for DISPSW_WIN_MAIN_HALF_SUB modes
#define DISPSW_A_HALF_3_1   0 //main 3/4, sub 1/4
#define DISPSW_A_HALF_2_2   1 //main 2/4, sub 2/4
#define DISPSW_A_HALF_1_3   2 //main 1/4, sub 3/4
#define DISPSW_A_HALF_0_4   3 //main 0/4, sub 4/4

//select the screen to output on tv when tv-out is on
#define DISPSW_M0_SHIFT     8
#define DISPSW_M0_MASK      (1 << DISPSW_M0_SHIFT)
#define DISPSW_M0(x)        ((x) << DISPSW_M0_SHIFT)

#define DISPSW_M0_TV_SUB    0 //output the sub screen to tv-out
#define DISPSW_M0_TV_MAIN   1 //output the main screen to tv-out

//select which screens picture is main and which is sub
#define DISPSW_M1_SHIFT     9
#define DISPSW_M1_MASK      (1 << DISPSW_M1_SHIFT)
#define DISPSW_M1(x)        ((x) << DISPSW_M1_SHIFT)

#define DISPSW_M1_MAIN_BOTTOM_SUB_TOP    0 //main = bottom screen, sub = top screen
#define DISPSW_M1_MAIN_TOP_SUB_BOTTOM    1 //main = top screen, sub = bottom screen

//tv-out enable/disable
#define DISPSW_TVOUT_SHIFT      14
#define DISPSW_TVOUT_MASK       (1 << DISPSW_TVOUT_SHIFT)
#define DISPSW_TVOUT(x)         ((x) << DISPSW_TVOUT_SHIFT)

#define DISPSW_TVOUT_DISABLED   0   //disables tv-out, top screen will be white
#define DISPSW_TVOUT_ENABLED    1   //enabled tv-out, top screen signals are used to output 10 bit digital NTSC

//key input enable/disable
//in single screen mode three of the top screen signals are used as button inputs to configure the output
//the hardware will change the register values of WIN, A and M01 when the buttons are pressed
//this feature can be disabled by setting this bit
#define DISPSW_KEYLOCK_SHIFT    15
#define DISPSW_KEYLOCK_MASK     (1 << DISPSW_KEYLOCK_SHIFT)
#define DISPSW_KEYLOCK(x)       ((x) << DISPSW_KEYLOCK_SHIFT)

#define DISPSW_KEYLOCK_DISABLED 0   //allow configuring the display mode using the buttons
#define DISPSW_KEYLOCK_ENABLED  1   //the buttons will do nothing

//Enables/disables outputting an 11MHz clock on the CLK11M pin of the SOC
//I believe this signal is not actually connected to anything
#define REG_CLK11M          (*(vu16*)0x040010B0)

#define CLK11M_CK11_LOW     0
#define CLK11M_CK11_ACTIVE  1

____________________
Take me to your heart / never let me go!

"clearly you need to mow more lawns and buy a better pc" - Hydr8gon

StrikerX3

Posted on 04-24-23 02:11 PM (rev. 6 of 04-24-23 03:10 PM)

Link | #5974

Newcomer
Normal user

Level: 2

Posts: 1/1
EXP: 19
Next: 27

Since: 04-22-23

Last post: 366 days ago
Last view: 1 day ago

Recently I've been implementing the ARM9 cache on my NDS emulator and made several tests on real hardware (a 3DS) to figure out how exactly it works, because the ARM manuals are as unhelpful as ever. My emulator fully implements the instruction and data caches (minus the lockdown feature), actually storing and retrieving data from a separate block of memory and having the cache do line fetches and flushes, and partially implements the write buffer -- the FIFO exists, but is drained as soon as anything goes into it. All ROMs I tested seem to run normally, with around 5-40% performance loss depending on the title compared to no cache emulation. The larger losses happen on titles that already ran crazy fast (500+ fps), so they're still very much playable in real time.

Shoutouts to Gericom, Generic, asie and AntonioND on the gbadev Discord, they gave great feedback, suggestions, insights and coding help on this research.

So, here are my findings on the ARM9 cache.

NDS ARM9 cache inner workings

The basics

The NDS ARM9 has an 8 KB instruction cache and a 4 KB data cache. Both are 4-way set associative, with a line size of 8 words (32 bytes).
Lines have a valid bit and two dirty bits, one for each half of the line. If only one of the dirty bits is set, only that half of the line is flushed. If both are set, the whole line is flushed at once. This distinction is important for timing, as a half-line flush does 1N+3S accesses and a full line does 1N+7S, i.e. it's not two half-line flushes.
The instruction cache does not have dirty bits and always read as 0. Trying to force them to 1 with the cache debug registers does not have any effect.
The TAG write debug operations use the debug index register (p15, 3, r#, c15, c0, 0). The TAG value contains the same fields, but they are ignored for these writes.
The cache always honors the cachability bits when scanning for cached data. If there is cached data from a region that used to be cachable but became uncachable after a PU region change, that data is not retrieved, even though it is still valid and maybe even dirty. However, when such lines are flushed, the dirty contents are written back to memory.

Write buffer

The write buffer is always enabled on the NDS ARM9, you cannot disable it through CP15 register 0 (bit 3 is always 1).
Is a 16-entry FIFO, where each entry can be either an address+size or data.
There are two internal registers to track address and transfer size.
While draining, when the write buffer finds an address entry, it updates the internal address counter and the transfer size.
When it finds data, it will write to external memory using the internal address and size and automatically increment the address by that size.
For bytes and halfwords, it takes the LSBs of the value and discards the rest.
The write buffer drains in parallel with other instructions, while the CPU is busy-waiting on coprocessor accesses, waiting for IRQ, etc. as long as the data bus is not in use by the Memory pipeline stage or by external components. It can drain while the CPU is fetching instructions from a different region.
I haven't checked what happens if the write buffer overflows. I'm assuming that, depending on the operation, the CPU stalls until the queue drains enough to make room for more data.

Replacement strategies

NOTE: research is ongoing on this subject. This is a summary of the behavior I observed from collected data.

Data was collected from my 3DS with a test program that:

Statically allocates a large volatile page-aligned array with a page-aligned size
Sets up PU region 7 to cover the entire array, with data caching enabled and write buffering disabled
Disables data caching and buffering for all other regions
Disables interrupts
Flushes the entire data cache
Sets up VRAM banks A-D to LCD mode to use as data output with predictable access timings
Runs the test loop in ITCM which, on every iteration:
1. Optionally manipulates CP15 registers such as the RR bit, force a cache flush, toggle cache lockdown, etc.
2. Reads from a fixed point in the array
3. Writes the value to a separate, uncached, volatile variable
4. Uses the CP15 cache debug registers to determine which of the lines was allocated by checking the valid bit of all 4 segments of the set without using conditionals or skipping lines
5. Writes the index of the affected line to VRAM using a 32-bit write, tightly packing 16 values in one word

Known facts

The instruction and data caches have independent counters.
The counter is incremented whenever a cache line fill happens.
The cache does not try to look for a free line out of the 4 in the set. It simply selects a line using the counter and flushes an existing line if it is dirty. This was evidenced by a simple program that reads data alternating between two addresses that line up to two different sets on the cache. While using the round-robin replacement strategy, the first set had lines 1 and 3 filled in and replaced, while the second set had lines 2 and 4 used.
The only other action that seems to influence the counter is the cache lockdown procedure, which forces the round-robin counter to start from the first free index (i.e. the index set on the lockdown register).
- Flipping the RR bit on CP15 control register doesn't change the counter.
- The counter is not incremented on CPU cycles, as evidenced by increasing or decreasing code complexity on a tight, fixed-cycle-count loop with interrupts disabled.
The random replacement strategy uses an LFSR of at least 12 bits with a period of 0x7FF based on the observed polynominals, or possibly 14 bits due to the round-robin counter interferences.
- The LSB of the random output sequence 100% matches the output of an LFSR, though there is no single 12-bit polynomial that matches every sequence generated by the ARM CPU. I've seen 7 different polynomials so far. (I haven't tested a 14-bit LFSR yet... it might have a single common polynomial, hopefully.)
- The MSB of the random output is still a mystery, though it does repeat after 0x7FF cycles and has an even distribution of 0s and 1s. Here are a couple of facts on it:
  - It's not generated from the second bit (or any other bit) of the 12-bit LFSR.
  - It's not generated from a second LFSR with a different polynomial. I've tested all possible 16-bit LFSRs with this.
  - It's not generated from every other bit of the LFSR output (i.e. advance once for the LSB, once more for the MSB, or vice-versa).
  - It's not a XOR of any combination of bits from the 12-bit LFSR and the round-robin counter.
  (Testing a 14-bit LFSR might change things, I hope.)
The random counter influences and is influenced by the round-robin counter, implying that they're either one shared counter or the counter increment circuit merges the two counters somehow.
The cache lockdown register influences the random sequence counter, possibly due to affecting the round-robin counter. The output is affected as follows:
- When the lockdown index is 0 (lockdown disabled), the random generator outputs 00s, 01s, 10s and 11s evenly distributed.
- When the lockdown index is 1, the random generator seems to replace 00s with 10s, as evidenced by the uneven distribution biased towards that number. The LSB is still evenly distributed between 0s and 1s as expected from the output of an LFSR.
- When the lockdown index is 2, the MSB bit is forced to 1 and there is an even distribution of 10s and 11s.
- When the lockdown index is 3, the output is always 11 (obviously).
- In all cases, the round-robin counter is always set to the lockdown register index when it is written. Further increments advance it by 1, wrapping back to the lockdown index on overflow.
Some interesting observations made during the tests:
(Remember that setting the cache lockdown register also sets the round-robin counter to the specified index, so setting it on every iteration effectively forces the counter to have a fixed value)
- Forcing the cache lockdown to index 0 on every iteration causes the random number generator to output 01s instead of 11s. The other outputs are unaffected.
- Forcing the cache lockdown to indexes 1 or 2 on every iteration causes the random number generator to output 10s instead of 00s. The other outputs are unaffected.
- Flipping the CP15 control register RR every 32 or 256 accesses causes the round-robin sequence to start at different points, once again evidencing that the LFSR influences the round-robin counter, or that this counter is extracted from a portion of the LFSR register.
- Flipping the CP15 control register RR for one iteration after 0x7FF random iterations produces a predictable incrementing pattern on the round-robin counter (00, 01, 10, 11, repeat). The random sequence is much less obvious -- it keeps using the same polynomial, but jumps around the sequence unpredictably, and sometimes flips all output bits.

Here are a few hypotheses on how the replacement counter(s) might be implemented.

Hypothesis 1: one shared counter for both strategies

There is a single counter, at least 14 bits long (2 bits for the round-robin counter and at least 12 bits for the LFSR), used for both strategies.
Some part of this counter is used as the round-robin counter.
The entire counter is used with the LFSR circuit to generate the random output bits.
Evidences for:
- Both algorithms affect both counters, as evidenced by the RR flipping and lockdown forcing tests.
- The round-robin counter predictably increments by 1 after 0x7FF random iterations.
- The circuit design is much simpler if both counters share the same register. The round-robin bits is simply extracting two bits from the register, and the LFSR applies to the whole register; its output is the LSB, and some additional circuitry provides the MSB, very likely tapping into the round-robin bits.
Evidences against:
- None so far, though I have yet to test a 14-bit LFSR, I've only tested 12-bit LFSRs.

Hypothesis 2: separate counters for each strategy

There is a 12-bit-minimum LFSR and a 2-bit round-robin counter.
The random circuit uses the LFSR to produce the LSB and somehow combines the two counters to produce the MSB, modifying the round-robin counter in the process.
Evidences for:
- Forcing the cache lockdown register on every iteration also alters the random sequence's MSB, indicating that this bit is derived from the round-robin counter.
- The round-robin counter increments predictably after 0x7FF random iterations.
Evidences against:
- The circuit would have to be more complex to support two separate counters used as input for two algorithms. LFSRs in particular tend to be very simple and compact, so the added complexity seems unjustified.
- Both algorithms affect both counters; they don't seem to be independent.

Here's the messy code of the test ROM I used if you want to play around. Can be compiled with devkitPro using the Makefile (dkp), or BlocksDS with the regular Makefile + Makefile.include. main.cpp must be in a folder called source. The replacement strategy tests mess with the PU table and assume region 3 is the GBA cart area or the DSi switchable IWRAM; I haven't checked if the PU regions are the same when compiled with dkP, so it might not work.

Arisotura

Posted on 04-24-23 02:19 PM

Link | #5975

Big fire melon
magical melon girl

Level: 56

Posts: 862/887
EXP: 1343169
Next: 55007

Since: 03-28-17
From: France

Last post: 14 hours ago
Last view: 12 hours ago

that's super interesting!

____________________
Kuribo64

PoroCYon

Posted on 07-31-23 11:05 PM (rev. 3 of 08-16-23 09:31 PM)

Link | #6140

Half-eaten melon
Normal user

Level: 11

Posts: 23/24
EXP: 4712
Next: 1273

Since: 12-01-19
From: .be

Last post: 255 days ago
Last view: 252 days ago

I found some more misc stuff, so:

Second NTR cartridge registers

GBATEK is right that there's a second cartridge slot at 0x040021Ax (with its own AUXSPICNT/DAT, ROMCTRL, etc, and data-in at 0x04102010), but this has some consequences that are not always listed:

EXMEMCNT.bit10 controls which ARM core has access rights to the above IO registers
IRQ flags:
- bit 14: NTR cart A detect
- bit 15: NTR cart B detect
- bit 26: NTR cart B transfer complete
- bit 27: NTR cart B IREQ

SCFG

0x04004002: SCFG_ROMWE (ARM7) (no clue what this does, always read as 0(?). clearing bit 0 of this reg is the very very first thing the ARM7 bootROM does.)
SCFG_CLK7:
- bit 1: ?
- bit 2: AES clock control
SCFG_EXT7:
- bit 21: seems to control all things I2S I think? (i.e. SNDEXCNT+MIC)
- bit 11 and 28 are something but idk what

CAM_MCNT (A9)

bit 0: standby/sleep mode?
bit 1: reset (active-low)
bit 2: sync reset? (cf. VSYNC/HSYNC in camera connector on PCB)
bit 3: stop RCLK?
bit 4: 1.8V core voltage rail enable
bit 5: 1.8V IO voltage rail enable
bit 6: 2.8V voltage rail enable
bit 7: ready status bit

ConsoleID format

This one seems to be a die/wafer/lot ID number (which you can also see in eg. MSP430 TLV data), with format (from LSB to MSB):

die X/Y position (12 bit)
die Y/X position (12 bit) [basically the other coordinate]
wafer ID (8 bit)
lot ID (20 bit)
??? ID (9 bit) [fab ID? something else??]

WIFIWAITCNT

bit 7: MCLK disable?

PoroCYon

Posted on 08-13-23 06:23 PM

Link | #6169

Half-eaten melon
Normal user

Level: 11

Posts: 24/24
EXP: 4712
Next: 1273

Since: 12-01-19
From: .be

Last post: 255 days ago
Last view: 252 days ago

Nintendo DSi XL testpoint names

The testpoints on the regular DSi mobo (CPU-TWL-01) are named, the one on the DSi XL (CPU-UTL-01) aren't. So I went over them all with a continuity tester to figure it out. Took me ~12h spread across 4 days.

Arisotura

Posted on 10-28-23 12:22 PM (rev. 2 of 10-28-23 01:18 PM)

Link | #6285

Big fire melon
magical melon girl

Level: 56

Posts: 867/887
EXP: 1343169
Next: 55007

Since: 03-28-17
From: France

Last post: 14 hours ago
Last view: 12 hours ago

entry 0x67 in firmware user settings

it's the RTC clock adjust value

on DSi it is stored at offset 0x88 in HWINFO_N.dat

also 2FFFDE8/2FFFDEC aren't specifically the RTC date/time, these addresses are used for RTC IO in general

____________________
Kuribo64

AntonioND

Posted on 11-29-23 02:20 AM

Link | #6327

Newcomer
Normal user

Level: 2

Posts: 1/1
EXP: 12
Next: 34

Since: 11-29-23

Last post: 148 days ago
Last view: 135 days ago

 MMIO[82C0h/82C2h+(0..1)*80h] - BTDMP Receive/Transmit FIFO Status (R)
...
  4      FIFO Empty (0=No, 1=Empty, 0x16bit words)

This is incorrect. The bit is 0 when the FIFO is empty, 1 when it isn't empty. I would need to verify it more, but I've seen that with a couple of quick tests.

CasualPokePlayer

Posted on 12-26-23 09:32 PM

Link | #6370

Member
Normal user

Level: 8

Posts: 15/16
EXP: 1763
Next: 424

Since: 03-27-22

Last post: 88 days ago
Last view: 1 day ago

075h  1   Extended Language (0..5=Same as Entry 064h, plus 6=Chinese)
            (for language 6, entry 064h defaults to english; for compatibility)
            (for language 0..5, both entries 064h and 075h have same value)
  076h  2   Bitmask for Supported Languages (Bit0..6)
            (007Eh for iQue DS, ie. with chinese, but without japanese)
            (0042h for iQue DSi, chinese (and english, but only for NDS mode))
            (003Eh for DSi/EUR, ie. without chinese, and without japanese)

Language 7 is Korean. Similar to Chinese this is only represented in the extended language in firmware userdata, the older language field will just be set to 1 (English). This also means the supported language bitmask is another bit large (bit 7 for Korean). Presumably DSi specific settings follow the same rules, although I don't have a Korean NAND to verify that.

  01Dh 1    Console type
              FFh=Nintendo DS
              20h=Nintendo DS-lite
              57h=Nintendo DSi (also iQueDSi)
              43h=iQueDS
              63h=iQueDS-lite
            The entry was unused (FFh) in older NDS, ie. replace FFh by 00h)
              Bit0   seems to be DSi/iQue related
              Bit1   seems to be DSi/iQue related
              Bit2   seems to be DSi related
              Bit3   zero
              Bit4   seems to be DSi related
              Bit5   seems to be DS-Lite related
              Bit6   indicates presence of "extended" user settings (DSi/iQue)
              Bit7   zero

0x35 is the console type for Korean DS Lite, which also has extended user settings. So gbatek's description of bit 6 seems to be wrong? Bit 0 seems to be more indicative of having extended user settings.

Mighy Max

Posted on 01-05-24 09:27 AM

Link | #6379

Member
Normal user

Level: 7

Posts: 9/10
EXP: 1010
Next: 438

Since: 07-08-21

Last post: 98 days ago
Last view: 98 days ago

Posted by StrikerX3

The basics

The NDS ARM9 has an 8 KB instruction cache and a 4 KB data cache. Both are 4-way set associative, with a line size of 8 words (32 bytes).

How sure are you about this?

I have been busy some days now to measure out memory and instruction timings on the (Phat-)DS, my DSi and the various emulators.
Out of curiosity and because I think that this is still a bit poor documented/implemented (yeah its a performance hit without much if any benefit)

The thing here is:
All my measurements point to 2kB data cache on the DS and only 1kB data cache on the DSi.
I checked multiple times if my measurement is off, or some hardware setting (i.e. cache lockdown) is causing this but to no avail.
On all emulators (as expected) no cache size can be detrminated and all runs need the same time. Reference clock is the 33MHz timer on the arm9.

The result does not match the public info about 4kB data cache.

What I do to measure the instruction and data cache sizes:

  Enabled Instruction cache, disable cache lockdowns
  for n in [8..16]:
    Do 11 runs of the following measurement and take the median time of these measurements:
      run a loop of (2^n)-3 'mov r8, r8' instructions. together with 3 instruction for the looping
    If the median divided by the amount of instructions exceeds the double of the previous run
      Report the previous run length as instruction cache size

For the data cache the rise of the execution time is not that sharp because the instruction cache is disabled and results in a larger prefetch time for each instruction.
I do this to eleminate the effect of instruction caching on this measurement.

  Disabled Instruction cache, disable cache lockdowns
  Enabled Data cache, disable cache lockdowns
  for n in [8..16]:
    Do 11 runs of the following measurement and take the median time of these measurements:
      run a loop of (2^n)-3 'ldr r2, [pc]' instructions. together with 3 instruction for the looping
    If the median divided by the amount of instructions exceeds 120% of the previous run
      Report the previous run length as data cache size

I can provide a simple .NDS testing this behaviour, if someone wants to double check.

Mighy Max

Posted on 01-17-24 08:52 AM

Link | #6402

Member
Normal user

Level: 7

Posts: 10/10
EXP: 1010
Next: 438

Since: 07-08-21

Last post: 98 days ago
Last view: 98 days ago

Hello again,

I cleaned up and published a little app to measure cache and memory timings.

The issue with the cache did not resolve. I'm quite certain that the 4kB data cache are ot present (or at least not effective) on the DS and DSi.
The measurements are consistent on the HW between runs and variant of the test, while they are different for the generations (DS vs DSi)

You can find the source code to measure and verify yourself at https://github.com/DesperateProgrammer/DSMemoryCycleCounter

I hope the data can improve the accuracy of the emulators in the future.
I am aware that it has not much benefit at the moment, as implementing the mechanics of a cache would impose a significant performance impact in the emulation.
Maybe the timings of the non-cached memory regions can be improved and compatibility increases. The app might provide a benchmark for this.

If you happen to find an error in the calculation/measurement of cycles please let me know.

There are still a lot of things i want to include into the measurements such as impact of DMA, verification of BUS priority, verification of cache-writeback sizes and much more.

Without further ado, the memory timings and cache sizes:

DSi Caches
==========

                    Reported      Measured
ICACHE Size             8192          8192
DCACHE Size             4096          1024    ! Conflict
ICACHE Line Size          32            32
DCACHE Line Size          32            32


DS Caches
==========

                    Reported      Measured
ICACHE Size             8192          8192
DCACHE Size             4096          2048    ! Conflict
ICACHE Line Size          32            32
DCACHE Line Size          32            32



DSi Memory Timings
==================

In cpu cycles @ 66MHz
                N16   S16   N32   S32
Main RAM         16    16    18     4
ITCM              2     2     2     2
DTCM              1     1     1     1
WRAM              8     8     8     2
VRAM              8     8     8     2
GBA ROM          26    26    38    24     EXMEMCNT[4..2] = 000
GBA ROM          22    22    34    24     EXMEMCNT[4..2] = 001
GBA ROM          18    18    30    24     EXMEMCNT[4..2] = 010
GBA ROM          42    42    52    24     EXMEMCNT[4..2] = 011
GBA ROM          26    26    34     8     EXMEMCNT[4..2] = 100
GBA ROM          22    22    26     8     EXMEMCNT[4..2] = 101
GBA ROM          18    18    26     8     EXMEMCNT[4..2] = 110
GBA ROM          42    42    50     8     EXMEMCNT[4..2] = 111
GBA RAM          26    26    26    20     EXMEMCNT[1..0] = 00
GBA RAM          22    22    22    16     EXMEMCNT[1..0] = 01
GBA RAM          18    18    18    12     EXMEMCNT[1..0] = 10
GBA RAM          42    42    42    36     EXMEMCNT[1..0] = 11

In cpu cycles @ 133MHz

                N16   S16   N32   S32
Main RAM         32    32    36     8
ITCM              2     2     2     2
DTCM              1     1     1     1
WRAM             12    12    12     4
VRAM             12    12    12     4
GBA ROM          48    48    72    48     EXMEMCNT[4..2] = 000
GBA ROM          40    40    64    48     EXMEMCNT[4..2] = 001
GBA ROM          32    32    56    48     EXMEMCNT[4..2] = 010
GBA ROM          80    80   104    48     EXMEMCNT[4..2] = 011
GBA ROM          48    48    64    32     EXMEMCNT[4..2] = 100
GBA ROM          40    40    56    32     EXMEMCNT[4..2] = 101
GBA ROM          32    32    48    32     EXMEMCNT[4..2] = 110
GBA ROM          80    80    96    32     EXMEMCNT[4..2] = 111
GBA RAM          48    48    48    40     EXMEMCNT[1..0] = 00
GBA RAM          40    40    40    32     EXMEMCNT[1..0] = 01
GBA RAM          32    32    32    24     EXMEMCNT[1..0] = 10
GBA RAM          80    80    80    72     EXMEMCNT[1..0] = 11

In bus cycles @ 66MHz

                N16   S16   N32   S32
Main RAM          8     8     9     2
ITCM              1     1     1     1
DTCM            0.5   0.5   0.5   0.5
WRAM              4     4     4     1
VRAM              4     4     4     1
GBA ROM          13    13    19    12     EXMEMCNT[4..2] = 000
GBA ROM          11    11    17    12     EXMEMCNT[4..2] = 001
GBA ROM           9     9    15    12     EXMEMCNT[4..2] = 010
GBA ROM          21    21    27    12     EXMEMCNT[4..2] = 100
GBA ROM          13    13    17     8     EXMEMCNT[4..2] = 100
GBA ROM          11    11    15     8     EXMEMCNT[4..2] = 101
GBA ROM           9     9    13     8     EXMEMCNT[4..2] = 110
GBA ROM          21    21    25     8     EXMEMCNT[4..2] = 111
GBA RAM          13    13    13    10     EXMEMCNT[1..0] = 00
GBA RAM          11    11    11     8     EXMEMCNT[1..0] = 01
GBA RAM           9     9     9     6     EXMEMCNT[1..0] = 10
GBA RAM          21    21    21    18     EXMEMCNT[1..0] = 11

In bus cycles @ 133MHz

                N16   S16   N32   S32
Main RAM          8     8     9     2
ITCM            0.5   0.5   0.5   0.5
DTCM            .25   .25   .25   .25
WRAM              3     3     3     1
VRAM              3     3     3     1
GBA ROM          12    12    18    12     EXMEMCNT[4..2] = 000
GBA ROM          10    10    16    12     EXMEMCNT[4..2] = 001
GBA ROM           8     8    14    12     EXMEMCNT[4..2] = 010
GBA ROM          20    20    26    12     EXMEMCNT[4..2] = 100
GBA ROM          12    12    16     8     EXMEMCNT[4..2] = 100
GBA ROM          10    10    14     8     EXMEMCNT[4..2] = 101
GBA ROM           8     8    12     8     EXMEMCNT[4..2] = 110
GBA ROM          20    20    24     8     EXMEMCNT[4..2] = 111
GBA RAM          12    12    12    10     EXMEMCNT[1..0] = 00
GBA RAM          10    10    10     8     EXMEMCNT[1..0] = 01
GBA RAM           8     8     8     6     EXMEMCNT[1..0] = 10
GBA RAM          20    20    20    18     EXMEMCNT[1..0] = 11

DS Memory Timings
=================

In cpu cycles:
Main RAM         16    16    18     4
ITCM              2     2     2     2
DTCM              1     1     1     1
VRAM              8     8     8     2
GBA ROM/RAM      *problems measuring*


In bus cycles:
                N16   S16   N32   S32
Main RAM          8     8     9     2
ITCM              1     1     1     1
DTCM            0.5   0.5   0.5   0.5
VRAM              4     4     4     1
GBA ROM/RAM      *problems measuring*

Jakly

Posted on 04-23-24 10:51 PM

Link | #6560

Newcomer
Normal user

Level: 1

Posts: 1/1
EXP: 5
Next: 6

Since: 03-22-24

Last post: 1 day ago
Last view: 10 hours ago

Disclaimer: this research is not 100% complete and all info in it may be subject to change.

DS Raster Timings and relevant info:
Testing done on a US new3DSxl

One cycle in the context of the 3D GPU is 2 Arm 7 cycles.

The basic steps in rasterization are as follows:
1a. Begin rasterizing two scanlines (a scanline pair).
1b-1. If possible, perform the finishing pass on two scanlines. (simultaneously with raster pass)
1b-2. Push finished scanlines to an intermediary buffer (48 scanlines big).
2. Check if the intermediary buffer has room for two scanlines, if not, wait for the 2d gpu to read enough scanlines.
3. Repeat the process until all 192 scanlines are done.

Note: the 3d gpu repeats this process as fast as it can, it doesn’t wait for hblank or anything to start a new loop like the 2d gpu does.
It does have to wait for the scanline buffer to be freed up by the 2d gpu, which limits how fast it can render scanlines.

Scanlines are rendered in two passes:
First pass: Scanlines are rendered simultaneously in pairs like so:
(1 + 2) -> (3 + 4) -> (5 + 6)...

Second Pass: Scanlines are finished simultaneously in pairs as well, just offset by -one, eg.
(1) -> (2 + 3) -> (4 + 5)......(190 + 191) -> (192)
After they are finished they are then pushed to the scanline buffer.
This process always takes a fixed amount of cycles (approx. 500, i think?)

Scanline Buffer: There is a 48 scanline buffer where scanlines are stored that the 2d gpu then reads from.

Internal Scanline Buffer:
Seems likely there’s two of them?
Probably at least 3 scanlines big each?
Most likely structured as such:

sl0 - sl2 - sl4 - sl0
sl1 - sl3 - sl5 - sl1

I suspect this behavior because…

Screen border edge marking bug:
The gpu seems to do edge marking in such a way that left/right checks on the edge of the screen seem to result in viewing the second previous/second next scanline, largely matching up with the pairs the second pass uses
Though depending on the exact timing behavior of the prev/next scanline it can wind up seeing the clear plane (not using the bitmap depth though, it always uses the clear depth reg’s value?)
(exact timing details on this aren’t provided due to me being slightly confused by them still…)

Scanline Timeout:
If the 2d gpu is about to need a scanline and there isn’t a ready-to-go unread scanline in the buffer it’ll force the 3d gpu to “abort” rasterization of the current scanline pair, and begin the process of edgemarking+etc. the scanline in time for the 2d gpu to need it.
DANGER: the first two scanlines can NOT be aborted. (dont ask why, I cannot even begin to theorize what went wrong here. We are in hard-mode research territory.)
This causes. A LOT of bugs.

Scanline Repeat:
You can delay the first scanline pair so long that the 2d gpu starts reading from the buffer without there actually being a new scanline ready yet. This results in scanlines from the last successfully rendered quarter of the screen being repeated. Since the first pair has absolutely 0 timeout, you can delay it so long that subsequent scanlines also get delayed. Which allows you to break stuff even more.

Partially Updated Scanline reads:
It can also read scanlines while they’re being submitted. This seems to result in it reading two pixels once every two cycles? But it also sometimes behaves slightly weird? Idk what’s going on here tbh.

Rasterizer Collapse:
Delay scanline writes long enough and the gpu will eventually completely $#!% itself.
The image’s stability will break down, lots of flashing colors, garbage being displayed.
Seemingly also results in the RDLines_Count and the Color Buffer Underflow Flag getting confused?
Display capture alters this behavior for some reason?
It can recover from this state if you reduce the number of polygons rendered (though sometimes it struggles for a bit).
All very strange.
Theories:
My best guess is that scanline rendering gets so delayed that it winds up with scanlines queued up (due to full buffer) and no way to empty it out since scanlines aren’t being read.
Therefore causing the previous frame to linger through vblank, resulting in *serious* issues as it has the memory it was working with ripped out from under it and replaced with a new set via a buffer swap
Basically i think understanding it fully requires ludicrously low level knowledge of how the gpu works.

Bottom Scanline Pair Underflow Behavior:
For some reason this scanline pair behaves oddly with regards to the top scanline pair bug. In such a manner that I can only say I don’t understand it. Like underflowing it can lock in some scanline bugs? Prevent others? It’s weirdddddddd.

RDLines_Count Register:
This register is a 6 bit unsigned integer.
It tracks the lowest number of unread scanlines in the intermediary buffer after each read from it.
It effectively works by updating an internal tracker whenever a scanline is read from the buffer.
(though it seems to update this value 39/40 cycles (alternating every scanline for some reason?) earlier than I’d expect?)
If the value of the tracker is greater than the number of unread scanlines after each read it is updated to the new value.
This effectively results in the register’s cap of 46. Because the buffer is 48 scanlines large, the gpu cannot render new scanlines if there isn’t space for them to fit. And it always updates the tracker *after* reading. Thus you always end up with 2 scanlines missing.
The externally readable value appears to be updated to the internal tracker’s value on VBlank (needs verification?)
The internal tracker’s value is reset every frame.
On first boot (verified by installing a test rom as DSiware on 3DS) and when the 3d gpu’s rasterizer is turned on (not off for some reason?) the value of this register is set to 63 (0x3F).
It starts updating to a proper value once the rasterizer is “fully enabled” (which seems to require pushing two swap buffers commands after powering the rasterizer on.)

3DDispCnt Color Buffer Underflow Bit:
Bit 12 of the 3DDispCnt register.
The bit is set if any of the scanlines in the frame were timed out.
The bit appears to be set the moment this happens?
Once set, the bit can be cleared by writing a 1 to it.

Misc Timing Info:

As mentioned earlier, one cycle in the context of the 3D GPU will refer to 2 Arm 7 cycles.

First Polygon Delays:
The first polygon in a scanline has an extra delay that no successive polygons seem to possess.
This delay seems to usually be a minimum of 4 cycles (but not always?) and increases slightly under certain factors.
Seems to be based on slope?
Translucency blending?
Maybe some other stuff I’m not aware of?
I don’t really understand it yet.

Polygon delay: 12 cycles.
Each polygon takes at least 12 cycles to begin for each scanline it’s in.

Polygon Delay (Empty): 4 cycles.
The bottom most scanline of polygons, which is normally not rendered, does, in fact, have timing characteristics.

Min Spare To Start Rasterizing: 2
There must be 2 spare cycles after the initial polygon delay for the polygon to be rendered.

Free Pixels: 4
The first 4 pixels of a polygon appear to be free? No idea why.

Cycles Per Pixel: 1
Each pixel costs 1 cycle.
Pixels that are behind another incur a cycle cost.
Pixels not rendered due to edge fill still incur a cycle cost.

Second Pass + Scanline Pushing Length: 500(?) Cycles.
Tbh im not 100% sure about this number.

Delay Between Scanline Reads: 809 cycles. (?)
Probably correct?

Time To Read Scanline: 256 cycles.
Pixels seem to be read at a rate of 2 pixels (simultaneously?) every 2 cycles?
It’s odd. Also sometimes not true.(?)

Texture Lookup:
Not done simultaneously.
Seems like the two scanlines in a pair alternate between which one looks up textures every half-cycle (1 arm 7 cycle)

Pages: 1 2

Main - Development - GBAtek addendum/errata

Hide post layouts | New reply

Page rendered in 0.120 seconds. (2048KB of memory used)
MySQL - queries: 30, rows: 103/103, time: 0.022 seconds.
[powered by Acmlm]