Views: 2,738,493 Homepage | Main | Rules/FAQ | Memberlist | Active users | Last posts | Calendar | Stats | Online users | Search 09-17-21 12:49 AM
Guest:

0 users reading GBAtek addendum/errata | 1 bot

Main - Development - GBAtek addendum/errata New reply


Arisotura
Posted on 04-11-17 04:16 PM (rev. 27 of 07-23-19 09:34 PM) Link | #87
GBAtek is an amazing piece of documentation, but it can be improved upon :)


This is a general pile of findings. I claim no ownership on those, they come from several individuals.



ROM transfers

* transfer time is 8 cycles for a command, 4 cycles per response word (basically 1 cycle per byte) (see ROMCTRL bit27 for cycle duration)
* plus start delay and 0x200-block delay at the start of each 0x200 block
* bit28 allows skipping incoming bytes automatically during delays
* DELAYS DO NOT APPLY WHEN THE WR BIT IS SET
* DRQ bit (bit23) is set once a response word has been transferred
* reading from 0x04100010 clears the DRQ bit, and:
** if the transfer is finished: clears the busy bit and triggers IRQ if specified
** if there are more words to transfer: begins transferring the next word


2D

* The main memory display FIFO is a simple circular buffer that holds 16 pixels. The video controller reads from it regardless of whether you fill it. It doesn't get 'empty' or 'full'.
* Writing to the upper halfword of 0x04000068 increments the FIFO write pointer by two (writes to the lower halfword leave it unchanged). The write pointer simply wraps to 0 when reaching the end of the buffer. It is also reset upon VBlank.
* 8-bit writes to 0x04000068 don't work well. TODO: figure out what's happening. eventually.

* Colors are converted early from 5-bit to 6-bit, as such: 6bit = 5bit*2
* Color special effects (brightness, blending) are applied to the 6-bit color components
* In some cases, the MSb of color values is used as LSb for the green component. TODO: find out where this applies. Confirmed to apply to the standard BG palette.

* Bitmap sprite blending follows the same rules as non-bitmap semitransparent sprites, with EVA=alpha+1 and EVB=16-EVA. Except: bitmap sprites with alpha=0 are always hidden.

* 3D layer blending follows rules similar to those of semitransparent sprites (only requires second target bits set in BLDCNT, overrides BLDCNT color effect selection and window 'enable color effect' setting where it applies).
* 3D layer blending uses 5-bit alpha values (from the 3D graphics), such as: EVA=alpha+1 and EVB=32-EVA.
* When the 3D alpha is less than 16, the final color components are incremented by one. (seems to be some hardware glitch??)
* 3D layer pixels with alpha=0 are always hidden (not rendered). They're preserved when capturing the 3D output alone, though.

* BG mode 6 works on both GPUs. On the sub GPU, it only gets 128K of VRAM, so it will repeat the same bitmap 4 times.
* BG mode 7 renders (text-mode) BG0, BG1, and sprites. No BG2/BG3.

* large BG sizes 2-3 are the same as corresponding sizes for regular bitmap BGs (512x256, 512x512)


3D

* Shadow polygons can use textures. In that case, decal blending is applied.
* The stencil buffer can hold two scanlines. It's cleared only when the current scanline contains shadow mask polygons, before rendering a group of shadow mask polygons.
* Stencil buffer bits are set only where the shadow mask is drawn but fails the depth test.
* Visible shadow polygons (polyID>0) are only drawn where stencil buffer bits are set and where the destination pixel's polygonID is different from that of the shadow, regardless of whether that pixel was translucent.

* Drawing visible shadow polygons supposedly resets the stencil bits. TODO: check. I guess not.

* Toon highlight mode uses the following formula: (GBAtek is wrong)
v=vertex t=texture s=tooncolor=toontable[Rv]
R = ((Rt+1)*(Rv+1)-1)/64+Rs ;truncated to MAX=63
G = ((Gt+1)*(Rv+1)-1)/64+Gs ;truncated to MAX=63
B = ((Bt+1)*(Rv+1)-1)/64+Bs ;truncated to MAX=63
A = ((At+1)*(Av+1)-1)/64

* Translucent pixels are only drawn where the destination pixel has a different polygonID OR where the destination pixel was opaque.

* for each separate polygon, W values are 'normalized', they're collectively shifted left or right by 4 until they all fit within 16 bits (if they fit within 12 bits or less, they can be shifted left to use the 16-bit range better)

* conversion for Z values:
** Z-buffering: zbuf = (((Z * 0x4000) / W) + 0x3FFF) * 0x200 (using original W)
** W-buffering: zbuf = W (but it appears to use normalized W)

* conversion for clear depth:
** clearZ = (val * 0x200) + 0x1FF

* There are special depth-test rules for polygon borders. TODO: work it out.
** it seems to only apply to wireframe polygons
** when Z values are equal, left edges have priority over right edges, and top edges have priority over bottom edges (TODO: check wireframe vs normal)
** 'less or equal' depth test has no margin
** (dunno about other orders but they should use the regular rules. Y-sorting gets in the way)

* Cases where 'less than' depth test becomes 'less or equal':
** wireframe polygon borders as mentioned above
** apparently, polygon borders in some other cases too
** when rendering frontfacing polygon pixels over existing opaque backfacing polygon pixels

* in W-buffering mode, 'equal' depth test mode has a margin of 0xFF in either direction. That is, for example, incoming Z range of 0x100-0x2FE is considered equal to an existing Z-buffer value of 0x1FF.
* in Z-buffering mode, margins are +-0x200.

* PUSH/POP/STORE/RESTORE to the modelview matrix always apply to the vector matrix too, even in matrix mode 1.
* "NORMAL/VEC_TEST require matrix mode 2" <- wrong. They work the same regardless of the matrix mode.

* edge marking
Posted by GBAtek
Technically, when rendering a polygon, it's edges (ie. the wire-frame region) are flagged as possible-edges (but it's still rendered normally, without using the edge-color). Once when all opaque polygons (*) have been rendered, the edge color is applied to these flagged pixels, under following conditions: At least one of the four surrounding pixels (up, down, left, right) must have different polygon_id than the edge, and, the edge depth must be LESS than the depth of that surrounding pixel (ie. no edges are rendered if the depth is GREATER or EQUAL, even if the polygon_id differs). At the screen borders, edges seem to be rendered in respect to the rear-plane's polygon_id entry (see Port 4000350h).

-> polygon ID rule for screen edges confirmed
-> at screen edges, the aforementioned depth test uses CLEAR_DEPTH (when testing against a pixel that would be offscreen), even when using a clear bitmap

* antialiasing
** seems to be calculated from edge slopes
** topmost two pixels are retained, antialiasing blends them together including alpha (except color isn't blended when the pixel below has alpha=0)
** during rendering, if an incoming pixel fails the depth test with the topmost pixel, it is checked against the pixel below


DMA

* ARM9 DMA start mode 3 is similar to the GBA's 'video capture' DMA, although GBAtek doesn't make it obvious. It is triggered at the start of each scanline from 2 to 193. The enable bit is automatically cleared on scanline 194.
* TODO: find when 'main memory display' DMA starts. Probably 8 pixels (48 cycles) in advance from the actual display. -- DMA starts ~32 cycles after the start of the scanline. Actual display starts ~48 cycles after the start of the scanline.


Sound

* repeat mode 3 behaves same as mode 1 (loops)
* TODO: check to see what can be changed while a channel is playing. Format can be changed and that's a whole fucking pile of things to check.

____________________
Kuribo64

TechnoNightz
Posted on 07-12-18 08:43 AM Link | #625
I'd actually like to see an open-source documentation (perhaps a melonDS wiki) on this. As you say GBATEK is great but yeah.

Also you did a nice job explaining things, you've always been clear ^^

PoroCYon
Posted on 12-01-19 12:59 PM (rev. 5 of 11-16-20 05:38 PM) Link | #1406
Some that need to be confirmed:

DSP

  • DSP_PSTS bits 10..12 (REP0..REP2) are active-high (as in, 1=was written by DSP), while GBAtek says they're active-low

  • DSP_PCFG bits 12..15 have an undocumented transer mode (7: ARM9 bus loopback): transfers to/from the ARM9 bus, cf. DSP-internal DMA transfer mode 7. This mode requres some additional setup: you first need to set the following DSP-internal DMA registers to the following values (using transfer mode 1):

    [0x81BE] = 0 // select channel 0
    [0x81C6] = 0xABCD // destination address, high 16-bit
    [0x80E2] = 0 | (0<<1) | (1<<4) // configure AHBM (DSP->ARM9 DMA) // example value works for 16-bit transfers (see GBATek/Teakra docs for details)
    [0x80E4] = (1<<9) | 1<<8) // resp. mandatory bit, direction (0=read, 1=write) (see GBATek/Teakra docs for details)
    [0x80E6] = 1 // enable channel 0

    Then perform a transfer as follows (the example writes 0x1337 to 0xABCDEF98):

    DSP_PADR = 0xEF98 // destination address, low 16-bit
    DSP_PDATA = 0x1337 // for a write, read from this address for a read

Chagall
Posted on 09-15-20 06:38 PM Link | #2336
Rumble pak detection needs bit 6 set in addition to bit 1 reset (GBAtek only mentions bit 1).
Found empirically: https://github.com/Arisotura/melonDS/pull/719#discussion_r474976353

Generic aka RSDuck
Posted on 11-15-20 12:02 PM (rev. 20 of 07-11-21 02:24 PM) Link | #2756
there is a power saving mode for the wifi in which it stops receiving and transmitting data.

W_MODE_WEP:
bit0-3 is set to 1: bit1 of W_POWERUNK is set
bit0-3 is set to 2: bit0-1 of W_POWERUNK are set (i.e. it's value becomes 3)
if bit0-3 are either one of those is the case and bit1 of W_POWER_TX is set the corrosponding bit is/bits are not manually changeable writeable (I haven't checked if setting W_MODE_WEB while W_POWER_TX bit1 is set makes a difference).

W_MODE_RST:
Only bit0 is relevant for power saving.
Changing bit0 from 0 to 1:
if bit1 of W_POWERUNK is set, bit8 of W_POWERSTATE is set. A short while later bit9 changes to 0 and if bit0 of W_POWERUNK is set at this point at time bit8 is set to 0

Changing bit0 from 1 to 0:
returns back to power saving mode immediately.

W_POWERSTATE:
bit8 seems to indicate some other state related to power saving mode
bit9 seems to indicate normal power saving mode
while atleast one of bit8 and bit9 is 1 power saving mode is enabled

Once that 2048 us power up period (initialised e.g. by setting bit0 of W_MODE_RST from 0 to 1) is over: when bit0-1 is 2 (the exact purpose of bit0 is still unknown) and bit8 is set, power saving mode is left and bit1 is set to 0 (thus the register value will be 0).

W_POWERFORCE:
setting bit15 and bit0 is set (=0x8001) switches to power saving mode (it also switches to regular power saving mode from bit8 power saving mode)
setting bit15 and not bit0 (=0x8000) initiates waking up from power saving mode (like always first W_POWERSTATE bit8 is set then the register is cleared). Neither W_POWER_UNK nor bit2 of W_POWERSTATE are relevant for this.
having bit15 set to 0 and and bit1 (=0x0001) of W_POWERSTATE initiates waking up from power saving mode (independently of bit1 of W_POWER_UNK), so bit8 of W_POWERSTATE is set. If bit0-1 of W_POWERSTATE is 2 once that process is done W_POWERSTATE becomes 0 (as already described in it's section) otherwise W_POWERSTATE stays 0x100. This doesn't work while bit0 of W_MODE_RST=0.

Note that when switching power saving mode off via W_POWERFORCE it acts similarly to setting W_MODE_RST bit0 from 0 to 1, but the value of W_POWER_UNK doesn't matter.

Leaving power saving mode takes about 2048 microseconds (measured with the wifi timer register).

Both methods of leaving power saving mode will first set trigger IRQ11.

After entering power saving mode: RFStatus: 9 RFPins: 0x4
After leaving power saving mode: RFStatus: 1 RFPins: 0x84

There's probably still a lot more to unpack. What is disabled and what not is currently an educated guess and the consequences of writing 0x8000 to W_POWERFORCE mentioned in gbatek are still to be explored. Als Gbatek says on bit1 of W_POWERSTATE: "Note: That queue stuff seems to work only if W_POWER_US=0 and W_MODE_RST=1.". I haven't tried bringing W_POWER_US into this yet.

What does power saving mode do exactly?

An existing program which utilises wifi I tried continued to work fine, except that nothing was being received anymore. W_CONTENTFREE and W_US_COUNT seemed to still count down. My guess is that only the antenna hardware is affected.

In regards to transmission: what happens seems to be that upon requesting to send data just nothing happens. So W_TXBUSY is never flagged nor is W_TX_SEQNO incremented. A request made in power saving mode will be discarded, i.e. not sent once power saving mode is off.

Not implementing power saving mode can lead to freezes in the Pokemon games. See here: http://melonds.kuribo64.net/board/thread.php?pid=2314#2314

____________________
Take me to your heart / never let me go!

"clearly you need to mow more lawns and buy a better pc" - Hydr8gon

PoroCYon
Posted on 11-16-20 05:30 PM (rev. 9 of 03-01-21 04:21 PM) Link | #2762

IR cartridges

IR cartridges seem to work as follows, but I'd like to have someone else to verify this (seems to work with HGSS, BW and B2W2 carts, idk about others):

Everything automatically happens at 115200 baud, 8n1.

There seem to be three main SPI commands that are sent to what normally would be the savegame SPI bus, there's a fourth command to perform actual savegame operations. All transfers happen at 1 MHz (serial AUXSPI mode) unless indicated otherwise.

The cartridge needs to be powered on, but nothing more besides this. No header reading or KEY1/KEY2 init, and so on. (I rebooted the cart with SCFG_MC and started doing SPI commands immediately afterwards, seems to work fine).

This seems to be relevant for pretty much all NTR-031 carts, so Pokémon HGSS, BW, B2W2, "Walk With Me" and similar games, ...

The commands:
  • 0x00: savegame escape byte: as long as chip select is held, the bytes that follow will be treated like a regular savegame transfer. These can also happen at any clock speed, but the 0x00 byte itself needs to be transferred at 1 MHz. (This bit was already known.)
  • 0x01: receive data from IR: one command byte (0x01) is written, after which data bytes are read by the DS. The first byte read indicates the amount of bytes that will follow. It doesn't seem to be able to receive more than 255 bytes afaics. Bytes written to perform the reads are unused as far as I know, but usually set to 0 (HGSS does this, at least).
    • When there are zero bytes to read, you still have to deselect the SPI chip, or the next transfer will fail. Disable the 'chip select hold' bit in AUXSPICNT and send a zero byte.
  • 0x02: send data over IR: this one has no length prefix, chip select is used to determine when the transfer ends, as usual.
  • 0x03: write byte to in-cart IR MCU RAM: send high addr byte and low addr byte, then send a data byte. Writes the data byte to the specified address in the in-cartridge IR MCU. Discovered by nocash.
  • 0x04: read byte from in-cart IR MCU RAM: send high addr byte and low addr byte, then read a data byte. Can be used to dump the ROM inside the in-cart IR MCU. Discovered by nocash.
  • 0x05: write word to in-cart IR MCU RAM: send high addr byte and low addr byte, then send two data bytes. Discovered by nocash.
  • 0x06: read word from in-cart IR MCU RAM: send high addr byte and low addr byte, then read two data bytes. Discovered by nocash.
  • 0x07: mystery command, purpose unclear. Discovered by nocash.
  • 0x08: not too sure about this, but probably a status thing. A command byte (0x08) is sent, and a status byte(?) is received from the cart. HGSS seems to always send two of these one after another, carts seem to return 0x00 on the first one and 0xaa for the second, unless other IR devices are sending actively, then both bytes are 0xaa. Allegedly, "Walk With Me" and similar games don't have this command.
HGSS, while trying to connect to a Pokéwalker, seems to first do a cmd 0x01, which returns 0 bytes, then 0x08 twice, after which it repeatedly issues other 0x01 commands until the return data of one of these indicates a Pokéwalker presence (the Pokéwalker sends out a fixed byte value as 'beacon' thing, the game will receive a 1-byte packet containing that beacon). 0x08 is never used again after being used twice in the beginning as far as I can see (but I might be wrong).

Allegedly, the chip in carts responsible for IR is another H8/38606F (or 38602F?) (connected to the SPI bus on one side, and to some IR leds or so on the other).

[UPDATE: 2021-03-01: added info on cmd 0x03..0x07, and non-Pokémon games. info from nocash, not me.]

____________________
TiTAN Forever

PoroCYon
Posted on 02-01-21 12:10 AM (rev. 9 of 02-01-21 02:35 PM) Link | #3240
Recently I've done some timing tests with the DSi's NDMA units. These are all my findings:

NOTE: all testing has been done as main RAM->main RAM (functional), or as TIMER0_DATA->main RAM (priority/timing). CPU timing comparison was done with TIMER0_DATA->DTCM.

I did not yet test the interaction between the DSP's DMA capabilities, NDMA, and the ARM9.

Function


  • GCNT bit 0 does nothing. 3dbrew made it sound like it could be used to enable/disable NDMA globally, but it doesn't seem to do anything?
  • inc/dec/fix modes, FDATA work as expected.
  • Both the physical and logical block counts are just word sizes (or log2 thereof in the case of the physical block count). The logical block count (WCNT) is just the total amount of words transferred, it's not used a multiple of the physical block count, it's only a word count. It also doesn't have to be a multiple or has to be aligned or anything like that.
  • When accessing peripherals/devices, WCNT signifies the number of words that should be transferred by one peripheral event (when using the corresponding startup mode), and before the next peripheral event. When the device uses a FIFO, this register should contain the size of the FIFO, in words.
  • The timer (BCNT) doesn't implement any sort of timeout/deadline/... for the transfer to finish or it'll be cancelled, it's only for inserting delays. (Not sure why I tested this, I might've just been confused by the name.) It is meant to insert delays, see below for more details on that.
  • TCNT RESET mode does indeed reset the src or dest address after each logical block, not after each physical block.

Access ordering and priority


  • NDMA has priority over old DMA. This is especially visible when there are multiple ODMA and NDMA channels waiting to be resumed as soon as one NDMA channel is suspended through BCNT: first the NDMA channels will be picked (first to last), and ony then the ODMA channels. ODMA transfers cannot be suspended during their entire lifetime, however, not even by GCNT round-robin mode (unlike NDMA). That is, an ODMA transfer acts like a single NDMA physical block transfer.
  • Letting NDMA channels suspend using BCNT seems to restart the next available one. GCNT setting seems to change nothing in this situation
  • GCNT does indeed seem to be made to let the CPU have some time to do stuff while DMA is running.
  • Scheduling between NDMA channels (BCNT) and between NDMA and the CPU (GCNT) seems to happen according to the following rules:
    • Under NO circumstances is a physical block transfer interrupted, paused, aborted, or anything. It always completes once it has started, regardless of whatever may happen (except maybe a full power down of the entire SoC).
    • NDMA startup seems to have a few cycles delay, enough for the CPU to access the bus once or twice (in my case, copy TIMER0_DATA to somewhere in main RAM). Sadly, I don't have an exact number. (However, this seems to contradict the earlier observations wrt. ODMA vs NDMA startup priority behavior?) At this point, the "GCNT timer" starts.
    • Once a physical block has been transferred, the channel suspends for the time specified in BCNT (or, if it is 0, immediately continues with the next block). If the next channel is enabled, this one will start transfers. If no others are enabled, the CPU is now allowed to master the bus. This is true even if GCNT is set to highest-priority!
    • Once the "GCNT timer" expires, first, the hardware waits for the current physical block to finish transferring (if there is any currently happening), once that's finished, the CPU is allowed to run (for the same amount of cycles as specified in GCNT? I think it is). If all channels are either inactive or suspended due to BCNT, the CPU is resumed immediately.
    • In other words, only BCNT controls the priority and switching between NDMA channels, only GCNT controls switching between NDMA globally and the CPU. GCNT knows nothing about scheduling between DMA channels themselves, BCNT however, does seem to be at least slightly aware that it can relinquish control over to the CPU when all NDMA channels are either suspended or inactive. GCNT switches bus master every period, while BCNT causes a pause after every physical block transfer, it doesn't do a suspend/resume after every timer period.
  • As every channel is at least suspended for the amount of time specified in BCNT, channels can be fired in at least a round-robin fashion. However, when the BCNT period is smaller than the time it takes for a physical block to be transferred, the lowest pending channel is always resumed first, (regardless of GCNT setting)(?).

Timing stuff


  • NDMA seems to copy data as fast as ODMA (often pretty much exactly the same speed). However, NDMA seems to have slightly less variance on the timing in some tests, but more in others?.
  • Physical block sizing/moving to the next one doesn't seem to cause any delays (as in, there's no delay between finishing one physical block and starting the next one), or at least not as far as I've noticed. This is also true for writing to main RAM, so, a new phsyical block is still considered sequential access.
  • Main RAM writes are slow, lmao (at least when comparing main RAM->VRAM and VRAM->main RAM copies, VRAM->main seems to need 40% extra cycles (60k vs 100k timer ticks) for a copy of bank A).
  • Pausing transfers using BCNT between physical blocks does seem to insert delays, however, with a VRAM->main RAM copy, these seem to be disproportionate to the actual delay inserted. Therefore, this probably does cause new nonsequential accesses on the start of the next block. HOWEVER, timing variability seems to go down DRASTICALLY with higher BCNT values!
    • `BCNT=0 P16 32kwords vram2main` -> `N=100897 sigma=4005`
    • `BCNT=1 P16 32kwords vram2main` -> `N=110634 sigma=1635`: higher than expected
    • `BCNT=8 P16 32kwords vram2main` -> `N=118801 sigma= 597`: lower! than expected
    • `BCNT=16 P16 32kwords vram2main` -> `N=135170 sigma= 255`: lower! than expected
    • `BCNT=0 P16 32kwords main2vram` -> `N= 67846 sigma=3128`
    • `BCNT=1 P16 32kwords main2vram` -> `N= 83990 sigma= 920`: higher than expected
    • `BCNT=8 P16 32kwords main2vram` -> `N= 94221 sigma= 784`: lower! than expected
    • `BCNT=16 P16 32kwords main2vram` -> `N=110595 sigma= 463`: lower! than expected
    • `BCNT=0 P16 32kwords vrmA2vrmB` -> `N= 65623 sigma=3112`
    • `BCNT=1 P16 32kwords vrmA2vrmB` -> `N= 69651 sigma= 816`: ok
    • `BCNT=8 P16 32kwords vrmA2vrmB` -> `N= 83974 sigma= 226`: ok
    • `BCNT=16 P16 32kwords vrmA2vrmB` -> `N=100352 sigma= 447`: ok, WTF stddev?
    Sample size of 256 for all cases. `P16` means a physical block size of 16 words. Units of `N` and `σ` are timer 0 ticks (clockdiv set to 1, so 33 MHz).
  • GCNT round-robin mode timing does seem to add delays, linear(?) to the amount of cycles specififed in GCNT (so exponential to the actual value in that register).
    • `GCNT=1<<0 P1 32kwords vrmA2vrmB' -> `N=65630 sigma= 3543`
    • `GCNT=1<<1 P1 32kwords vrmA2vrmB' -> `N=65625 sigma= 3216`
    • `GCNT=1<<4 P1 32kwords vrmA2vrmB' -> `N=65623 sigma= 3117`
    • `GCNT=1<<8 P1 32kwords vrmA2vrmB' -> `N=65680 sigma= 3202`
    • `GCNT=1<<a P1 32kwords vrmA2vrmB' -> `N=66017 sigma= 4176`
    • `GCNT=1<<c P1 32kwords vrmA2vrmB' -> `N=67530 sigma= 5084`: +3.125% cf. 0
    • `GCNT=1<<e P1 32kwords vrmA2vrmB' -> `N=73650 sigma= 9418`: +12.5%
    • `GCNT=1<<f P1 32kwords vrmA2vrmB' -> `N=81810 sigma=16886`: +25%
  • FDATA/FILL mode doesn't incur an access penalty, unlike ODMA. It's fast. Clearing an entire VRAM bank takes 32k cycles, probably fewer when you enable the 32-bit VRAM bus enhancement.
  • Just like other types of accesses/DMA, copies with the source and destination within the same region (eg. both in main RAM, both in the same VRAM or (N)WRAM bank), are really slow (intra-main RAM can be up to 6 times slower than a VRAM-to-main RAM copy, inter-VRAM A takes 1.5x the time of a VRAM A -> VRAM B copy. The physical block size, again, has no effect on whether an access is deemed sequential or nonsequential.


____________________
TiTAN Forever

Rayyan
Posted on 04-09-21 10:39 AM (rev. 2 of 04-09-21 10:42 AM) Link | #3572

Makercodes:

00 = Homebrew
01 = Nintendo
41 = Ubisoft Entertainment

(will add to this later)

____________________

How to write an emulator
1. throw code to be emulated somewhere
2. make memory system that allows accessing that code
3. emulate CPU
4. have fun implementing all the other hardware
-- Arisotura, Tuesday 5th January 2021, 22:00:17



Mighy Max
Posted on 07-08-21 04:04 AM Link | #3974
Addendum to NWRAM:

Priorities between the sets of WRAM from highest to lowest.
They are the same for reading and writing:

  • NWRAM Set A
  • NWRAM Set B
  • NWRAM Set C
  • Static Arm7 WRAM
  • Shared WRAM

Within a set of WRAM, the parts can be set to lay on top of each other.
The lowest part has the highest read priority.

Write Priority is special. Writing to overlapping parts, writes to them all at once!

Result of a testcase on HW:
[thumbnail]

Arisotura
Posted on 07-08-21 04:08 AM (rev. 2 of 07-08-21 04:08 AM) Link | #3975
Posted by Mighy Max
Write Priority is special. Writing to overlapping parts, writes to them all at once!

interesting!

have you checked that reading from overlapping parts doesn't OR them together? like it does for VRAM.

also, the NWRAM regions could theoretically overlap I/O space, but in practice I doubt that works

____________________
Kuribo64

Mighy Max
Posted on 07-09-21 03:35 AM (rev. 2 of 07-09-21 04:15 AM) Link | #3977
Posted by Arisotura
interesting!

have you checked that reading from overlapping parts doesn't OR them together? like it does for VRAM.


Yes, that is the first part of the check. I write an initial individual value to all parts and overlay them. Afterwareds I read the value. The Part containing that value then is moved away and the read repeats, until all parts are done.
See the line WRAM Bank Read Priorities, which notes which line is moved in which order.

Also checked for the different sets. The WRAM Write Priority line shows which set really got written to, when the windows overlap: the write and read only applies to the highest priority.

also, the NWRAM regions could theoretically overlap I/O space, but in practice I doubt that works



Yes. But I have not yet have a safe method to test this behavior. And I doubt this is something that is used intentionally .... I thought about using it to blend over the Wifi RAM at 04800000h but you can't specify a window start that high, only the window end can be in the 04... range.

Will make a repository of the NWRAM Testcase. It was first only a way to verify I understood the access found in my reversing of the stage 2 loader.



Edit: Testcase source is here: https://github.com/DesperateProgrammer/DSiTestCases



Arisotura
Posted on 07-09-21 05:04 AM Link | #3978
re: leaking NWRAM over I/O

I think the way memory mapping on the DS works makes that impossible anyway

____________________
Kuribo64

PoroCYon
Posted on 07-09-21 08:20 AM Link | #3979
In the past I once tried overlapping NWRAM and the IO space (by the request of, either profi200 or normmatt iirc), and reads always returned IO stuff. maybe a write would go to both, didn't test that, but I think it did end up at at least the IO registers, iirc. (Testing wasn't very thorough, though, we mainly wanted to know if we could exploit possible IO r/w redirection stuff, and the result is 'no'.)

____________________
TiTAN Forever

Mighy Max
Posted on 07-10-21 02:52 PM (rev. 4 of 07-10-21 03:05 PM) Link | #3989
Posted by PoroCYon
maybe a write would go to both, didn't test that, but I think it did end up at at least the IO registers, iirc. (Testing wasn't very thorough, though, we mainly wanted to know if we could exploit possible IO r/w redirection stuff, and the result is 'no'.)


I got curious today and wanted to check, if we can find out by either write or read through happening, if a HW register is actually present at a given address.
Unfortunally this is not the case. The area 0x04.... is completely prioritized by the HW Registers. No NWRAM write passes nor any read operation shines through.

But this makes the emulation simplier for melonDS, as NWRAM does not need to be implemented on this region.

Beside this test, there is another reasoning, that Bit28 is unlikely to be the range:
It would require the Bus Client for the NWRAM to do a sub instead of a mask (You can just mask with a fixed 0x03000000) but if you allow 03 and 04 you need to sub 1 to create such a mask or create a mask for each region and oring them. Both ways slow down the cirquit and increase its gatecount for no apperent reason.
:edit: rethought this and i already has to do a sub

But that opens some other questions i would like to check:
- is Bit28 really writeable? - did not check if the info on GBATEK is correct here
- if this bit is writeable, does it have any other function then that in GBATEK? Does it change the priority, the fallthrough, timing?

Any other suggestions or ideas what this bit ould control instead of an increased mapping region?

For completeness and crosschecking, the simple test I made below. It shows 0 counts n both output lines.
Image Size of set A was set to 64k from the previos tests.
/*****************************************************************************
* Test HW Registers via overlay.
* Check if the NWRAM below HW registers show through, if there is no
* physical register present
*****************************************************************************/

int count = 0;

WRAMSetWindow(0, 0x03FF0000, 0x04000000) ;
memset((void *)0x03FF0000, 0xAA, 0x10000) ;

WRAMSetWindow(0, 0x03FF0000, 0x04FF0000) ;
for (int i=0;i<0x0ff0000;i++)
{
uint8_t val = ((volatile uint8_t *)0x04000000)[i] ;
if (val == 0xAA)
{
printf("| %08x ", i) ;
count++;
if (count > 20)
break ;
}
}
printf("Read HW Regs Fallthrough Check: %i\n", count) ;
count = 0;
memset((void *)0x03FF0000, 0xAA, 0x10000) ;
for (int i=0;i<0x0ff0000;i++)
{
((volatile uint8_t *)0x04000000)[i] |= 0;
if (((volatile uint8_t *)0x03FF0000)[i & 0xFFFF] == ((volatile uint8_t *)0x04000000)[i])
{
printf("| %08x:%02x ", i, ((volatile uint8_t *)0x03FF0000)[i & 0xFFFF]) ;
count++;
if (count > 20)
break ;
}
}
printf("Write HW Regs Fallthrough Check: %i\n", count) ;
WRAMSetWindow(0, 0x03FF0000, 0x04000000) ;

Arisotura
Posted on 07-10-21 02:53 PM Link | #3990
that makes sense, at a coarse level memory mapping is probably implemented similarly to how melonDS does it (the big switch statement deciding which area the read/write goes to)

____________________
Kuribo64

Mighy Max
Posted on 07-10-21 03:08 PM (rev. 2 of 07-10-21 03:31 PM) Link | #3991
Posted by Arisotura
that makes sense, at a coarse level memory mapping is probably implemented similarly to how melonDS does it (the big switch statement deciding which area the read/write goes to)


Yes, it kind of makes sense, but i am not 100% sure on that.

I thought of another test for that. I will do this as i get time. If bit28 extends the region, it must blend in 03ff0000 to 0x3ffffff, he start index is ff and the end is 1ff. Otherwise it should fall through.


:edit: It dawned to me just now. It is part of the region. Its required so the last 32kB can be mapped and a zero length window still can be created.
This means the end index is cropped internally at 0x100

Mighy Max
Posted on 07-12-21 06:49 AM Link | #4013
Tested more today, until the battery died just a few mins ago:

The end indizes in the windows are indeed completely writeable and do not reflect the cut at 0x04000000 happening. With Bit28 is set, last 32/64kB is accessible (It actually was already tested but not realized in the code posted above). So it is really part of the window region, allthough there is no observed difference if the end index is any greater. than 100(A)/200(B&C)

The bits 0xE00FC00F at window for set A and 0xE007C007 in the windows for set B and C can not be chnaged.
The bits 0x72 in set A banks and 0x60 in set B and C banks can not be set.
This is not yet reflected in code and will be implemented (at the write, the performance impact is minimal)

Some addendum to the SCFG_EXT7/9 Bit 25: Allthough gbatek states this enables/disables NWRAM, the NWRAM related HW registers seems still accessible and changeable with the same masks. So I need to revert the fall-through for NWRAM HW registers, when disabled NWRAM bit in SCFG_EXT and get some better understanding on that bit effect on the NWRAM. Did someone already play with it?

Mighy Max
Posted on 07-13-21 05:15 AM Link | #4025
SCFG_EXT7/9. Bit 25: R/W on Arm7. RO on Arm9.
If cleared, the NWRAM is not blended in on the Arm7 or Arm9 system bus. However, the HW Registers at 0x04004040..0x04004063 are still working.

Test: https://github.com/DesperateProgrammer/DSiTestCases/tree/master/NWRAM

PoroCYon
Posted on 08-19-21 07:05 PM (rev. 7 of 08-21-21 08:11 PM) Link | #4268

Aptina MT9V113 internal MCU stuff


The Aptina cameras have an internal 68HC11-based MCU. Parts of its address space can be accessed through the XDMA registers (0x98c, 0x990, reachable over I2C)
There's a "physical" address range (0x0000..0x1fff) and a "virtual"/"logical" one (0x2000..0x3fff). The former can be used to access "system" and "user RAM" (resp. 0x0000..0x03ff and 0x0400..0x7ff), as well as "Special Function Registers" (SFRs), basically just MMIO regs of the HC11. The latter allows access to the variable spaces of several "firmwares"/"services" running on the MCU, used for autofocus, autowhitebalance, etc. Each camera has a separate HC11.

The above is already kinda known, but now for the new stuff. (Keep this section of GBATEK at hand while reading this, as I guess only a handful people ever looked at the cameras to begin with.)

(NOTE: as the HC11 is an 8/16-bit MCU, pointers etc. are 16-bit (its address space is 16 bits wide). Also, it's a big-endian architecture, keep that in mind.)

SFRs

  • 0x1040: watchdog reset: write 0 to this address to calm down the watchdog timer.
  • 0x1048: "high-precision timer": 32-bit timer value that probaly increases by 1 every 16 MHz tick.
  • 0x1050: "pagetable" pointer: pointer to a list of 16 addresses used to determine to which physical addresses the virtual ones will resolve.
  • 0x1060..0x1066: "ring bus access" or so, not too sure what this is, but it looks like it gives you some sort of DMA access. Sadly, I don't know which registers in this range are address ones, and which ones are used for data.
  • 0x1070: GPIO (already documented on GBATEK, didn't touch it myself.)

Address translation


As alluded to in the previous part, SFR 0x1050 is used to set the mapping. Its value is 0x0100, which means we can access and modify the "pagetable mapping" over I2C. (I'm using quotes here because it's far from anything like a real MMU.)

At 0x0100, a list of 16 pointes can be found: 0x0140 (MONITOR), 0x0000 (SEQ), 0x005d (AF), 0x0165 (AWB), 0x01d4 (FD), 0, 0, 0x282 (MODE), 0, 0, 0, 0x0220 (HG), and then more zeros.

This matches what you'll find when dumping the 0x2000..0x2fff area. While system RAM has space for 16 more pointers (starting at 0x0120) for addresses 0x3000..0x3fff, in practice these are mirrors of the 0x2*** range.

Setting a pagetable entry to an address of 0x2000 or higher seems to return 0 values, maybe there's a carveout, or maybe there's just nothing behind those addresses. Additionally, it does seem to mirror the upper half of memory (0x8000 and up) to the lower half, which is *quite* suspicious as a datasheet says there's 32 kilobytes of firmware ROM, and the exception vectors of the HC11 are at 0xfffe etc. (6502-style) So I'm betting the firmware ROM is in the upper half of MCU memory.

System RAM layout


With the above, we can start building a map of the system RAM space. As one datasheet (links at the end of this post) alludes to the stack of the HC11 firmware also being in system RAM and being 128 bytes in size, it's not too hard to guess where it is.
  • 0x000..0x047: SEQ
  • 0x05d..0x0c0: AF
  • 0x100..0x11f: page table
  • 0x120..0x13f: unused shadow pagetable? idk
  • 0x140..0x168: MONITOR
  • 0x165..0x1d2: AWB
  • 0x1d4..0x1f4: FD
  • 0x220..0x286: HG
  • 0x282..0x2ea: MODE (yes this overlaps with HG)
  • 0x300..0x37f: ??? (maybe stack but I doubt it)
  • 0x380..0x3ff: stack
The 0x380 region seems to be 0xdeadbeef (big-endian)-filled and grows from high addresses to low ones. Additionally, it's one of the RAM regions that seems to get reset regularly (see below). This gives me a relatively high conficence to say it is the stack of the HC11.

User RAM (0x0400 and on) doesn't seem to be writable, sadly. Or maybe I'm missing some kind of magic switch.

Running code on the HC11


A datasheet alludes to using the MONITOR variables to run code: arg1 is the pointer of the code to run, arg2 an optional argument (where is this put in? the accumulators?), and then set cmd to 0x01 to start the code.

Sadly, this did not work for me: it resets the MCU, while doing not much else. Maybe it needs some kind of CRC (which the datasheet also alludes to), or maybe the feature is just locked away. (EDIT: maybe the resetting thing would be added to have the XDMA thing avoid messing with internal state. But does using the virtual addressing mode bypass it? Would chaining using another peripheral (the "ring bus access" maybe?) be able to be used as a bypass? I haven't tested.)

Then I tried putting some shellcode in low system RAM and filling the stack with bad addresses (that is at the same time a NOP sled). Sadly this didn't work either, as the MCU also got reset before it got to execute my code. Maybe I'm triggering some kind of (bad) crash, or maybe there's a safety mechanism that automatically resets everything. (I hope it's not the latter.)

(MCU resets seem to clear/reset the pagetable, stack (0x0380 and up), and some of the "logical variable" spaces in system RAM. 0x300..0x37f seems to be preserved, as well as some other small places in low system RAM.)

Maybe this is a fun challenge for someone else here? :P

P.S. I used Pk11's dsi-camera proof-of-concept as a base to mess with the cameras. Not really publishing my code because it's mostly just a spaghetti of I2C accesses and FIFO ugliness. I used vasm as HC11 assembler, though be aware that it gets some opcodes wrong, check with A09 (which doesn't seem to be able to process "org" directives) and this and this opcode listing I found. For more info on the HC11, see this (original DIP CPU info) and this (more modern microcontroller impl, only gave it a cursory glance, idk how useful it is outside of the instruction set info).

Datasheets

Also a neat link: https://files.niemo.de/aptina_pdfs/

____________________
TiTAN Forever


Main - Development - GBAtek addendum/errata New reply

Page rendered in 0.028 seconds. (2048KB of memory used)
MySQL - queries: 27, rows: 113/113, time: 0.019 seconds.
[powered by Acmlm] Acmlmboard 2.064 (2018-07-20)
© 2005-2008 Acmlm, Xkeeper, blackhole89 et al.