melonDS RSS The latest news on melonDS. OpenGL renderer: requirements and details -- by Arisotura Fri, 17 May 2019 14:30:48 +0000 OpenGL 3.1 as a minimum.

I think that given the reports I got in this thread, this puts us at a sweet spot. Certain features (like accurate shadow support) may require higher OpenGL versions, but the 3.1 baseline allows us to support a large part of user platforms/setups, including Macs.

Technically, OpenGL 3.1 is the lowest we can go with the current renderer setup. Any lower would require partially rewriting it to get around the limitations.

There is also more chance of eventually porting this to OpenGL ES, which might open the way for Android or PSVita ports that don't run at crapoed-snail speeds.

I'm sorry for the toaster-rocking folks out there who are stuck on lower OpenGL versions. We are still keeping the software renderer and the old Direct2D-based window drawing code, so you aren't completely left out. Besides, there are still other emulators (DeSmuME, NO$GBA) supporting older OpenGL versions.


We will be using OpenGL to draw the DS framebuffer to the window. This mode will be an option when using the software renderer, and forcibly enabled when using the OpenGL renderer (they work faster together).

This may eventually allow for features like fancypants postprocessing shaders. Not for melonDS 0.8, but eventually, later on.

Upscaling will also be restricted to the OpenGL renderer. It makes little sense with the software renderer as that one cannot do upscaling, and the performance would likely be subpar if it tried.]]>
The DS GPU interpolation -- by Arisotura Wed, 15 May 2019 16:27:59 +0000

As explained in our previous technical post, vertex attributes are interpolated twice: once along the polygon edges, and once along the scanline being drawn.

The details of how those are interpolated is where the meat is. A typical 3D rasterizer will perform perspective-correct interpolation, which is meant to take perspective distortion into account when drawing a polygon that has been projected to 2D screen space. There are early rasterizers that use different interpolation types, for example, the Playstation does affine texture mapping.

In the case of the DS, it does perspective-correct texture mapping and shading.

The basics

This is the canonical formula for perspective-correct interpolation over a given span:

- x is the position within the span, units don't matter (as long as they're consistent with xmax)
- xmax is the length of the span
- A0 and A1 are the attributes at each end of the span
- W0 and W1 are the associated W coordinates
- Ax is the interpolated attribute at position x

Rasterizers implement this formula in one way or another. For example, it is typical for software rasterizers to start by dividing all vertex attributes by W, so those can be interpolated linearly, then divide them by the W reciprocal to recover the correct values for each pixel. I don't know much about how typical GPUs implement this.

However, all the research I did back then for melonDS's renderer gave me a lot of insight over this.

Regarding x and xmax, the DS GPU keeps things simple. When interpolating vertically (ie across polygon edges), those are either Y or X positions, depending on the slope of the edge. For X-major edges, X positions are a lot more precise than Y positions, so this makes sense. When interpolating horizontally (ie over a scanline span), those are X positions. As said above, units for x and xmax don't matter, so the pixel coordinates are used as-is.

The interpolation calculation itself is where this gets interesting. I eventually figured out how it works after observing interpolation of W values over large spans: those always changed by increments of 1/256th of the difference between the W values.

For example, given W values 0x1000 to 0x2000 over a span of 256 pixels, the canonical formula had it begin like: 0x1000, 0x1008, 0x1010, 0x1018, 0x1020... On the DS, it was 0x1000, 0x1000, 0x1010, 0x1010, 0x1020... This was also the case with other ranges, like 0x1001-0x2001, 0x1007-0x2007, etc, further straying away from the canonical formula.

This had me scratching my head for a while until I finally figured it out: the DS GPU precalculates an interpolation factor with limited precision, then uses that to interpolate vertex attributes linearly.

So, how does that work? Well, the canonical formula can be transposed to this simpler, less division-y version:

The DS simply sets A0 to zero and A1 to 1. The precision is 8 bits of fractional part along scanline spans (hence the effect observed above), and 9 bits along polygon edges.

Thus, we have:

With this factor, we can interpolate vertex attributes linearly quite quickly, resulting in a pretty good (but imperfect) approximation of perspective-correct interpolation.

There are some interesting quirks about all this, though.

Divider details

The DS uses an unsigned 32-bit divider to perform the division, which induces some extra quirks.

Those quirks become evident when using tiny W values. For example, if you draw a polygon with W values going from 0x1000 to 0xF000 on the left side, and 0x0001 to 0x000F on the right side, you will get distortion on the right side.

In the formula above:
- W0 and W1 are 16-bit (those are 'normalized' during polygon setup: they're shifted left or right by 4 until they all fit the 16-bit range as much as possible)
- x fits within 8 bits
- (xmax-x) may take up 9 bits, if viewport transform has overflowed

The denominator takes up, at most, 26 bits. This is okay.

The numerator takes 24 bits. Along scanline spans, we add 8 bits of fractional part, bringing it to 32 bits. However, along polygon edges, we add 9 bits, which brings it to 33 bits. Oops.

For this reason, when interpolating along polygon edges, there's some weird adjustment made to W values, so those fit in 15 bits. This is best described by the following code:

if ((w0 & 0x1) && !(w1 & 0x1))
   w0_numerator = w0 - 1;
   w0_denominator = w0 + 1;
   w1_denominator = w1;
   w0_numerator = w0 & 0xFFFE;
   w0_denominator = w0 & 0xFFFE;
   w1_denominator = w1 & 0xFFFE;

This is namely responsible for the dents in SM64DS's level select buttons.

Alternate linear-interpolation path

Due to the limited fixed-point precision, the method described above isn't suitable for 2D polygons, those could end up with misplaced textures.

This is where the DS has another trick in store: for those cases, it directly does linear interpolation, completely bypassing the perspective correction math.

Technically, it checks the W values at the ends of the span. If those are equal, and have low-order bits cleared (bit0-6 along scanline spans, bit1-6 along polygon edges), the alternate linear-interpolation path is used. I'm not sure why low-order bits have to be cleared, maybe that is to guard against precision errors?

Interpolator note

The interpolators in the DS GPU, similarly to the divider, only work with unsigned numbers.

The basic operation would be:

val = a + (((b-a) * x) / xmax);

What if a is greater than b? (b-a) would end up negative. To avoid this, it does the operation in reverse:

if (a < b) val = a + (((b-a) * x) / xmax);
else if (a > b) val = b + (((a-b) * (xmax-x)) / xmax);

Side note

We're coding epic shit for melonDS 0.8. Not telling much more though, that'll be a surprise :)]]>
OpenGL progress -- by Arisotura Thu, 02 May 2019 12:32:41 +0000

So this is where we are today. Our GL renderer is a lot more capable, even though it's still unfinished.

Here's a little TODO list:

* shadows
* handling DISPCNT
* texture blending modes (decal/toon/highlight)
* edgemarking, fog
* clear bitmap
* higher res rendering
* OpenGL init under Linux/etc
* switch between GL and software renderer

I first want to clean things up so that I can release a beta version, though. The idea would be to see how the renderer fares 'in the real world', so the actual proper 0.8 release doesn't end up being a major disappointment.

The current renderer requires OpenGL 4.3. Which might exclude old hardware and, well, Macs.

At the end of the development process, the requirements will be adjusted. The current renderer will likelly have its requirements lowered as far as it can go without losing accuracy.

Depending on demand, there might be alternate GL renderers, supporting older GL versions but sacrificing some accuracy (they will render certain cases wrong and that will not be fixed).

See ya in a while!]]>
Doing the time warp -- by Arisotura Wed, 10 Apr 2019 23:36:19 +0000

This looks a lot like another screenshot, from two years ago:

So why am I posting this now? Well the answer is simple, we're going back in time and preparing a new melonDS release that is roughly equivalent to 0.1.


Joke aside, there are some key differences between those screenshots:

* newer one has proper clipping at screen edges
* both lack Z-buffering, but newer one is different, likely because the older one didn't have Y-sorting
* newer one is using OpenGL

So, yeah, that's the long-awaited OpenGL renderer. Like the good ol' software renderer in its time, it's taking its baby steps, and barely beginning to render something, but I'm working on it, so in a while it will become awesome :P

This renderer will aim for reasonable accuracy. As a result, it will require a pretty recent OpenGL version, and compatible hardware. It's set to OpenGL 4.3 currently, but I will adjust the minimum requirement once the renderer is finished.

If needed, I can provide alternate versions of the renderer for lower-end hardware supporting older OpenGL versions, but they will be less accurate. While the software renderer is the 'gold standard', the current OpenGL renderer is a sort of 'minimum standard' to get most games to render correctly. Any lower-spec renderer may render certain games wrong and that will be unlikely to get fixed (or it would be fixed but at the cost of killing performance).


Speaking of the software renderer, I also felt like doing a bit more research towards the holy grail: pixel perfection. I'm not done yet, but I finally have those pesky polygon edge slope functions down. Someday we will get those aging cart tests to pass ;)

But that will be for later. I may also write a post about all the juicy low-level hardware details.


But, back to OpenGL. Might as well explain why the planning phase for this renderer took so long. Although you guess that it's 50% the DS GPU being a pile of quirks and 50% me being a lazy fuck.

The first experiments were made with a compute shader based rasterizer. That way, I could get it perfect, while supporting graphical enhancements. I ended up ditching this solution because the performance wasn't good.

So, back to more standard rendering methods, aka pushing triangles. We won't get to rasterize quads correctly that way, but in most cases, the difference shouldn't matter.

First thing to do is to devise an efficient way to push triangles. This requires straying away from standard rendering methods, especially in how we do texturing and all.

On the DS, a game can choose to change the current texture at any time. Polygon attributes can only be changed before a BEGIN_VTXS command, but that doesn't make it any better. Polygons are sorted by their Y coordinates before rendering, which can completely change their ordering. Basically, there is no guarantee that polygons will be grouped by polygon/texture attributes, and the ordering after Y-sorting must be preserved or you might break things like UIs that rely on it.

This is shitty for our purposes though. If, for the DS, changing polygon/texture attributes is mostly free, you can't say as much about OpenGL (or any desktop graphics API for that matter). You would end up with one draw call per polygon, which isn't really a good thing.

Another thing worth considering is that our window for 3D rendering is not a full frame (16.667ms). On the DS, 3D rendering starts at scanline 215 (or 214?). Rendering any sooner would be a bad idea as the game might still be updating texture VRAM. But, we need 3D graphics as soon as scanline 0 of the next frame, which leaves us only 48 scanlines worth of time to do the rendering.

The software renderer is able to work around this limitation by using threading and per-scanline rendering (pretty much like the real thing, except that one seems to render two scanlines at once), which extends the rendering time frame to 192 scanlines.

OpenGL does not render per-scanline, though. So we can forget about this. However, a possibility would be splitting the frame in four 256x48 chunks. I will study that possibility if performance is an issue -- would have to see how far the extended rendering timeframe can outweigh the extra draw calls. Maybe propose the two rendering methods as options.

Back to pushing triangles, for now. I devised a way to pass polygon/texture attributes to the fragment shader, and render all the polygons in one draw call. Nice and dandy, but we're not out of trouble. This will imply passing the raw DS VRAM to the fragment shader and having it handle all the details of texturing, akin to the TextureLookup() function in the software renderer. No idea about the performance implications of this.

Also, we will have to think of something for shadow polygons, I don't think we can use the regular stencil buffer with this.

Well. I hope this renderer will be compatible with OpenGL ES, with all the tricks it may be pulling, but... we'll see.]]>
melonDS 0.7.4 -- by Arisotura Tue, 26 Mar 2019 03:10:06 +0000

This release is hardly going to be revolutionary though, but I had to get it out before starting work on the hardware renderer.

The highlight of this release is the upgrade to the online wifi feature. Two main points there:

1. Under direct mode, you can finally choose which network adapter will be used by libpcap. The wifi settings dialog also shows their MAC and IP as to help you ensure you're picking the right adapter.

2. Indirect mode, which, as mentioned before, works without libpcap and on any connection. However, it's in alpha stages and it's a gross hack in general, so you can feel free to try it out, but don't expect miracles.

As usual, direct mode requires libpcap and an ethernet connection, and that you run melonDS with admin/superuser privileges.

Other than that, there are a few little fixes to the SDL audio code. melonDS now only emits sound when running something, so it shouldn't emit obnoxious noises when starting anymore. Also, it no longer crashes if you use WAV microphone input with a file that has multiple channels.

And other little things.

Now, full steam ahead towards 0.8! Or not quite. I also need to finish the redesign for this site, among other things.


melonDS 0.7.4, Windows 64-bit
melonDS 0.7.4, Linux 64-bit

melonDS Patreon if you're feeling generous]]>
Immediate plans -- by Arisotura Sun, 17 Mar 2019 19:32:46 +0000

I also wanted to fix one of the issues with local multiplayer: when more than two players are involved, clients receive replies sent by other clients, which they shouldn't receive, and this likely contributes to it shitting itself big time.

But alas, this will be more complicated than anticipated. This is also the main reason why local multiplayer pretty much stagnated after melonDS gained its "wifi emulator" reputation back in 2017: emulating local MP is a large pile of issues that are all interconnected. Wifi emulation in melonDS is more or less a pile of hacks, and it works, but it's full of issues. Some stem from incomplete understanding of the DS wifi hardware (after 15 years. welp), but most of them are timing issues.

(which are also why I'm pretty much pessimistic about ever connecting melonDS to a real DS)

Local multiplayer 'a la Nintendo' works on a host-client scheme detailed here. Long story short, the host polls the clients at a regular, small interval (for example, every 4ms for Pictochat). Dealing with the sheer amount of packets transferred is in itself a challenge.

But that's not all. When the host sends its packet, each client is given a window within which it should send its response. Miss your window and it's considered a failure.

This works well with actual DSes because they're all running at the same speed, so the timings are reliable.

With melonDS, it's another story. We get lag inherent to the platform on which we're running: the network stack, thread scheduling, etc... Running multiple DSes in one melonDS instance might help alleviate these lag sources, but it wouldn't be a perfect solution either (it would likely be running the DSes on separate threads).

We also learned by experimentation that the framerate limiter is a problem, and connections worked better when disabling it. As silly as that sounds, it makes sense when looking closer. When disabling the framerate limiter (or when running below 60FPS), the melonDS instances run as fast as possible, and they may end up running at roughly the same speed, consistently, which makes for a better connection (less chance that MP replies miss their window). However, when enabling it, your melonDS instances may be running at any speed above 60FPS, but they will be spending several milliseconds per frame doing nothing in order to bring the framerate back to 60FPS.

Which, you guess, is bad bad bad for MP communications. The wifi system is driven by the emulator's scheduler, so it will end up running faster than it should, squishing MP reply windows and making it way more likely that replies get dropped.

And indeed, disabling the framerate limiter greatly reduces the amount of MP replies missing their window, even if the framerates are barely above 60FPS. The framerate limiter might be a bit zealous there.

However, the emulator instances might run at different framerates and possibly desync. And that's not too convenient if they end up running at absurdly high framerates.


To get anywhere with this local multiplayer shito, we'll need a redesign. No amount of hacky solutions will get us anywhere with this pile of hacks.

First part is how the wifi system is driven.

melonDS runs ~560190 cycles per frame, which is 33611400Hz under ideal circumstances, close enough to the DS clock frequency.

Wifi is updated every microsecond. The handler is called every 33 cycles, which means that one emulated microsecond actually lasts ~0.9818 microseconds. I'll spare you the nerdy calculations about how much that represents in offset, because so far that hasn't prevented it from working.

But on the DS, the wifi system is driven by a 22MHz clock, independently from the system clock. So driving melonDS's wifi system independently from the scheduler would not only be accurate, but also isolate it from the core's variable execution speed. However, two main issues arise from this:

1. We need to keep the core somewhat in sync with the wifi system, or the game would eventually shit itself. How to synchronize and when to do so? That's the question.

2. The current wifi system needs to be updated per microsecond. If we put it on a separate thread, how to take care of this without pegging the host CPU and killing performance? We could design a scheduler similar to the core one. However the wifi system has a readable microsecond counter, and we need to take care of that somehow. Without per-microsecond updates, we can only approximate it and hope it will be good enough.

To further prove my point:


We'll think of all that later. Past 0.7.4, we're going to focus on the hardware renderer.]]>
Getting somewhere, finally -- by Arisotura Sat, 09 Mar 2019 01:10:25 +0000

Finally, ClIRC is cooperating!

This took a bit of hackery to get around the limitations we face when working with plain old BSD sockets. Namely, how we're handling TCP acks for now.

Regardless, it's finally working without exploding! Well, I don't know about things like altWFC. But ClIRC is working about as well as when using direct mode.

We'll be polishing this and testing it with things like altWFC. You can expect a release real soon.

(also, a side effect of using sockets is that closing melonDS terminates connections correctly, which is nice for testing)]]>
Change in plans -- by Arisotura Tue, 05 Mar 2019 14:15:23 +0000
So I figured I would just rewrite it to use regular sockets instead of libpcap. There will be a lot less issues getting things to work this way, and this has the significant advantage that it doesn't require libpcap, and likely can work without requiring admin privileges.

The downside is that we have less control this way, as all the low-level TCP/IP details are handled by the host OS, so some things might break, but I'm confident that this will work just fine for the most typical uses, which involve standard TCP connections.

Anyway, I have DHCP, ARP and DNS covered. I started work on TCP, it's now able to initiate a connection to a server. Now I need to get the actual data exchange working.

I'm also unsure this can work for any use case where the DS is used as a server, but for now we don't have to care about that. And there's always direct mode.]]>
Updatez0rz -- by Arisotura Wed, 27 Feb 2019 18:57:04 +0000

1. Coming-out, sorta.

I am, well... nonbinary, likely some flavor of girl.

I'm a bit sick of the constant assumption that internet users are male by default, or that 'there are no girls on the internet'. Other genders, trans or not, are here too.

If you don't know, you can either ask or use gender-neutral prounous, it's still better than perpetrating the 'masculine by default' norm.

2. Network shito

The issue I mentioned with the DS receiving packets addressed to the host, also happens the other way around, ie the host receives traffic addressed to the DS. This might be the cause for the lag I observed.

Still thinking of a workaround to this. May need to read about network routing and all.

That's all folks!]]>
Indirect mode progress -- by Arisotura Sun, 24 Feb 2019 12:36:11 +0000
The idea is that outgoing network traffic is altered to make it look like it's coming from the host machine, so that it will get past wifi access points without trouble. In return, incoming traffic is addressed to the host machine, so I first took the easy path of redirecting all traffic to the emulated DS regardless.

Bad idea tho.

This caused all connections on the host machine to die constantly. Reason is that on the DS, the sgIP stack is receiving TCP traffic it did not initiate (because those are destined to the host), and interfering with these connections, likely just trying to kill what it perceives as bogus connections.

To work around this, we have to keep track of TCP sockets. Examining the flags in the TCP header, we know when a connection begins, and can know to only redirect incoming traffic if the source IP and source/destination ports match an existing socket.

Basically, NAT.

This gave promising results with ClIRC, but for some reason there's quite some lag on messages sent from ClIRC, which I don't remember being that laggy.

The current implementation will be limited to TCP and UDP, but for uses like AltWFC, this should be enough. For more advanced shito, there's always direct mode, which I think can be made to work on a wifi connection with some hackery.]]>