melonDS RSS http://melonds.kuribo64.net The latest news on melonDS. Antialiasing -- by StapleButter http://melonds.kuribo64.net/comments.php?id=32 Mon, 28 Aug 2017 17:41:43 +0000
As well as finding out that my edge slope functions weren't perfect. I tried, with moderate success, to make them more accurate, but they're still not perfect. So for now, I need to let it cool down. I decided I would make antialiasing 'good enough', then start working on the UI. Considering that there are other areas of the GPU that aren't perfect, like polygon clipping.

So here's a couple screenshots:



I picked cases where the result is quite visible. Antialiasing generally makes things look better, but whether it is that visible depends on what's being rendered.


Antialiasing may look like one of those minor details that are unimportant. But I consider it important to emulate, past the effect of making things look nicer: the way it works on the DS is very different from common antialiasing techniques. If you're into DS homebrew, you can't just turn on antialiasing to magically make things look better.

To begin with, antialiasing is applied to polygon edges based on their slopes, in order to make the edges look smoother. There is no antialiasing applied inside polygons or at intersections between two polygons.

Antialiasing also doesn't apply to translucent polygons. But actually, it's more complex than that -- the polygon fill rules depend on whether the individual pixel being rendered is translucent. So if a polygon has opaque and translucent pixels at its edges, the opaque parts will be antialiased and the translucent parts won't be.

The effect was shown in the following dasShiny screenshot. Note that dasShiny only has partial antialiasing support, it is only applied to Y-major edges.


More importantly, antialiasing interferes with wireframe polygons, lines and points. Wireframes are drawn as if they were filled polygons, ie. only the outer side gets antialiased. Lines are similarly antialiased on one side only. Points are said to disappear entirely.

Edge marking is also affected in that marked edges are made translucent when antialiasing is enabled. No idea whether it was intended, but it needs to be taken into account.


The real gold is how the hardware is designed to ensure antialiased edges are always rendered properly. The design is quite atypical and inspired from how the 2D GPU does blending.

If antialiased edges were immediately blended against the pixels underneath, it could cause visible glitches if another polygon is later rendered behind an antialiased polygon. So instead, per-pixel edge coverages are put aside for later: antialiasing is a separate rendering pass. It is the final pass, applied after edge marking and fog.

That final pass needs to know what the colors under the polygon edges are. Thus, the GPU keeps track of the last two colors rendered. When rendering an opaque polygon, the existing color is pushed down, and the new polygon color is written at the top. This is the same design that is used by the 2D GPU for its blending.

But if you render a polygon behind another, that polygon's pixels can be inserted behind the existing polygon, effectively replacing the existing bottom-most pixels.

The limitation is that this only works for the topmost edge pixels. If you draw two identical polygons A and B at the same position, with A on the top, A's edges are blended against B, and B isn't antialiased.

Things are funkier when translucency is involved. As far as I understand, translucent polygons don't push the existing topmost pixel down, but they blend with both topmost and bottom-most pixels. Translucent pixels that are behind the topmost pixel can still be blended with the bottom-most pixel in they come in front of it.

melonDS doesn't emulate this bit yet, but I will have to rework the renderer code at this point to make things nicer to work with.

Similarly, fog is applied to both topmost and bottom-most pixels, so that antialiased edges are still fogged properly. This part is emulated but at the expense of some duplicate code.


I'm not going to cover in detail how edge pixel coverages are calculated, because I'm still not quite sure. I haven't even gotten the edge functions perfect yet. This GPU isn't done being weird, heh.


But enough GPU work for now. There are still many other areas that need work. And I really want to put out a release with a nice UI.


As a side note: much of the research was initially done by Cydrak, and dasShiny was the first to attempt antialiasing support (albeit incomplete) (not counting DeSmuME's option to use OpenGL AA).]]>
http://melonds.kuribo64.net/comments.php?id=32
Sneak peek -- by StapleButter http://melonds.kuribo64.net/comments.php?id=31 Fri, 18 Aug 2017 23:16:13 +0000


I'll let you guess ;)


Work is also being put into figuring out the exact GPU algorithms. What has been done so far is a double bonus: not only is the renderer more accurate, it's also faster as a result of doing less divisions per pixel. The GPU takes shortcuts, so figuring them out allows for this kind of double-bonus optimization.

For example:



The buttons have dents in their borders, that are also visible on hardware. Those are caused by interpolation quirks. Artificially perfect interpolation will make the buttons look "perfect".

It's a tiny detail that likely doesn't matter to a whole lot of gamers, but as I said, I like figuring out the logic behind what I'm observing :P

Besides, it can't be bad to have an accurate renderer. You never know when a homebrew dev or ROM hacker may run into GPU quirks, but they may find about those more easily if they can reproduce the quirks on an emulator.

Or you have those gamers who want the original experience, too :P


Regardless, the next thing that will be worked on after this is... the UI. This will be a surprise :)]]>
http://melonds.kuribo64.net/comments.php?id=31
Slowdown -- by StapleButter http://melonds.kuribo64.net/comments.php?id=30 Mon, 07 Aug 2017 23:03:04 +0000

First thing is that real life is striking back. My current job is ending at the end of this month, and I need to finish their project (fresh new website) and put it live. Most of it is done already, but it takes time to check everything and ensure it's alright and looks nice and all. So this occupies my mind more.


I'm also busy with parts of the melonDS adventure that don't produce a whole lot of visual results. For one, I'm investigating interpolation, and once again we're into weird land.

Z interpolation, for example: the depth buffer is 24-bit, but in Z-buffering mode, interpolation suffers from precision loss at various stages. The output precision actually depends on the value range being interpolated over (the greater the difference, the bigger the precision loss).

W-buffering doesn't have these issues as it uses untransformed W values, and uses the regular perspective-correct interpolation path (while Z-buffering requires using special linear interpolation as transformed Z values are already linear into screen space).

Regular perspective-correct interpolation is weird too. It seems to apply correction to W values based on their first bit. I don't quite see what was intended there, if it was intended at all -- after all, it could just be a hardware glitch. But regardless, that causes slight differences that can have visible effects. It's generally a texture being one pixel off, so one would say it doesn't really matter, but I like figuring out the logic behind what I'm observing.


I also want to finally tackle the UI. melonDS has become a fairly solid emulator, but the current UI doesn't really reflect that.

I still haven't picked something to go with, though. I think I'll pick something lightweight, like byuu's hiro, and modify it to interoperate nicely with SDL. I'm not sure how well that can work across platforms, but ideally I would create a SDL window and add a menubar to it, rather than having two separate windows.


I'm also thinking about a buildbot type thing. People wouldn't have to wait for proper releases, but on the other hand, I fear it reduces or kills my incentive to do proper releases.]]>
http://melonds.kuribo64.net/comments.php?id=30
Why there is no 32-bit build of melonDS -- by StapleButter http://melonds.kuribo64.net/comments.php?id=29 Tue, 25 Jul 2017 00:04:44 +0000
That being said, melonDS can currently run on 32-bit platforms. It may be less performant, as the 3D renderer does a lot of 64-bit math, but it is still possible.

But if I ever decide to implement a JIT, for example, there will be no 32-bit version of it.


If you're stuck on a 32-bit OS for hardware reasons, your computer will not be fast enough to run melonDS at playable speeds.

melonDS will be optimized, it will run faster, but it will also tend towards more accuracy. So I can't tell how fast it will be in the end. But I highly doubt it will run well on a PC from 2004. Maybe it will, if a JIT is made, but that's not a high priority task.

If you are stuck on such hardware, NO$GBA is a better choice for you. Or NeonDS if you don't mind lacking sound. Or hell, the commonly mentioned method of running DraStic in an Android emulator -- those who bash DeSmuME at every turn claim it's fast.


Truth is, emulating the DS is not a walk in the park. People tend to assume it should be easy to emulate fast because the main CPU is clocked at a measly 66MHz. Let's see:

There are two CPUs. ARM9 and ARM7, 66MHz and 33MHz respectively. Which means you need to keep them more or less in sync. Each time you stop emulating one CPU to go emulate the other (or to emulate any other hardware) impacts performance negatively, but synchronizing too loosely (not enough) can cause games to break. So you need to find the right compromise.

The ARM7 generally handles tasks like audio playback, low-level wifi access, and accessing some other peripherals (power management, firmware FLASH...). All commercial games use the same ARM7 program, because Nintendo never provided another one or allowed game devs to write their own. This means that in theory the ARM7 could be emulated in HLE. In practice, this has never been attempted, unless DraStic happens to do it. It's also worth noting that it would be incompatible with homebrew, since they don't use Nintendo's ARM7 program.

If it is possible to get ARM7 timings reasonably accurate without too much effort, the ARM9 is another deal. It is clocked at 66MHz, but the bus speed is 33MHz, so the CPU needs to adjust to that when accessing external memory. Accesses to main RAM are slow, moreso than on the ARM7, due to what appears to be bad hardware design. But the ARM9 has caches which can attenuate the problem (provided the program is built right). When the caches are used, timings can vary dramatically between cache hits and cache misses.

So emulating ARM9 timings is a choice between two unappealing options: 1) emulating the whole memory protection unit and caches, or 2) staying grossly inaccurate. I went with the second option with melonDS, but I'm considering attempting MPU/cache emulation to see how much it would affect performance.

Noting that RaymanDS is an example of timing-sensitive game: when timings are bad enough, it will start doing weird things. Effects get worse as timings are worse, and can range from text letters occasionally jumping out of place to travellings going haywire. I have observed polygons jumping out in melonDS, so the timings aren't good enough for this game.

And then you have all sorts of hardware attached to the CPUs. Timers, DMA channels, or the video hardware-- oldschool 2D tile engines and a primitive, quirky custom 3D GPU. The 2D GPUs need to be emulated atleast once per scanline as games can do scanline effects by changing registers midframe (it is less common than on the SNES for example, but it's still a thing). The 3D GPU is another deal: geometry is submitted to it by feeding commands to a FIFO. You need to take care of running commands often enough to avoid the FIFO getting full, especially as games often use DMA to fill it.

Oh by the way, the 2D GPUs are actually pretty complex, supporting a variety of BG modes and sizes, multiple ways to access graphics for sprites, and so on. The 3D GPU isn't any better, I think we have already established that it's a pile of quirks.

So yeah, with the sheer amount of hardware that must be emulated, the DS isn't a piece of cake.]]>
http://melonds.kuribo64.net/comments.php?id=29
melonDS 0.4 -- It's here, finally! -- by StapleButter http://melonds.kuribo64.net/comments.php?id=28 Sun, 16 Jul 2017 01:01:15 +0000

melonDS 0.4 was long awaited, and finally, it's here!

So here's a quick rundown of the changes since 0.3. I'm keeping the best for the end.


The infamous boxtest bug that plagued several games has finally been fixed. The bug generally resulted in missing graphics.

The boxtest feature of the DS lets you determine whether a given box is within the view volume. Several games use it to avoid submitting geometry that would end up completely offscreen. A prime example would be Nanostray, which uses it for everything, including 2D menu elements that are always visible.

Technically, you send XYZ coordinates and sizes to the GPU, which calculates the box vertices from that. The box faces are then transformed and clipped like regular geometry, and the test returns true if any geometry makes it through the process (which would mean that it appears onscreen). This also means that the result will be false if the view volume is entirely contained within the box.

I had no idea where the bug was, as melonDS did all that correctly, and some tests with the libnds BoxTest sample revealed nothing. It turned out that the issue lied within the initial calculation of the box coordinates. When melonDS calculated, say, "X + width", it did so with 32-bit precision. However, the hardware calculates it with 16-bit precision, so if the result overflows, it just gets truncated. And, surprise, games tend to rely on that behavior. Getting it wrong means you end up testing a different box, and getting different results. Hence the bug.


There have been various other improvements to the 3D renderer. Things have been revised to be closer to the hardware.

As a result, the Pokémon games look nicer, they don't have those random black dots/lines all over the place anymore. The horrendous Z-fighting observed in Black/White is also gone.

Other games that suffered from random dots/lines around things, should be fixed too.

As well as things like wrong layering in UI screens rendered with 3D polygons.


The 2D renderer got its share of fixes too. Mainly related to bitmap BGs, but also a silly copypaste bug where rotscaled bitmap sprites didn't show up.

But there have also been improvements to an obscure feature: the main memory display FIFO. If you happen to remember:
The display FIFO is another obscure feature -- I don't know of any retail game that uses it, but I have yet to be surprised.

And, when debugging rendering issues in Splinter Cell, I have been surprised -- that game does use the display FIFO when using the thermal/night vision modes.

The issue in that game was unrelated to that feature (it was bad ordering of rendering and capture that caused flickering), but it was a good occasion to improve display FIFO emulation. It atleast allowed me to finally axe the last DMA hack as the associated DMA now works like on hardware. The FIFO is also sampled to a scanline-sized buffer with the same level of accuracy.

This doesn't matter when the FIFO is fed via DMA, but it enables some creative use of the feature -- for example, you can write 16 pixels to the FIFO to render a vertical stripe pattern on the whole screen. Or you will get bad results should you try to use it the wrong way. All of which is similar to what happens on hardware.


Pushing for accuracy, the last few big things that magically happened instantly now have their proper delays emulated, namely SPI transfers (including backup memory SPI) and hardware division/sqrt.


Firmware write was implemented, meaning you can change the firmware settings from the firmware itself (to run the firmware with no game: System -> Run). A backup of your firmware image will be made should anythig go wrong.


And, last but not least: working wifi multiplayer. It was already spoiled, but regardless, it's the first time DS emulation gets this far. And it doesn't require an unofficial build! It's there and ready to use.

It's not perfect though. But it's a start. Pictochat, NSMB and Pokémon are known to be working, but you might encounter spurious disconnects (or, more likely, lag). Mario Kart and Rayman RR2 refuse to work.

I added a setting ("Wifi: bind socket to any address") which you can play with to try making wifi work better. It can make things better or worse. Try it out if needed. Leaving it unchecked should be optimal, but I was told that it doesn't work under Linux.


With all this, no work was done on the UI, despite what I had stated. My apologies for this. But the UI will be worked on, I promise!


You can find melonDS 0.4 on the downloads page.

Patreon for melonDS if you're feeling generous.]]>
http://melonds.kuribo64.net/comments.php?id=28
Nightmare in viewport street -- by StapleButter http://melonds.kuribo64.net/comments.php?id=27 Wed, 05 Jul 2017 10:28:18 +0000
Namely, it was reported two months ago in issue #18: missing character models in Homie Rollerz's character select screen. Actually, if you look closely, you can see they were there, but they got compressed to one scanline at the top, which was caused by a bug in the viewport transform.

In 3D graphics terms, the viewport defines how normalized device coordinates of polygons are transformed to screen coordinates, which can be used to render the polygons.

Most games specify a standard fullscreen viewport, but there are games that pull tricks. Homie Rollerz is one of them, the character select screen uses a 'bad' viewport. But, unlike high-level graphics APIs, the DS has no concept of bad viewport. You can input whatever viewport coordinates, it doesn't reject them or correct them.

So how does it behave? Well, if you've been following this, you surely know that the DS looks normal and friendly on the surface, but if you look deeper, you find out that everything is weird and quirky. Viewports are no exception.

That's why the bug stayed unfixed for so long. GBAtek doesn't document these cases, so it's a matter of running hardware tests, observing results, and doing it again and again until you figure out the logic.


For example, here's a test: the viewport is set to range from 64,0 to 128,192. Nothing special there.

shitty triangle

Now, we change the viewport to range from 192,0 to 128,192, which results in a width of -64 (which, by the way, OpenGL would reject). One would say that such a viewport results in graphics getting mirrored, like this:

shitty triangle but mirrored

It looks correct, but that's not what the hardware outputs. In reality, it looks more like this:

shitty triangle stretched

The GPU doesn't support negative screen coordinates or viewport sizes, so they wrap around to large positive values. The range for X coordinates is 0-511, so for example trying to specify a viewport width of -1 results in a 511 pixel wide viewport. Such values are likely to cause the viewport transform calculations to overflow, causing coordinates to wrap around in a similar fashion. This is what happens in the example above -- the green vertex actually goes above 511.

Y coordinates work in the same way but with a range of 0-255. However, one has to take into account the fact that they're reversed -- the Y axis for 3D graphics goes upwards, but the 3D scene is rendered from top to bottom. The viewport is also specified upside-down. Different ways of reversing the screen coordinates give slightly different results, so I had to take a while to figure out the proper way. Y coordinates are reversed before the viewport transform, by inverting their sign. Viewport Y coordinates are reversed too, as such:

finalY0 = 191 - Y1;
finalY1 = 191 - Y0;


Past that, there are no special cases -- the viewport transform is the same, no matter the viewport coordinates and vertices. The only special case is that polygons with any Y coordinate greater than 192 aren't rendered (but they are still stored in vertex/polygon RAM). This is probably because the hardware can't figure out whether the coordinate should have been negative (as said before, it doesn't support negative screen coordinates), and thus can't guess what the correct result should be, so it's taking the easy way out. But this is a bit odd considering it has no problem with X coordinates going beyond 256.


With that, I'm not sure how coordinates should be scaled should 3D upscaling be implemented. But, I have a few ideas. We'll see.


But regardless, after revising viewport handling according to these findings, the Homie Rollerz character select screen Just Works™, and I haven't observed any regression, so it's all good.


Also, release 0.4 is for very soon.]]>
http://melonds.kuribo64.net/comments.php?id=27
To make it clear -- by StapleButter http://melonds.kuribo64.net/comments.php?id=26 Wed, 21 Jun 2017 11:35:54 +0000 There is no Patreon version of melonDS. There will never be such a version.

Donating doesn't entitle you to anything. Not donating doesn't make you miss out on anything. You get the same package in all cases.


(this post used to be directed towards a particular news post, but it has since been corrected, so I will leave this here as a general note)


While it's nice that there are emu sites spreading the news, it is really better when they do some basic research and fact checking instead of posting mere assumptions that only cause confusion.


Thank you.]]>
http://melonds.kuribo64.net/comments.php?id=26
Opening to the outer world -- by StapleButter http://melonds.kuribo64.net/comments.php?id=25 Tue, 20 Jun 2017 09:33:54 +0000

The first melonDS release, 0.1, already included some wifi code, but it was a very minimalistic stub. The point was merely to allow games to get past wifi initialization successfully. And it even failed at that due to a bug.

Chocolate waffles to you if you can locate the bug, by the way ;)

But well, at that stage, the focus wasn't much on wifi.


It was eventually fixed in 0.2, and some functionality was added, but it still didn't do much at all. Games finally got past wifi initialization, but that was about it.

It wasn't until 0.3 that some serious work was done. With the emulator core getting more robust, I could try going for the wifi quest again. Not that 0.3 went far at all -- it merely allowed players to see eachother, but it wasn't possible to actually connect. But it was something, and infrastructure for sending and receiving packets was in place and working, as well as a good chunk of the wifi hardware functionality.


You may already know how it went back in the DeSmuME days. As far as local multiplayer was concerned, I kept hitting a wall. Couldn't get it working, no matter how hard I tried. WFC is a separate issue.

It didn't help drive motivation knowing that my work was doomed to stay locked behind a permanent EXPERIMENTAL_WIFI wall, requiring a custom wifi-enabled build, and that the DeSmuME team's public attitude is to sweep wifi under the carpet and pretend it doesn't exist, but the main issue was the lack of documentation as far as local multiplayer is concerned.


The DS wifi hardware isn't a simple, standard transceiver. It is a custom, proprietary wifi system made by Nintendo for their purposes. It has special features to assist local multiplay communication at a fast rate.

GBAtek documents the more regular wifi features well enough to get things like WFC working, but the specific multiplay features weren't well known until now. Many packet captures have been made, several multiplay protocols like Pictochat and download play have been reverse-engineered, but details on hardware operation were at a loss. The hardware operates in a very specific way during multiplay, and there's some real clever design there.

As the DS doesn't support ad-hoc communication, multiplay follows a host-client scheme. Clients can't directly communicate between themselves, everything goes through the host. Who is the host depends on the game -- it is generally the person who starts the multiplayer game.

The host sends beacon frames at a regular interval to advertise the game to potential clients, pretty much like how access points send beacons to advertise their presence. The DS wifi hardware allows to automatically send beacons at a set interval, but there's nothing too special there.

When a client connects, the regular 802.11 association procedure is followed. The client sends an auth frame, to which the host replies with an auth frame, then the client sends an association request, and the host replies with an association response which gives the client its ID.

At this point, multiplay communication begins, and this is where the real meat is.

The host sends data frames using a special TX slot (which GBAtek names CMD). It doesn't simply send the data, it actually assists the whole exchange.

The host data frame has a bitmask telling which clients should reply. Each client is given an ID from 1 to 15.

So, after the data frame is sent, the hardware waits for those clients to reply. It uses said bitmask, as well as a register telling how long a client reply should last, to determine how long the wait should last. It can report when clients fail to reply within their window (assuming clients reply in order).

The client doesn't have much control over the process. It has a register that tells where its reply frame is in memory, and a register that holds its client ID, and that's about it. When receiving a data frame, the hardware checks its client bitmask against the client ID register to determine whether it should reply. The data frame also contains the client reply time. With these data, the client can determine when it should send its reply, and do so automatically. This happens even if no reply frame address is set -- in that case, it sends a default empty reply.

When all this is done, the host sends an automatic acknowledgement frame to the connected clients. This frame is different from the standard 802.11 ack in that it is received like a regular frame and processed by the clients. The standard ack isn't exposed to software.

If any replies are missing, the host will attempt to repeat the whole process (maybe altering the client bitmask so the clients that replied successfully aren't polled again, I haven't checked), if there is enough time left -- the CMDCOUNT timer defines a window during which the transfer is possible. When sending packets, it needs to wait until noone else is sending things, crosstalk isn't a good idea. Regular TX slots seem to keep waiting until the channel is free, but the CMD slot can abort if it has waited for too long.

All in all, a neat piece of hardware. You can guess that missing details on all this doesn't help getting multiplay working, especially as there are various status bits that get set here and there during the process, some of which are crucial.


So I set to work. I managed to replicate multiplay communication between two DS's with homebrew, and used it to work out the details. This, coupled with reverse-engineering of the Pictochat code, allowed to finally get it working in melonDS.



It isn't perfect, but this is definitely progress, as neither DeSmuME nor NO$GBA have gotten this far. As far as Pictochat is concerned, it appears that host->client transfers work fine, but client->host transfers sometimes get partially corrupted or lost entirely (and cause the client to get stuck unable to send).

NSMB MvsL also works, and seems rather stable, even though it can disconnect on you. There is some slowdown which appears to be caused by data loss.

There are also issues stemming from the communication medium used, namely, plain old BSD sockets. With UDP, it's easy and cross-platform, but there are issues. Binding to 127.0.0.1 gave the best results for me with Pictochat under Windows, but I got feedback that it doesn't work under Linux. Binding to all possible addresses works under both Window and Linux, but seemed to be worse for Pictochat, while being better for some other games. So we need to work on this too. BSD sockets have the advantage that they allow playing on separate computers over LAN (in theory -- this hasn't been tested), but maybe we will need faster IPC methods. Multiplayer sends packets at an amazing rate (Pictochat sends every ~4ms), so the communication medium has to keep up with this.


Stay tuned for more exciting reports!

melonDS Patreon]]>
http://melonds.kuribo64.net/comments.php?id=25
So there it is, melonDS 0.3 -- by StapleButter http://melonds.kuribo64.net/comments.php?id=24 Sun, 04 Jun 2017 17:32:40 +0000

So what's new in this version?

A bunch of bugfixes. This version has better compatibility than the previous one.

This includes graphical glitches ranging from UI elements folding on themselves to motion blur filters becoming acid trips, or your TV decoder breaking.

But also, more evil bugs. As stated in previous posts, booting your game from the firmware is no longer a gamble, it should be stable all the time now. An amusing side note on that bug, it has existed since melonDS 0.1, but in that version, the RTC returned a hardcoded time, so the bug would always behave the same (some games worked, others not). melonDS 0.2 started using the system time, which is what introduced the randomness.

The 3D renderer got one-upped too. Now it runs on a separate thread, which gives a pretty nice speed boost on multicore CPUs. This is optional, so if it causes issues or slows things down, you can disable it and the renderer will work mostly like before.

When I had to implement the less interesting aspects of this (controlling the 3D renderer thread), I procrastinated and implemented 3D features instead. Like fog or edge marking. You can see them demonstrated in the screenshots above.

Then I went back and finished the threading. I'm not a big fan of threaded code, but it seems to be completely stable.

However, resetting or loading a new game is still not completely stable, it has a chance of freezing. Oh and the UI still sucks. I plan to finally get at the UI shit for 0.4, and I want to ditch wxWidgets, so I don't really feel like pouring a lot of time into the current UI.

There are more things that went into this release, and I'll let you find out!


Also, here's the melonDS Patreon for those who feel generous.


Have fun!]]>
http://melonds.kuribo64.net/comments.php?id=24
Threads! -- by StapleButter http://melonds.kuribo64.net/comments.php?id=23 Wed, 24 May 2017 02:01:34 +0000

Ever since 3D rendering was added into melonDS, it's been one of the bottlenecks whenever games use it. On the other hand, 2D rendering, while not being very well optimized, doesn't make for a big performance hit.

2D rendering isn't very expensive or difficult though -- the 2D renderers are oldschool tile engines, essentially drawing raster graphics onscreen at the specified coordinates, optionally with a bunch of fancy effects, but nothing too complex. In comparison, the 3D renderer is a full-fledged 3D GPU. It basically turns a bunch of polygons defined in 3D space, into a 2D representation that is then passed to the main 2D renderer and composited mostly like a regular 2D layer.

Transformations by various matrices, culling, clipping, viewport transform, rasterization with perspective-correct interpolation... it's a bunch of work.


The approach originally taken was to render a whole 3D frame upon scanline 215. I'm not sure whether rendering should start upon scanline 214 or 215 (GBAtek says it starts "48 scanlines in advance"), but melonDS starts at 215.

Which basically meant that the emulator had to wait until the whole 3D frame was rendered before doing anything else.


However, in emulation, you can't just throw everything on separate threads. Considering a component, whether you can put it on a separate thread depends on how tightly it is synchronized to other components.

This excellent article from byuu explains well how all this works.

In the case of our DS 3D renderer, threading is feasible because tight synchronization isn't required. Once the renderer starts rendering, you can't alter its state, the polygon and vertex lists are double-buffered and all the other important registers are latched. The only thing you can do is change VRAM mapping, but I have yet to come across a game that pulls stunts like swapping texture banks mid-frame.

So you can put the rendering process on a separate thread, and just tell it when to start rendering, and wait until it's done before telling it to start a new frame.

That's not the approach I tried, though. Rendering starts upon scanline 215, but the 2D renderer needs 3D graphics as soon as scanline 0, which only leaves 48 scanlines worth of time (out of 263) for the 3D thread to do its job. We can do better.

The 2D renderer is scanline-based. So, upon scanline 0, it doesn't need a whole 3D frame, but only the first scanline of it. And similarly for further scanlines. This gives the 3D thread 192 scanlines worth of time, which is a lot better.

It required more work though, as the 3D renderer had to be modified to work per-scanline. But nothing insurmontable. After a while tracking threading bugs, this finally worked, and gave a nice speed boost. Except it caused glitches in Worms 2.

The glitches were because of the "wait for the frame to be finished before starting a new frame" bit. I first did it upon scanline 215, right before signalling the 3D thread to start rendering. Well, normally the renderer would be done by scanline 192 anyway.

But Worms 2 takes advantage of the fact that 3D rendering starts 48 scanlines in advance, which also means it finishes 48 scanlines in advance. So the game unmaps texture banks and starts loading new textures as soon as scanline 146.

Which would cause problems if the 3D thread was still running at that time. When I forced the main thread to wait for the 3D thread to finish upon scanline 144, the glitches went away.


In the end, the threaded 3D renderer gives a nice speed boost. On my laptop, Super Mario 64 DS runs fullspeed now, and Mario Kart DS is close to fullspeed too. And I haven't seen more glitches, so all should be fine.

It still needs a bunch of polish. I want to make it optional, and the 3D thread needs proper start/stop control.

Also noting that there will be no performance gain if you have a single-core CPU (the whole threading apparatus may even reduce performance).

And, of course, there are other targets to optimize too.



Love and waffles~]]>
http://melonds.kuribo64.net/comments.php?id=23