Joined: 4/17/2010
Posts: 11495
Location: Lake Chargoggagoggmanchauggagoggchaubunagungamaugg
https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units
Has this been used at least once? How applicable can this be, considering all the CPU emulation difficulties that slow down the cores? Does the task we'll be giving to the GPU has to be of a threaded nature, so the GPU parallelizes it, or anything straightforward like traditional interpreter core will also work? I'm probably wording this poorly, just curious if anyone has already pondered this idea.
Warning: When making decisions, I try to collect as much data as possible before actually deciding. I try to abstract away and see the principles behind real world events and people's opinions. I try to generalize them and turn into something clear and reusable. I hate depending on unpredictable and having to make lottery guesses. Any problem can be solved by systems thinking and acting.
Shaders.
What difficulties?
When this question comes up, people are essentially saying that parts of the emulated system should be separated and distributed to several CPU cores. However, a CPU core works best when it can run separately from others - because it doesn't have to wait for data to arrive / wait for synchronization (the other core must finish its current task, check for waiting queries, and send its "ready!" signal). As you can see from these numbers, synchronization (mutexes) are relatively slow. If I can fit most of the system state into the L1 cache and much of the data/instruction stream into the L2 cache, I'm going to be more than 4 times faster working with L2 cache data than working with another core.
GPUs go the "separate the cores" way even further: these SMs / CUs are slower than a CPU and have less cache associated with them. This doesn't help for a console like the SNES where every part of the system can influence another part of the system at a rate of ~21MHz (every ~46.56ns).
One has to understand that the GPU doesn't run code in the same way as a multi-core CPU. GPU's have "general-purpose" cores in the sense that they support almost everything that a CPU does (at least in terms of mathematical calculations applied to data), but not in the sense of "every core can execute code independently of each other".
Each core in a multi-core CPU can run an independent routine, or the same routine completely out-of-sync. The cores are pretty much completely independent of each other and can do whatever they want. (Which is actually one of the reasons why it's so difficult to have them handle the same shared data: They could interfere with each other when reading/writing that shared data.)
In a GPU, however, each core (at least the cores in a group) runs the exact same routine, completely in sync. It's like a bunch of cores traversing through the set of opcodes at the same speed, every one executing the exact same instruction as every other one, in parallel. (This execution model is labeled "SIMT", or "single instruction, multiple threads".) So it's essentially having one program, which multiple cores are executing in perfectly in unison. (The reason why this makes any kind of sense is that the values that the program is operating on can be different for each core. Ie each core can get different input values, execute those instructions on them, and thus produce different output values.)
Anyway, what this means is that to be able to get any advantage from GPGPU, your task needs to be one single routine that can be run in hundreds/thousands of cores in parallel (using different input values for each). You can't just run different tasks on different cores.
(Well, technically you can because the cores in a GPU are divided into groups, each group being able to run an independent routine. However, to get any sort of speed advantage from this, the routine ought to be parallelizable to dozens of cores at least. But even then, there are limitations on how these different routines can interact with each other.)
Fragment shaders are a good example of this (and it's exactly how they work in the rendering process). A fragment shader is one program that gets run in parallel for hundreds (even thousands) of pixels, with different input values (eg. texture coordinates).
(The SIMT mode of execution is also the reason why conditionals are a bit inefficient in GPGPU programming. If you have an "if(condition) slow_code(); else another_slow_code();" what happens is that when the threads reach that condition, the ones where the condition is false go to idle, waiting for the remaining threads to finish executing that branch. After that the first threads will start executing the else branch, while the others go idle. The overall speed of that conditional is the same as executing both branches consecutively, regardless of how many threads go to one branch and how many to the other.)
This is true at the level of one indivisible group of threads (which is, incidentally, called a "warp"). Typically that's 16 or 32 threads (at least last time I checked; it's probably more nowadays).
However, GPUs can juggle separate pieces of code around in other ways that don't have to obey the SIMT restrictions. Most notably, it can park a thread while it's waiting for a memory access and do some other calculations on its many ALUs while it's waiting for the slow-by-comparison memory controller to catch up. (Modern CPUs can do this sort of thing too, but to a much lesser extent because unlike GPU programs, CPUs don't normally have the benefit of explicit instructions telling them what bulk memory transfer operations will be needed for the future.) There's not an immediately obvious way to use this sort of feature in a standard emulator construction, but it wouldn't surprise me if there were some clever way to do things.