Senior Moderator
Joined: 8/4/2005
Posts: 5777
Location: Away
So here's a horrible idea. Bear with me for a bit. Right now, emulator botting is fundamentally a single-threaded endeavor, and CPUs haven't undergone a drastic progress in that department lately (the best consumer CPU you'll find today probably won't even be 1.5 times faster on a typical ST task than my 5.5 years old i3-8350K from 2017 overclocked to 5.1 GHz). So if we want it to handle a complex enough task, we either have to run multiple instances of the emulator and make each instance search through a dedicated chunk of input permutations, or, in some very specific cases, we isolate the game logic entirely and then turn it into a parallelized code to be executed completely separately from the emulator. Both are very labor-intensive (especially the latter, which also requires reverse-engineering and knowledge of a high-performance programming language), don't have a generalized solution, and in the former case still don't run very fast, either. The fact that emulator code is mostly not parallelizable has been a serious problem for botting and the main argument against porting it onto CUDA or another platform using compute shaders to take advantage of GPUs' massive parallelism. However, that argument has historically been built around the speed of a single emulator instance, which would most likely be slower in this case, and it's a valid argument if we want to prioritize that. But do we? What if we use it to run multiple cores, each within its own waterboxed instance, all in parallel? Sure, each one would be slower, but on the other hand, we can run as many as we can fit into VRAM without much (if any) overhead, unlike running them on a >8-core CPU. Assuming a single core instance for a 8/16-bit platform takes roughly 200 MB on average (I just pulled that number out of my ass, don't judge me), and we're using a graphics card with 12 GB of VRAM (a GTX 1080 Ti, RTX 2060-12, RTX 3060, or RX 6700 XT, most of which can be found for a couple hundred USD on the aftermarket), we can fit up to ~60 instances of our core at the same time, with a single interface for managing their inputs. So even if it makes a single core run about twice as slow as it would have on a modern 5 GHz CPU core (out of my ass again), that's still a whopping 30x net speedup for the purposes of botting in particular. And a graphics card with 16, 20, or 24 GB would result in a proportionately larger speedup still, which would make it a really damn good generalized solution in the longer term as VRAM sizes keep increasing at a faster rate than the number of high-performance CPU cores. And then there's the possibility of running 3-4 GPUs on the same machine, and you can see how well it scales in principle. We're looking at overall speedups of at least two orders of magnitude in the near-term: up to 500x could already be achievable with today's technology if my napkin math here is anywhere close to realistic. And from there, we only need a relatively minor step towards a folding@home-style parallel computing network run by other TASVideos members on their own GPUs, each able to pick the games to which they'd like to dedicate their compute resources. I've avoided the obvious elephant in the room, which is whether that's feasible to implement at all, to which my answer is: honestly, I don't know. But if it is, it's something worth considering, as the optimization problems we're encountering become progressively more complex with both old and new games, so we'll be relying on bots more, not less, over time.
Warp wrote:
Edit: I think I understand now: It's my avatar, isn't it? It makes me look angry.
Post subject: GPUs and Emulation don't mix
eien86
He/Him
Judge, Skilled player (1981)
Joined: 3/21/2021
Posts: 275
Location: Switzerland
Fully parallelizable emulators exist and I've been using them for a while with high-level of parallelism (128 cores in my threadripper computer) See: [https://tasvideos.org/Forum/Topics/24058] Take a look at my movies to see what can be done with a fully parallelized bot: [https://tasvideos.org/Movies-Author11245] Some emus are non-parallelizable yes, but just because they employ global variables. This is an easily fixable problem, and I indeed did solve it for a bunch of platforms in JaffarPlus. Beyond that, GPUs are nothing but vector processors, ideal for workloads like linear algebra where the same instruction is applied to many data elelments. Console emulation is literally the worst possible workload for a GPU, because you need every different instance of the game to emulate a different console opcode.
Senior Moderator
Joined: 8/4/2005
Posts: 5777
Location: Away
Oh yes, I am familiar with your work on PoP. Didn't know you've made that into a (more) generalized solution since, that's very cool. I realize it's a horrible workload for a GPU, which, I'm sure, is why nobody has ever seriously considered it, but I'd be interested to know where the limits of this approach are because GPUs allow for just so much better (and cheaper) scaling that it can overcome the lack of speed just in volume. Because, say, if this is just 1/10 the speed of a consumer CPU, we're still more than good to go. If it's somewhere between 1/10 and 1/20, we're still good to go at a big enough scale. Between 1/20 and 1/50 could be problematic, at least in the near term. Slower than 1/50 is where I'd say it's completely infeasible.
Warp wrote:
Edit: I think I understand now: It's my avatar, isn't it? It makes me look angry.