Here's how to pipe framebuffer to ffmpeg without intermediate images, tho this is in C++:
https://github.com/clementgallet/libTAS/blob/master/src/library/encoding/AVEncoder.cpp
https://github.com/clementgallet/libTAS/blob/master/src/library/encoding/NutMuxer.cpp
With such a pipe set up you don't have any overhead that has to be kept just for the magic to keep working. And it still gives you full control over ffmpeg since you still pass your command.
If it's just a matter of keeping only a few images and then removing them, all you need is efficient ffmpeg command. Here's what youtube does to your bitrate (might be outdated, but it's more or less true):
https://vadosnaprimer.github.io/feos.vs.youtube.2160p.html
Here's how size and speed relate when using x264:
http://tasvideos.org/Feos/VideoTests.html
Do you render at 1x scale? It's quite possible that rendering at 2x scale could allow you to preserve much higher quality in general while increasing the ratefactor significantly to reduce the size (native pixel size gets heavily killed by chroma subsampling that yuv420 does). It's worth using a short test clip to see what youtube does to your video, be it 4K or 8K.
BTW, since Tails is so critically important in this run, won't it be better to keep him non-transparent at all? I guess he's half-transparent when off-screen, but this is one of the biggest features here, so making it more obvious would help. I doubt not knowing when he is off-screen will make anyone angry.