Hi everybody,
I study Machine Learning and for one of my course I intend to work on designing an algorithm aimed at learning to play a game by statistical means.
The method used is called Reinforcement Learning (RL). In this field of machine learning, we try to develop algorithms that compute strategies revolving around improvement over a high number of simulation (die and retry style, or pavlovian learning style, but computed).
I've been casually watching the TAS community for a while and this project is the best occasion to do some hacking myself, so I've downloaded FCEUX (both linux version and wine) and began to test the lua scripting tool on Gradius, helped by the datacrystal RAM mapping.
At first glance I was amazed by the flexibility and the robustness of the engine. However after some digging I've found a couple of inconsistencies, hence my questions:
Version: FCEUX 2.2.2Turbo mode in Linux: it's not implemented ? maximum mode works but turbo doesn't.
emu.frameadvance and turbo mode: when turbo/maximum mode is on, what is the behaviour of emu.frameadvance() ? I designed a random pathing AI that speedfire and take random direction inputs at each frame, and it doesn't behave well in turbo mode.. does it means that computations are skipped in turbo mode ?
Side note: if you're interested by the project - which is super exciting since machine learning could lead to amazing TAS techniques - feel free to ask more questions on the subjects). It will happen in a short window of time for me but I'd be interested to continue it afterward.
Sincerely.
Joined: 4/17/2010
Posts: 11486
Location: Lake Chargoggagoggmanchauggagoggchaubunagungamaugg
What exactly is wrong when you turbo?
Warning: When making decisions, I try to collect as much data as possible before actually deciding. I try to abstract away and see the principles behind real world events and people's opinions. I try to generalize them and turn into something clear and reusable. I hate depending on unpredictable and having to make lottery guesses. Any problem can be solved by systems thinking and acting.
What my script does: select random directions at each frames.
What I see in normal mode: Vic Viper taking small step in random directions, back and forth.
What I see in turbo mode: Vic Viper crashing into the sides of the screen.
Answer to myself: sorry, the problem is not here.
My script is just bad, I just realised put math.randomseed(os.script()) in a loop. Tired days.
I'm not familiar with Lua though. Should I delete the thread ? -.-" I will post again in the TAS technique section.
Joined: 4/17/2010
Posts: 11486
Location: Lake Chargoggagoggmanchauggagoggchaubunagungamaugg
No need to delete of course, you will likely still have questions. Not sure if it should be moved either.
Warning: When making decisions, I try to collect as much data as possible before actually deciding. I try to abstract away and see the principles behind real world events and people's opinions. I try to generalize them and turn into something clear and reusable. I hate depending on unpredictable and having to make lottery guesses. Any problem can be solved by systems thinking and acting.
I did a bit of ML lately as well and I seriously really don't think using lua could lead to any great implementation which can be easily maintained or been improved. It's true that MarI/O did quite a bit of buzz, but the result is actually quite simple, since the fitness function only check for horizontal scroll.
So, if you don't go the lua-way, maybe could be to directly hack fceux in C... or:
An another possibility could be to add a socket capability which can remotely call the lua API from an external language. If you are interested, I started a simple project that call the bizhawk lua API from an external app using java with UDP. Here you go:
-Server(luascript+luasocket+timer)
-Client(java... but you could re-implement it to any language you want).
Of course, it would be much better to embed the udp-server directly in the emulator rather than lua.... also huh, the project I made is kind of unfinished, but you could definitely build on something like that.
And I guess it go without saying, but the benefit of this approach is that you can make a real distributed system accross different computer(and on different platform), so your processing can easily scale. So far, I thought this could help for a genetic algorithm performing several prediction from a large dataset.
In any case, let me known if you like the idea or if this sound too confusing, maybe I'll just try to eleborate.
edit: fixed github branch url
Warning: When making decisions, I try to collect as much data as possible before actually deciding. I try to abstract away and see the principles behind real world events and people's opinions. I try to generalize them and turn into something clear and reusable. I hate depending on unpredictable and having to make lottery guesses. Any problem can be solved by systems thinking and acting.
edit: the script given below is much more impressive if you allow Vic Viper to shoot :p (it goes to the final boss of the level 1 in a few interations)
edit2: I've fixed it so it now learn correctly the pacifist run. You can play with the parameter so it can be faster. I guess if you run it for long enough it would end the game... (but it woud be smarter to just restart it at each level with a savestate of course).
Hi, thank for the answers
I've never used Lua but I've heard it's good for ML because of the library Torch for manipulating arrays, with implementation for math operations, optimization and neural networks, and it has a good CUDA support for heavy GPU computation (too bad my computer is a potato).
There have been a lot more on Mario already, check those 3 papers:
http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6156425&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6156425
(PM me for full version it just describes this http://www.marioai.org/)http://cs229.stanford.edu/proj2012/LiaoYiYang-RLtoPlayMario.pdf
This is one learning approach that does a lot more than checking for horizontal scrolling (ennemies, position of ennemies, etc) and learn to be responsive in all kind of situation, not only pre-loaded ones.
https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
I haven't gone through deeply into this paper but apparently, deep networks with reinforcement learning have shown amazing results on Atari games.
This is interesting, however that's not the way I want to explore (and I'm not that good of a programmer myself). I think one good computer with good GPU and several days of computations at most could be enough for most tasks. I have more specific ML-related ideas for now.
I post now a first script for Gradius I've worked on those past days. Let me give a few comment before you read/try it.
First of all I'm new to Lua and my code might be bad-looking. I wanted to use the library Torch, but the main implementation of arrays (the "Tensors") doesn't support sparse matrix so it was a waste of time because my potato has 2GB of RAM. I used the native implementation of arrays in Lua instead, which is good for sparse but not very convenient and you have to do low(ish)-level stuff.
It's a naïve implementation of q-learning. The only input it takes is the position of Vic Viper and if it's alive or not. The default strategy is to give no input, and then Vic tries blindly different paths if the previous one led to death. It's a very raw and blind die-and-retry but as you can see, it would probably be enough to do pacifist runs of many shoot'em up with reasonable computation time.
The code looks long, but the only lines that really matter (where you can tune parameters for better learning) are:
Language: lua
-- update visit_counter
if visit_counter[state_action] then
visit_counter[state_action]=
visit_counter[state_action]+1;
else
visit_counter[state_action]=1;
end
a=visit_counter[state_action];
and
Language: lua
-- Define a reward
R = 0;
if not isAlive then
R = -math.huge;
end
and
Language: lua
-- Update the Q-Value
q_value[state_action]=(1-1/a)*q_value[state_action]
+(1/a)*(R+1*max_q_value);
also you can tune the input frame rate:
Language: lua
-- press those inputs for 12 frames
-- (approximately 300 apm)
for frame=1,12,1 do
joypad.set(1,inputs);
emu.frameadvance();
end
As @BadPotato remarks, this script only checks for horizontal scrolling and the position of Vic Viper so it's a bit disappointing. but it can be improved by adding more states (cf the Mario paper linked above): number/type/position/speed/hitbox of ennemies, of bonuses, information on the sprites where you can go or where you would crash, etc. Then Vic Viper could be trained to react to any random environment and not only the frame/position data. But that would demand a more exhaustive mapping of the RAM.
To a better extent, I wonder if it's possible to take into account all of the RAM as a state. It's probably too heavy for a fast computation (2000*8bits) and too noisy for a good learning (the bits encoding the music or the sprites are useless) but maybe there's a way to pre-process it to remove the useless stuff and shrink the size.
(mentionning this thread would be appreciated if you reuse the code)
Language: lua
-- Launch ROM and this script and watch.
-- One file is created in the folder of the script
-- that records the best inputs over all runs.
-- Vitesse de l'émulation ("turbo", "normal" or "maximum")
-- What works depend on the OS
emu.speedmode("turbo");
-- Useful Inputs:
pressStart={
up=false, down=false,
left=false, right=false,
A=false, B=false,
start=true, select=false
};
releasePad={
up=false, down=false,
left=false, right=false,
A=false, B=false,
start=false, select=false
};
-- Parameters for the environment
frame_max = 200000;
x_max = 256;
y_max = 256;
action_max = 5;
best_frame_alive=0;
-- Best policy at current iteration
policy = {};
-- Counting how many time you visit each state
visit_counter={};
-- Initialize q-value
q_value={};
while true do
-- Reset the engine
emu.softreset()
-- Press start two times for starting a 1P game
emu.frameadvance();
emu.frameadvance();
emu.frameadvance();
emu.frameadvance();
joypad.set(1, pressStart);
emu.frameadvance();
emu.frameadvance();
joypad.set(1, pressStart);
-- Wait for the game to begin
-- i.e that Vic Viper appears
while not isAlive do
emu.frameadvance();
isAlive=memory.readbyte(0x0100)==1;
end
-- Initialize input recording
input_history={};
-- Number of framed passed so far
frame_ni = emu.framecount();
frame_alive = 0;
-- Initial state:
vic_x_i=memory.readbyte(0x07A0);
vic_y_i=memory.readbyte(0x07C0);
state_i=frame_alive+frame_max*(vic_x_i+x_max*vic_y_i);
-- While Vic Viper is still alive
while isAlive do
-- choose an action to perform
-- according the best policy computed so far
action=0;
if policy[state_i] then
action = policy[state_i];
end
-- Record input
input_history[frame_alive]=action;
-- compute index of (frame_alive,vic_x_i,vic_y_i,action)
state_action=frame_alive+frame_max*(vic_x_i+x_max*(vic_y_i+y_max*action));
-- update visit_counter
if visit_counter[state_action] then
visit_counter[state_action]=
visit_counter[state_action]+1;
else
visit_counter[state_action]=1;
end
a=visit_counter[state_action];
-- define inputs to be pressed this frame
if action==0 then
inputs=releasePad;
else
go_up= (action==1)or(action==2);
go_left=(action==1)or(action==3);
press_A=false;
press_B=false;
-- format
inputs={
up= go_up, down= not go_up,
left=go_left, right=not go_left,
A=press_A, B=press_B,
start=false, select=false
};
end
-- press those inputs for 12 frames
-- (approximately 300 apm)
for frame=1,12,1 do
joypad.set(1,inputs);
emu.frameadvance();
end
-- Get new state
new_frame=emu.framecount()-frame_ni;
vic_x=memory.readbyte(0x07A0);
vic_y=memory.readbyte(0x07C0);
state=new_frame+frame_max*(vic_x+x_max*vic_y);
-- Check if alive
isAlive=memory.readbyte(0x0100)==1;
-- Define a reward
R = 0;
if not isAlive then
R = -math.huge;
end
-- max_a Q(X_t+1,a)
max_q_value = -math.huge;
for new_action=0,(action_max-1),1 do
new_state_action=new_frame+
frame_max*(vic_x+
x_max*(vic_y+
y_max*new_action));
if q_value[new_state_action] then
new_q_value=q_value[new_state_action]
if new_q_value > max_q_value then
max_q_value=new_q_value;
end
else
max_q_value = 0;
end
end
-- Initialise Q-value if needed
-- for all actions possible for the state
for possible_action=0,(action_max-1),1 do
possible_state_action=frame_alive+
frame_max*(vic_x_i+
x_max*(vic_y_i+
y_max*possible_action));
if not q_value[possible_state_action] then
q_value[possible_state_action]=0;
end
end
-- Update the Q-Value
q_value[state_action]=
(1-1/a)*q_value[state_action]
+(1/a)*(R+1*max_q_value);
-- Update the state for the next loop
vic_x_i=vic_x;
vic_y_i=vic_y;
frame_alive = new_frame;
state_i=frame_alive+frame_max*(vic_x_i+x_max*vic_y_i);
end
-- compute argmax(qvalue) over the actions
qvalue_max={};
qvalue_argmax={};
for key,value in pairs(q_value) do
action=math.floor(key/(frame_max*x_max*y_max));
vic_y=math.floor(key/(frame_max*x_max))-y_max*action;
vic_x=math.floor(key/(frame_max))-x_max*(vic_y+y_max*action);
frame_alive=key-frame_max*(vic_x+x_max*(vic_y+y_max*action));
state=frame_alive+frame_max*(vic_x+x_max*vic_y);
if not qvalue_max[state] then
qvalue_max[state]=value;
qvalue_argmax[state]=action;
elseif value > qvalue_max[state] then
qvalue_max[state]=value;
qvalue_argmax[state]=action;
end
end
-- update the best policy only if it improves the q-value
for key,value in pairs(qvalue_max) do
previous_action=0;
if policy[key] then
previous_action=policy[key];
end
previous_state_action=key
+previous_action*frame_max*x_max*y_max;
if q_value[previous_state_action]~=value then
policy[key]=qvalue_argmax[key];
end
end
-- save the inputs if it's the best run
if frame_alive > best_frame_alive then
best_time=io.open('best_time.value',"w+");
best_time:write(frame_alive);
best_time:write('\n');
for key,value in pairs(input_history) do
best_time:write(value);
best_time:write('\n');
end
best_time:close();
best_frame_alive=frame_alive;
end
end
This is so much fun. In less that one hour Vic Viper learn to pass the first level including the two bosses, without shooting a missile. The default action is no input, and when it's not possible you can see it dance with the missiles/ennemies, with the accuracy you want (choose 1input/frame for frame-perfect dodges, but it will be long to learn).
Is there a way to disable the video to have faster emulation ?
Thanks for all the resources. Keep it up.
Out of of the box, I believe FCEUX is a bit limited with these kind of rendering option.
Yeah, any static data such as music or sprite, wouldn't be relevant 99.9% of the time. Maybe you can tune up some function to filter those... so again you don't get out of hand with too much processing.
That being said, if you can take a screenshot, I guess this could be reusable data for ML. It's true that with a resolution of 256x240 pixel for the NES, it can be hard to settle down with a particuliar kind of algorithm you can use without getting out of hand. But I still think it could be relevant data.
As for the script using the q-value, well I'll need to check up more stuff about how Q-learning works, but I just wanted to point out there's a 0 on the numerator division when you update the Q-value : (1-1/a)*q_value[state_action]+...
Also do you think math.random() could help for the choice of input?
In any case, your script seem to work just fine for gradius... yet for this particular game, I'm a bit unsure how well it perform in comparison to simply bruteforcing or backtracking on the long run.
Yes, the screenshots can be interesting inputs too. But what exactly is encoded in the ram ? is there the information of every object loaded in the screen ? i thought that if the sprites are in the RAM, the information of the screenshot is already there..?
The "a" is always at least equal to 1 because of how it is defined here:
Language: lua
-- update visit_counter
if visit_counter[state_action] then
visit_counter[state_action]=
visit_counter[state_action]+1;
else
visit_counter[state_action]=1;
end
a=visit_counter[state_action];
On q-value in general, there should be litterature available on the internet, or i'll look for some good papers when some more time.
Well, if by backtracking you mean "play some sequence of inputs, and when it's bad, go to the frame it was still ok and from there try different inputs"... this first script is just an elegant implementation of that, it does no more.
The q-value can be a lot more than that (cf. second paper on mario in last post), this was just the more obvious thing to try to begin with. It's because the parameters taken for describing a state of the game is the horizontal scrolling and the position of the ship. Moreover, there isn't any random exploration of states at all (as we could add with math.random), it's really just deterministic backtracking (more or less if you play with the parameters). I don't think it's a good idea to use random exploration of states on this simple example, but it is for more sophisticated models in the future.
Btw I've opened a git repository for the last script:
https://gitlab.com/FranckC/nes_learning/
learn_gradius.lua is the one from last post (polished and with some remaining bugs corrected)
play_performance.lua can be used to play the result of learning at normal speed.
also:
finish_gradius.lua is the same thing, but implemented with savestates
(unfortunately, it doesn't finish gradius, as it was unable to pass alone the wall from the end of lvl 2, even with shots activated :(.. this is a challenge now :D )
and play_run.lua to play the result of previous script.
Thanks to the kind mod that edited my message to format it correctly. It was late in the night and I surrendered to the tags, with the intent to fix it in the morning...
Just a little up to let you know that I've adapted the same script to dodonpachi, with fba-rr just for fun. It's available in the repository. Launch the rom, pause it, load a savestate to a running instance of a level with the ship alive, start the script "finish_dodon.lua", and unpause to watch it learn. It works with the red ship, I haven't tried the others.
There's another script to play the video when it's done, with the same method than above. (play_run_dodon.lua).
I'm watching it now and it passed successfully lvl 2, I think it can do it! Not 100% sure though because I haven't used the variables for the momentum and the speed of the ship, just the position.
So, yeah, conceptually it brings nothing new... I hope next updates will introduce more intelligent algorithm.
edit: blue fighter does finish loop 1 of the game with fire on. it would finish the loop2 too but it crashes, idk why.
Up: I'm back to Gradius, trying to learn to play the game with a deep neural network applied to Q-learning, and taking the RAM of the game as the state space. It looks very promising!
This idea is taken from this paper.
You can find a lua script here. It runs with fceux/gradius, and you could easily tune it to work with other games or emulators that support lua.
What you need to run it:
- Lua5.1, Torch7 , and net-toolkit. Torch7 is not available on windows.
- Fceux and the rom for Gradius
- To expect results in a reasonable time, you need a powerful cpu (at least 4 cores) dedicated to the task. I'd recommend to monitor the temperature during the learning.
It has been tested with archlinux and linux mint. On linux mint you might run into errors because of relative paths, in this case just replace it with absolute paths.
What it does: it trains a deep network to predict a score for each possible action for the player, such that the best possible action has the highest score. (the score is the so-called q-value). The training includes taking random inputs. It reads the RAM to know the state of the game (but not the screenshots).
The script saves periodically the network to the disk (with net-toolkit) so you can stop the learning process at any time and resume it later, and it saves the scores of the consecutive runs so you can plot the progress.
I haven't had time to observe a major improvement in the learning because my computer is a potato, and up to several thousands runs could be expected before the learning looks tangible. So far I wouldn't bet on a spectacular result before more tuning but it's worth a try.
You can try to tune the parameters, or the architecture of the neural network, or the state/action spaces, the reward, etc, for better results. The script is commented in this regard.
I probably won't work much more on this project, however i'll let you know how it looks after several days of processing.
So, any result?
I guess it all come on how you do training. So I'm not sure if random input alone can build a good enough model to apply new data. Even if you probably have a set of rule to reward/punish to handle randomness, maybe adding a human playthrough into your training set can help.