due date: Tue Apr 11 by 9:00 PM email to: mcb419@gmail.com subject: hw12 email contents: 1) jsbin.com link to your project code 2) answer all the questions at the bottom of this page in the email
This week we will use a reinforcement learning algorithm, called Q-learning, to find an action selection policy for an agent foraging for pellets. The formulation for the Q-learning algorithm can be found in the lecture slides. The foraging task takes place in a grid world, as specified below.
Pellets: 15 green (good, reward = +1), 15 blue (bad, reward = -1)
Walls: shown in gray; inpenetrable
Sensors: the bot has 3 sensors (left, front, right);
each sensor has 4 possible values indicating what is at that grid location
Sensor values: 0 = nothing, 1 = wall, 2 = good pellet, 3 = bad pellet
Actions: the bot has 3 actions; 0 = move forward, 1 = turn left 90°, 2 = turn right 90°
States: correspond to possible combinations of sensor values; since there are 3 sensors and each
sensor has 4 possible values there are 4*4*4 = 64 possible states.
Q[#states][#actions]: an array for storing learned action values; indexed by state and action; (click the "reset Q" button to randomize)
First, select randAction and click the "run series" button. Look in the results table and you should see a value near -20. This is because there is a small cost of -0.01 on each time step, which corresponds to a cost of -20 over 2000 time steps; for random actions the bot is equally likely to run into green and blue pellets so the pellet rewards cancel out.
Now, near the top of the javascript code, modify the "handCoded" controller. This does not involve any reinforcement learning; you should code it as a set of "condition-action" statements that specify what action the bot should take for different sensor states (e.g., green pellet ahead, more forward; green pellet to the left, turn left; etc.). You'll probably want to introduce some randomness into certain choices. As you develop your code, test the performance using the "run series" button; you should aim for a performance of at least 50. The best controllers in last year's class scored around 100.
Before you can use reinforcement learning, you have to implement methods that return the best action (bestAction
) and the maximum Q value (maxQ
) for a given state, based on
values in the Q array, and you need to implement the code to update the Q values (updateQ
). These methods can be found near the top of the javascript file. After you've correctly implemented
these functions, select "training" and then click "run series" to do a set of training trials. Look at the results in the table... you should see a value greater than 30. Click "run series" again to continue training. Stop training when the values in the table are no longer increasing. Finally, select the "testing" controller and click "run series".
Controller | Fitness mean (std dev) |
---|