HW 12 Reinforcement Learning

Week 12 SOLUTION: Reinforcement Learning in Grid World

  due date: Tue Apr 11 by 9:00 PM
  email to: mcb419@gmail.com
  subject: hw12
  email contents:
    1) jsbin.com link to your project code
    2) answer all the questions at the bottom of this page in the email

Introduction

This week we will use a reinforcement learning algorithm, called Q-learning, to find an action selection policy for an agent foraging for pellets. The formulation for the Q-learning algorithm can be found in the lecture slides. The foraging task takes place in a grid world, as specified below.

Pellets: 15 green (good, reward = +1), 15 blue (bad, reward = -1)
Walls: shown in gray; inpenetrable
Sensors: the bot has 3 sensors (left, front, right); each sensor has 4 possible values indicating what is at that grid location
Sensor values: 0 = nothing, 1 = wall, 2 = good pellet, 3 = bad pellet
Actions: the bot has 3 actions; 0 = move forward, 1 = turn left 90°, 2 = turn right 90°
States: correspond to possible combinations of sensor values; since there are 3 sensors and each sensor has 4 possible values there are 4*4*4 = 64 possible states.
Q[#states][#actions]: an array for storing learned action values; indexed by state and action; (click the "reset Q" button to randomize)

Instructions

First, select randAction and click the "run series" button. Look in the results table and you should see a value near -20. This is because there is a small cost of -0.01 on each time step, which corresponds to a cost of -20 over 2000 time steps; for random actions the bot is equally likely to run into green and blue pellets so the pellet rewards cancel out.

Now, near the top of the javascript code, modify the "handCoded" controller. This does not involve any reinforcement learning; you should code it as a set of "condition-action" statements that specify what action the bot should take for different sensor states (e.g., green pellet ahead, more forward; green pellet to the left, turn left; etc.). You'll probably want to introduce some randomness into certain choices. As you develop your code, test the performance using the "run series" button; you should aim for a performance of at least 50. The best controllers in last year's class scored around 100.

Before you can use reinforcement learning, you have to implement methods that return the best action (bestAction) and the maximum Q value (maxQ) for a given state, based on values in the Q array, and you need to implement the code to update the Q values (updateQ). These methods can be found near the top of the javascript file. After you've correctly implemented these functions, select "training" and then click "run series" to do a set of training trials. Look at the results in the table... you should see a value greater than 30. Click "run series" again to continue training. Stop training when the values in the table are no longer increasing. Finally, select the "testing" controller and click "run series".

Questions:

(provide answers in the body of your email)

After training, use the javascript console to examine the three values for learner.Q[0]. What sensor-value configuration does this state correspond to? What action is optimal in this state?
state = 16*sensors[2].val + 4*sensors[1].val + 1*sensors[0].val
Q[0] corresponds to state=0, with all 3 sensors registering 0 (open space).
Forward is slightly better than left or right in this state Q[0] ~ [0.26, 0.23, 0.22]
Examine learner.Q[2]. What sensor configuration does this correspond to? What action is optimal in this state?
Q[2] corresponds to sensor 0 registering a good pellet (i.e., a green pellet to the left).
Left is the best action. Q[2] ~ [0.28, 1.18, 0.20]
Examine learner.Q[8]. What sensor configuration does this correspond to? What action is optimal in this state?
Q[8] corresponds to sensor 1 registering a good pellet (i.e., a green pellet directly ahead).
Forward is the best action. Q[8] ~ [1.35, 1.02, 1.01]
Examine learner.Q[12]. What sensor configuration does this correspond to? What action is optimal in this state?
Q[12] corresponds to sensor 1 registering a bad pellet (i.e., a blue pellet directly ahead).
Left was chosen as the best action; Forward is the worst choice. Q[12] ~ [-0.67, 0.24, 0.17]
Examine learner.Q[17]. What sensor configuration does this correspond to? Has this state ever been visited? Why or why not?
Q[17] corresponds to sensors 0 and 2 registering a wall (i.e., a wall on both the left and righ of the bot)
This can never happen in the current configurations. Q[17] ~ [0.0, 0.0, 0.0]
(not identically zeros, but small random initialization values)
How did your "testing" performance compare to your "handCoded" performance? If they are similar, why? If they are different, why?
Testing performance was around 63, while handcoded performance was around 75.
During testing the bot selects non-optimal actions with probability epsilon = 0.1, which accounts for some of the difference.

Results Table

Controller	Fitness mean (std dev)