TDDC17 Artificial Intelligence

TDDC17-Lab5

Aim

In this lab, you will learn how to control a rocket in a simulated continuous environment with reinforcement learning.

Preparation

Read chapter 17.1 about Markov Decision Processes (MDPs) and 21.1-3 about reinforcement learning in the course book. You only need an very shallow understanding of MDPs but the chapters 21.1-3 are important, especially 21.3 which describes the Q-learning algorithm.

Part I: Tutorial

The first part of the lab is a tutorial that will introduce you to the environment that is used in this lab and how you can access the sensors and actuators of the simulated rocket to be able to control it.

The lab environment consists of a simple physics engine which can contain objects that are built from point masses and springs. Different types of actuators can be mounted on these objects such as wheels and rocket engines. The environment can also contain solid objects that can be used for simulating obstacles and/or ground.

The setup for this lab does not contain any obstacles. The rocket that is supposed to be controlled, is shown in figure 1. It has three rocket engines that can be controlled independently. They are either turned on or off, nothing in between is allowed.

The rocket has a set of sensors that can be queried. The sensors are all real-valued and their values can be read by calling getValue() on the corresponding DoubleFeature reference. See the code skeleton for part I as an example.

Figure 1: The rocket

Task for part I:

Make sure that you are using the latest version of the Java development kit by typing:
- module add prog/jdk/1.6
Copy the file TutorialController.java from ~TDDC17/www-pub/info/labs/rl/ to your own account.
Open it and implement the tick() method so that it receives readings from the vertical velocity sensor "vx" and prints them out on the standard output.
Compile the file by typing:
- javac -classpath .:/home/TDDC17/www-pub/info/labs/rl/rllab.jar TutorialController.java
Run the simulation application by typing:
- ~TDDC17/www-pub/sw/scripts/rl_tutorial
Try to fly the rocket manually with the "W", "A" and "D" buttons that control the rocket engines and check that your printouts are working. The graphics is unfortunately rather "slow" if you use the thin clients.

Part II: Hardcoded Controller

In the second part of the lab you are supposed to use the sensor values from the rocket and try to implement your own controller that performs a hovering operation.

The different rocket engines are turned on or off by calling the setBursting(boolean) method on the corresponding RocketEngine references. See the code skeleton for part II.

When you design your controller, beware of the risks with rapidly oscillating control actions since they can destroy the rocket and its corresponding sensor readings.

Task for part II:

Copy the file HardcodedController.java from ~TDDC17/www-pub/info/labs/rl/ to your own account.
Open it and implement the tick() method that should receive values from the sensors and use them to control the rocket. The rocket is supposed to hover.
Compile the file by typing:
- javac -classpath .:/home/TDDC17/www-pub/info/labs/rl/rllab.jar HardcodedController.java
Test your controller by typing:
- ~TDDC17/www-pub/sw/scripts/rl_hardcoded
Demonstrate your controller to the lab assistent and describe your design.

Part III: State Space and Reward Design

In this part of the lab, you will use parts of the sensor in the rocket to create two different state spaces and reward functions that will later be used for the reinforcement learning algorithm. The first state space only discretizes the angle of the rocket. This will be used for testing the Q-learning implementation together with a reward function that only depends on the angle. The second state space and reward function are supposed to make it possible for the rocket to hover and more sensor variables than just the angle are therefore necessary.

The sensor variables can be discretized in many ways and it is up to you how this should be done in your implementation. A function for uniform discretization within certain bounds is provided in the code skeleton for part III. You do not have to write your own discretization methods for this lab.

Since the rocket is supposed to learn how to hover, it is important that both the state space and reward function is designed appropriately. Calculate the size of the state space times the number of possible actions (which gives the size of the Q-function) and check that it is not larger than approximately 400. If it is much larger, the Q-learning algorithm will take too long to converge.

There are certainly tradeoffs involved in the state space design but some methods have previously been used sucessfully. For example, the angle seems to be the most important sensor variable and as many discrete values as possible should be assigned to it during the discretization. The horizontal and vertical velocity will then only be given a small number of discrete values, but that is not really a problem in this lab.

Try also to make the reward function as simple as possible. A sufficiently good reward function can be written in one line of code and a state space that is completely defined by the uniform discretization methods is also sufficent for the purpose of this lab. Many students have tried to write complicated reward functions and their own discretization functions and most of them had to redo everything from scratch because of the debugging mess that followed when something did not work properly. Keep it as simple as possible and you will be alright.

Task for part III:

Copy the file RewardAndState.java from ~TDDC17/www-pub/info/labs/rl/ to your own account.
Open it and implement the getStateSimple() (state for the angle controller), getStateFull() (state for the hover controller), getRewardSimple() (reward for the angle controller) and getRewardFull() (reward for hover controller) methods. Feel free to change the method declaration if you don't like it; RewardAndState.java is just a suggestion.
Test the state and reward functions on your hardcoded controller and see how it performs. You can call the static methods in RewardAndState.java like this: RewardAndState.getStateSimple(array) and so on. Remember that you can always disconnect the controller by pressing "P". Test your own performance, if you like, by controlling the rocket manually. Connect the controller again with "O".

Part IV: Q-learning Implementation

Now you should have two state spaces and two reward functions available. It is time to implement the Q-learning algorithm and see how it performs in the two different settings.

The Q-learning algorithm is described in chapter 21.3 in the course book.

You may implement the Q-function in any way you want, but a quick and recommended solution is to represent it with a map from strings to doubles, HashMap<String, Double> map = new HashMap<String, Double>();, where the index string is the string representation of the state appended to the action name or number. Use map.put(qString, value) to set a Q-value and map.get(qString) to read it. More information about the HashMap can be found here.

There is a slight complication that will arise when you try to implement the algorithm. Most of the time when tick() is called, no state change will occur due to the few number of states that are allowed in the discretization. This means that if Q-learning is implemented without making some adjustments, it will take an enormous amount of time to converge.

A first simple trick is to specify that the current action is executed until a state change occurs. But this may cause the rocket to wait forever for the no-op action to finish. A way to get around this is to specify a maximum number of primitive steps that an action is allowed to execute before it is considered to make a transition to the same state as the previous one.

When a maximum number of steps is added to the algorithm, the rewards that are used for the Q-learning algorithm have to be changed. The reward that Q-learning receives for updating the Q-value, should intuitively depend on some form of cumulative or average reward that was received during the more primitive steps. In this lab you can simply assume that the average of the rewards during the primitive steps can be used as input to the Q-learning algorithm.

Another extension that you might want to do is to provide a minimum number of steps that has to be executed before any new action can be selected. The reason for the extension is that the rocket may break due to internal oscillations if controlled improperly.

Q-learning converges rather slow compared to methods that either uses function approximation or learns an internal model of the environment. If you want to see if your implementation converges you can use the class TestPairs to keep track of the cumulative reward. The following code example shows how the TestPairs class can be used:


 // Import IO
 import java.io.*;

 // Member variables in QLearningController
 TestPairs pairs = new TestPairs();
 double sumReward = 0.0;
 int nrTicks = 0;
 int nrWrites = 0;


 public void writeToFile(String filename, String content) {
   try {
     FileOutputStream fos = new FileOutputStream(filename);
     fos.write(content.getBytes());
   } catch (Exception e) {
     e.printStackTrace();
   }
 }

 ...
  
 // Inside the tick() method
 int nrTicksBeforeStat = 10000; // An example
 if (nrTicks >= nrTicksBeforeStat) {
   TestPair p = new TestPair(nrTicksBeforeStat * nrWrites, (sumReward / nrTicksBeforeStat));
   pairs.addPair(p);
   try {
     writeToFile("output.m", pairs.getMatlabString("steps", "result"));
   } catch (Exception e) {
     e.printStackTrace();
   }
   sumReward = currentReward;
   nrTicks = 0;
   nrWrites++;
 } else {
   nrTicks++;
   sumReward += currentReward;
 }

Every 10000th tick, the cumulative rewards are written as a Matlab file "output.m". The file can be inspected in its raw form or can be plotted in Matlab with the commands:

output;
plot(steps, result);

Ask your lab assistent if you have any questions about how to start and use Matlab.

The drawing of the rocket seems to take a lot of time when the simulation is executed on the thin clients. It is possible to disable the drawing by pressing "M" and in that way shorten the waiting time. You can also increase the simulation speed somewhat, when drawing is disabled, by pressing "V". The simulation speed is decreased again with "B". The current speed is shown in the terminal window where you started the application.

If you run into problems (and you probably will) and wonder whether the bug is in the reward function, state space definition or implementation of the Q-Learning algorithm, use the state space for the angle controller which should make the Q-learning algorithm converge quickly (use the Matlab file printouts to check the improvement).

Task for part IV:

Copy the file QLearningController.java from ~TDDC17/www-pub/info/labs/rl/ to your own account.
Open it and implement the extended Q-Learning algorithm in the tick() method for both the angle and hover controller. Use the reward and state extraction methods in your RewardAndState.java. Also use the so called epsilon-gready exploration function, which simply selects a random action with epsilon probability (epsilon is commonly set between 0.01 and 0.1) and "best" action with probability (1 - epsilon). Ignore the exploration function suggested in the book because it does not guarantee convergence of the Q-Learning algorithm. Set gamma (the discount factor) to 0.95 and alpha (the learning rate) to (1 + N[s, a])^-1. Set the maximum number of primitive steps between 40 and 80 and the minimum to 5.
Compile the file by typing:
- javac -classpath .:/home/TDDC17/www-pub/info/labs/rl/rllab.jar QLearningController.java
Test your controller by typing:
- ~TDDC17/www-pub/sw/scripts/rl_qlearning
Demonstrate your solution to the lab assistent and hand in the code for your QLearningController.java. You must show that your implementation and state/reward design leads to a reasonable good solution either by direct inspection of the rocket behavior or by an output from the TestPairs.getMatlabString() method.

Acknowledgements

This lab has been developed by Per Nyblom.

Page responsible: Fredrik Heintz
Last updated: 2011-09-30

IDA - Department of Computer and Information Science