Using the Pytux Kart package in Python the goal of the group was to explore different methods of training the kart to complete a set of tracks provided. Part of the group experimented with a tradition convolutional neural network with varying network architectures, while the second half of the team experimented with reinforcement learning teachniques, such as Q-learning, to train the kart to effectively navigate the tracks. I was responsible for the reinforcement learning approaches.
- Here it is shown how our CNN is overfitting and thinks that the road is to the right of the screen below the bridge, but actually the road continues straight ahead
- Shows the difference between two identical network configurations and training times, but with one set having twice the amount of training images than the other
- Our general conclusion was to not overtrain the model to make the training and testing error as low as possible while also avoiding overfitting in our model
Utilized a RL controller created by a team in prior years. Our main changes included tuning of the epsilon parameter and observing the exploration vs exploitation dilemma, changing the rewards for for different actions, and implementing linear approximation.
- To start we set epsilon to be very high at 0.99, so that there is a 99% chance that the controller will take a random action in the first iteration of Q-Learning to highly encourage it to explore the tracks
- From there, with ϵ(0) = 0.99 update ϵ(n)=(0.99)^(n+1)
- This way the probability of with which the controller takes a random action instead of leveraging the maximum Q value, decays exponentially
- For testing purposes, we allow it to decay linearly simply subtracting 0.01 every iteration, for 100 iterations
- The given system gave a reward of +500 if the current aimpoint was more centered than the previous one and -200 otherwise, as well as a small +15 reward for keeping a velocity above 13
- We did not see how this was beneficial given it does not reward decision making
- Instead we give reward -1 for every frame the track is not completed, and +1e6 when a track is completed
- We also implemented a version of the Q-Learning Controller that uses a feature vector to linearly approximate Q-values as weights for each respective feature
- The Features we consider: how much the aimpoint changed, being rescued, difference between two aimpoints, on straight-away, on turn, speed, steering Angle of aimpoint 1, steering Angle of aimpoint 2
The goal of double Q-learning is to combat the overestimation problem with single Q-learning. The primary issue that results from the Q-learning algorithm is overestimation. This occurs because we use incremental averages to estimate the values of expectations and there is a discrepancy between these two values that leads to overestimation. The issue arises from the fact that Q-learning uses maximum Q values. Expectation of the max running average will tend to always be higher than the maximum expectation of the actual variable, which is what we are trying to model.
The solution to this problem of overestimation is the double Q-learning algorithm. This algorithm combats over-estimation by initializing two Q-value functions that update each other. This mathematically checks out because Van Hasselt proved that the expectation of the second Q-value function is less than or equal to the maximum of the first Q-values. Hence when the first Q-values are updated, they are no longer updated with the maximum value since the second Q-values are used to update it. This leads the algorithm to reach good performance much quicker than traditional Q-learning.
For implementing the double Q-learning algorithm, we started with the provided RL controller from previous years. The first step was creating a second Q-value table that would store the second set of values. Then the next step was changing the function that was called to update Q-values. The function was changed to randomly return either a value from the first or second Q-value table. The probability for returning each was equal at ½. The update equations also had to be altered so that each value was updated using the values from the differing Q-table :
This ensures that we use different Q-values to combat the overestimation described earlier. Next the getQValue function was changed to return a tuple that contained a Q-value from the first (A-table) and second (B-table) set. Next, for both computeValuefromQValues and the computeActionfromQValues functions, each value and action was calculated for both the A table value as well as the B table value. The A and B values were put into a tuple and then a random one was returned from the function. Another numpy file also had to be created to store the B Q-values.
We expected to see faster times across all track times as a result of the double Q-learning algorithm, but our results ended up being slower. Data from double Q-learning runs is not worth displaying here because it did not lead to any improvements in speed, and most of the tracks ended up maxing out over the time period. Despite not producing any meaningful results, from research it is evident that the double Q-learning algorithm can bring about drastic improvements if implemented correctly. At this point, it is unclear where our error lies within the implementation of the algorithm.