Q-Learning on MountainCar-v0 Environment
Author: Gleb Tcivie
In this project, we are training a Q-Learning agent to solve the `MountainCar-v0` environment from OpenAI's Gym.
The objective of the MountainCar environment is to get an underpowered car to the top of a hill. The car is on a one-dimensional track, and the position and velocity of the car are observable at each time step.
Training Process
We implement a Q-Learning algorithm with an ε-greedy policy for action selection. We use a simple table to represent the Q-values of state-action pairs. To handle the continuous state space of the environment, we discretize the states by splitting the position and velocity into bins.
The agent's goal is to maximize the total reward it receives in an episode. The reward at each time step is -1, and an episode ends when the car reaches the goal (position 0.5) or after 200 time steps.
Hyperparameter Tuning
We use Weights & Biases Sweeps for hyperparameter tuning. We explore different values of the learning rate, discount factor, and the number of discretized states. The agent's performance is measured by the average reward over 100 episodes.
Analysis and Visualization
These metrics are plotted on a Weights & Biases dashboard, illustrating the agent's evolving performance as it learns from its interactions within the environment.
Results
After training for a specified number of episodes, the agent is able to consistently reach the goal within the 200 time step limit. The hyperparameters found by the sweep lead to more efficient learning compared to a manually chosen baseline.
Due to slowness of loading the notebook I had to move the Weights and Biases result to a link format - You can see the results here
Code and Running
Configure Sweeps & Login to Weights and Biases
Number of episodes to iterate
25000 / 100000
This function takes in a continuous state and returns a discrete state
Defining the main training loop
The main loop for the Q-learning algorithm is in the section where we loop over each episode. Inside this loop:
Creating helper functions
reset_environment(env, bin_size) - This function resets the environment to its initial state at the beginning of each episode. It also converts the initial state into a discrete format, as our Q-table is based on discrete states and actions. It takes the environment and bin size as input and returns the initial observation and the discrete state.
choose_action(discrete_state, q_table, epsilon, env) - This function implements the ε-greedy policy for action selection. With a probability of ε, it selects a random action, and with a probability of (1-ε), it selects the action with the highest Q-value in the current state. It takes the current discrete state, the Q-table, the epsilon value, and the environment as input and returns the chosen action.
update_q_table(q_table, discrete_state, action, reward, new_discrete_state, LEARNING_RATE, DISCOUNT) - This function updates the Q-value for the current state-action pair based on the Q-Learning update rule. It takes the Q-table, the current discrete state, the chosen action, the reward obtained, the new discrete state, learning rate, and discount factor as input. It doesn't return anything as the Q-table is updated in-place.
log_metrics(run, reward_list, max_reward_list, min_reward_list, episode_reward, episode_steps, duration, epsilon) - This function logs various metrics of interest during the training process. These metrics include the average reward over the last 100 episodes, the total reward in the current episode, the number of steps in the current episode, the current ε value, and the minimum and maximum rewards obtained so far. It takes the current run, lists to store total, maximum and minimum rewards per episode, reward for the current episode, number of steps in the current episode, duration of the current episode, and the current ε value as input. The metrics are logged to the current Weights & Biases run for visualizing the training progress.
The main loop
run_episodes(run, env, q_table, bin_size, epsilon, LEARNING_RATE, epsilon_decay_value, DISCOUNT, END_EPSILON_DECAYING, START_EPSILON_DECAYING) - This function contains the main training loop. In each episode, it resets the environment, selects actions according to the ε-greedy policy, takes the actions in the environment, updates the Q-table, and logs the training metrics. It takes the current run, the environment, the Q-table, the bin size for discretizing states, the initial epsilon value, the learning rate, the epsilon decay value, the discount factor, and the start and end episodes for epsilon decay as input. The Q-table gets updated in-place during the training process, and the training metrics are logged to the current Weights & Biases run.
train() is a warper for the run_episodes() which is made to utilise the Weights and Biases sweeps functionality
Run the sweeps