Q-Learning on MountainCar-v0 Environment

Author: Gleb Tcivie

In this project, we are training a Q-Learning agent to solve the `MountainCar-v0` environment from OpenAI's Gym.

The objective of the MountainCar environment is to get an underpowered car to the top of a hill. The car is on a one-dimensional track, and the position and velocity of the car are observable at each time step.

Training Process

We implement a Q-Learning algorithm with an ε-greedy policy for action selection. We use a simple table to represent the Q-values of state-action pairs. To handle the continuous state space of the environment, we discretize the states by splitting the position and velocity into bins.

The agent's goal is to maximize the total reward it receives in an episode. The reward at each time step is -1, and an episode ends when the car reaches the goal (position 0.5) or after 200 time steps.

Hyperparameter Tuning

We use Weights & Biases Sweeps for hyperparameter tuning. We explore different values of the learning rate, discount factor, and the number of discretized states. The agent's performance is measured by the average reward over 100 episodes.

Analysis and Visualization

Episode Reward: This represents the total reward accrued in a single episode.

Steps: This metric counts the number of steps taken in one episode.

Epsilon: This refers to the current value of ε, which is used in the ε-greedy policy.

Average Reward: This is calculated as the mean episode reward over the last 100 episodes.

Max/Min Reward: These are the highest and lowest episode rewards obtained to date.

EPS: Episodes per second, which is a measure of the simulation speed.

These metrics are plotted on a Weights & Biases dashboard, illustrating the agent's evolving performance as it learns from its interactions within the environment.

Results

After training for a specified number of episodes, the agent is able to consistently reach the goal within the 200 time step limit. The hyperparameters found by the sweep lead to more efficient learning compared to a manually chosen baseline.

Due to slowness of loading the notebook I had to move the Weights and Biases result to a link format - You can see the results here

Code and Running

!pip install wandb !pip install gym==0.26.2 !pip install pyglet

import gym import numpy as np from tqdm import tqdm import matplotlib.pyplot as plt import wandb import time

Configure Sweeps & Login to Weights and Biases

wandb.login() sweep_config = { 'method': 'random', # or 'grid' or 'bayes' 'metric': { 'name': 'avg_reward', 'goal': 'maximize' }, 'parameters': { 'learning_rate': { 'values': [0.1, 0.01, 0.001] }, 'discount': { 'values': [0.9, 0.95, 0.99] }, 'epsilon': { 'values': [0.5, 0.8, 0.9] }, 'num_states': { 'values': [10, 20, 30, 40] }, }, 'count': 20 # limit sweep to 20 runs }

Number of episodes to iterate

EPISODES

25000 / 100000

This function takes in a continuous state and returns a discrete state

def get_discrete_state(state, env, bin_size): # This is used to convert continuous state space into a discrete state space discrete_state = (state - env.observation_space.low) / bin_size return tuple(discrete_state.astype(int))

Defining the main training loop

The main loop for the Q-learning algorithm is in the section where we loop over each episode. Inside this loop:

We reset the environment and initialize the reward and steps for this episode to zero.

We use the epsilon-greedy method to select actions and execute them in the environment.

The Q-value for the executed action is updated using the Q-learning update rule.We also log the episode per second, reward, reward per second, and steps per second to wandb.

Creating helper functions

reset_environment(env, bin_size) - This function resets the environment to its initial state at the beginning of each episode. It also converts the initial state into a discrete format, as our Q-table is based on discrete states and actions. It takes the environment and bin size as input and returns the initial observation and the discrete state.

choose_action(discrete_state, q_table, epsilon, env) - This function implements the ε-greedy policy for action selection. With a probability of ε, it selects a random action, and with a probability of (1-ε), it selects the action with the highest Q-value in the current state. It takes the current discrete state, the Q-table, the epsilon value, and the environment as input and returns the chosen action.

update_q_table(q_table, discrete_state, action, reward, new_discrete_state, LEARNING_RATE, DISCOUNT) - This function updates the Q-value for the current state-action pair based on the Q-Learning update rule. It takes the Q-table, the current discrete state, the chosen action, the reward obtained, the new discrete state, learning rate, and discount factor as input. It doesn't return anything as the Q-table is updated in-place.

log_metrics(run, reward_list, max_reward_list, min_reward_list, episode_reward, episode_steps, duration, epsilon) - This function logs various metrics of interest during the training process. These metrics include the average reward over the last 100 episodes, the total reward in the current episode, the number of steps in the current episode, the current ε value, and the minimum and maximum rewards obtained so far. It takes the current run, lists to store total, maximum and minimum rewards per episode, reward for the current episode, number of steps in the current episode, duration of the current episode, and the current ε value as input. The metrics are logged to the current Weights & Biases run for visualizing the training progress.

def reset_environment(env, bin_size): observation, info = env.reset() discrete_state = get_discrete_state(observation, env, bin_size) return observation, discrete_state def choose_action(discrete_state, q_table, epsilon, env): if np.random.random() > epsilon: action = np.argmax(q_table[discrete_state]) else: action = np.random.randint(0, env.action_space.n) return action def update_q_table(q_table, discrete_state, action, reward, new_discrete_state, LEARNING_RATE, DISCOUNT): max_future_q = np.max(q_table[new_discrete_state]) # estimate of optimal future value current_q = q_table[discrete_state + (action,)] # current Q-value new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q) q_table[discrete_state + (action,)] = new_q # update Q-table with new Q-value def log_metrics(run, reward_list, max_reward_list, min_reward_list, episode_reward, episode_steps, duration, epsilon): reward_list.append(episode_reward) max_reward_list.append(max(episode_reward, max_reward_list[-1]) if max_reward_list else episode_reward) min_reward_list.append(min(episode_reward, min_reward_list[-1]) if min_reward_list else episode_reward) avg_reward = np.mean(reward_list[-100:]) # average over last 100 episodes metrics = {'eps': 1/duration, 'reward': episode_reward, 'steps': episode_steps, 'epsilon': epsilon, 'avg_reward': avg_reward, 'max_reward': max_reward_list[-1], 'min_reward': min_reward_list[-1]} run.log(metrics)

The main loop

run_episodes(run, env, q_table, bin_size, epsilon, LEARNING_RATE, epsilon_decay_value, DISCOUNT, END_EPSILON_DECAYING, START_EPSILON_DECAYING) - This function contains the main training loop. In each episode, it resets the environment, selects actions according to the ε-greedy policy, takes the actions in the environment, updates the Q-table, and logs the training metrics. It takes the current run, the environment, the Q-table, the bin size for discretizing states, the initial epsilon value, the learning rate, the epsilon decay value, the discount factor, and the start and end episodes for epsilon decay as input. The Q-table gets updated in-place during the training process, and the training metrics are logged to the current Weights & Biases run.

def run_episodes(run, env, q_table, bin_size, epsilon, LEARNING_RATE, epsilon_decay_value, DISCOUNT, END_EPSILON_DECAYING, START_EPSILON_DECAYING): # Additional data lists reward_list = [] max_reward_list = [] min_reward_list = [] for episode in tqdm(range(EPISODES), desc="Training", unit="episode"): start_time = time.time() episode_reward = 0 # initialize the reward for this episode episode_steps = 0 # initialize the number of steps for this episode observation, discrete_state = reset_environment(env, bin_size) done = False while not done: action = choose_action(discrete_state, q_table, epsilon, env) observation, reward, terminated, truncated, info = env.step(action) new_discrete_state = get_discrete_state(observation, env, bin_size) if not done: update_q_table(q_table, discrete_state, action, reward, new_discrete_state, LEARNING_RATE, DISCOUNT) if observation[0] >= env.goal_position: done = True q_table[discrete_state + (action,)] = 0 discrete_state = new_discrete_state episode_reward += reward episode_steps += 1 end_time = time.time() duration = end_time - start_time log_metrics(run, reward_list, max_reward_list, min_reward_list, episode_reward, episode_steps, duration, epsilon) if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING: epsilon -= epsilon_decay_value

train() is a warper for the run_episodes() which is made to utilise the Weights and Biases sweeps functionality

def train(): # Initialize a new wandb run run = wandb.init(config=wandb.config) # Config is a variable that holds and saves hyperparameters and inputs config = wandb.config LEARNING_RATE = config.learning_rate DISCOUNT = config.discount epsilon = config.epsilon START_EPSILON_DECAYING = 1 END_EPSILON_DECAYING = EPISODES // 2 num_states = np.array([config.num_states, config.num_states]) epsilon_decay_value = epsilon / (END_EPSILON_DECAYING - START_EPSILON_DECAYING) env = gym.make('MountainCar-v0') bin_size = (env.observation_space.high - env.observation_space.low) / num_states # Initialize Q-table with zeros q_table = np.zeros(shape=(num_states[0] ,num_states[1], env.action_space.n)) run_episodes(run, env, q_table, bin_size, epsilon, LEARNING_RATE, epsilon_decay_value, DISCOUNT, END_EPSILON_DECAYING, START_EPSILON_DECAYING) # Save the Q-table as an Artifact artifact = wandb.Artifact('q_table', type='model') np.save('q_table.npy', q_table) artifact.add_file('q_table.npy') run.log_artifact(artifact) env.close() run.finish() # End the run

Run the sweeps

sweep_id = wandb.sweep(sweep_config, project="mountain-car-v0") wandb.agent(sweep_id, train)