Machine Data and Learning - Assignment 2 Part 3
Team 2
Suyash Vardhan Mathur - 2019114006
Tushar Choudhary - 2019111019
Overview
For the given problem, we have solved the given Markov Decision Process using Linear Programming and found and optimal policy using the same. Upon executing the code using python3 part_3.py
, a file name part_3_output.json
is created in the folder outputs, which stores the following parameters:
Matrix A
For an MDP, A is a 2D matrix with number of rows equal to total number of possible states, and number of columns equal to total number of (state,action) pairs possible.
In our case, total number of states can be calculated by counting all possible combinations of position, material count, arrow count, MM's state and MM'health. Taking all possible values we get 5 x 3 x 4 x 2 x 5 = 600. Total number of state action pairs possible can be calculated by iterating through all possible states and counting number of actions IJ can do from that state. In our case this count turns out to be equal to 1936. Hence A is a matrix of dimension 600x1936.
In in this matrix, each row represents the effect the corresponding state has on all the (state, action) pairs. The value of each entry is determined by the probability of the (state, action) pair. If the state corresponding to the row has no direct relation to the (state, action) pair, which means it isn't a past or future state for this pair, then the entry will be 0. Otherwise, we'll add all inflow probabilities and subtract all outflow probabilities. So if we consider any row (which will have a corresponding state) and a column (which will have a corresponding (state, action) pair), then the probability corresponding to this (state, action) pair will be subtracted if the action causes an incoming edge to the state, and will be added if it causes an outgoing edge from state.
This matrix was created by using np.zeros of the size number_of_states*number_of_actions_state_pairs_possible, and then, we iterated through each possible action for every state, and added and subtracted the outgoing and incoming probabilities in the repective (state, action_state) element.
Procedure to find Policies
The optimal policy for an MDP is the one which results in maximum reward, and thus this is the objective function that we have to maximize for our LP problem.
Using the Bellman Equation, we know that:
We used the cxpy solver
to find the values of the array x
such that the constraints of MDP were being satisfied.