Outlook | Temp (F) | Humidity (%) | Windy? | Class |
sunny | 75 | 70 | true | Play |
sunny | 80 | 90 | true | Don't Play |
sunny | 85 | 85 | false | Don't Play |
sunny | 72 | 95 | false | Don't Play |
sunny | 69 | 70 | false | Play |
overcast | 72 | 90 | true | Play |
overcast | 83 | 78 | false | Play |
overcast | 64 | 65 | true | Play |
overcast | 81 | 75 | false | Play |
rain | 71 | 80 | true | Don't Play |
rain | 65 | 70 | true | Don't Play |
rain | 75 | 80 | false | Play |
rain | 68 | 80 | false | Play |
rain | 70 | 96 | false | Play |
(a) Optimal Parameter Estimation (10 pts.) Given the data set above, give the conditional probability tables if the following variables are conditionally dependent: Temp is CD on Outlook, Humidity is CD on Outlook, Windy? is CD on Humidity, and Class is CD on Temp and Windy?. For this problem, discretize Temp into < 70 and ≥ 70, Humidity into < 75 and ≥ 75, and Outlook into (sunny) and (overcast or rain). (Therefore all your variables should be binary for this BN)
(b) Structure learning (15 pts.) Given the Bayes Net you just constructed, show one refinement operation you could make to the structure of the BN (add/remove/reverse a CD link) and show the change in the likelihood of the data given your new model. This will require you to update the CPT and compute the likelihood of the data given both the old and new model.
The diagram below shows a gridworld domain in which the agent starts at the upper left location. The upper and lower rows are both "one-way streets," since only the actions shown by arrows are available.
Actions that attempt to move the agent
into a wall (the outer borders, or the thick black wall
between all but the leftmost cell of the top and bottom
rows) leave the agent in the same state it was in with
probability 1, and have reward -2.
If the agent tries to move to the right from the upper right
or lower right locations, with probability 1,
it is teleported to the far left end
of the corresponding row, with reward as marked (-10 and +20, respectively).
All other actions have
the expected effect (move up, down, left, or right)
with probability .9, and leave the agent in the same state it was
in with probability .1. These actions all have reward
-1,
except for the transitions that are marked in the upper left and
lower left cells. (Note that the
marked transitions only give the indicated reward if the action
succeeds in moving the agent in that direction.)
(a) MDP (10 pts) Give the MDP for this domain only for the state transitions starting from each of the states in the top row, by filling in a state-action-state transition table (showing only the state transitions with non-zero probability). You should refer to each state by its row and column index, so the upper left state is [1,1] and the lower right state is [2,4].
To get you started, here are the first few lines of the table:
State s | Action a | New state s' | p(s'|s,a) | r(s,a,s') |
[1,1] | Up | [1,1] | 1.0 | -2 |
[1,1] | Right | [1,1] | 0.1 | -1 |
[1,1] | Right | [1,2] | 0.9 | +20 |
(b) Value function (20 pts)
Suppose the agent follows a randomized policy π (where each
available action in any given state has equal probability)
and uses a discount factor of γ=.85.
Given the partial value function (Vπ; Uπ in
Russell & Norvig's terminology) shown below,
fill in the missing Vπ values. Show and explain your work.
(c) Policy (15 pts)
Given the value function Vπ computed in (b),
what new policy π' would policy iteration produce at
the next iteration? Show your answer as a diagram (arrows
on the grid) or as a state-action table.