courses:b4m36smu

In the following Markov decision process (MDP), there are four states $s_1, s_2, s_3, s_4$. States $s_3$ and $s_4$ are terminal, reward for entering $s_3$ is $+100$ and $-100$ in $s_4$. In states $s_1$ and $s_2$, there is action FW, which leads to the next state ($s_2$ and $s_3$, respectively) with 0.9 probability. With 0.1 probability the action FW leads to state $s_4$. In $s_2$, action BW leads to state $s_1$ with 1.0 probability. In states $s_1$ and $s_2$, there is action exit that leads to $s_4$.

Let $\gamma \in [0, 1]$ be a discount factor (treat it as a parameter, do not assume any particular value), $V$ the state utility, and $V^\pi$ the value of a state under a policy $\pi$.

(2 points) Calculate value $V(s_1)$. Treat discount factor $\gamma$ as a parameter of $V(s_1)$.
(2 points) Let $\gamma' > \gamma$. Will $V(s_1)$ be higher in a different scenario where the discount factor is $\gamma'$? Justify your answer.
(1 point) Consider a passive temporal difference based reinforcement learning agent that in the first epoch of the learning encounters sequence of states and actions $s_1$, FW, $s_2$, BW, $s_1$, FW, $s_2$, FW, $s_3$. Initially, the values of states $s_1$ and $s_2$ are 0, the value of $s_3$ is 100, and the value of $s_4$ is -100. Which estimates of $V^\pi$ will change after the first epoch of the learning?
(2 points) Assume the scenario from the previous point. Consider two learning rate values $\alpha_1$ and $\alpha_2$ such that $\alpha_1 < \alpha_2$. Compare $V(s_2)$ in those two cases and justify your answer. Do the same for the TD-based estimates of $V^\pi(s_2)$.
(3 points) Unlike the passive TD-agent, SARSA and Q-learning are able to learn actively under some explore/exploit policy $\pi$. Assume that an adversarial has given those two algorithms a policy that selects an action completely randomly no matter what the $Q$-values estimates are. Is it possible that SARSA or Q-learning agent still learns the true $Q$-values, i.e., $Q(s, a)$? Is it possible that one of those algorithms will play optimally even with the replaced policy? Justify your answers. You may use the running example with action exit to simplify your explanation.

1. (2 points) Let $\pi$ be some policy. Consider the temporal-difference policy-evaluation algorithm, which uses the following update rule: $$V^\pi(s_t) := V^\pi(s_t) + \alpha_t \cdot (r_{i,t} + \gamma \cdot V^\pi(s_{t+1}) - V^\pi(s_t)).$$ Which two conditions should the numbers $\alpha_t$ satisfy to guarantee convergence to the true value of the state-value function $V^\pi$ (i.e. to the true value of the state-value function w.r.t. the policy $\pi$)?

Solution

Consider the Bayesian network structure below:

(2 points) Assuming that the random variables $X_3, X_6$ are three-valued and the rest are all binary, how many parameters are needed in order to fully specify the joint probability distribution?
(3 points) Let $\mathcal{R}$ denote a subset of the network's random variables. Find the largest $\mathcal{R}$ (excluding evidence) such that

a) $X_1 \perp\!\!\!\perp \mathcal{R} \mid X_8$

b) $\{X_1, X_8\} \perp\!\!\!\perp \mathcal{R} \mid X_4$

c) $X_5 \perp\!\!\!\perp \mathcal{R} \mid \emptyset$

In this question on Naive Bayes modelling, assume the following training corpus of 5 simplified movie “reviews”:

fun, movie, love, stupid $\rightarrow$ positive
stupid, movie, weird, boring $\rightarrow$ negative
love, weird, movie, fun $\rightarrow$ positive
boring, love, fun, movie $\rightarrow$ positive
stupid, love, movie, boring $\rightarrow$ negative

Answer to the following questions:

(1 points) What is the link between the Naive Bayes model and language models?

(2 points) Write down its formula and design the structure of a Naive Bayes model from the 5 labeled movie reviews.

(3 points) Write down all the necessary values and probabilities. Keep it in fractions.

(3 points) Calculate classification of a new test review: weird, stupid, love, movie

(1 points) Assume classification of much longer reviews. What problem could occur and how to solve it?

(1 point) What is a word "embedding" and what are the two main approaches to obtain it?

(1 points) What is a "one-hot" vector? How does it relate to the embedding and "symbolic" representation of a word?

(1 point) How does word2vec relate to the neural models?

(1 point) What are some advantages and interesting properties of word embeddings?

(1 points) What is TF-IDF and what is it good for?

(1 points) How exactly is TF-IDF calculated?

(2 points) Using a diagram, describe the main idea(s) behind the *Transformer* model. What is the core component?

(1 point) What is the main difference of Transformers from RNNs, and what are the advantages?

(1 point) Does a Transformer model take the input word order into account? If not, why? If so, how?

Consider the generalization algorithm for learning conjunctions working in the mistake-bound learning model.

(2 points) Prove that the algorithm's mistake bound is $n + 1$ where $n$ is the dimension of the example tuples.
(1 point) How can we modify the algorithm to learn $k$-CNFs where $k \in \mathbb{N}$ is a constant?
(4 points) Consider the modified algorithm for $k = 2$ and assume the examples:

a) $x_1 = (1, 1)$ with label $y_1 = 1$

b) $x_2 = (0, 1)$ with label $y_2 = 0$

c) $x_3 = (1, 0)$ with label $y_3 = 1$

d) $x_4 = (0, 0)$ with label $y_4 = 1$

Write down the initial hypothesis $h_0$ and the hypotheses $h_1, \dots h_4$ following each of the examples.

Assuming the target concept is a 2-CNF, can we claim that the final hypothesis describes the concept?

Solution 1.

Solution 2.

Solution 3.

Let $C, C'$ be two concept classes such that $C' \subseteq C$. Decide the correctness of each of the statements:

If true, justify why. If false, provide a counter-example.

If $C$ is PAC-learnable then so is $C'$.

If $C$ is efficiently PAC-learnable then so is $C'$.

If $C$ is efficiently properly PAC-learnable then so is $C'$

Consider an active reinforcement learning algorithm. You are not told whether it is an instance of SARSA or Q-learning. The implementation met all convergence criteria. All plots shown below are related to the same state-action pair Q-value, i.e., $\hat{Q}(s,a)$. The action a is suboptimal in state s. The used explore-exploit policy was the $\epsilon$-greedy policy, i.e., with probability $\epsilon$ a random action is selected; otherwise, the agent behaves greedily. Now, consider four different situations of learning Q-values over 100000 episodes. The plots show how estimates of $Q(s,a)$ change with the number of episodes, i.e., point (50,000, 8.7) in the plot means that after 50,000 episodes, $\hat{Q}(s,a)$ was 8.7.

Match Figures a-d to Figures 1-4. Explain your decision.
The $\epsilon$ was set as a function of the number of visits of state s. Relate the following four functions to Figures 1-4. Explain your choice.

$$ \epsilon(n_{s})=\frac{8}{7+n_{s}} \quad \epsilon(n_{s})=\frac{3}{2+n_{s}} \quad \epsilon(n_{s})=\frac{100}{99+n_{s}} \quad \epsilon(n_{s})=\frac{1000}{999+n_{s}}$$

Decide whether the following statement is true or false:

If a policy is greedy with respect to its own value function $V^{\pi}$, then this policy is an optimal policy. Briefly explain your answer; there is no need for a formal proof, one or two-sentence answer is enough.

Solution

Provide the update rule for the Q-learning and SARSA algorithms. Then, answer the questions below.

What is the difference between the on-policy and off-policy algorithms?
Decide whether the Q-learning and SARSA algorithms belong to the on-policy or off-policy category.
Which of these algorithms will converge to an optimal solution even in the presence of an adversarial man-in-the-middle between the agent and the environment, who occasionally alters the agent's chosen action? Assume that all other convergence criteria (GLIE, etc.) are met.

Consider the network below. Calculate the probability $P(A=1|B=0,E=1)$.

In this question, assume the following corpus of 5 sentences:

I love this test
I test this love
this test I love
this love I test
I love this

For simplicity, denote the start and end of each sentence with the same special symbol <s>. Treat it like any other word.

(4 points) Create a Markov (bi-gram) probabilistic language model from the given corpus. It is sufficient to fill in only the values necessary to solve the sub-task 3.
(2 points) Apply Laplace (add-one) smoothing to the model and write down all the necessary calculations.
(2 points) Using the smoothed model, calculate probability of the sentence “I test this”, including its start and end.
(1 points) Assume a sentence “I love this dog”. How would you alter the model to assign some probability here?
(1 points) Assume comparing probabilities of much longer sentences. What problem could occur and how to solve it?

(3 point) Describe the Skip-Gram with Negative Sampling (SGNS) variant of the word2vec algorithm.

(1 point) What is the training objective, learnable parameters, and hyperparameters?

(1 point) What is (unsupervised) topic modeling in NLP?

(1 point) What are the inputs and outputs of such a topic model?

(2 point) Using a diagram, describe a (high-level) structure of an LSTM cell.

(2 point) What are neural "gates"? What is the common idea and how is it implemented?

Consider the following decision tree:

(1 point) Express the tree as a 3-DNF.

(1 point) Express the tree as a 3-CNF.

(2 points) Can we use (modify) the generalization algorithm to learn k-decision trees in the mistake bound learning model? If we can, explain how. Will we also learn efficiently? If we cannot, explain why. Does it also mean that we cannot learn k-decision trees in the PAC learning model?

Prove that the VC dimension of monotone conjunctions on n variables is n.

Consider the deterministic world below (part $(a)$). Allowable moves are shown by arrows, and the numbers indicate the reward for performing each action. If there is no number, the reward is zero. Given the $Q$ values in $(b)$, show the changes in the $Q$ estimates when the agent takes the path shown by the dotted line (the agent starts in the lower left cell) when $\gamma = 0.5$ and $\alpha = 0.5$. Show all of your work including intermediate steps.

Solution

Consider a version-space agent whose version space contains all non-contradictory conjunctions on 3 propositional variables.

(3 points) How large will the initial (i.e., before seeing any observation) version space be? Explain your reasoning.

(2 points) What is the maximal possible size of it after the first update, assuming the first class decision was wrong? Explain your reasoning.

Describe the difference between First-Visit and Every-Visit Monte Carlo Policy Evaluation. Which of these estimators is biased, and which is unbiased? Are they consistent?

Provide the formal statistical definitions of a biased/unbiased estimator and a consistent estimator.

Define the Multi-Armed Bandit problem. What are the key mathematical and conceptual differences between a MABP setting and a standard MDP?

What is UCB in MABP? Equations are not needed.

What is the learning rate $\alpha$ discount factor $\gamma$, the parameter $\epsilon$? Which of these parameters is domain independent? Which of them can have negative values? Which parameters are not used by passive RL agents (TD)?

Explain GLIE and Robbins-Monro.

What is DQN and Double Q-Learning?

How is bayesian sampling and Thompson sampling used for MABP?

Explain how to construct a term-context matrix. If you are given a corpus and told to use a context window of $\pm 1$ (one word to the left, one word to the right), exactly what values are you counting to fill in the cells of the matrix?

The matrix is constructed to reflect the meaning of the word based on the company they keep. The matrix has dimensions $|V| \times |V|$, $|V|$ is the size of the vocabulary.

The rows are the target words $w_i$, the columns are the context words $c_j$.

We construct it by scanning through the document and looking left and right by the number of context-window words. So $\pm 1$ in this case. We add one to the column-row combination of words (it reflects raw counts).

For example, for the sentence $\text{To be or not to be}$, the matrix is as following: