Symbolické strojové učení
Sample exam 1
Question 1. (10 points)
In the following Markov decision process (MDP), there are four states $s_1, s_2, s_3, s_4$. States $s_3$ and $s_4$ are terminal, reward for entering $s_3$ is $+100$ and $-100$ in $s_4$. In states $s_1$ and $s_2$, there is action FW, which leads to the next state ($s_2$ and $s_3$, respectively) with 0.9 probability. With 0.1 probability the action FW leads to state $s_4$. In $s_2$, action BW leads to state $s_1$ with 1.0 probability. In states $s_1$ and $s_2$, there is action exit that leads to $s_4$.
Let $\gamma \in [0, 1]$ be a discount factor (treat it as a parameter, do not assume any particular value), $V$ the state utility, and $V^\pi$ the value of a state under a policy $\pi$.
(2 points) Calculate value $V(s_1)$. Treat discount factor $\gamma$ as a parameter of $V(s_1)$.
(2 points) Let $\gamma' > \gamma$. Will $V(s_1)$ be higher in a different scenario where the discount factor is $\gamma'$? Justify your answer.
(1 point) Consider a passive temporal difference based reinforcement learning agent that in the first epoch of the learning encounters sequence of states and actions $s_1$, FW, $s_2$, BW, $s_1$, FW, $s_2$, FW, $s_3$. Initially, the values of states $s_1$ and $s_2$ are 0, the value of $s_3$ is 100, and the value of $s_4$ is -100. Which estimates of $V^\pi$ will change after the first epoch of the learning?
(2 points) Assume the scenario from the previous point. Consider two learning rate values $\alpha_1$ and $\alpha_2$ such that $\alpha_1 < \alpha_2$. Compare $V(s_2)$ in those two cases and justify your answer. Do the same for the TD-based estimates of $V^\pi(s_2)$.
(3 points) Unlike the passive TD-agent, SARSA and Q-learning are able to learn actively under some explore/exploit policy $\pi$. Assume that an adversarial has given those two algorithms a policy that selects an action completely randomly no matter what the $Q$-values estimates are. Is it possible that SARSA or Q-learning agent still learns the true $Q$-values, i.e., $Q(s, a)$? Is it possible that one of those algorithms will play optimally even with the replaced policy? Justify your answers. You may use the running example with action exit to simplify your explanation.
Solution 1.
$V(s_3) = 100 \quad V(s_4) = -100 \quad R(s_1) = R(s_2) = 0$
$V(s) = \text{max}_a\left[R(s) + \gamma \sum P(s'|s,a)V(s')\right]$
For $V(s_2)$ assuming optimal play:
FW: $0 + \gamma \cdot (0.9 \cdot V(s_3) + 0.1 \cdot V(s_4)) = \gamma \cdot (90 - 10) = 80 \gamma$
BW: $0 + \gamma \cdot (1.0 \cdot V(s_1)) = \gamma V(s_1)$
exit: $0 + \gamma \cdot (1.0 \cdot V(s_4)) = - 100 \gamma$
$V(s_2) = \max\left\{80\gamma, \gamma V(s_1)\right\}$
Then $V(s_1)$:
If FW is optimal in $s_2$ then:
$V(s_2) = 80 \gamma \quad V(s_1) = \max \left\{-100\gamma, 72\gamma^2 - 10\gamma \right\}$
$72\gamma^2 - 10\gamma \geq - 10 \gamma \geq -100 \gamma$ as $\gamma \in \left[0 , 1\right]$
Thus $V(s_1) = 72\gamma^2 - 10\gamma$
Value of FW in $s_2$ is $80\gamma$, value of BW in $s_2$ is $\gamma V(s_1) = 72 \gamma ^ 3 - 10 \gamma ^2$, then $80\gamma \geq 72 \gamma ^ 3 - 10 \gamma ^2$ for all $\gamma \in \left[0 , 1\right]$. The assumption holds.
If BW is optimal in $s_2$ then:
$V(s_2) = \gamma V(s_1) \quad V(s_1) = \max \left\{-100\gamma, 0.9 \gamma ^2 V(s_1) - 10 \gamma\right\}$
$V(s_1) = 0.9 \gamma ^2 V(s_1) - 10 \gamma \dots \rightarrow V(s_1) = \frac{-10 \gamma}{1 - 0.9 \gamma ^2}$
The denominator is positive, and the numerator negative, thus $V(s_1)$ will be negative. This is smaller than the value of $80\gamma$, so the agent would not take the action BW.
Solution 2.
Not necessairly, for small values, the quadratic formula in 1) will yield smaller values in larger $\gamma$.
If $\gamma = 0$ and $\gamma ' = 0.1$ then $V(s_1,\gamma) = 0 > V(s_1,\gamma') = 72 \cdot 0.1^2 - 1 = -0.28$
Solution 3.
A passive TD agent updates using the rule:
$V^\pi (s_t) \leftarrow V^\pi (s_t) + \alpha (r_t + \gamma V^\pi (s_{t+1}) - V^\pi (s_t))$
as initial values for $s_1$ and $s_2$ are 0 then:
$V^\pi (s_1) \leftarrow 0 + \alpha (0 + \gamma 0 - 0)$, does not change
$V^\pi (s_2) \leftarrow 0 + \alpha (0 + \gamma 0 - 0)$, does not change
$V^\pi (s_1) \leftarrow 0 + \alpha (0 + \gamma 0 - 0)$, does not change
$V^\pi (s_2) \leftarrow 0 + \alpha (0 + \gamma 100 - 0)$, $V(s_2)$ is updated to $(100 \alpha \gamma)$
Solution 4.
The true values are the same independent of the learning rate (based on the Bellman equations).
The TD-based estimate will be larger, for the larger $\alpha_2$, as the target $100\gamma$ is positive, multiplying by a larger positive number yields larger positive value.
Solution 5.
Q-learning is an off-policy algoritm, it evaluates the optimal policy regardless of the actions executed by the running policy (using the update rule $\text{max}_a(Q(s_{t+1}, a)$). The random policy gurantees that all state-action pair will be visited infinitely often, Q-learning trhen satisfies the conditions needed to converge to the optimal Q-values.
SARSA is an on-policy method, it evaluates the actions actually taken by the policy. It will learn the values for the random policy, not the optimal Q-values, penalizing the states $s_1$ and $s_2$, since the random policy selects the exit action frequently.
We cannot play optimally, Q-learning will learn the optimal Q-values, but cannot execute optimal play, as it is forced to play completely randomly.
Question 2. (2 points)
1. (2 points) Let $\pi$ be some policy. Consider the temporal-difference policy-evaluation algorithm, which uses the following update rule:
$$V^\pi(s_t) := V^\pi(s_t) + \alpha_t \cdot (r_{i,t} + \gamma \cdot V^\pi(s_{t+1}) - V^\pi(s_t)).$$
Which two conditions should the numbers $\alpha_t$ satisfy to guarantee convergence to the true value of the state-value function $V^\pi$ (i.e. to the true value of the state-value function w.r.t. the policy $\pi$)?
Solution
They must satisfy the Robbins-Monro conditions:
Question 3. (5 points)
Consider the Bayesian network structure below:
(2 points) Assuming that the random variables $X_3, X_6$ are three-valued and the rest are all binary, how many parameters are needed in order to fully specify the joint probability distribution?
(3 points) Let $\mathcal{R}$ denote a subset of the network's random variables. Find the largest $\mathcal{R}$ (excluding evidence) such that
a) $X_1 \perp\!\!\!\perp \mathcal{R} \mid X_8$
b) $\{X_1, X_8\} \perp\!\!\!\perp \mathcal{R} \mid X_4$
c) $X_5 \perp\!\!\!\perp \mathcal{R} \mid \emptyset$
Question 4. (10 points)
In this question on Naive Bayes modelling, assume the following training corpus of 5 simplified movie “reviews”:
fun, movie, love, stupid $\rightarrow$ positive
stupid, movie, weird, boring $\rightarrow$ negative
love, weird, movie, fun $\rightarrow$ positive
boring, love, fun, movie $\rightarrow$ positive
stupid, love, movie, boring $\rightarrow$ negative
Answer to the following questions:
(1 points) What is the link between the Naive Bayes model and language models?
The Naive Bayes model calculates the probability of a class $c$ given document $d$ using the relationship
$P(c|d) \propto P(d|c) P(c)$
$P(d|c)$ represents the prob. of generating the exact sequence of words in the document, assuming class $c$. In Naive Bayes every word is conditionally independend of the others, it calcualtes this prob. by multiplying the individual word probs $\prod P(w_i|c)$
This is same as the unigram language model for the class $c$, which is defined as $P(w_1 w_2 \dots w_n)=\prod_i P(w_i)$.
(2 points) Write down its formula and design the structure of a Naive Bayes model from the 5 labeled movie reviews.
The formula for the Naive Bayes is:
$c_{NB} = \text{argmax}_{c_j \in C} P(c_j) \prod_{i \in \text{positions}} P(x_i|c_j)$
or in log space (to avoid multiplying small numbers and prevent underflows):
$c_{NB} = \text{argmax}_{c_j \in C} \left[ \text{log} P(c_j) + \sum_{i \in \text{positions}} \text{log} P(x_i|c_j) \right]$
The decision rule is to choose the class that maximizes the posterior probability.
(3 points) Write down all the necessary values and probabilities. Keep it in fractions.
$V = \left[ \text{fun}, \text{movie}, \text{love}, \text{stupid}, \text{weird}, \text{boring} \right] \quad |V| = 6$
$\mathcal{P} := \text{positive} \quad \mathcal{N} := \text{negative} \quad P(\mathcal{P}) = \frac{3}{5} \quad P(\mathcal{N}) = \frac{2}{5}$
$P(\text{fun}|\mathcal{P}) = \frac{3}{12} = \frac{1}{4}$
$P(\text{movie}|\mathcal{P}) = \frac{3}{12} = \frac{1}{4}$
$P(\text{love}|\mathcal{P}) = \frac{3}{12} = \frac{1}{4}$
$P(\text{stupid}|\mathcal{P}) = \frac{1}{12}$
$P(\text{weird}|\mathcal{P}) = \frac{1}{12}$
$P(\text{boring}|\mathcal{P}) = \frac{1}{12}$
$P(\text{fun}|\mathcal{N}) = 0$
$P(\text{movie}|\mathcal{N}) = \frac{2}{8} = \frac{1}{4}$
$P(\text{love}|\mathcal{N}) = \frac{1}{8}$
$P(\text{stupid}|\mathcal{N}) = \frac{2}{8} = \frac{1}{4}$
$P(\text{weird}|\mathcal{N}) = \frac{1}{8}$
$P(\text{boring}|\mathcal{N}) = \frac{2}{8} = \frac{1}{4}$
(3 points) Calculate classification of a new test review: weird, stupid, love, movie
To classify the test review: weird, stupid, love, movie we calculate score for each class:
$S(\mathcal{P}) = P(\mathcal{P}) \prod_i P(w_i | \mathcal{P}) = \frac{3}{5} \cdot \frac{1}{12} \cdot \frac{1}{12} \cdot \frac{1}{4} \cdot \frac{1}{4} = \frac{3}{11520} = \frac{1}{3840}$
$S(\mathcal{N}) = P(\mathcal{N}) \prod_i P(w_i | \mathcal{N}) = \frac{2}{5} \cdot \frac{1}{8} \cdot \frac{1}{4} \cdot \frac{1}{8} \cdot \frac{1}{4} = \frac{2}{5120} = \frac{1}{2560}$
$S(\mathcal{N}) \gt S(\mathcal{P})$, Naive Bayes will classify this class as a negative review.
(1 points) Assume classification of much longer reviews. What problem could occur and how to solve it?
With longer reviews, we multiply numbers that are close to zero, which could lead to underflows. We then calculate the probabilities in log space. Logarithm is an increasing function, so the classification will lead the same classification results, as the raw probabilities.
We can also stumble upon a completely unseen word, which will have a probability of zero, making the whole sentence have the probability of zero. Solution to this problem is smoothing.
The problem with longer reviews is also that the Naive Bayes (or unigram model) does not take the context of the word, that can span the whole sentence in normal language, like:
It only calculates the absolute counts in each class, without positional awareness. But this is not directly the point of the question.
Question 5. (4 points)
(1 point) What is a word "embedding" and what are the two main approaches to obtain it?
A word embedding is a representation of a word using a (dense) vector, in lower dimensional space. In this space words with similar meaning have similar representations (like vectors in N-D space with same direction for similar words).
We can obtain those embedding by using a predictive / neural model for predicting a word from the context or context from word (eg. word2vec). Or by using matrix factorization models (LSA, SVD, GloVe).
(1 points) What is a "one-hot" vector? How does it relate to the embedding and "symbolic" representation of a word?
A one-hot vector is a sparse vector with lenght $|V|$, where all indices are set to $0$ apart from the index that corresponds to the word index in the vocabilary, which gets a $1$. This representation is purely symbolic, as it treats every word as an independent symbol with no context.
(1 point) How does word2vec relate to the neural models?
Word2vec utilizes a shallow neural network (we train using SGD, and logistic regression), to learn the word embeddings. By learning to predict a word on the surrounding context (continuous bag of word), or predicting surrounding context (skip-gram) the network is able to learn the semantic relationships. The final embeddings are the weights extracted from the learned parameters.
(1 point) What are some advantages and interesting properties of word embeddings?
We reduce the dimenionality of the data, into some reasonable space (instead of requiring tons of sparse dimensions represented by one-hot encoded data).
We can relate the data, such as car and PC are closer than car and apple.
There are certain interesting and convenient geometrical relations:
$\overrightarrow{king} - \overrightarrow{man} + \overrightarrow{woman}$ is close to $\overrightarrow{queen}$
$\overrightarrow{Paris} - \overrightarrow{France} + \overrightarrow{Italy}$ is close to $\overrightarrow{Rome}$
Question 6. (2 points)
(1 points) What is TF-IDF and what is it good for?
TF-IDF mean Term frequency inverese document frequency. We calculate how much the term appears in the single document (normalized by the number of terms in the document. While the inverse document frequency is how opten the select term appears accross all the other documents (binary for each document it appears in).
Using TF-IDF we are able to separate words, which add no meaning to the individual documents (like the, an, a) from terms which appear in the document and are significant in it (like the, appears in each document many times, but Romeo appears many times only in the Romeo & Juliet, so it must be significant in this specific document).
(1 points) How exactly is TF-IDF calculated?
given $N = |\mathcal{D}|$, $\mathcal{D}$ is the document corpus
$\text{TD-IDF}(t, d, \mathcal{D}) = \text{TF}(t,d) \cdot \text{log}(\frac{N}{d\in \mathcal{D}: t\in d})$
where TF is simply the count of the word in each document
$\text{TF}(t,d)=count(t,d)$
or calculated logarithmically and smoothed:
$\text{TF}(t,d)=\text{log}(count(t,d)+1)$
Question 7. (4 points)
(2 points) Using a diagram, describe the main idea(s) behind the *Transformer* model. What is the core component?
We first tokenize the text and turn the words into embeddings (with positional encodings - information about their location in the text, implemented using sine and cosine functions to encode positions), we feed it through the multi-head self-attention layer (which catches multiple relationships between pair of words) and we normalize, this is then repeated N times until we feed forward through a standard NN to process the learned patterns. We predict the probabilities of the output words.
The core component is the self-attention block (or the multi-head version). It helps the transformer learn the relationships between word pairs.
(1 point) What is the main difference of Transformers from RNNs, and what are the advantages?
For RNNs, they process the text strictly sequentially, one word at a time, which makes them easier to forget the beginning of the sentence. The self-attention layer makes it possible fo the other words in the sentence to “look at” the other words simultaneously.
The distant word relationships are often important, but learning them leads to problems with vanishing gradients.
(1 point) Does a Transformer model take the input word order into account? If not, why? If so, how?
Yes it does, by using the positional information in the word embeddings and the self-attention layer. As the self-attention layer processes everything in parallel, the positional information is required so the model does not loose the information.
Question 8. (7 points)
Consider the generalization algorithm for learning conjunctions working in the mistake-bound learning model.
(2 points) Prove that the algorithm's mistake bound is $n + 1$ where $n$ is the dimension of the example tuples.
(1 point) How can we modify the algorithm to learn $k$-CNFs where $k \in \mathbb{N}$ is a constant?
(4 points) Consider the modified algorithm for $k = 2$ and assume the examples:
a) $x_1 = (1, 1)$ with label $y_1 = 1$
b) $x_2 = (0, 1)$ with label $y_2 = 0$
c) $x_3 = (1, 0)$ with label $y_3 = 1$
d) $x_4 = (0, 0)$ with label $y_4 = 1$
Write down the initial hypothesis $h_0$ and the hypotheses $h_1, \dots h_4$ following each of the examples.
Assuming the target concept is a 2-CNF, can we claim that the final hypothesis describes the concept?
Solution 1.
The initial hypothesis $h$ containing $2n$ literals is tautologically false, on the first positive example (which we will predict as negative) we delete $n$ literals from $h$. Resulting in $|h| = n$
If a literal is in $c$, we never delete it from $h$, $c \subseteq h$ (the literal will never be false on a positive example, so it is never deleted)
In any subsequent mistakes, we remove at least one literal (at least $n$)
The max number of mistakes (before we learn the concept or end up with an empty hypothesis) is then $n + 1 \leq \text{poly}(n)$
Solution 2.
For every possible k-CNF we add a new clause ($z_i$ for example), which we will then learn using the standard generalization algorithm for monotone conjunctions ($h_0 = \land_{i}z_i$).
For $n$ variables, there are $n'=\sum_{i=1}^k \binom{n}{i} 2^i \leq \text{poly}(n)$ (for clause size, choosing $i$ variables, and positive or negative). We can use the gen. algorithm.
After we learn the $z_i$'s we can then get back the original clauses.
Solution 3.
We create the new clauses:
$z_1 = v_1$
$z_2 = \neg v_1$
$z_3 = v_2$
$z_4 = \neg v_2$
$z_5 = v_1 \lor v_2$
$z_6 = v_1 \lor \neg v_2$
$z_7 = \neg v_1 \lor v_2$
$z_8 = \neg v_1 \lor \neg v_2$
$h_0 = z_1 \land z_2 \land z_3 \land z_4 \land z_5 \land z_6 \land z_7 \land z_8$
$x_1$:
$v_1 = 1$, $v_2 = 1$, $y_1 = 1$
we remove $z_2 = \neg v_1 = 0$ and $z_4 = \neg v_2 = 0$ and $z_8 = \neg v_1 \lor \neg v_2 = 0$
$h_1 = z_1 \land z_3 \land z_5 \land z_6 \land z_7$
$x_2$:
$v_1 = 0$, $v_2 = 1$, $y_2 = 0$
$z_1 = v_1 = 0$, the whole conjunction is false, which is the correct prediction, we do not remove anything
$h_2 = z_1 \land z_3 \land z_5 \land z_6 \land z_7$
$x_3$:
$v_1 = 1$, $v_2 = 0$, $y_3 = 1$
we remove $z_3 = v_2 = 0$ and $z_7 = \neg v_1 \lor v_2 = 0$
$h_3 = z_1 \land z_5 \land z_6$
$x_4$:
$v_1 = 0$, $v_2 = 0$, $y_4 = 1$
we remove $z_1 = v_1 = 0$ and $z_5 = v_1 \lor v_2 = 0$
$h_4 = z_6$
We learned the hypothesis $h = z_6 = v_1 \lor \neg v_2$ which classifies $x_1, x_3, x_4$ as true and $x_2$ as false.
The final hypothesis describes the target concept, as it correctly described the concept on all instances from the entire instance space $X$.
Question 9. (6 points)
Let $C, C'$ be two concept classes such that $C' \subseteq C$. Decide the correctness of each of the statements:
If true, justify why. If false, provide a counter-example.
If $C$ is PAC-learnable then so is $C'$.
Yes
The concept class $C$ is PAC-learnable, iff $\text{VC}(C) \leq poly(n)$.
If $\text{VC}(C)$ is polynomial, it could be shattered by a polynomial amount of points. That means that any subset of instances can be shattered by at most $\text{VC}(C)$ points.
So $\text{VC}(C') \leq \text{VC}(C) \leq \text{poly}(n)$.
That means $C'$ is also PAC-learnable.
If $C$ is efficiently PAC-learnable then so is $C'$.
Yes
For an algorithm to be efficiently PAC-learnable it must run in $\text{poly}( \frac{1}{ \epsilon }, \frac{1}{ \delta }, n)$.
If we use the same algorithm that learns $C$ to also learn $C'$, we could take $C' \subseteq C$ and use the original algorithm to learn $C'$ in the same time. It does not guarantee proper learning (typically $H \supseteq C$).
If $C$ is efficiently properly PAC-learnable then so is $C'$
No
Let $C$ be the class of 3-CNF and $C'$ the class of 3-term DNF.
Any k-term DNF can be multiplied out and rewritten as an equivalent k-CNF. Meaning $C'\subseteq C$.
$C$ is efficiently properly PAC-learnable, in polynomial time via the generalization algorithm
For the algorithm to properly learn $C'$ it would need to output a hypothesis strictly in the form of 3-term DNF ($H = C'$), which is NP-hard as it reduces to graph 3-coloring problem.
$C'$ cannot be efficiently probably PAC-learnable, which disproves the statement.
Sample exam 2
Question 1. (4 points)
Consider an active reinforcement learning algorithm. You are not told whether it is an instance of SARSA or Q-learning. The implementation met all convergence criteria. All plots shown below are related to the same state-action pair Q-value, i.e., $\hat{Q}(s,a)$. The action a is suboptimal in state s. The used explore-exploit policy was the $\epsilon$-greedy policy, i.e., with probability $\epsilon$ a random action is selected; otherwise, the agent behaves greedily. Now, consider four different situations of learning Q-values over 100000 episodes. The plots show how estimates of $Q(s,a)$ change with the number of episodes, i.e., point (50,000, 8.7) in the plot means that after 50,000 episodes, $\hat{Q}(s,a)$ was 8.7.
Match Figures a-d to Figures 1-4. Explain your decision.
The $\epsilon$ was set as a function of the number of visits of state s. Relate the following four functions to Figures 1-4. Explain your choice.
$$ \epsilon(n_{s})=\frac{8}{7+n_{s}} \quad \epsilon(n_{s})=\frac{3}{2+n_{s}} \quad \epsilon(n_{s})=\frac{100}{99+n_{s}} \quad \epsilon(n_{s})=\frac{1000}{999+n_{s}}$$
Question 2. (2 points)
Decide whether the following statement is true or false:
If a policy is greedy with respect to its own value function $V^{\pi}$, then this policy is an optimal policy. Briefly explain your answer; there is no need for a formal proof, one or two-sentence answer is enough.
Solution
True
If a policy is greedy with respect to its own value function, it means that applying a policy improvement step would result in no changes to the policy. Policy iteration algorithm converges on the optimal policy, when we stop updating we found the optimal policy.
So the policy is optimal.
Question 3. (5 points)
Provide the update rule for the Q-learning and SARSA algorithms. Then, answer the questions below.
What is the difference between the on-policy and off-policy algorithms?
Decide whether the Q-learning and SARSA algorithms belong to the on-policy or off-policy category.
Which of these algorithms will converge to an optimal solution even in the presence of an adversarial man-in-the-middle between the agent and the environment, who occasionally alters the agent's chosen action? Assume that all other convergence criteria (GLIE, etc.) are met.
Solution
SARSA update rule: $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha (r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_{t}, a_{t}))$
Q-Learning update rule: $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha(r_t + \gamma \text{max}_{a \in A} Q(s_{t+1}, a) - Q(s_t, a_t))$
Solution 1.
On-policy algorithms require the sample to come from a specific policy that the algorithm is trying to learn.
Off-policy algorithms do not have such reqirements, they can learn the optimal policy using samples coming from a different policy.
Solution 2.
SARSA is on-policy, Q-Learning is off-policy.
Solution 3.
Q-Learning will converge to the optimal solution, as it is always taking the best possible next action; making the process independent on which actual action was executed by the environment.
Question 4. (5 points)
Consider the network below. Calculate the probability $P(A=1|B=0,E=1)$.
Question 5. (10 points)
In this question, assume the following corpus of 5 sentences:
I love this test
I test this love
this test I love
this love I test
I love this
For simplicity, denote the start and end of each sentence with the same special symbol <s>. Treat it like any other word.
(4 points) Create a Markov (bi-gram) probabilistic language model from the given corpus. It is sufficient to fill in only the values necessary to solve the sub-task 3.
(2 points) Apply Laplace (add-one) smoothing to the model and write down all the necessary calculations.
(2 points) Using the smoothed model, calculate probability of the sentence “I test this”, including its start and end.
(1 points) Assume a sentence “I love this dog”. How would you alter the model to assign some probability here?
(1 points) Assume comparing probabilities of much longer sentences. What problem could occur and how to solve it?
Solution 1.
The vocabulary is $V = \left[\text{<s>}, \text{I}, \text{love}, \text{this}, \text{test}\right] \quad |V| = 5$
$c(\text{<s>}) = 10$
$c(\text{I}) = 5$
$c(\text{love}) = 5$
$c(\text{this}) = 5$
$c(\text{test}) = 4$
We need to evaluate the sentece <s>I test this<s>, so we need the probabilities for the bigrams. The counts for the bigrams is as follows:
$c(\text{<s>}, \text{I}) = 3$
$c(\text{I}, \text{test}) = 2$
$c(\text{test}, \text{this}) = 1$
$c(\text{this}, \text{<s>}) = 1$
Using MLE: $P(w_i|w_{i-1}) = \frac{c(w_{i-1}, w_i)}{c(w_{i-1})}$
$P(\text{I}|\text{<s>}) = \frac{3}{10}$
$P(\text{test}|\text{I}) = \frac{2}{5}$
$P(\text{this}|\text{test}) = \frac{1}{4}$
$P(\text{<s>}|\text{this}) = \frac{1}{5}$
Solution 2.
We add one as we pretend we saw each combination one more time.
$P_{Add-1}(w_i|w_{i-1}) = \frac{c(w_{i-1}, w_i) + 1}{c(w_{i-1}) + |V|}$
Then:
$P(\text{I}|\text{<s>}) = \frac{3 + 1}{10 + 5} = \frac{4}{15}$
$P(\text{test}|\text{I}) = \frac{2 + 1}{5 + 5} = \frac{3}{10}$
$P(\text{this}|\text{test}) = \frac{1 + 1}{4 + 5} = \frac{2}{9}$
$P(\text{<s>}|\text{this}) = \frac{1 + 1}{5 + 5} = \frac{1}{5}$
Solution 3.
The probability of the sentence is just the product of the prababilities of bigrams.
$P(\text{<s>I test this <s>}) = P(\text{I}|\text{<s>}) \cdot P(\text{test}|\text{I}) \cdot P(\text{this}|\text{test}) \cdot P(\text{<s>}|\text{this}) = \frac{4\cdot 3\cdot 2 \cdot 1}{15 \cdot 10 \cdot 9 \cdot 5} = \frac{24}{6750} = \frac{4}{1125} \approx 0.0035$
Solution 4.
We can introduce a new token $\text{<UNK>}$, during training we would replace words which we have not seen with this token.
We could also expand the vocabulary to six words and with the Laplace smoothing add one to this unseen word.
Solution 5.
We could face underflow problems. The solution is to compute everything in log space and to just add the probabilities.
The problem also is that the model may have problem with the meaning of the sentence, since the word may have a relationship with a word on the other end of the sentence, which we do not take into account.
Question 6. (4 points)
(3 point) Describe the Skip-Gram with Negative Sampling (SGNS) variant of the word2vec algorithm.
Word2vec is self supervised prediction based model, it learns the dense vector embeddings by learning a binary classifier using logistic regression.
It takes positive examples by scanning the text and pairing a target word with a context word within a specific context window.
For every positive example it randomly sample k noise words from the lexicon based on their frequency to create negative sample.
The classifier is trained by calculating a similarity between them by calculating the dot product of the word embeddings and passing it through the sigmoid function (to get a probability).
It is trained using SGD and after training it discards the classifier and the learned weights are the final word embeddings.
(1 point) What is the training objective, learnable parameters, and hyperparameters?
Objective is to maximize similarity between target and context words while minimizing the similarity between the target and the negative examples. (using cross entropy loss)
Learnable parameters are the embedding weights (W for target words and C for context/noise words, in the final representation they are often added together).
Hyperparameters are the context window size, number of negative samples per positive sample and the dimensionality of the initial random embedding vectors.
Question 7. (2 points)
(1 point) What is (unsupervised) topic modeling in NLP?
For models that are designed to discover hidden topics or themes in a collection of documents. Works by identifying clusters of terms that tend to occur together. Unsupervised means, that it does not require labeled data, it relies on probabilistic generative processes with unobserved variables to find out the topics.
(1 point) What are the inputs and outputs of such a topic model?
Input is a set of documents, the corpus and a number of topics to learn.
The output are the learned topics, the topic distribution in each document (which parts of the document are a which topic) and the topic distribution for each word (which topic is responsible for which word).
Question 8. (4 points)
Question 9. (4 points)
Question 10. (9 points)
Prove that the VC dimension of monotone conjunctions on n variables is n.
$Y \leq X$: $Y$ shattered iff $ \quad \forall \text{labelling} \quad \exists \text{hypothesis}$
$VC(\mathcal{H}) = \text{max}(Y| Y \text{is shattered})$
To prove that VC dimension of monotone conjunctions of $n$ variables is $n$, we need to show that:
it is $\geq n$
it is $\lt n + 1$
For $VC(\mathcal{H}) \geq n$:
$x_1 = 0 \quad 1 \quad 1 \quad \dots \quad 1$
$x_2 = 1 \quad 0 \quad 1 \quad \dots \quad 1$
$x_n = 1 \quad 1 \quad 1 \quad \dots \quad 0$
$\mathcal{I} \subseteq \left[1, 2, \dots n\right]$
we then take the conjunction of all that are not in $\mathcal{I}$.
If $\mathcal{I} = \left[1, 2\right] \rightarrow h = p_3 \land p_4 \land \dots \land p_n$
If $\mathcal{I} = 0 \rightarrow h = \land_{i}p_i$
If $\mathcal{I} = \left[1, 2, \dots n\right] \rightarrow h = \top$
$\displaystyle\bigwedge_{\forall i \notin \mathcal{I}} (p_i) \in \mathcal{H}$
This means we can shatter at least $n$ points.
For $VC(\mathcal{H}) \lt n + 1$:
To shatter an instance of monotone conjunctions on $n$ variables, we would need $2^{n+1}$ distinct labelings. But we only have $|\mathcal{H}| = 2^n$ total hypothesis available.
$VC(\mathcal{H}) \leq \text{lg}(|\mathcal{H}|) = n$
We thus get:
$VC(\mathcal{H}) \leq n \land VC(\mathcal{H}) \geq n \implies VC(\mathcal{H}) = n$
Other questions
Question 5. (10 points)
Consider the deterministic world below (part $(a)$). Allowable moves are shown by arrows, and the numbers indicate the
reward for performing each action. If there is no number, the reward is zero. Given the $Q$ values in $(b)$, show the changes
in the $Q$ estimates when the agent takes the path shown by the dotted line (the agent starts in the lower left cell) when
$\gamma = 0.5$ and $\alpha = 0.5$. Show all of your work including intermediate steps.
Solution
$Q(q_1) = q_1 + \frac{1}{2} \cdot (0 + \frac{1}{2} \cdot \displaystyle\max_{a \in A} \left\{q_3, q_4\right\} - q_1) = 4 + \frac{1}{2} \cdot (0 + 4 - 4) = 4$
$Q(q_4) = q_4 + \frac{1}{2} \cdot (0 + \frac{1}{2} \cdot \displaystyle\max_{a \in A} \left\{q_7, q_8\right\} - q_4) = 8 + \frac{1}{2} \cdot (0 + 8 - 8) = 8$
$Q(q_7) = q_7 + \frac{1}{2} \cdot (0 + \frac{1}{2} \cdot \displaystyle\max_{a \in A} \left\{q_5, q_6\right\} - Q_7) = 4 + \frac{1}{2} \cdot (0 + 5 - 4) = 4.5$
$Q(q_6) = q_6 + \frac{1}{2} \cdot (0 + \frac{1}{2} \cdot q_9 - q_6) = 10 + \frac{1}{2} \cdot (0 + 10 - 10) = 10$
$Q(q_9) = q_9 + \frac{1}{2} \cdot (20 + 0 - q_9) = 20 + \frac{1}{2} \cdot (20 + 0 - 20) = 20$
The only change occurs in $Q(q_7)$, where the value changes from $4$ to $4.5$.
Question 3. (5 points)
Other RL
Describe the difference between First-Visit and Every-Visit Monte Carlo Policy Evaluation. Which of these estimators is biased, and which is unbiased? Are they consistent?
Both are methods for estimating the value function of a state by averaging the return from sampled episodes.
First visit only counts the first time the state is visited in an episode. It calculates return from the first occurence and updates the average, if state is visited later in the episode it is ignored.
Every visit counts every single time the state was visited in the episode, calculating separate returns and including them all in the running average.
First visit is unbiased, every visit is biased. Both are consistent.
Provide the formal statistical definitions of a biased/unbiased estimator and a consistent estimator.
The bias of an estimator $\hat{\theta}$ (for parameter $\theta$) is defined as the difference between the expected value of the estimator and the true parameter: $BIAS_\theta(\hat{\theta}) = \mathbb{E}_\theta[\hat{\theta}(X)] - \theta$.
The estimator is unbiased, if the bias is zero.
A sequence of estimators $\hat{\theta}_N(X_N)$ for sample size $N$ is considered consistent, if for every possible true parameter $\theta$ and some $\epsilon \gt 0$, the estimation approaches the true parameters with $N$ approaching infinity.
$\lim_{N \to \infty} P[|\hat{\theta}_N(X_N) - \theta| < \epsilon] = 1$
Define the Multi-Armed Bandit problem. What are the key mathematical and conceptual differences between a MABP setting and a standard MDP?
A multi armed banded is a degenerate MDP with a single state.
It has:
set $A$ containing $m$ actions - pulling an arm
reward distributions $P\left[R_t=r|A_t=a\right]$ at time $t$ given action at time $t$
at each step the agent takes an action and receives a reward sampled from the distribution for that action
the goal is to maximize the expected sum of rewards, which is equivalent to minimizing total regret (accumulated opportunity loss)
the rewards are immediate, there is no impact on the future state
the policy is irrelevant, there is no future sequence of states to navigate
What is UCB in MABP? Equations are not needed.
Upper confidence bound is an algorithm used to balance exploration and exploitation in MABP. It tracks an upper bound for each action, updating those estimates in time. It than takes the action with the highest upper bound and plays it.
What is the learning rate $\alpha$ discount factor $\gamma$, the parameter $\epsilon$? Which of these parameters is domain independent? Which of them can have negative values? Which parameters are not used by passive RL agents (TD)?
$\alpha$ is the step-size parameter, it is used in the update rules to determine how much the new values should rewrite the old values.
$\gamma$ is the discount factor, it controls how much future rewards matter compared to the immediate rewards. $\gamma \in \left[0, 1\right]$, where $0$ means only immediate rewards, $1$ means all the future rewards have the same value as the immediate ones.
$\epsilon$ is the eploration parameter used in the $\epsilon$-greedy policy, where the agent takes a random action with probability $\epsilon$.
All of the domain parameters are domain independent, they do not depend on the domain, algorithm etc.
None of them can reach negative values, $\gamma \in \left[0, 1\right]$, $\epsilon \in \left[0, 1\right]$, $\alpha \gt 0$.
$\epsilon$ is not used by the passive RL agents, as their only goal is the policy evaluation, simply observing the environment.
Explain GLIE and Robbins-Monro.
If and algorithm has the GLIE property it satisfies these conditions:
If a state $s\in S$ is visited inf. often, then each action in that state is chosen inf. often.
In the limit, the policy is greedy with respect to the learned Q-function (with prob 1), this means that our policy always selects the max action based on our learned Q-values.
If MC algorithm satisfies GLIE, it converges to the optimal state-action value function.
Robbins-Monro:
SARSA converges to the optimal state-value function if step-sizes satisfy Robbins-Monro and sequence of policies satify GLIE. Q-Learning only needs Robbins-Monro for optimality.
What is DQN and Double Q-Learning?
Deep Q-Learning is a method that uses a NN to approximate the state-action Q-function in environments with large state spaces (games with 2d pixels etc.). Standard Q-Learning can easily diverge while learning due to correlated trainig samples or non stationary targets. DQN uses techniques like replay buffers and fixed target networks to solve this.
Double Q-Learning is designed to fix maximization bias. While standard Q-Learning uses the same Q-function to select the best next action and to evaluate the value, DQL has two separate Q-function estimates: one for picking the max action and the secondf to evaluate the action values.
Double DQN is the combination of both.
How is bayesian sampling and Thompson sampling used for MABP?
In MABP we are trying to estimate the hidden parameters of each arms distribution (like the success probability $\theta$ of a Bernoulli coin flip).
We start with a prior distribution for each arm and every time we pull it and observe the reward, we use Bayes law to update the posterior distribution $P(\theta|X,\alpha)$ for the pulled arm.
Thompson sampling is a specific algorithm that uses the bayesian framework to solve the exploration/exploitation tradeoff using probability matching.
it randomly samples a set of parameters from the current posterior distribution of each arm
it calculates the expected $Q(a)$ based on the parameters
it selects the arm that yielded the maximum sampled value
it observes the real reward and updates the posterior
It explores even arms that have a wide uncertainty, as it samples from a distribution and those arms occasionaly yield very high rewards.
Other NLP
Explain how to construct a term-context matrix. If you are given a corpus and told to use a context window of $\pm 1$ (one word to the left, one word to the right), exactly what values are you counting to fill in the cells of the matrix?
The matrix is constructed to reflect the meaning of the word based on the company they keep. The matrix has dimensions $|V| \times |V|$, $|V|$ is the size of the vocabulary.
The rows are the target words $w_i$, the columns are the context words $c_j$.
We construct it by scanning through the document and looking left and right by the number of context-window words. So $\pm 1$ in this case. We add one to the column-row combination of words (it reflects raw counts).
For example, for the sentence $\text{To be or not to be}$, the matrix is as following:
| Word \ Context | to | be | or | not |
| to | 0 | 2 | 0 | 1 |
| be | 2 | 0 | 1 | 0 |
| or | 0 | 1 | 0 | 1 |
| not | 1 | 0 | 1 | 0 |
Define Pointwise Mutual Information. What is the mathematical formula for PMI between two words ($w_1$ and $w_2$)?
PMI is used to determine whether two event co-occur more frequenly than if completely independent.
$PMI(X, Y) = lg \frac{P(x,y)}{P(x)P(y)}$
for words:
$PMI(w_1, w_2) = lg \frac{P(w_1,w_2)}{P(w_1)P(w_2)}$
How is PPMI calculated from standard PMI, and why do we use it in NLP?
As PMI typically ranges from $-\infty \dots + \infty$, $\gt 0$ for occuring more than by change, $\lt 0$ less then by chance, $=0$ exacly as by chance.
We only care about emphasizing the positive case.
$PPMI = PMI \iff PMI > 0, \text{else}\ 0$
PPMI emphasizes words which have a meaningful relationship, thats why it is used in NLP.
Define Perplexity in the context of probabilistic language models. Give the formula for calculating it over a sequence of words. What does a lower perplexity indicate about a language model?
Perplexity is the inverse probability of a sentence, normalized by the number of words. It measures how the model is “surprised”.
$PP(W) = \sqrt[N]{\frac{1}{P(w_1 w_2 \dots w_N)}}$
Chain rule:
$PP(W) = \sqrt[N]{\displaystyle\prod_{i=1}^N \frac{1}{P(w_i|w_1 w_2 \dots w_{i-1})}}$
For bigrams:
$PP(W) = \sqrt[N]{\displaystyle\prod_{i=1}^N \frac{1}{P(w_i| w_{i-1})}}$
Lower perplexity implies a better language model.
Draw a diagram of truncated SVD, explain.
Any $w \times c$ matrix $X$ equals the product of three matrices: $X = W\cdot S\cdot C$. $S$ si a singular matrix, $m$ is the rank of matrix $X$ and $m \leq \min(w, c)$
It helps to reduce the dimensionality, each of the matrices is much smaller than the original data.
Truncated SVD helps reduce the dimensionality even further, keeping only the top $l$ signular values. The result is the least squares approximation of the original X.
Explain LDA, Dirichlet dist.
Latent Dirichlet allocation tries to discover hidden topics within a set of documents, by finding clusters of terms that tend to occur together.
The model extracts these topics and infers the topic distribution $\theta$ for each document and the word distribution $\beta$ for each topic.
The document-topic distribution $\theta$ is modeled as a random variable $\theta \sim \text{Dirichlet}(\alpha)$
Dirichlet distribution is used as it represent a distribution over positive probability vectors that sum to one. It is a prior to the multinomial distribution, allows for nice Bayes updates.
RNN diagram, explain.
Recurrent neural networks are a class of sequential neural models that contain cycles. The process one item at a time and maintain dependency on earlier inputs and output (they can process prior context with no fixed length limit).
At each time step they calculate the current state and output based on the matrices $U, V, W$, feeding the calculations forward.
There are also bi-directional RNNs.
Naive Bayes, bag of words. Which assumptions does Naive Bayes use?
Bag of words only records the number of times each word has ocurred, not considering the order they appeared in.
Naive Bayes is a classifier, based on the bayes rule. It tries to find the MAP estimate ($P(c|d)$ of class $c$ given document $d$.
Naive Bayes does this by multiplying the prior prob of class $c$ by the likelihood of the given words ($P(x_1, x_2, \cdots x_n | c$) and ignoring the normalizing denominator.
It assumes the bag of words and conditional independence. It acts as a collection of unigram models, each class represents its own unigram model.
Uni-gram and Bi-gram models learned on the same data, which will have a lower perplexity?
Bi-gram will have a lower perplexity, since the uni-grams only evaluate the absolute probability of each word appearing, not taking into account the previous word.
Uni-grams are worse at this task, so they will have a higher perplexity.
Other COLT
Provide the general definition of the VC Dimension ($VC(\mathcal{H})$) for a hypothesis class $\mathcal{H}$. What does it mean for a set of instances to be "shattered"?
A set of instances $X' \subseteq X$ is considered shattered by a hypothesis $\mathcal{H}$ (or a concept class $\mathcal{C}$) if for every possible subset $X'' \subseteq X'$ there is a hypothesis $h \in \mathcal{H}$ that perfectly isolates it (or there is a concept $C$ in $\mathcal{C}$ such that $C \cap X' = X''$).
Finite $X'$ is shattered by $\mathcal{C}$ if it can be split by concepts in all $2^{|X'|}$ possible ways.
VC-dimension is the size of the largest subset $X$ shattered by $\mathcal{C}$:
$VC(\mathcal{C}) = \max \left\{|X'|: \mathcal{C}\ \text{shatters}\ X', X' \subseteq X \right\}$
What is the exact VC dimension for the hypothesis class of open intervals on real numbers? Provide the number and a brief justification.
The VC dimension is $\geq 2$:
(i assume the hypothesis is positive inside of the open interval, and negative outside of it, with points $x_1 \lt x_2 \lt x_3 \lt \dots \lt x_n $)
For any labeling of two points we can find a hypothesis to shatter them:
$(+, +)$ - interval with both the points inside the interval
$(+, -)$ - interval with the left point inside
$(-, +)$ - interval with the right point inside
$(-, -)$ - interval with both the points outside
So the VC-dimension is at least $2$.
For three points, we are unable to shatter a combination of points $(+, -, +)$, thus the VC-dimension is $\lt 3$
We proved that the VC-dimension is 2.
Adapt Winnow / Halving algorithm, define $\mathcal{H}$, define mistake bound learning
M-B-learning (or online learning) is bounded by the number of mistakes which it can do while learning, that is $\leq \text{poly}(n)$.
$\mathcal{H}$ is the hypothesis class, with the individual hypotheses being functions $X\rightarrow \left\{0, 1\right\}$ The hypothesis class is the complete set of all hypotheses that a specific learner is capable of expressing.
WINNOW algorithm is used to learn linearly separable concept (like perceptron). The initial $h$ is all ones, if it makes a mistake:
WINNOW natively learns monotone conjunctions and disjunctions, we adapt it to learn non-monotone by introducing new variables.
Halving algorithm takes the version space consisting of all $h \in \mathcal{H}$, predicts the lable by taking majority vote among the remaining $h$, if it makes the mistake it removes all the $h$ that guessed incorrectly.
At least half of the hypotheses are eliminated each mistake, the max mistake bound is $\text{lg} |\mathcal{H}|$
To adapt it to any $C$ we simply use it with any hypothesis $\mathcal{H}$ where $\mathcal{H} \supseteq C$ and $\text{lg} |\mathcal{H}| \leq \text{poly}(n)$
What is the PAC learning model? What does it mean for something to be PAC-learnable?
Probably approximately correct learning is an offline learning model, that has i.i.d. assumptions, bounded error and failure.
Concept class $\mathcal{C}$ is PAC-learnable if an algorithm exists that can learn it while:
receiving the number of examples bounded by $m \leq \text{poly}(\frac{1}{\epsilon}, \frac{1}{\delta}, n)$
outputting the hypothesis that is approximately correct: $err(h) \leq \epsilon$
succeeding in doing so with probability of at least $1 - \delta$
If it also runs in $\text{poly}(\frac{1}{\epsilon}, \frac{1}{\delta}, n)$ time it is efficiently PAC-learnable.
Decide whether the following statements are True or False and justify your answer based on the relationship between learning models:
If a concept class is efficiently learnable in the Mistake Bound model, it is also efficiently learnable in the PAC model.
Yes.
Mistake bound learnability implies PAC-learnability.
M-B learner with $M \lt \text{poly}(n)$ can be transformed into PAC learner by running it on a batch of training examples, stopping when a hypothesis successfully survives $\frac{1}{\epsilon}\text{ln}(\frac{M}{\delta})$ consecutive examples without making a mistake.
If a concept class is efficiently learnable in the PAC model, it is also efficiently learnable in the Mistake Bound model.
No.
PAC-learnability does not guarantee M-B-learnability.
If a concept class is PAC-learnable, then $VC(\mathcal{C}) \leq \text{poly}(n)$. A concept class can have a finite VC-dimension, even when the number of concepts $|\mathcal{C}|$ is infinite. This can lead to unbounded number of mistakes, that it may no learn in M-B.
What are the k-decision lists? Are they properly and efficiently PAC-learnable?
k-DL is a decision model that consists of a structured list of k-conjunctions (each has up to k literals), each is paired with a specific class indicator followed by a final default class indicator.
When evaluating, it is assigned a class of the top-most conjunctions for which it evaluates to true.
They are both properly and efficiently PAC-learnable.
The size of the class is bounded by $\text{ln}|DL| \leq \text{poly}(n)$, it is PAC-learnable.
We can efficiently find a consistent k-DL using the covering algorithm, which guarantees the proper and efficient PAC-learnability.
What is the ERM? What is the inconsistent learning?
Inconsistent learning occurs when it is impossible for a learner to find a hypothesis that perfectly matches all the labels in the training set (eg. due to noise or it being impossible).
ERM is a principle used to handle this problem; Under ERM the learner output the hypothesis $h_{erm}$ that minimizes the empirical risk, which is the training error $\hat{erm}(h)$, which is the proportion of the training examples that are missclassified by $h$.