Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
courses:b4m36smu [2026/05/24 16:49] jpelccourses:b4m36smu [2026/05/27 09:37] (current) – [Other COLT] jpelc
Line 3: Line 3:
 {{courses:smu:smu_2023_model_exam.pdf| Sample Exam (2023)}} {{courses:smu:smu_2023_model_exam.pdf| Sample Exam (2023)}}
  
-======= Sample test =======+{{courses:smu:smu_skripta_thirtyfourth_draft.pdf| Skripta}} 
 +======= Sample exam 1 =======
  
 ===== Question 1. (10 points) ===== ===== Question 1. (10 points) =====
Line 51: Line 52:
   * exit: $0 + \gamma \cdot (1.0 \cdot V(s_4)) = - 100 \gamma$   * exit: $0 + \gamma \cdot (1.0 \cdot V(s_4)) = - 100 \gamma$
  
-optimal in $s_2$ is FW, $V(s_2) = 80\gamma$+$V(s_2) = \max\left\{80\gamma, \gamma V(s_1)\right\}$
  
 Then $V(s_1)$: Then $V(s_1)$:
-  * FW: $0 + \gamma \cdot (0.9 V(s_2) + 0.1 V(s_4)) = \gamma \cdot (0.9 \cdot 80 \gamma -10= 72\gamma^2 - 10\gamma$+  * FW: $0 + \gamma \cdot (0.9 V(s_2) + 0.1 V(s_4)) = \gamma (0.9 V(s_2) - 10)$
   * exit: $0 + \gamma \cdot (1.0V(s_4)) = -100\gamma$   * exit: $0 + \gamma \cdot (1.0V(s_4)) = -100\gamma$
  
-$72\gamma^2 - 10\gamma \geq -100\gamma$ for any $\gamma \geq 0$; the optimal action in $s_1is FW.+If FW is optimal in $s_2then:
  
-$V(s_1) = 72\gamma^2 - 10\gamma$+$V(s_2) = 80 \gamma \quad V(s_1) = \max \left\{-100\gamma, 72\gamma^2 - 10\gamma \right\}$ 
 + 
 +$72\gamma^2 - 10\gamma \geq - 10 \gamma \geq -100 \gamma$ as $\gamma \in \left[0 , 1\right]$ 
 + 
 +Thus $V(s_1) = 72\gamma^2 - 10\gamma$ 
 + 
 +Value of FW in $s_2$ is $80\gamma$, value of BW in $s_2$ is $\gamma V(s_1) = 72 \gamma ^ 3 - 10 \gamma ^2$, then $80\gamma \geq 72 \gamma ^ 3 - 10 \gamma ^2$ for all $\gamma \in \left[0 , 1\right]$. The assumption holds. 
 + 
 +If BW is optimal in $s_2$ then: 
 + 
 +$V(s_2) = \gamma V(s_1) \quad V(s_1) = \max \left\{-100\gamma, 0.9 \gamma ^2 V(s_1) - 10 \gamma\right\}$ 
 + 
 +$V(s_1) = 0.9 \gamma ^2 V(s_1) - 10 \gamma \dots \rightarrow V(s_1) = \frac{-10 \gamma}{1 - 0.9 \gamma ^2}$ 
 + 
 +The denominator is positive, and the numerator negative, thus $V(s_1)$ will be negative. This is smaller than the value of $80\gamma$, so the agent would not take the action BW.
  
 ++++ ++++
Line 217: Line 232:
 ++++ (1 points) Assume classification of much longer reviews. What problem could occur and how to solve it?| ++++ (1 points) Assume classification of much longer reviews. What problem could occur and how to solve it?|
 With longer reviews, we multiply numbers that are close to zero, which could lead to underflows. We then calculate the probabilities in log space. Logarithm is an increasing function, so the classification will lead the same classification results, as the raw probabilities. With longer reviews, we multiply numbers that are close to zero, which could lead to underflows. We then calculate the probabilities in log space. Logarithm is an increasing function, so the classification will lead the same classification results, as the raw probabilities.
 +
 +We can also stumble upon a completely unseen word, which will have a probability of zero, making the whole sentence have the probability of zero. Solution to this problem is smoothing.
  
 The problem with longer reviews is also that the Naive Bayes (or unigram model) does not take the context of the word, that can span the whole sentence in normal language, like: The problem with longer reviews is also that the Naive Bayes (or unigram model) does not take the context of the word, that can span the whole sentence in normal language, like:
Line 348: Line 365:
  
 % Layer Box % Layer Box
-\draw[thick, dashed] (-3.5, 2.5) rectangle (3.5, 7.5); +\draw[thick, dashed] (-4.0, 2.5) rectangle (3.5, 10.5); 
-\node at (4.55) [font=\sffamily\bfseries\Large] {$\times N$ Layers};+\node at (5.07) [font=\sffamily\bfseries\Large] {$\times N$ Layers};
  
 \end{tikzpicture} \end{tikzpicture}
Line 355: Line 372:
 </tikzjax> </tikzjax>
  
-We first tokenize the text and turn the words into embeddings (with positional encodings - information about their location in the text), we feed it through the multi-head self-attention layer (which catches multiple relationships between pair of words) and we normalize, this is then repeated N times until we feed forward through a standard NN to process the learned patterns. We predict the probabilities of the output words.+We first tokenize the text and turn the words into embeddings (with positional encodings - information about their location in the text, implemented using sine and cosine functions to encode positions), we feed it through the multi-head self-attention layer (which catches multiple relationships between pair of words) and we normalize, this is then repeated N times until we feed forward through a standard NN to process the learned patterns. We predict the probabilities of the output words.
  
 The core component is the self-attention block (or the multi-head version). It helps the transformer learn the relationships between word pairs. The core component is the self-attention block (or the multi-head version). It helps the transformer learn the relationships between word pairs.
Line 490: Line 507:
  
  
-====== Other questions ======+====== Sample exam 2 ======
  
 ===== Question 1. (4 points) ===== ===== Question 1. (4 points) =====
Line 657: Line 674:
 \begin{tikzpicture}[ \begin{tikzpicture}[
     >=Stealth,     >=Stealth,
-    % Added label styling 
     label_style/.style={font=\sffamily\scriptsize, text=gray},     label_style/.style={font=\sffamily\scriptsize, text=gray},
     box/.style={draw, rectangle, rounded corners, minimum width=1.2cm, minimum height=0.8cm, fill=white, font=\sffamily},     box/.style={draw, rectangle, rounded corners, minimum width=1.2cm, minimum height=0.8cm, fill=white, font=\sffamily},
Line 683: Line 699:
 \node (h_next) at (12, 1) {$h_t$}; \node (h_next) at (12, 1) {$h_t$};
  
-\draw[thick] (h_prev) -- (9.5, 1);+% Fixed: Stop the continuous line at the output gate (8.5) 
 +\draw[thick] (h_prev) -- (8.5, 1);
 \draw[->, thick] (x_curr) -- (1, 1); \draw[->, thick] (x_curr) -- (1, 1);
  
Line 694: Line 711:
  
 \node[box] (g) at (6.5, 2.5) {$\tanh$}; \node[box] (g) at (6.5, 2.5) {$\tanh$};
-% Candidate gate doesn't have a label in your request, but usually it's "cand" 
 \node[label_style, below=of g] {candidate}; \node[label_style, below=of g] {candidate};
  
Line 725: Line 741:
 % Final Hidden State routing % Final Hidden State routing
 \draw[->, thick] (mult_o) -- (9.5, 1) -- (h_next); \draw[->, thick] (mult_o) -- (9.5, 1) -- (h_next);
-\node[dot] at (9.5, 1) {}; 
  
 % Label % Label
Line 821: Line 836:
 For $VC(\mathcal{H}) \geq n$: For $VC(\mathcal{H}) \geq n$:
  
-  $x_1 = 0 \quad 1 \quad 1 \quad \dots \quad 1$ +  $x_1 = 0 \quad 1 \quad 1 \quad \dots \quad 1$ 
-  $x_2 = 1 \quad 0 \quad 1 \quad \dots \quad 1$ +  $x_2 = 1 \quad 0 \quad 1 \quad \dots \quad 1$ 
-  $x_n = 1 \quad 1 \quad 1 \quad \dots \quad 0$+  $x_n = 1 \quad 1 \quad 1 \quad \dots \quad 0$
  
 $\mathcal{I} \subseteq \left[1, 2, \dots n\right]$ $\mathcal{I} \subseteq \left[1, 2, \dots n\right]$
Line 847: Line 862:
 $VC(\mathcal{H}) \leq n \land VC(\mathcal{H}) \geq n \implies VC(\mathcal{H}) = n$ $VC(\mathcal{H}) \leq n \land VC(\mathcal{H}) \geq n \implies VC(\mathcal{H}) = n$
 ++++ ++++
 +
 +====== Other questions ======
 +
  
 ===== Question 5. (10 points) ===== ===== Question 5. (10 points) =====
Line 1031: Line 1049:
  
 SARSA converges to the optimal state-value function if step-sizes satisfy Robbins-Monro and sequence of policies satify GLIE. Q-Learning only needs Robbins-Monro for optimality. SARSA converges to the optimal state-value function if step-sizes satisfy Robbins-Monro and sequence of policies satify GLIE. Q-Learning only needs Robbins-Monro for optimality.
 +++++
 +
 +++++ What is DQN and Double Q-Learning? |
 +Deep Q-Learning is a method that uses a NN to approximate the state-action Q-function in environments with large state spaces (games with 2d pixels etc.). Standard Q-Learning can easily diverge while learning due to correlated trainig samples or non stationary targets. DQN uses techniques like replay buffers and fixed target networks to solve this.
 +
 +Double Q-Learning is designed to fix maximization bias. While standard Q-Learning uses the same Q-function to select the best next action and to evaluate the value, DQL has two separate Q-function estimates: one for picking the max action and the secondf to evaluate the action values.
 +
 +Double DQN is the combination of both.
 +++++
 +
 +++++ How is bayesian sampling and Thompson sampling used for MABP? |
 +In MABP we are trying to estimate the hidden parameters of each arms distribution (like the success probability $\theta$ of a Bernoulli coin flip).
 +
 +We start with a prior distribution for each arm and every time we pull it and observe the reward, we use Bayes law to update the posterior distribution $P(\theta|X,\alpha)$ for the pulled arm.
 +
 +Thompson sampling is a specific algorithm that uses the bayesian framework to solve the exploration/exploitation tradeoff using probability matching.
 +
 +  * it randomly samples a set of parameters from the current posterior distribution of each arm
 +  * it calculates the expected $Q(a)$ based on the parameters
 +  * it selects the arm that yielded the maximum sampled value
 +  * it observes the real reward and updates the posterior
 +
 +It explores even arms that have a wide uncertainty, as it samples from a distribution and those arms occasionaly yield very high rewards.
 ++++ ++++
  
Line 1149: Line 1190:
 Latent Dirichlet allocation tries to discover hidden topics within a set of documents, by finding clusters of terms that tend to occur together. Latent Dirichlet allocation tries to discover hidden topics within a set of documents, by finding clusters of terms that tend to occur together.
  
-The model extracts these topics and infers the topic distribution $\thata$ for each document and the word distribution $\beta$ for each topic.+The model extracts these topics and infers the topic distribution $\theta$ for each document and the word distribution $\beta$ for each topic.
  
 The document-topic distribution $\theta$ is modeled as a random variable $\theta \sim \text{Dirichlet}(\alpha)$ The document-topic distribution $\theta$ is modeled as a random variable $\theta \sim \text{Dirichlet}(\alpha)$
Line 1229: Line 1270:
  
 ++++ Uni-gram and Bi-gram models learned on the same data, which will have a lower perplexity? | ++++ Uni-gram and Bi-gram models learned on the same data, which will have a lower perplexity? |
-Bi-gram will have a lower perplexity, since the uni-gram only evaluate the absolute probability of each word appearing, not taking into account the previous word.+Bi-gram will have a lower perplexity, since the uni-grams only evaluate the absolute probability of each word appearing, not taking into account the previous word.
  
-Uni-grams are better at this task, so they will have a lower perplexity.+Uni-grams are worse at this task, so they will have a higher perplexity.
 ++++ ++++
  
Line 1238: Line 1279:
  
 ++++ Provide the general definition of the VC Dimension ($VC(\mathcal{H})$) for a hypothesis class $\mathcal{H}$. What does it mean for a set of instances to be "shattered"? | ++++ Provide the general definition of the VC Dimension ($VC(\mathcal{H})$) for a hypothesis class $\mathcal{H}$. What does it mean for a set of instances to be "shattered"? |
 +A set of instances $X' \subseteq X$ is considered shattered by a hypothesis $\mathcal{H}$ (or a concept class $\mathcal{C}$) if for every possible subset $X'' \subseteq X'$ there is a hypothesis $h \in \mathcal{H}$ that perfectly isolates it (or there is a concept $C$ in $\mathcal{C}$ such that $C \cap X' = X''$).
 +
 +Finite $X'$ is shattered by $\mathcal{C}$ if it can be split by concepts in all $2^{|X'|}$ possible ways.
 +
 +VC-dimension is the size of the largest subset $X$ shattered by $\mathcal{C}$:
 +
 +$VC(\mathcal{C}) = \max \left\{|X'|: \mathcal{C}\ \text{shatters}\ X', X' \subseteq X \right\}$
 ++++ ++++
  
 ++++ What is the exact VC dimension for the hypothesis class of open intervals on real numbers? Provide the number and a brief justification. | ++++ What is the exact VC dimension for the hypothesis class of open intervals on real numbers? Provide the number and a brief justification. |
 +The VC dimension is $\geq 2$:
 +
 +(i assume the hypothesis is positive inside of the open interval, and negative outside of it, with points $x_1 \lt x_2 \lt x_3 \lt \dots \lt x_n $)
 +
 +For any labeling of two points we can find a hypothesis to shatter them:
 +  * $(+, +)$ - interval with both the points inside the interval
 +  * $(+, -)$ - interval with the left point inside
 +  * $(-, +)$ - interval with the right point inside
 +  * $(-, -)$ - interval with both the points outside
 +
 +So the VC-dimension is at least $2$.
 +
 +For three points, we are unable to shatter a combination of points $(+, -, +)$, thus the VC-dimension is $\lt 3$
 +
 +We proved that the VC-dimension is 2.
 ++++ ++++
  
 ++++ Adapt Winnow / Halving algorithm, define $\mathcal{H}$, define mistake bound learning | ++++ Adapt Winnow / Halving algorithm, define $\mathcal{H}$, define mistake bound learning |
 +M-B-learning (or online learning) is bounded by the number of mistakes which it can do while learning, that is $\leq \text{poly}(n)$.
 +
 +$\mathcal{H}$ is the hypothesis class, with the individual hypotheses being functions $X\rightarrow \left\{0, 1\right\}$ The hypothesis class is the complete set of all hypotheses that a specific learner is capable of expressing.
 +
 +WINNOW algorithm is used to learn linearly separable concept (like perceptron). The initial $h$ is all ones, if it makes a mistake:
 +
 +  * false negative (if $x$ is True and we predicted False) - double all $h_i$ where $x_i = 1$
 +  * false positive - nullify all $h_i$ where $x_i$ = 1
 +
 +WINNOW natively learns monotone conjunctions and disjunctions, we adapt it to learn non-monotone by introducing new variables.
 +
 +Halving algorithm takes the version space consisting of all $h \in \mathcal{H}$, predicts the lable by taking majority vote among the remaining $h$, if it makes the mistake it removes all the $h$ that guessed incorrectly.
 +
 +At least half of the hypotheses are eliminated each mistake, the max mistake bound is $\text{lg} |\mathcal{H}|$
 +
 +To adapt it to any $C$ we simply use it with any hypothesis $\mathcal{H}$ where $\mathcal{H} \supseteq C$ and $\text{lg} |\mathcal{H}| \leq \text{poly}(n)$
 ++++ ++++
  
 ++++ What is the PAC learning model? What does it mean for something to be PAC-learnable? | ++++ What is the PAC learning model? What does it mean for something to be PAC-learnable? |
 +Probably approximately correct learning is an offline learning model, that has i.i.d. assumptions, bounded error and failure.
 +
 +Concept class $\mathcal{C}$ is PAC-learnable if an algorithm exists that can learn it while:
 +  * receiving the number of examples bounded by $m \leq \text{poly}(\frac{1}{\epsilon}, \frac{1}{\delta}, n)$
 +  * outputting the hypothesis that is approximately correct: $err(h) \leq \epsilon$
 +  * succeeding in doing so with probability of at least $1 - \delta$
 +
 +If it also runs in $\text{poly}(\frac{1}{\epsilon}, \frac{1}{\delta}, n)$ time it is efficiently PAC-learnable.
 ++++ ++++
  
Line 1252: Line 1339:
  
 ++++ If a concept class is efficiently learnable in the Mistake Bound model, it is also efficiently learnable in the PAC model. | ++++ If a concept class is efficiently learnable in the Mistake Bound model, it is also efficiently learnable in the PAC model. |
 +Yes.
 +
 +Mistake bound learnability implies PAC-learnability.
 +
 +M-B learner with $M \lt \text{poly}(n)$ can be transformed into PAC learner by running it on a batch of training examples, stopping when a hypothesis successfully survives $\frac{1}{\epsilon}\text{ln}(\frac{M}{\delta})$ consecutive examples without making a mistake.
 ++++ ++++
  
 ++++ If a concept class is efficiently learnable in the PAC model, it is also efficiently learnable in the Mistake Bound model.| ++++ If a concept class is efficiently learnable in the PAC model, it is also efficiently learnable in the Mistake Bound model.|
 +No.
 +
 +PAC-learnability does not guarantee M-B-learnability.
 +
 +If a concept class is PAC-learnable, then $VC(\mathcal{C}) \leq \text{poly}(n)$. A concept class can have a finite VC-dimension, even when the number of concepts $|\mathcal{C}|$ is infinite. This can lead to unbounded number of mistakes, that it may no learn in M-B.
 ++++ ++++
 +
 +++++ What are the k-decision lists? Are they properly and efficiently PAC-learnable? |
 +k-DL is a decision model that consists of a structured list of k-conjunctions (each has up to k literals), each is paired with a specific class indicator followed by a final default class indicator.
 +
 +When evaluating, it is assigned a class of the top-most conjunctions for which it evaluates to true.
 +
 +They are both properly and efficiently PAC-learnable.
 +
 +The size of the class is bounded by $\text{ln}|DL| \leq \text{poly}(n)$, it is PAC-learnable.
 +
 +We can efficiently find a consistent k-DL using the covering algorithm, which guarantees the proper and efficient PAC-learnability.
 +++++
 +
 +++++ What is the ERM? What is the inconsistent learning? |
 +Inconsistent learning occurs when it is impossible for a learner to find a hypothesis that perfectly matches all the labels in the training set (eg. due to noise or it being impossible).
 +
 +ERM is a principle used to handle this problem; Under ERM the learner output the hypothesis $h_{erm}$ that minimizes the empirical risk, which is the training error $\hat{erm}(h)$, which is the proportion of the training examples that are missclassified by $h$.
 +++++
 +
  
Navigation

Playground

QR Code
QR Code courses:b4m36smu (generated for current page)