Differences

This shows you the differences between two versions of the page.

--- courses:b4m36smu [2026/05/24 19:08] – [Other NLP] jpelc
+++ courses:b4m36smu [2026/05/27 09:37] (current) – [Other COLT] jpelc
@@ Line 3: / Line 3: @@
 {{courses:smu:smu_2023_model_exam.pdf| Sample Exam (2023)}}
-======= Sample test =======
+{{courses:smu:smu_skripta_thirtyfourth_draft.pdf| Skripta}}
+======= Sample exam 1 =======
 ===== Question 1. (10 points) =====
@@ Line 231: / Line 232: @@
 ++++ (1 points) Assume classification of much longer reviews. What problem could occur and how to solve it?|
 With longer reviews, we multiply numbers that are close to zero, which could lead to underflows. We then calculate the probabilities in log space. Logarithm is an increasing function, so the classification will lead the same classification results, as the raw probabilities.
+We can also stumble upon a completely unseen word, which will have a probability of zero, making the whole sentence have the probability of zero. Solution to this problem is smoothing.
 The problem with longer reviews is also that the Naive Bayes (or unigram model) does not take the context of the word, that can span the whole sentence in normal language, like:
@@ Line 362: / Line 365: @@
 % Layer Box
-\draw[thick, dashed] (-3.5, 2.5) rectangle (3.5, 7.5);
+\draw[thick, dashed] (-4.0, 2.5) rectangle (3.5, 10.5);
-\node at (4.5, 5) [font=\sffamily\bfseries\Large] {$\times N$ Layers};
+\node at (5.0, 7) [font=\sffamily\bfseries\Large] {$\times N$ Layers};
 \end{tikzpicture}
@@ Line 369: / Line 372: @@
 </tikzjax>
-We first tokenize the text and turn the words into embeddings (with positional encodings - information about their location in the text), we feed it through the multi-head self-attention layer (which catches multiple relationships between pair of words) and we normalize, this is then repeated N times until we feed forward through a standard NN to process the learned patterns. We predict the probabilities of the output words.
+We first tokenize the text and turn the words into embeddings (with positional encodings - information about their location in the text, implemented using sine and cosine functions to encode positions), we feed it through the multi-head self-attention layer (which catches multiple relationships between pair of words) and we normalize, this is then repeated N times until we feed forward through a standard NN to process the learned patterns. We predict the probabilities of the output words.
 The core component is the self-attention block (or the multi-head version). It helps the transformer learn the relationships between word pairs.
@@ Line 504: / Line 507: @@
-====== Other questions ======
+====== Sample exam 2 ======
 ===== Question 1. (4 points) =====
@@ Line 671: / Line 674: @@
 \begin{tikzpicture}[
     >=Stealth,
-    % Added label styling
     label_style/.style={font=\sffamily\scriptsize, text=gray},
     box/.style={draw, rectangle, rounded corners, minimum width=1.2cm, minimum height=0.8cm, fill=white, font=\sffamily},
@@ Line 697: / Line 699: @@
 \node (h_next) at (12, 1) {$h_t$};
-\draw[thick] (h_prev) -- (9.5, 1);
+% Fixed: Stop the continuous line at the output gate (8.5)
+\draw[thick] (h_prev) -- (8.5, 1);
 \draw[->, thick] (x_curr) -- (1, 1);
@@ Line 708: / Line 711: @@
 \node[box] (g) at (6.5, 2.5) {$\tanh$};
-% Candidate gate doesn't have a label in your request, but usually it's "cand"
 \node[label_style, below=of g] {candidate};
@@ Line 739: / Line 741: @@
 % Final Hidden State routing
 \draw[->, thick] (mult_o) -- (9.5, 1) -- (h_next);
-\node[dot] at (9.5, 1) {};
 % Label
@@ Line 835: / Line 836: @@
 For $VC(\mathcal{H}) \geq n$:
-  - $x_1 = 0 \quad 1 \quad 1 \quad \dots \quad 1$
+  * $x_1 = 0 \quad 1 \quad 1 \quad \dots \quad 1$
-  - $x_2 = 1 \quad 0 \quad 1 \quad \dots \quad 1$
+  * $x_2 = 1 \quad 0 \quad 1 \quad \dots \quad 1$
-  - $x_n = 1 \quad 1 \quad 1 \quad \dots \quad 0$
+  * $x_n = 1 \quad 1 \quad 1 \quad \dots \quad 0$
 $\mathcal{I} \subseteq \left[1, 2, \dots n\right]$
@@ Line 861: / Line 862: @@
 $VC(\mathcal{H}) \leq n \land VC(\mathcal{H}) \geq n \implies VC(\mathcal{H}) = n$
 ++++
+====== Other questions ======
 ===== Question 5. (10 points) =====
@@ Line 1045: / Line 1049: @@
 SARSA converges to the optimal state-value function if step-sizes satisfy Robbins-Monro and sequence of policies satify GLIE. Q-Learning only needs Robbins-Monro for optimality.
+++++
+++++ What is DQN and Double Q-Learning? |
+Deep Q-Learning is a method that uses a NN to approximate the state-action Q-function in environments with large state spaces (games with 2d pixels etc.). Standard Q-Learning can easily diverge while learning due to correlated trainig samples or non stationary targets. DQN uses techniques like replay buffers and fixed target networks to solve this.
+Double Q-Learning is designed to fix maximization bias. While standard Q-Learning uses the same Q-function to select the best next action and to evaluate the value, DQL has two separate Q-function estimates: one for picking the max action and the secondf to evaluate the action values.
+Double DQN is the combination of both.
+++++
+++++ How is bayesian sampling and Thompson sampling used for MABP? |
+In MABP we are trying to estimate the hidden parameters of each arms distribution (like the success probability $\theta$ of a Bernoulli coin flip).
+We start with a prior distribution for each arm and every time we pull it and observe the reward, we use Bayes law to update the posterior distribution $P(\theta|X,\alpha)$ for the pulled arm.
+Thompson sampling is a specific algorithm that uses the bayesian framework to solve the exploration/exploitation tradeoff using probability matching.
+  * it randomly samples a set of parameters from the current posterior distribution of each arm
+  * it calculates the expected $Q(a)$ based on the parameters
+  * it selects the arm that yielded the maximum sampled value
+  * it observes the real reward and updates the posterior
+It explores even arms that have a wide uncertainty, as it samples from a distribution and those arms occasionaly yield very high rewards.
 ++++
@@ Line 1163: / Line 1190: @@
 Latent Dirichlet allocation tries to discover hidden topics within a set of documents, by finding clusters of terms that tend to occur together.
-The model extracts these topics and infers the topic distribution $\thata$ for each document and the word distribution $\beta$ for each topic.
+The model extracts these topics and infers the topic distribution $\theta$ for each document and the word distribution $\beta$ for each topic.
 The document-topic distribution $\theta$ is modeled as a random variable $\theta \sim \text{Dirichlet}(\alpha)$
@@ Line 1280: / Line 1307: @@
 ++++ Adapt Winnow / Halving algorithm, define $\mathcal{H}$, define mistake bound learning |
-MB-learning (or online learning) is bounded by the number of mistakes which it can do while learning, that is $\leq \text{poly}(n)$.
+M-B-learning (or online learning) is bounded by the number of mistakes which it can do while learning, that is $\leq \text{poly}(n)$.
 $\mathcal{H}$ is the hypothesis class, with the individual hypotheses being functions $X\rightarrow \left\{0, 1\right\}$ The hypothesis class is the complete set of all hypotheses that a specific learner is capable of expressing.
@@ Line 1286: / Line 1313: @@
 WINNOW algorithm is used to learn linearly separable concept (like perceptron). The initial $h$ is all ones, if it makes a mistake:
-  * false positive (if $x$ is True and we predicted False) - double all $h_i$ where $x_i = 1$
+  * false negative (if $x$ is True and we predicted False) - double all $h_i$ where $x_i = 1$
-  * false negative - nullify all $h_i$ where $x_i$ = 1
+  * false positive - nullify all $h_i$ where $x_i$ = 1
 WINNOW natively learns monotone conjunctions and disjunctions, we adapt it to learn non-monotone by introducing new variables.
@@ Line 1316: / Line 1343: @@
 Mistake bound learnability implies PAC-learnability.
-MB learner with $M \lt \text{poly}(n)$ can be transformed into PAC learner by running it on a batch of training examples, stopping when a hypothesis successfully survives $\frac{1}{\epsilon}\text{ln}(\frac{M}{\delta})$ consecutive examples without making a mistake.
+M-B learner with $M \lt \text{poly}(n)$ can be transformed into PAC learner by running it on a batch of training examples, stopping when a hypothesis successfully survives $\frac{1}{\epsilon}\text{ln}(\frac{M}{\delta})$ consecutive examples without making a mistake.
 ++++
@@ Line 1322: / Line 1349: @@
 No.
-PAC-learnability does not guarantee MB-learnability.
+PAC-learnability does not guarantee M-B-learnability.
-If a concept class is PAC-learnable, then $VC(\mathcal{C}) \leq \text{poly}(n)$. A concept class can have a finite VC-dimension, even when the number of concepts $|\mathcal{C}| is infinite$. This can lead to unbounded number of mistakes, that it may no learn in MB.
+If a concept class is PAC-learnable, then $VC(\mathcal{C}) \leq \text{poly}(n)$. A concept class can have a finite VC-dimension, even when the number of concepts $|\mathcal{C}|$ is infinite. This can lead to unbounded number of mistakes, that it may no learn in M-B.
 ++++
+++++ What are the k-decision lists? Are they properly and efficiently PAC-learnable? |
+k-DL is a decision model that consists of a structured list of k-conjunctions (each has up to k literals), each is paired with a specific class indicator followed by a final default class indicator.
+When evaluating, it is assigned a class of the top-most conjunctions for which it evaluates to true.
+They are both properly and efficiently PAC-learnable.
+The size of the class is bounded by $\text{ln}|DL| \leq \text{poly}(n)$, it is PAC-learnable.
+We can efficiently find a consistent k-DL using the covering algorithm, which guarantees the proper and efficient PAC-learnability.
+++++
+++++ What is the ERM? What is the inconsistent learning? |
+Inconsistent learning occurs when it is impossible for a learner to find a hypothesis that perfectly matches all the labels in the training set (eg. due to noise or it being impossible).
+ERM is a principle used to handle this problem; Under ERM the learner output the hypothesis $h_{erm}$ that minimizes the empirical risk, which is the training error $\hat{erm}(h)$, which is the proportion of the training examples that are missclassified by $h$.
+++++

Trace:

Differences

Search

Navigation

Print/export

Tools

QR Code