Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
courses:b4m36smu [2026/05/24 19:11] – [Question 4. (10 points)] jpelccourses:b4m36smu [2026/05/27 09:37] (current) – [Other COLT] jpelc
Line 3: Line 3:
 {{courses:smu:smu_2023_model_exam.pdf| Sample Exam (2023)}} {{courses:smu:smu_2023_model_exam.pdf| Sample Exam (2023)}}
  
-======= Sample test =======+{{courses:smu:smu_skripta_thirtyfourth_draft.pdf| Skripta}} 
 +======= Sample exam 1 =======
  
 ===== Question 1. (10 points) ===== ===== Question 1. (10 points) =====
Line 364: Line 365:
  
 % Layer Box % Layer Box
-\draw[thick, dashed] (-3.5, 2.5) rectangle (3.5, 7.5); +\draw[thick, dashed] (-4.0, 2.5) rectangle (3.5, 10.5); 
-\node at (4.55) [font=\sffamily\bfseries\Large] {$\times N$ Layers};+\node at (5.07) [font=\sffamily\bfseries\Large] {$\times N$ Layers};
  
 \end{tikzpicture} \end{tikzpicture}
Line 371: Line 372:
 </tikzjax> </tikzjax>
  
-We first tokenize the text and turn the words into embeddings (with positional encodings - information about their location in the text), we feed it through the multi-head self-attention layer (which catches multiple relationships between pair of words) and we normalize, this is then repeated N times until we feed forward through a standard NN to process the learned patterns. We predict the probabilities of the output words.+We first tokenize the text and turn the words into embeddings (with positional encodings - information about their location in the text, implemented using sine and cosine functions to encode positions), we feed it through the multi-head self-attention layer (which catches multiple relationships between pair of words) and we normalize, this is then repeated N times until we feed forward through a standard NN to process the learned patterns. We predict the probabilities of the output words.
  
 The core component is the self-attention block (or the multi-head version). It helps the transformer learn the relationships between word pairs. The core component is the self-attention block (or the multi-head version). It helps the transformer learn the relationships between word pairs.
Line 506: Line 507:
  
  
-====== Other questions ======+====== Sample exam 2 ======
  
 ===== Question 1. (4 points) ===== ===== Question 1. (4 points) =====
Line 673: Line 674:
 \begin{tikzpicture}[ \begin{tikzpicture}[
     >=Stealth,     >=Stealth,
-    % Added label styling 
     label_style/.style={font=\sffamily\scriptsize, text=gray},     label_style/.style={font=\sffamily\scriptsize, text=gray},
     box/.style={draw, rectangle, rounded corners, minimum width=1.2cm, minimum height=0.8cm, fill=white, font=\sffamily},     box/.style={draw, rectangle, rounded corners, minimum width=1.2cm, minimum height=0.8cm, fill=white, font=\sffamily},
Line 699: Line 699:
 \node (h_next) at (12, 1) {$h_t$}; \node (h_next) at (12, 1) {$h_t$};
  
-\draw[thick] (h_prev) -- (9.5, 1);+% Fixed: Stop the continuous line at the output gate (8.5) 
 +\draw[thick] (h_prev) -- (8.5, 1);
 \draw[->, thick] (x_curr) -- (1, 1); \draw[->, thick] (x_curr) -- (1, 1);
  
Line 710: Line 711:
  
 \node[box] (g) at (6.5, 2.5) {$\tanh$}; \node[box] (g) at (6.5, 2.5) {$\tanh$};
-% Candidate gate doesn't have a label in your request, but usually it's "cand" 
 \node[label_style, below=of g] {candidate}; \node[label_style, below=of g] {candidate};
  
Line 741: Line 741:
 % Final Hidden State routing % Final Hidden State routing
 \draw[->, thick] (mult_o) -- (9.5, 1) -- (h_next); \draw[->, thick] (mult_o) -- (9.5, 1) -- (h_next);
-\node[dot] at (9.5, 1) {}; 
  
 % Label % Label
Line 837: Line 836:
 For $VC(\mathcal{H}) \geq n$: For $VC(\mathcal{H}) \geq n$:
  
-  $x_1 = 0 \quad 1 \quad 1 \quad \dots \quad 1$ +  $x_1 = 0 \quad 1 \quad 1 \quad \dots \quad 1$ 
-  $x_2 = 1 \quad 0 \quad 1 \quad \dots \quad 1$ +  $x_2 = 1 \quad 0 \quad 1 \quad \dots \quad 1$ 
-  $x_n = 1 \quad 1 \quad 1 \quad \dots \quad 0$+  $x_n = 1 \quad 1 \quad 1 \quad \dots \quad 0$
  
 $\mathcal{I} \subseteq \left[1, 2, \dots n\right]$ $\mathcal{I} \subseteq \left[1, 2, \dots n\right]$
Line 863: Line 862:
 $VC(\mathcal{H}) \leq n \land VC(\mathcal{H}) \geq n \implies VC(\mathcal{H}) = n$ $VC(\mathcal{H}) \leq n \land VC(\mathcal{H}) \geq n \implies VC(\mathcal{H}) = n$
 ++++ ++++
 +
 +====== Other questions ======
 +
  
 ===== Question 5. (10 points) ===== ===== Question 5. (10 points) =====
Line 1047: Line 1049:
  
 SARSA converges to the optimal state-value function if step-sizes satisfy Robbins-Monro and sequence of policies satify GLIE. Q-Learning only needs Robbins-Monro for optimality. SARSA converges to the optimal state-value function if step-sizes satisfy Robbins-Monro and sequence of policies satify GLIE. Q-Learning only needs Robbins-Monro for optimality.
 +++++
 +
 +++++ What is DQN and Double Q-Learning? |
 +Deep Q-Learning is a method that uses a NN to approximate the state-action Q-function in environments with large state spaces (games with 2d pixels etc.). Standard Q-Learning can easily diverge while learning due to correlated trainig samples or non stationary targets. DQN uses techniques like replay buffers and fixed target networks to solve this.
 +
 +Double Q-Learning is designed to fix maximization bias. While standard Q-Learning uses the same Q-function to select the best next action and to evaluate the value, DQL has two separate Q-function estimates: one for picking the max action and the secondf to evaluate the action values.
 +
 +Double DQN is the combination of both.
 +++++
 +
 +++++ How is bayesian sampling and Thompson sampling used for MABP? |
 +In MABP we are trying to estimate the hidden parameters of each arms distribution (like the success probability $\theta$ of a Bernoulli coin flip).
 +
 +We start with a prior distribution for each arm and every time we pull it and observe the reward, we use Bayes law to update the posterior distribution $P(\theta|X,\alpha)$ for the pulled arm.
 +
 +Thompson sampling is a specific algorithm that uses the bayesian framework to solve the exploration/exploitation tradeoff using probability matching.
 +
 +  * it randomly samples a set of parameters from the current posterior distribution of each arm
 +  * it calculates the expected $Q(a)$ based on the parameters
 +  * it selects the arm that yielded the maximum sampled value
 +  * it observes the real reward and updates the posterior
 +
 +It explores even arms that have a wide uncertainty, as it samples from a distribution and those arms occasionaly yield very high rewards.
 ++++ ++++
  
Line 1165: Line 1190:
 Latent Dirichlet allocation tries to discover hidden topics within a set of documents, by finding clusters of terms that tend to occur together. Latent Dirichlet allocation tries to discover hidden topics within a set of documents, by finding clusters of terms that tend to occur together.
  
-The model extracts these topics and infers the topic distribution $\thata$ for each document and the word distribution $\beta$ for each topic.+The model extracts these topics and infers the topic distribution $\theta$ for each document and the word distribution $\beta$ for each topic.
  
 The document-topic distribution $\theta$ is modeled as a random variable $\theta \sim \text{Dirichlet}(\alpha)$ The document-topic distribution $\theta$ is modeled as a random variable $\theta \sim \text{Dirichlet}(\alpha)$
Line 1282: Line 1307:
  
 ++++ Adapt Winnow / Halving algorithm, define $\mathcal{H}$, define mistake bound learning | ++++ Adapt Winnow / Halving algorithm, define $\mathcal{H}$, define mistake bound learning |
-MB-learning (or online learning) is bounded by the number of mistakes which it can do while learning, that is $\leq \text{poly}(n)$.+M-B-learning (or online learning) is bounded by the number of mistakes which it can do while learning, that is $\leq \text{poly}(n)$.
  
 $\mathcal{H}$ is the hypothesis class, with the individual hypotheses being functions $X\rightarrow \left\{0, 1\right\}$ The hypothesis class is the complete set of all hypotheses that a specific learner is capable of expressing. $\mathcal{H}$ is the hypothesis class, with the individual hypotheses being functions $X\rightarrow \left\{0, 1\right\}$ The hypothesis class is the complete set of all hypotheses that a specific learner is capable of expressing.
Line 1288: Line 1313:
 WINNOW algorithm is used to learn linearly separable concept (like perceptron). The initial $h$ is all ones, if it makes a mistake: WINNOW algorithm is used to learn linearly separable concept (like perceptron). The initial $h$ is all ones, if it makes a mistake:
  
-  * false positive (if $x$ is True and we predicted False) - double all $h_i$ where $x_i = 1$ +  * false negative (if $x$ is True and we predicted False) - double all $h_i$ where $x_i = 1$ 
-  * false negative - nullify all $h_i$ where $x_i$ = 1+  * false positive - nullify all $h_i$ where $x_i$ = 1
  
 WINNOW natively learns monotone conjunctions and disjunctions, we adapt it to learn non-monotone by introducing new variables. WINNOW natively learns monotone conjunctions and disjunctions, we adapt it to learn non-monotone by introducing new variables.
Line 1318: Line 1343:
 Mistake bound learnability implies PAC-learnability. Mistake bound learnability implies PAC-learnability.
  
-MB learner with $M \lt \text{poly}(n)$ can be transformed into PAC learner by running it on a batch of training examples, stopping when a hypothesis successfully survives $\frac{1}{\epsilon}\text{ln}(\frac{M}{\delta})$ consecutive examples without making a mistake.+M-B learner with $M \lt \text{poly}(n)$ can be transformed into PAC learner by running it on a batch of training examples, stopping when a hypothesis successfully survives $\frac{1}{\epsilon}\text{ln}(\frac{M}{\delta})$ consecutive examples without making a mistake.
 ++++ ++++
  
Line 1324: Line 1349:
 No. No.
  
-PAC-learnability does not guarantee MB-learnability.+PAC-learnability does not guarantee M-B-learnability.
  
-If a concept class is PAC-learnable, then $VC(\mathcal{C}) \leq \text{poly}(n)$. A concept class can have a finite VC-dimension, even when the number of concepts $|\mathcal{C}| is infinite$. This can lead to unbounded number of mistakes, that it may no learn in MB.+If a concept class is PAC-learnable, then $VC(\mathcal{C}) \leq \text{poly}(n)$. A concept class can have a finite VC-dimension, even when the number of concepts $|\mathcal{C}|is infinite. This can lead to unbounded number of mistakes, that it may no learn in M-B.
 ++++ ++++
 +
 +++++ What are the k-decision lists? Are they properly and efficiently PAC-learnable? |
 +k-DL is a decision model that consists of a structured list of k-conjunctions (each has up to k literals), each is paired with a specific class indicator followed by a final default class indicator.
 +
 +When evaluating, it is assigned a class of the top-most conjunctions for which it evaluates to true.
 +
 +They are both properly and efficiently PAC-learnable.
 +
 +The size of the class is bounded by $\text{ln}|DL| \leq \text{poly}(n)$, it is PAC-learnable.
 +
 +We can efficiently find a consistent k-DL using the covering algorithm, which guarantees the proper and efficient PAC-learnability.
 +++++
 +
 +++++ What is the ERM? What is the inconsistent learning? |
 +Inconsistent learning occurs when it is impossible for a learner to find a hypothesis that perfectly matches all the labels in the training set (eg. due to noise or it being impossible).
 +
 +ERM is a principle used to handle this problem; Under ERM the learner output the hypothesis $h_{erm}$ that minimizes the empirical risk, which is the training error $\hat{erm}(h)$, which is the proportion of the training examples that are missclassified by $h$.
 +++++
 +
  
Navigation

Playground

QR Code
QR Code courses:b4m36smu (generated for current page)