Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| courses:b4m36smu [2026/05/24 18:45] – [Other COLT] jpelc | courses:b4m36smu [2026/05/27 09:37] (current) – [Other COLT] jpelc | ||
|---|---|---|---|
| Line 3: | Line 3: | ||
| {{courses: | {{courses: | ||
| - | ======= Sample | + | {{courses: |
| + | ======= Sample | ||
| ===== Question 1. (10 points) ===== | ===== Question 1. (10 points) ===== | ||
| Line 231: | Line 232: | ||
| ++++ (1 points) Assume classification of much longer reviews. What problem could occur and how to solve it?| | ++++ (1 points) Assume classification of much longer reviews. What problem could occur and how to solve it?| | ||
| With longer reviews, we multiply numbers that are close to zero, which could lead to underflows. We then calculate the probabilities in log space. Logarithm is an increasing function, so the classification will lead the same classification results, as the raw probabilities. | With longer reviews, we multiply numbers that are close to zero, which could lead to underflows. We then calculate the probabilities in log space. Logarithm is an increasing function, so the classification will lead the same classification results, as the raw probabilities. | ||
| + | |||
| + | We can also stumble upon a completely unseen word, which will have a probability of zero, making the whole sentence have the probability of zero. Solution to this problem is smoothing. | ||
| The problem with longer reviews is also that the Naive Bayes (or unigram model) does not take the context of the word, that can span the whole sentence in normal language, like: | The problem with longer reviews is also that the Naive Bayes (or unigram model) does not take the context of the word, that can span the whole sentence in normal language, like: | ||
| Line 362: | Line 365: | ||
| % Layer Box | % Layer Box | ||
| - | \draw[thick, | + | \draw[thick, |
| - | \node at (4.5, 5) [font=\sffamily\bfseries\Large] {$\times N$ Layers}; | + | \node at (5.0, 7) [font=\sffamily\bfseries\Large] {$\times N$ Layers}; |
| \end{tikzpicture} | \end{tikzpicture} | ||
| Line 369: | Line 372: | ||
| </ | </ | ||
| - | We first tokenize the text and turn the words into embeddings (with positional encodings - information about their location in the text), we feed it through the multi-head self-attention layer (which catches multiple relationships between pair of words) and we normalize, this is then repeated N times until we feed forward through a standard NN to process the learned patterns. We predict the probabilities of the output words. | + | We first tokenize the text and turn the words into embeddings (with positional encodings - information about their location in the text, implemented using sine and cosine functions to encode positions), we feed it through the multi-head self-attention layer (which catches multiple relationships between pair of words) and we normalize, this is then repeated N times until we feed forward through a standard NN to process the learned patterns. We predict the probabilities of the output words. |
| The core component is the self-attention block (or the multi-head version). It helps the transformer learn the relationships between word pairs. | The core component is the self-attention block (or the multi-head version). It helps the transformer learn the relationships between word pairs. | ||
| Line 504: | Line 507: | ||
| - | ====== | + | ====== |
| ===== Question 1. (4 points) ===== | ===== Question 1. (4 points) ===== | ||
| Line 671: | Line 674: | ||
| \begin{tikzpicture}[ | \begin{tikzpicture}[ | ||
| > | > | ||
| - | % Added label styling | ||
| label_style/ | label_style/ | ||
| box/ | box/ | ||
| Line 697: | Line 699: | ||
| \node (h_next) at (12, 1) {$h_t$}; | \node (h_next) at (12, 1) {$h_t$}; | ||
| - | \draw[thick] (h_prev) -- (9.5, 1); | + | % Fixed: Stop the continuous line at the output gate (8.5) |
| + | \draw[thick] (h_prev) -- (8.5, 1); | ||
| \draw[->, | \draw[->, | ||
| Line 708: | Line 711: | ||
| \node[box] (g) at (6.5, 2.5) {$\tanh$}; | \node[box] (g) at (6.5, 2.5) {$\tanh$}; | ||
| - | % Candidate gate doesn' | ||
| \node[label_style, | \node[label_style, | ||
| Line 739: | Line 741: | ||
| % Final Hidden State routing | % Final Hidden State routing | ||
| \draw[->, | \draw[->, | ||
| - | \node[dot] at (9.5, 1) {}; | ||
| % Label | % Label | ||
| Line 835: | Line 836: | ||
| For $VC(\mathcal{H}) \geq n$: | For $VC(\mathcal{H}) \geq n$: | ||
| - | | + | |
| - | | + | |
| - | | + | |
| $\mathcal{I} \subseteq \left[1, 2, \dots n\right]$ | $\mathcal{I} \subseteq \left[1, 2, \dots n\right]$ | ||
| Line 861: | Line 862: | ||
| $VC(\mathcal{H}) \leq n \land VC(\mathcal{H}) \geq n \implies VC(\mathcal{H}) = n$ | $VC(\mathcal{H}) \leq n \land VC(\mathcal{H}) \geq n \implies VC(\mathcal{H}) = n$ | ||
| ++++ | ++++ | ||
| + | |||
| + | ====== Other questions ====== | ||
| + | |||
| ===== Question 5. (10 points) ===== | ===== Question 5. (10 points) ===== | ||
| Line 1045: | Line 1049: | ||
| SARSA converges to the optimal state-value function if step-sizes satisfy Robbins-Monro and sequence of policies satify GLIE. Q-Learning only needs Robbins-Monro for optimality. | SARSA converges to the optimal state-value function if step-sizes satisfy Robbins-Monro and sequence of policies satify GLIE. Q-Learning only needs Robbins-Monro for optimality. | ||
| + | ++++ | ||
| + | |||
| + | ++++ What is DQN and Double Q-Learning? | | ||
| + | Deep Q-Learning is a method that uses a NN to approximate the state-action Q-function in environments with large state spaces (games with 2d pixels etc.). Standard Q-Learning can easily diverge while learning due to correlated trainig samples or non stationary targets. DQN uses techniques like replay buffers and fixed target networks to solve this. | ||
| + | |||
| + | Double Q-Learning is designed to fix maximization bias. While standard Q-Learning uses the same Q-function to select the best next action and to evaluate the value, DQL has two separate Q-function estimates: one for picking the max action and the secondf to evaluate the action values. | ||
| + | |||
| + | Double DQN is the combination of both. | ||
| + | ++++ | ||
| + | |||
| + | ++++ How is bayesian sampling and Thompson sampling used for MABP? | | ||
| + | In MABP we are trying to estimate the hidden parameters of each arms distribution (like the success probability $\theta$ of a Bernoulli coin flip). | ||
| + | |||
| + | We start with a prior distribution for each arm and every time we pull it and observe the reward, we use Bayes law to update the posterior distribution $P(\theta|X, | ||
| + | |||
| + | Thompson sampling is a specific algorithm that uses the bayesian framework to solve the exploration/ | ||
| + | |||
| + | * it randomly samples a set of parameters from the current posterior distribution of each arm | ||
| + | * it calculates the expected $Q(a)$ based on the parameters | ||
| + | * it selects the arm that yielded the maximum sampled value | ||
| + | * it observes the real reward and updates the posterior | ||
| + | |||
| + | It explores even arms that have a wide uncertainty, | ||
| ++++ | ++++ | ||
| Line 1163: | Line 1190: | ||
| Latent Dirichlet allocation tries to discover hidden topics within a set of documents, by finding clusters of terms that tend to occur together. | Latent Dirichlet allocation tries to discover hidden topics within a set of documents, by finding clusters of terms that tend to occur together. | ||
| - | The model extracts these topics and infers the topic distribution $\thata$ for each document and the word distribution $\beta$ for each topic. | + | The model extracts these topics and infers the topic distribution $\theta$ for each document and the word distribution $\beta$ for each topic. |
| The document-topic distribution $\theta$ is modeled as a random variable $\theta \sim \text{Dirichlet}(\alpha)$ | The document-topic distribution $\theta$ is modeled as a random variable $\theta \sim \text{Dirichlet}(\alpha)$ | ||
| Line 1243: | Line 1270: | ||
| ++++ Uni-gram and Bi-gram models learned on the same data, which will have a lower perplexity? | | ++++ Uni-gram and Bi-gram models learned on the same data, which will have a lower perplexity? | | ||
| - | Bi-gram will have a lower perplexity, since the uni-gram only evaluate the absolute probability of each word appearing, not taking into account the previous word. | + | Bi-gram will have a lower perplexity, since the uni-grams only evaluate the absolute probability of each word appearing, not taking into account the previous word. |
| - | Uni-grams are better | + | Uni-grams are worse at this task, so they will have a higher |
| ++++ | ++++ | ||
| Line 1280: | Line 1307: | ||
| ++++ Adapt Winnow / Halving algorithm, define $\mathcal{H}$, | ++++ Adapt Winnow / Halving algorithm, define $\mathcal{H}$, | ||
| - | MB-learning (or online learning) is bounded by the number of mistakes which it can do while learning, that is $\leq \text{poly}(n)$. | + | M-B-learning (or online learning) is bounded by the number of mistakes which it can do while learning, that is $\leq \text{poly}(n)$. |
| $\mathcal{H}$ is the hypothesis class, with the individual hypotheses being functions $X\rightarrow \left\{0, 1\right\}$ The hypothesis class is the complete set of all hypotheses that a specific learner is capable of expressing. | $\mathcal{H}$ is the hypothesis class, with the individual hypotheses being functions $X\rightarrow \left\{0, 1\right\}$ The hypothesis class is the complete set of all hypotheses that a specific learner is capable of expressing. | ||
| Line 1286: | Line 1313: | ||
| WINNOW algorithm is used to learn linearly separable concept (like perceptron). The initial $h$ is all ones, if it makes a mistake: | WINNOW algorithm is used to learn linearly separable concept (like perceptron). The initial $h$ is all ones, if it makes a mistake: | ||
| - | * false positive | + | * false negative |
| - | * false negative | + | * false positive |
| WINNOW natively learns monotone conjunctions and disjunctions, | WINNOW natively learns monotone conjunctions and disjunctions, | ||
| Line 1316: | Line 1343: | ||
| Mistake bound learnability implies PAC-learnability. | Mistake bound learnability implies PAC-learnability. | ||
| - | MB learner with $M \lt \text{poly}(n)$ can be transformed into PAC learner by running it on a batch of training examples, stopping when a hypothesis successfully survives $\frac{1}{\epsilon}\text{ln}(\frac{M}{\delta})$ consecutive examples without making a mistake. | + | M-B learner with $M \lt \text{poly}(n)$ can be transformed into PAC learner by running it on a batch of training examples, stopping when a hypothesis successfully survives $\frac{1}{\epsilon}\text{ln}(\frac{M}{\delta})$ consecutive examples without making a mistake. |
| ++++ | ++++ | ||
| Line 1322: | Line 1349: | ||
| No. | No. | ||
| - | PAC-learnability does not guarantee | + | PAC-learnability does not guarantee |
| - | If a concept class is PAC-learnable, | + | If a concept class is PAC-learnable, |
| ++++ | ++++ | ||
| + | |||
| + | ++++ What are the k-decision lists? Are they properly and efficiently PAC-learnable? | ||
| + | k-DL is a decision model that consists of a structured list of k-conjunctions (each has up to k literals), each is paired with a specific class indicator followed by a final default class indicator. | ||
| + | |||
| + | When evaluating, it is assigned a class of the top-most conjunctions for which it evaluates to true. | ||
| + | |||
| + | They are both properly and efficiently PAC-learnable. | ||
| + | |||
| + | The size of the class is bounded by $\text{ln}|DL| \leq \text{poly}(n)$, | ||
| + | |||
| + | We can efficiently find a consistent k-DL using the covering algorithm, which guarantees the proper and efficient PAC-learnability. | ||
| + | ++++ | ||
| + | |||
| + | ++++ What is the ERM? What is the inconsistent learning? | | ||
| + | Inconsistent learning occurs when it is impossible for a learner to find a hypothesis that perfectly matches all the labels in the training set (eg. due to noise or it being impossible). | ||
| + | |||
| + | ERM is a principle used to handle this problem; Under ERM the learner output the hypothesis $h_{erm}$ that minimizes the empirical risk, which is the training error $\hat{erm}(h)$, | ||
| + | ++++ | ||
| + | |||