Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| courses:b4m36smu [2026/05/24 19:12] – [Question 7. (4 points)] jpelc | courses:b4m36smu [2026/05/27 09:37] (current) – [Other COLT] jpelc | ||
|---|---|---|---|
| Line 3: | Line 3: | ||
| {{courses: | {{courses: | ||
| - | ======= Sample | + | {{courses: |
| + | ======= Sample | ||
| ===== Question 1. (10 points) ===== | ===== Question 1. (10 points) ===== | ||
| Line 364: | Line 365: | ||
| % Layer Box | % Layer Box | ||
| - | \draw[thick, | + | \draw[thick, |
| - | \node at (4.5, 5) [font=\sffamily\bfseries\Large] {$\times N$ Layers}; | + | \node at (5.0, 7) [font=\sffamily\bfseries\Large] {$\times N$ Layers}; |
| \end{tikzpicture} | \end{tikzpicture} | ||
| Line 506: | Line 507: | ||
| - | ====== | + | ====== |
| ===== Question 1. (4 points) ===== | ===== Question 1. (4 points) ===== | ||
| Line 673: | Line 674: | ||
| \begin{tikzpicture}[ | \begin{tikzpicture}[ | ||
| > | > | ||
| - | % Added label styling | ||
| label_style/ | label_style/ | ||
| box/ | box/ | ||
| Line 699: | Line 699: | ||
| \node (h_next) at (12, 1) {$h_t$}; | \node (h_next) at (12, 1) {$h_t$}; | ||
| - | \draw[thick] (h_prev) -- (9.5, 1); | + | % Fixed: Stop the continuous line at the output gate (8.5) |
| + | \draw[thick] (h_prev) -- (8.5, 1); | ||
| \draw[->, | \draw[->, | ||
| Line 710: | Line 711: | ||
| \node[box] (g) at (6.5, 2.5) {$\tanh$}; | \node[box] (g) at (6.5, 2.5) {$\tanh$}; | ||
| - | % Candidate gate doesn' | ||
| \node[label_style, | \node[label_style, | ||
| Line 741: | Line 741: | ||
| % Final Hidden State routing | % Final Hidden State routing | ||
| \draw[->, | \draw[->, | ||
| - | \node[dot] at (9.5, 1) {}; | ||
| % Label | % Label | ||
| Line 837: | Line 836: | ||
| For $VC(\mathcal{H}) \geq n$: | For $VC(\mathcal{H}) \geq n$: | ||
| - | | + | |
| - | | + | |
| - | | + | |
| $\mathcal{I} \subseteq \left[1, 2, \dots n\right]$ | $\mathcal{I} \subseteq \left[1, 2, \dots n\right]$ | ||
| Line 863: | Line 862: | ||
| $VC(\mathcal{H}) \leq n \land VC(\mathcal{H}) \geq n \implies VC(\mathcal{H}) = n$ | $VC(\mathcal{H}) \leq n \land VC(\mathcal{H}) \geq n \implies VC(\mathcal{H}) = n$ | ||
| ++++ | ++++ | ||
| + | |||
| + | ====== Other questions ====== | ||
| + | |||
| ===== Question 5. (10 points) ===== | ===== Question 5. (10 points) ===== | ||
| Line 1047: | Line 1049: | ||
| SARSA converges to the optimal state-value function if step-sizes satisfy Robbins-Monro and sequence of policies satify GLIE. Q-Learning only needs Robbins-Monro for optimality. | SARSA converges to the optimal state-value function if step-sizes satisfy Robbins-Monro and sequence of policies satify GLIE. Q-Learning only needs Robbins-Monro for optimality. | ||
| + | ++++ | ||
| + | |||
| + | ++++ What is DQN and Double Q-Learning? | | ||
| + | Deep Q-Learning is a method that uses a NN to approximate the state-action Q-function in environments with large state spaces (games with 2d pixels etc.). Standard Q-Learning can easily diverge while learning due to correlated trainig samples or non stationary targets. DQN uses techniques like replay buffers and fixed target networks to solve this. | ||
| + | |||
| + | Double Q-Learning is designed to fix maximization bias. While standard Q-Learning uses the same Q-function to select the best next action and to evaluate the value, DQL has two separate Q-function estimates: one for picking the max action and the secondf to evaluate the action values. | ||
| + | |||
| + | Double DQN is the combination of both. | ||
| + | ++++ | ||
| + | |||
| + | ++++ How is bayesian sampling and Thompson sampling used for MABP? | | ||
| + | In MABP we are trying to estimate the hidden parameters of each arms distribution (like the success probability $\theta$ of a Bernoulli coin flip). | ||
| + | |||
| + | We start with a prior distribution for each arm and every time we pull it and observe the reward, we use Bayes law to update the posterior distribution $P(\theta|X, | ||
| + | |||
| + | Thompson sampling is a specific algorithm that uses the bayesian framework to solve the exploration/ | ||
| + | |||
| + | * it randomly samples a set of parameters from the current posterior distribution of each arm | ||
| + | * it calculates the expected $Q(a)$ based on the parameters | ||
| + | * it selects the arm that yielded the maximum sampled value | ||
| + | * it observes the real reward and updates the posterior | ||
| + | |||
| + | It explores even arms that have a wide uncertainty, | ||
| ++++ | ++++ | ||
| Line 1165: | Line 1190: | ||
| Latent Dirichlet allocation tries to discover hidden topics within a set of documents, by finding clusters of terms that tend to occur together. | Latent Dirichlet allocation tries to discover hidden topics within a set of documents, by finding clusters of terms that tend to occur together. | ||
| - | The model extracts these topics and infers the topic distribution $\thata$ for each document and the word distribution $\beta$ for each topic. | + | The model extracts these topics and infers the topic distribution $\theta$ for each document and the word distribution $\beta$ for each topic. |
| The document-topic distribution $\theta$ is modeled as a random variable $\theta \sim \text{Dirichlet}(\alpha)$ | The document-topic distribution $\theta$ is modeled as a random variable $\theta \sim \text{Dirichlet}(\alpha)$ | ||
| Line 1282: | Line 1307: | ||
| ++++ Adapt Winnow / Halving algorithm, define $\mathcal{H}$, | ++++ Adapt Winnow / Halving algorithm, define $\mathcal{H}$, | ||
| - | MB-learning (or online learning) is bounded by the number of mistakes which it can do while learning, that is $\leq \text{poly}(n)$. | + | M-B-learning (or online learning) is bounded by the number of mistakes which it can do while learning, that is $\leq \text{poly}(n)$. |
| $\mathcal{H}$ is the hypothesis class, with the individual hypotheses being functions $X\rightarrow \left\{0, 1\right\}$ The hypothesis class is the complete set of all hypotheses that a specific learner is capable of expressing. | $\mathcal{H}$ is the hypothesis class, with the individual hypotheses being functions $X\rightarrow \left\{0, 1\right\}$ The hypothesis class is the complete set of all hypotheses that a specific learner is capable of expressing. | ||
| Line 1288: | Line 1313: | ||
| WINNOW algorithm is used to learn linearly separable concept (like perceptron). The initial $h$ is all ones, if it makes a mistake: | WINNOW algorithm is used to learn linearly separable concept (like perceptron). The initial $h$ is all ones, if it makes a mistake: | ||
| - | * false positive | + | * false negative |
| - | * false negative | + | * false positive |
| WINNOW natively learns monotone conjunctions and disjunctions, | WINNOW natively learns monotone conjunctions and disjunctions, | ||
| Line 1318: | Line 1343: | ||
| Mistake bound learnability implies PAC-learnability. | Mistake bound learnability implies PAC-learnability. | ||
| - | MB learner with $M \lt \text{poly}(n)$ can be transformed into PAC learner by running it on a batch of training examples, stopping when a hypothesis successfully survives $\frac{1}{\epsilon}\text{ln}(\frac{M}{\delta})$ consecutive examples without making a mistake. | + | M-B learner with $M \lt \text{poly}(n)$ can be transformed into PAC learner by running it on a batch of training examples, stopping when a hypothesis successfully survives $\frac{1}{\epsilon}\text{ln}(\frac{M}{\delta})$ consecutive examples without making a mistake. |
| ++++ | ++++ | ||
| Line 1324: | Line 1349: | ||
| No. | No. | ||
| - | PAC-learnability does not guarantee | + | PAC-learnability does not guarantee |
| - | If a concept class is PAC-learnable, | + | If a concept class is PAC-learnable, |
| ++++ | ++++ | ||
| + | |||
| + | ++++ What are the k-decision lists? Are they properly and efficiently PAC-learnable? | ||
| + | k-DL is a decision model that consists of a structured list of k-conjunctions (each has up to k literals), each is paired with a specific class indicator followed by a final default class indicator. | ||
| + | |||
| + | When evaluating, it is assigned a class of the top-most conjunctions for which it evaluates to true. | ||
| + | |||
| + | They are both properly and efficiently PAC-learnable. | ||
| + | |||
| + | The size of the class is bounded by $\text{ln}|DL| \leq \text{poly}(n)$, | ||
| + | |||
| + | We can efficiently find a consistent k-DL using the covering algorithm, which guarantees the proper and efficient PAC-learnability. | ||
| + | ++++ | ||
| + | |||
| + | ++++ What is the ERM? What is the inconsistent learning? | | ||
| + | Inconsistent learning occurs when it is impossible for a learner to find a hypothesis that perfectly matches all the labels in the training set (eg. due to noise or it being impossible). | ||
| + | |||
| + | ERM is a principle used to handle this problem; Under ERM the learner output the hypothesis $h_{erm}$ that minimizes the empirical risk, which is the training error $\hat{erm}(h)$, | ||
| + | ++++ | ||
| + | |||