Deep Generative Models

These are notes for Stanford’s CS236, taught by Stefano Ermonn in Fall 2023. _Course description:_ Generative models are widely used in many subfields of AI and Machine Learning. Recent advances in parameterizing these models using deep neural networks, combined with progress in stochastic optimization methods, have enabled scalable modeling of complex, high-dimensional data including images, text, and speech. In this course, we will study the probabilistic foundations and learning algorithms for deep generative models, including variational autoencoders, generative adversarial networks, autoregressive models, normalizing flow models, energy-based models, and score-based models. The course will also discuss application areas that have benefitted from deep generative models, including computer vision, speech and natural language processing, graph mining, reinforcement learning, reliable machine learning, and inverse problem solving. ## Lecture 1: Introduction The goal of this class is to build the foundations, to understand how the methods that are used in industry and in academic papers actually work, and get up to speed with fundamental concept to build a generative model. Richard Feymann: What I cannot create, I do not understand Generative Modelling: What I understand, I can create. (The only way you can do a good job at genearting text that is meaninful is to have a certain level of understanding.) Statiscal Generative Models: learned form data (and also prior knowledge: physics, materials, ...) Priors are always necessary, but there is a spectrum: graphics rely more on prior knowledge, generative models rely more on data. On a high level, statistical generative model is a _probability distribution_ $p(x)$: - Data: samples (e.g: images of bedrooms) - Prior knowledge: parametric form, loss funtion, optimization algorithm, etc. It is generative because sampling from $p(x)$ generates new images. Building a simulator for the data generating process: control signals (e.g: captions, black and white images) $->$ data simulator $->$ new data points. potentials datapoints $->$ data simulator $->$ probability values. Some examples: ![[Pasted image 20241112005305.png]] ![[Pasted image 20241112005511.png]] ![[Pasted image 20241112005903.png]] It's really an exiciting time to be working on this area. Roadmap and Key Challenges: - Representation: how do we model the joint distribution of many random variables? - Need compact representation - Learning: what is the right way to compare probability distributions? - Inference: how do we invert the generation process - Unsupervised learning: recover high-level descriptions from raw data. Syllabus: - Fully observed likelihood-based models - Autoregressive - Flow-based models - Latent variable models - Variational learning - Inference amoritzation - Variational autoencoder - Implicit generative models - Two sample tests, embeddings, F-divergences - Generative Adversarial Networks - Energy Based Models - Score-based Diffusion Generative Models - Learn about algorithms, theory and applications ## Lecture 2: Background ### What is a generative model? ![[Pasted image 20241112233808.png]] First, how to represent $p(x)$. (not trivial to repsent a probability distribution over a high dimensional space.) ![[Pasted image 20241112234936.png]] ![[Pasted image 20241112235434.png]] _Two important rules:_ Chain rule Let $S_1, \ldots S_n$ be events, $p\left(S_i\right)>0$. $ p\left(S_1 \cap S_2 \cap \cdots \cap S_n\right)=p\left(S_1\right) p\left(S_2 \mid S_1\right) \cdots p\left(S_n \mid S_1 \cap \ldots \cap S_{n-1}\right) $ Bayes' rule Let $S_1, S_2$ be events, $p\left(S_1\right)>0$ and $p\left(S_2\right)>0$. $ p\left(S_1 \mid S_2\right)=\frac{p\left(S_1 \cap S_2\right)}{p\left(S_2\right)}=\frac{p\left(S_2 \mid S_1\right) p\left(S_1\right)}{p\left(S_2\right)} $ Structure through conditional independence - Using Chain Rule $ p\left(x_1, \ldots, x_n\right)=p\left(x_1\right) p\left(x_2 \mid x_1\right) p\left(x_3 \mid x_1, x_2\right) \cdots p\left(x_n \mid x_1, \cdots, x_{n-1}\right) $ - How many parameters? $1+2+\cdots+2^{n-1}=2^n-1$ - $p\left(x_1\right)$ requires 1 parameter - $p\left(x_2 \mid x_1=0\right)$ requires 1 parameter, $p\left(x_2 \mid x_1=1\right)$ requires 1 parameter Total 2 parameters. - $2^n-1$ is still exponential, chain rule does not buy us anything. - Now suppose $X_{i+1} \perp X_1, \ldots, X_{i-1} \mid X_i$, then $ \begin{aligned} p\left(x_1, \ldots, x_n\right) & =p\left(x_1\right) p\left(x_2 \mid x_1\right) p\left(x_3 \mid x_1, x_2\right) \cdots p\left(x_n \mid x_1, \ldots, x_{n-1}\right) \\ & =p\left(x_1\right) p\left(x_2 \mid x_1\right) p\left(x_3 \mid x_2\right) \cdots p\left(x_n \mid x_{n-1}\right) \end{aligned} $ - How many parameters? $2 n$ - 1 . Bayes Network: General Idea - Use conditional parameterization (instead of joint parameterization) - For each random variable $X_i$, specify $p\left(x_i \mid \mathbf{x}_{\mathbf{A}_{\mathbf{i}}}\right)$ for set $\mathbf{X}_{\mathbf{A}_{\mathbf{i}}}$ of random variables - Then get joint parametrization as $ p\left(x_1, \ldots, x_n\right)=\prod_i p\left(x_i \mid \mathbf{x}_{\mathbf{A}_{\mathbf{i}}}\right) $ - Need to guarantee it is a legal probability distribution. It has to correspond to a chain rule factorization, with factors simplified due to assumed conditional independencies - A Bayesian network is specified by a directed acyclic graph (DAG) $G=(V, E)$ with: - One node $i \in V$ for each random variable $X_i$ - One conditional probability distribution (CPD) per node, $p\left(x_i \mid \mathbf{x}_{\mathrm{Pa}(i)}\right)$, specifying the variable's probability conditioned on its parents' values - Graph $G=(V, E)$ is called the structure of the Bayesian Network - Defines a joint distribution: $ p\left(x_1, \ldots x_n\right)=\prod_{i \in V} p\left(x_i \mid \mathbf{x}_{\mathrm{Pa}(i)}\right) $ - Claim: $p\left(x_1, \ldots x_n\right)$ is a valid probability distribution because of ordering implied by DAG - Economical representation: exponential in $|\mathrm{Pa}(i)|$, not $|V|$ ![[Pasted image 20241113010121.png]] Discriminative versus generative models Using chain rule $p(Y, X) = p(X | Y)p(Y) = p(Y | X) p(X)$ However, suppose all we need for prediction is $p(Y \mid \mathbf{X})$ In the left model, we need to specify/learn both $p(Y)$ and $p(\mathbf{X} \mid Y)$, then compute $p(Y \mid \mathbf{X})$ via Bayes rule In the right model, it suffices to estimate just the conditional distribution $p(Y \mid \mathbf{X})$ - We never need to model/learn/use $p(\mathbf{X})$ ! - Called a discriminative model because it is only useful for discriminating $Y$ 's label when given $\mathbf{X}$ #### Discriminative versus generative model Since $\mathbf{X}$ is a random vector, chain rules will give - $p(Y, \mathbf{X})=p(Y) p\left(X_1 \mid Y\right) p\left(X_2 \mid Y, X_1\right) \cdots p\left(X_n \mid Y, X_1, \cdots, X_{n-1}\right)$ - $p(Y, \mathbf{X})=p\left(X_1\right) p\left(X_2 \mid X_1\right) p\left(X_3 \mid X_1, X_2\right) \cdots p\left(Y \mid X_1, \cdots, X_{n-1}, X_n\right)$ We must make the following choices: (1) In the generative model, $p(Y)$ is simple, but how do we parameterize $p\left(X_i \mid \mathbf{X}_{p a(i)}, Y\right)$? (2) In the discriminative model, how do we parameterize $p(Y \mid \mathbf{X})$ ? Here we assume we don't care about modeling $p(\mathbf{X})$ because $\mathbf{X}$ is always given to us in a classification problem Naive Bayes For the generative model, assume that $X_i \perp \mathbf{X}_{-i} \mid Y$ (naive Bayes) (1) For the discriminative model, assume that $ p(Y=1 \mid \mathbf{x} ; \boldsymbol{\alpha})=f(\mathbf{x}, \boldsymbol{\alpha}) $ (2) Not represented as a table anymore. It is a parameterized function of x (regression) - Has to be between 0 and 1 - Depend in some simple but reasonable way on $x_1, \cdots, x_n$ - Completely specified by a vector $\boldsymbol{\alpha}$ of $n+1$ parameters (compact representation) Linear dependence: let $z(\boldsymbol{\alpha}, \mathbf{x})=\alpha_0+\sum_{i=1}^n \alpha_i x_i$. Then, $p(Y=1 \mid \mathbf{x} ; \boldsymbol{\alpha})=\sigma(z(\boldsymbol{\alpha}, \mathbf{x}))$, where $\sigma(z)=1 /\left(1+e^{-z}\right)$ is called the logistic function (1) Decision boundary $p(Y=1 \mid \mathbf{x} ; \boldsymbol{\alpha})>0.5$ is linear in $\mathbf{x}$ (2) Equal probability contours are straight lines (3) Probability rate of change has very specific form (third plot) ![[Pasted image 20241116172123.png]] #### Neural Models (1) In discriminative models, we assume that $ p(Y=1 \mid \mathbf{x} ; \boldsymbol{\alpha})=f(\mathbf{x}, \boldsymbol{\alpha}) $ (2) Linear dependence: - let $z(\boldsymbol{\alpha}, \mathbf{x})=\alpha_0+\sum_{i=1}^n \alpha_i x_j$. - $p(Y=1 \mid \mathbf{x} ; \boldsymbol{\alpha})=\sigma(z(\boldsymbol{\alpha}, \mathbf{x}))$, where $\sigma(z)=1 /\left(1+e^{-z}\right)$ is the logistic function - Dependence might be too simple (3) Non-linear dependence: let $\mathbf{h}(A, \mathbf{b}, \mathbf{x})=f(A \mathbf{x}+\mathbf{b})$ be a non-linear transformation of the inputs (features). $ p_{\text {Neural }}(Y=1 \mid \mathbf{x} ; \boldsymbol{\alpha}, A, \mathbf{b})=\sigma\left(\alpha_0+\sum_{i=1}^h \alpha_i h_i\right) $ #### Baysian networks vs neural models - Using Chain Rule $ p\left(x_1, x_2, x_3, x_4\right)=p\left(x_1\right) p\left(x_2 \mid x_1\right) p\left(x_3 \mid x_1, x_2\right) p\left(x_4 \mid x_1, x_2, x_3\right) $ Fully General - Bayes Net $ p\left(x_1, x_2, x_3, x_4\right) \approx p\left(x_1\right) p\left(x_2 \mid x_1\right) p\left(x_3 \mid x_1, x_2\right) p\left(x_4 \mid x_1, x_2, x_3\right) $ Assumes conditional independencies - Neural Models $ p\left(x_1, x_2, x_3, x_4\right) \approx p\left(x_1\right) p\left(x_2 \mid x_1\right) p_{\text {Neural }}\left(x_3 \mid x_1, x_2\right) p_{\text {Neural }}\left(x_4 \mid x_1, x_2, x_3\right) $ Assume specific functional form for the conditionals. ## Lecture 3: Autoregressive Models ![[Pasted image 20241116195217.png]] ![[Pasted image 20241116200530.png]] ![[Pasted image 20241116202322.png]] ![[Pasted image 20241116203649.png]] The idea: reduce the number of parameters needed, by extracting the same feature for x1, x2, ... throughtout the classification process. ![[Pasted image 20241116210549.png]] ![[Pasted image 20241116211111.png]] ![[Pasted image 20241116212320.png]] ![[Pasted image 20241116214404.png]] ![[Pasted image 20241116215449.png]] ![[Pasted image 20241116221734.png]] ![[Pasted image 20241116223746.png]] ![[Pasted image 20241116224048.png]] _Introduction to Pytorch_: [slides](https://docs.google.com/presentation/d/1fyPRy-tFxdvswz2p4HyH6TGkOQCVXAPS8gxzaTJH55s/edit#slide=id.p) ## Lecture 4 : Maximum Likelihood Learning ![[Pasted image 20241201212504.png]] ![[Pasted image 20241201212520.png]] PixelCNN: [paper](https://arxiv.org/abs/1606.05328) PixelDefend: [paper](https://arxiv.org/pdf/1710.10766) Goal of learning - The goal of learning is to return a model $P_\theta$ that precisely captures the distribution $P_{\text {data }}$ from which our data was sampled - This is in general not achievable because of - limited data only provides a rough approximation of the true underlying distribution - computational reasons - Example. Suppose we represent each image with a vector $X$ of 784 binary variables (black vs. white pixel). How many possible states (= possible images) in the model? $2^{784} \approx 10^{236}$. Even $10^7$ training examples provide extremely sparse coverage! What is best? This depends on what we want to do - Density estimation: we are interested in the full distribution (so later we can compute whatever conditional probabilities we want) - Specific prediction tasks: we are using the distribution to make a prediction - Is this email spam or not? - Structured prediction: Predict next frame in a video, or caption given an image KL divergence - How should we measure distance between distributions? - The Kullback-Leibler divergence (KL-divergence) between two distributions $p$ and $q$ is defined as $ D(p \| q)=\sum_{\mathrm{x}} p(\mathrm{x}) \log \frac{p(\mathrm{x})}{q(\mathbf{x})} $ - Notice that KL-divergence is asymmetric, i.e., $D(p \| q) \neq D(q \| p)$ - Measures the expected number of extra bits required to describe samples from $p(\mathbf{x})$ using a compression code based on $q$ instead of $p$ ## Lecture 5: VAE Latent Variable Models: Motivaiton Lots of variability in images x due to gender, eye color, hair color, pose, etc. However, unless images are annotated, hese factors of variation are not explicitly available (latent). Idea: explicitly model these factors using latent variables z Mixture of Gaussians. Bayes net: $\mathbf{z} \rightarrow \mathbf{x}$. - $\mathbf{z} \sim \operatorname{Categorical}(1, \cdots, K)$ - $p(\mathbf{x} \mid \mathbf{z}=k)=\mathcal{N}\left(\mu_k, \Sigma_k\right)$ ![[Pasted image 20250309231851.png]] Latent Variable Models - Allow us to define complex models $p(x)$ in terms of simpler building blocks - Natural for unsupervised learning tasks (clustering, unsupervised representation learning, etc.) ![[Pasted image 20250309232834.png]]