Compositional Generalization in Diffusion Models
TLDR
- Paper 1 (SIM / Swing-by Dynamics) is a highly abstract wrapper of compositional generalization as learning an identity mapping on a structured Gaussian mixture. It ignores system internals (e.g., diffusion sampling, U-Nets) and develops theory for the learning dynamics of resulting simplified problem.
- Paper 2 (CPC/LCS in Conditional Diffusion) instead goes into diffusion models directly, showing that local inductive bias (sparse conditional dependencies in the score) is the mechanism that enables compositional / length generalization.
Paper 1 — Swing-by Dynamics in Concept Learning and Compositional Generalization (SIM) (Yang et al., ICLR 2025)
PDF: 2410.08309v3
1. Motivation: a theoretical wrapper for “concept space” diffusion results
Prior work evaluates a diffusion model by:
- mapping conditioning concepts to a vector,
- using a classifier to map generated images back to concept accuracy vectors,
- so a “good generator + classifier system” behaves like an identity mapping in concept space.
This paper argues:
The salient part is the structured organization of concept space, not the diffusion internals.
So they introduce a simplified task: Structured Identity Mapping (SIM).
2. SIM dataset: Gaussian mixture with structured centroids
Let \(d\) be dimension and \(s\le d\) be number of concept classes.
Training data consists of \(s\) Gaussian clusters aligned with coordinate axes: \(x_k^{(p)} \sim \mathcal{N}\!\left(\mu_p \mathbf{1}_p,\; \mathrm{diag}(\sigma)^2\right), \quad p\in [s],\; k\in[n].\)
Interpretation:
- \(\mu_p\) = concept signal strength (cluster mean distance from origin),
- \(\sigma_p\) = concept diversity (variance along that axis).
The learning problem is identity mapping with MSE: \(L(\theta) = \frac{1}{2sn}\sum_{p=1}^s\sum_{k=1}^n \|f(\theta; x_k^{(p)}) - x_k^{(p)}\|^2.\)
Evaluation uses an OOD “composition” point: \(x_b = \sum_{p=1}^s \mu_p \mathbf{1}_p.\)
3. Linearization: loss in terms of covariance
Assume the model is linear in input: \(f(\theta; x)=W_\theta x.\)
Then (via trace trick) the loss becomes: \(L(\theta) = \frac{1}{2}\|(W_\theta - I)A^{1/2}\|_F^2,\) where the (population) covariance is diagonal: \(A=\mathbb{E}[xx^\top] = \mathrm{diag}(a), \quad a_p= \begin{cases} \sigma_p^2 + \frac{\mu_p^2}{s}, & p\le s,\\ 0, & p>s. \end{cases}\)
Key: learning rates along coordinates are controlled by \(a_p\), hence by \(\mu_p\) and \(\sigma_p\).
4. One-layer linear model: closed-form dynamics
For \(f(W;x)=Wx\) under gradient flow, they derive: \(f(W(t),z)_k = \mathbf{1}_{k\le s}\bigl(1-e^{-a_k t}\bigr)z_k + \sum_{i=1}^s e^{-a_i t} w_{k,i}(0) z_i.\)
Interpretation:
- A “growth” term drives the correct identity mapping,
- A “noise” term decays with time and initialization scale.
Consequences:
- Concepts with larger \(a_k\) converge faster.
- Since \(a_k\) increases with \(\mu_k\) and \(\sigma_k\), generalization order is governed jointly by signal strength and diversity.
This reproduces empirical “concept order” phenomena.
Limitation:
- coordinates evolve independently → no non-monotonic OOD behavior.
5. Deep linear model: symmetric 2-layer network and Swing-by dynamics
They analyze: \(f(U;x)=UU^\top x, \quad W(t)=UU^\top.\)
They obtain an evolution equation for Jacobian entries \(w_{i,j}\) decomposed into:
- growth term,
- suppression term,
- noise term,
which yields multi-stage dynamics:
- initial growth of many Jacobian entries,
- one major diagonal entry grows first,
- it suppresses associated off-diagonal “minor” entries,
- next major entry grows, and so on.
This staged Jacobian evolution produces an OOD trajectory that:
- initially moves toward the OOD composition point,
- then detours back toward training cluster(s),
- then later returns to OOD performance.
They call this Swing-by dynamics and connect it to a double-descent-like test loss curve (for OOD).
6. Empirical bridge back to diffusion
Even though SIM is abstract, they verify in text-conditioned diffusion models that:
- OOD concept accuracy can be non-monotonic during training,
- matching the “Swing-by” mechanism predicted by theory.
7. Takeaway
This paper treats compositional generalization as a wrapper identity mapping problem:
- it ignores internal generative machinery,
- and explains sequential concept learning + non-monotonic OOD curves as consequences of optimization dynamics on structured data.
Paper 2 — Local Mechanisms of Compositional Generalization in Conditional Diffusion (Bradley et al., 2025)
PDF: 2509.16447v2
1. Motivation: why length generalization is hard
The paper studies length generalization in conditional diffusion: train on scenes with a small number of objects, then test with more conditions (e.g., more specified locations) than seen during training.
Key observation:
- Whether length generalization succeeds depends on whether the model learns a compositional mechanism (one object per condition) or a shortcut mechanism (condition triggers a typical scene, not additive per-condition behavior).
2. Setup: location-conditioned CLEVR experiments
They use CLEVR with location conditioning:
- Experiment 1 (success): conditioner labels all object locations.
- Experiment 2 (failure): conditioner labels only one randomly chosen object location (even if the image has 2–3 objects).
- Experiment 3 (fix): enforce an architecture that induces local conditional score structure, restoring length generalization.
Crucial point: Even when training does not include multi-location conditioning, length generalization can still happen if the right inductive bias forces the model to represent the conditional distribution compositionally.
3. Definitions: Score functions and locality
A diffusion model learns the conditional score: \(s_t(x \mid c) \;=\; \nabla_x \log p_t(x \mid c).\)
3.1 Local Conditional Scores (LCS)
They define Local Conditional Scores (LCS) as a sparsity condition on dependencies:
At each pixel \(i\), the score depends only on:
- a subset of pixels \(N_i\) (often a local neighborhood), and
- a subset of conditions \(L_i \subseteq J\) (often nearby conditions for location-conditioning).
Informally:
The score at pixel \(i\) does not need the entire image nor all conditioners—only a sparse subset.
This generalizes “local scores” to conditional settings.
4. Conditional Projective Composition (CPC)
They define a strong form of compositionality of the conditional distribution.
Let \(J\) index the set of conditions \(\{c_j\}_{j \in J}\). Let \(\{M_j\}_{j\in J}\) be disjoint pixel subsets, and let \(M_J^c\) denote the remaining pixels.
A CPC distribution factorizes as: \(p(x \mid c_J) \;=\; p(x_{M_J^c} \mid \varnothing) \prod_{j \in J} p(x_{M_j} \mid c_j).\)
Meaning:
- Each condition \(c_j\) affects only its own region \(M_j\).
- Different regions are conditionally independent.
- A background region \(M_J^c\) may be unconditional.
This structure implies length generalization naturally: adding a new condition adds a new independent factor.
5. Lemma 1: CPC ⇒ LCS (exact)
If \(p(x \mid c_J)\) satisfies CPC, then its score is LCS.
Sketch: Take logs: \(\log p(x \mid c_J) = \log p(x_{M_J^c}\mid \varnothing) + \sum_{j\in J}\log p(x_{M_j}\mid c_j).\)
Differentiate w.r.t. pixel \(i\):
- If \(i \in M_j\), then: \(\nabla_{x_i}\log p(x\mid c_J) = \nabla_{x_i}\log p(x_{M_j}\mid c_j),\) so dependence is only on \(x_{M_j}\) and condition \(c_j\).
- If \(i \in M_J^c\), then the score depends only on the unconditional background term.
Thus compositional distributions have local conditional score structure.
6. Relaxation: approximate CPC ⇒ approximate LCS, and “more compositional at high noise”
Real models are not perfectly CPC. The paper relaxes the lemma:
- If \(p(x\mid c)\) is approximately CPC, then the score is approximately LCS.
- The approximation becomes better at higher noise \(t\) (intuitively, noise washes out detailed interactions, leaving large-scale compositional structure).
This supports a diffusion-time decomposition:
- High noise: conditional dependencies dominate; global structure (object count/layout) is established.
- Low noise: pixel dependencies dominate; local unconditional denoising fills in details.
This explains why local conditional mechanisms can “set” the compositional structure early.
7. Feature-space extension: CPC/LCS after an invertible transform
Pixel-space compositionality often fails for prompts like “watercolor cat sushi” (style and content interact everywhere).
They propose: Let \(z = A(x)\) be an invertible feature transform. If the pushforward distribution \(A_\#p(z\mid c)\) is CPC, then the feature-space score is LCS.
This motivates “disentanglement” as CPC/LCS structure in a learned feature space.
7.1 Orthogonality heuristic for disentanglement
Define: \(\mu_i := \mathbb{E}_{z\sim A_\#p(\cdot\mid c_i)}[z], \quad \mu_b := \mathbb{E}_{z\sim A_\#p(\cdot\mid \varnothing)}[z], \quad d_i := \mu_i - \mu_b.\)
A necessary (not sufficient) condition for CPC is pairwise orthogonality: \(d_i^\top d_j = 0 \quad \forall i\neq j.\)
Practically they compute cosine similarity: \(\frac{d_i^\top d_j}{\|d_i\|\|d_j\|},\) where low off-diagonal similarity suggests feature-space disentanglement.
8. Causal evidence: enforcing LCS restores generalization
Experiment 3 performs a direct intervention:
- keep training distribution like Experiment 2,
- enforce architectural locality producing LCS-like score dependencies,
- observe restored length generalization.
Conclusion:
The local conditional score structure is not merely correlated with generalization; it is a causal mechanism.
9. Takeaway
Compositional generalization in conditional diffusion hinges on an inductive bias:
- representing conditional effects in a sparse / local way (LCS),
- which corresponds to a compositional factorization of the conditional distribution (CPC).
Synthesis: How the two papers complement each other
- Paper 1: compositionality depends on mechanistic inductive bias in diffusion: local conditional score structure.
- Paper 2: compositional phenomena can arise even in a stripped-down identity task: optimization + data geometry create staged learning and Swing-by.
Together:
- Paper 2 explains when and in what order concept directions emerge under training dynamics.
- Paper 1 explains whether a diffusion system will actually realize an additive compositional mechanism, via locality/sparsity constraints in the conditional score.
Enjoy Reading This Article?
Here are some more articles you might like to read next: