Understanding and Enforcing Precise Control in Generative Models via Graph‑Based Attention

Soni, Achint

Understanding and Enforcing Precise Control in Generative Models via Graph‑Based Attention

dc.contributor.author	Soni, Achint
dc.date.accessioned	2025-05-23T19:08:02Z
dc.date.available	2025-05-23T19:08:02Z
dc.date.issued	2025-05-23
dc.date.submitted	2025-05-22
dc.description.abstract	Generative models have significantly advanced in recent years, enabling unprecedented capabilities for data generation, manipulation, and editing. However, their practical applicability depends heavily on their ability to disentangle the underlying factors of variation, allowing precise and controllable modifications. This thesis explores disentanglement from two complementary perspectives: latent-space disentanglement in Variational Autoencoders (VAEs) and spatial disentanglement in diffusion-based text-guided image editing. In the first part of the thesis, we investigate the mechanisms behind disentanglement in VAEs. By proposing a local non-linear approximation of the VAE decoder, we provide a rigorous theoretical analysis that reveals orthogonality of the decoder's Jacobian as a fundamental condition for disentanglement. To support this finding, we introduce a quantitative measure termed the Orthogonality Deviation Score (OD-Score) and empirically demonstrate across multiple benchmark datasets (dSprites, 3D Faces, 3D Shapes, and MPI3D) that increased orthogonality directly corresponds to improved disentanglement as measured by established metrics such as Mutual Information Gap (MIG) and MIG-Sup. In the second part, we address the challenge of spatial disentanglement in text-guided image editing using diffusion models. Traditional diffusion-based methods rely primarily on cross-attention maps derived from textual prompts to determine regions for editing, often resulting in unintended alterations and compromised spatial coherence. To overcome this, we introduce LOCATEdit, a novel approach that refines attention maps using a graph-based regularization framework. LOCATEdit constructs a Cross and Self-Attention (CASA) graph, leveraging patch relationships derived from self-attention to promote spatial consistency and to constrain edits precisely within designated areas. Extensive evaluations on the PIE-Bench dataset illustrate that LOCATEdit achieves superior performance in localized editing tasks, substantially outperforming existing baselines in both semantic alignment and background preservation. Together, these contributions offer a unified understanding of disentanglement in generative modeling, bridging theoretical insights from latent-space analysis with practical advancements in spatially coherent, text-guided image editing. Ultimately, this thesis provides a principled foundation for developing interpretable, reliable, and highly controllable generative systems.
dc.identifier.uri	https://hdl.handle.net/10012/21779
dc.language.iso	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.title	Understanding and Enforcing Precise Control in Generative Models via Graph‑Based Attention
dc.type	Master Thesis
uws-etd.degree	Master of Mathematics
uws-etd.degree.department	David R. Cheriton School of Computer Science
uws-etd.degree.discipline	Computer Science
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0
uws.contributor.advisor	Rambhatla, Sirisha
uws.contributor.advisor	Clarke, Charles
uws.contributor.affiliation1	Faculty of Mathematics
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Soni_Achint.pdf
Size:: 33.38 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Computer Science