Understanding and Enforcing Precise Control in Generative Models via Graph‑Based Attention

dc.contributor.authorSoni, Achint
dc.date.accessioned2025-05-23T19:08:02Z
dc.date.available2025-05-23T19:08:02Z
dc.date.issued2025-05-23
dc.date.submitted2025-05-22
dc.description.abstractGenerative models have significantly advanced in recent years, enabling unprecedented capabilities for data generation, manipulation, and editing. However, their practical applicability depends heavily on their ability to disentangle the underlying factors of variation, allowing precise and controllable modifications. This thesis explores disentanglement from two complementary perspectives: latent-space disentanglement in Variational Autoencoders (VAEs) and spatial disentanglement in diffusion-based text-guided image editing. In the first part of the thesis, we investigate the mechanisms behind disentanglement in VAEs. By proposing a local non-linear approximation of the VAE decoder, we provide a rigorous theoretical analysis that reveals orthogonality of the decoder's Jacobian as a fundamental condition for disentanglement. To support this finding, we introduce a quantitative measure termed the Orthogonality Deviation Score (OD-Score) and empirically demonstrate across multiple benchmark datasets (dSprites, 3D Faces, 3D Shapes, and MPI3D) that increased orthogonality directly corresponds to improved disentanglement as measured by established metrics such as Mutual Information Gap (MIG) and MIG-Sup. In the second part, we address the challenge of spatial disentanglement in text-guided image editing using diffusion models. Traditional diffusion-based methods rely primarily on cross-attention maps derived from textual prompts to determine regions for editing, often resulting in unintended alterations and compromised spatial coherence. To overcome this, we introduce LOCATEdit, a novel approach that refines attention maps using a graph-based regularization framework. LOCATEdit constructs a Cross and Self-Attention (CASA) graph, leveraging patch relationships derived from self-attention to promote spatial consistency and to constrain edits precisely within designated areas. Extensive evaluations on the PIE-Bench dataset illustrate that LOCATEdit achieves superior performance in localized editing tasks, substantially outperforming existing baselines in both semantic alignment and background preservation. Together, these contributions offer a unified understanding of disentanglement in generative modeling, bridging theoretical insights from latent-space analysis with practical advancements in spatially coherent, text-guided image editing. Ultimately, this thesis provides a principled foundation for developing interpretable, reliable, and highly controllable generative systems.
dc.identifier.urihttps://hdl.handle.net/10012/21779
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.titleUnderstanding and Enforcing Precise Control in Generative Models via Graph‑Based Attention
dc.typeMaster Thesis
uws-etd.degreeMaster of Mathematics
uws-etd.degree.departmentDavid R. Cheriton School of Computer Science
uws-etd.degree.disciplineComputer Science
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0
uws.contributor.advisorRambhatla, Sirisha
uws.contributor.advisorClarke, Charles
uws.contributor.affiliation1Faculty of Mathematics
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Soni_Achint.pdf
Size:
33.38 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description: