Towards Robust Control in Visual Generation and Manipulation

dc.contributor.advisorChen, Wenhu
dc.contributor.authorKU, Max
dc.date.accessioned2024-09-16T16:18:04Z
dc.date.available2024-09-16T16:18:04Z
dc.date.issued2024-09-16
dc.date.submitted2024-09-06
dc.description.abstractThe fast development of generative models has started a new era in AI, especially in conditional image synthesis. Since the rise of diffusion models, current models can perform image generation with high fidelity and diversity. This thesis is driven towards controllable generation and manipulation in the image and video domains, guided by the three studies: ImagenHub's role in identifying the controllability of current state-of-the-art image synthesis models, VIEScore to produce explainable metrics in image synthesis tasks, and AnyV2V's role in performing precise video editing. The first part of this study highlights the evaluation of the image domain. ImagenHub, which tackles the challenge of distinguishing current research to find the best working methods, also standardized the human-centered evaluation in image synthesis research. Complementarily, VIEScore act as a new explainable metric to mimicking human-like evaluation across all conditional image synthesis tasks with multimodal LLMs, tickling the scalability issue of ImagenHub. The second study focuses on the video domain, which introduces AnyV2V, the first framework to treat video editing as an image editing problem. It leverages the editing power from off-the-shelf image editing models and the generalization power from image-to-video models to perform precise video editing. Such paradigm is training-free and allows video edits in a wide range of applications. Most importantly, we reported the increase in performance when plugging with stronger image-to-video models, highlighting the capacity of AnyV2V for adaptive evolution. These studies form the basis of this thesis, driving toward robust control in visual generation and manipulation. Through a thorough analysis of ImagenHub and VIEScore, this research not only identifies the current capabilities and limitations of image synthesis models but also sets the stage for future advancements in evaluating image synthesis models. Then with AnyV2V, we align the image editing and video editing problem with image-to-video models, lays the groundwork for future advancements in making video editing more controllable and robust.
dc.identifier.urihttps://hdl.handle.net/10012/20996
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.relation.urihttps://github.com/TIGER-AI-Lab/ImagenHub
dc.relation.urihttps://github.com/TIGER-AI-Lab/VIEScore
dc.relation.urihttps://github.com/TIGER-AI-Lab/AnyV2V
dc.titleTowards Robust Control in Visual Generation and Manipulation
dc.typeMaster Thesis
uws-etd.degreeMaster of Mathematics
uws-etd.degree.departmentDavid R. Cheriton School of Computer Science
uws-etd.degree.disciplineComputer Science
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0
uws.contributor.advisorChen, Wenhu
uws.contributor.affiliation1Faculty of Mathematics
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
KU_Max.pdf
Size:
23.21 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description: