Towards Robust Control in Visual Generation and Manipulation

KU, Max

Towards Robust Control in Visual Generation and Manipulation

dc.contributor.advisor	Chen, Wenhu
dc.contributor.author	KU, Max
dc.date.accessioned	2024-09-16T16:18:04Z
dc.date.available	2024-09-16T16:18:04Z
dc.date.issued	2024-09-16
dc.date.submitted	2024-09-06
dc.description.abstract	The fast development of generative models has started a new era in AI, especially in conditional image synthesis. Since the rise of diffusion models, current models can perform image generation with high fidelity and diversity. This thesis is driven towards controllable generation and manipulation in the image and video domains, guided by the three studies: ImagenHub's role in identifying the controllability of current state-of-the-art image synthesis models, VIEScore to produce explainable metrics in image synthesis tasks, and AnyV2V's role in performing precise video editing. The first part of this study highlights the evaluation of the image domain. ImagenHub, which tackles the challenge of distinguishing current research to find the best working methods, also standardized the human-centered evaluation in image synthesis research. Complementarily, VIEScore act as a new explainable metric to mimicking human-like evaluation across all conditional image synthesis tasks with multimodal LLMs, tickling the scalability issue of ImagenHub. The second study focuses on the video domain, which introduces AnyV2V, the first framework to treat video editing as an image editing problem. It leverages the editing power from off-the-shelf image editing models and the generalization power from image-to-video models to perform precise video editing. Such paradigm is training-free and allows video edits in a wide range of applications. Most importantly, we reported the increase in performance when plugging with stronger image-to-video models, highlighting the capacity of AnyV2V for adaptive evolution. These studies form the basis of this thesis, driving toward robust control in visual generation and manipulation. Through a thorough analysis of ImagenHub and VIEScore, this research not only identifies the current capabilities and limitations of image synthesis models but also sets the stage for future advancements in evaluating image synthesis models. Then with AnyV2V, we align the image editing and video editing problem with image-to-video models, lays the groundwork for future advancements in making video editing more controllable and robust.
dc.identifier.uri	https://hdl.handle.net/10012/20996
dc.language.iso	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.relation.uri	https://github.com/TIGER-AI-Lab/ImagenHub
dc.relation.uri	https://github.com/TIGER-AI-Lab/VIEScore
dc.relation.uri	https://github.com/TIGER-AI-Lab/AnyV2V
dc.title	Towards Robust Control in Visual Generation and Manipulation
dc.type	Master Thesis
uws-etd.degree	Master of Mathematics
uws-etd.degree.department	David R. Cheriton School of Computer Science
uws-etd.degree.discipline	Computer Science
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0
uws.contributor.advisor	Chen, Wenhu
uws.contributor.affiliation1	Faculty of Mathematics
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: KU_Max.pdf
Size:: 23.21 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Computer Science