Perception-Aligned Representation Learning for 3D Visual Content

Tang, Sheyang2026-05-282026-05-282026-05-282026-05-20https://hdl.handle.net/10012/234293D visual content is ubiquitous across applications such as digital humans, product visualization, film and gaming, and AR/VR. While advances in 3D acquisition, modeling, and rendering have greatly improved technical fidelity, the success of 3D experience is ultimately judged by human observers. However, most existing 3D representations are designed for geometric accuracy or rendering efficiency, treating perceptual objectives as external post- rendering signals rather than properties modeled within the representation itself. This thesis studies perception-aligned representation learning for 3D visual content, aiming to learn 3D representations whose features and structure are shaped by human-centric signals so that perceptual goals become measurable and optimizable in downstream applications. We explore this idea across three settings and representation types, covering the evaluation, generation, and presentation stages of the 3D visual content pipeline: quality assessment of colored 3D meshes, controllable 3D generation with Implicit Neural Representation (INR), and aesthetic camera viewpoint suggestion using 3D Gaussian Splatting (3DGS). The first part focuses on perceptual quality evaluation of colored 3D meshes. Human judgments of mesh quality depend not only on geometric distortions or texture degradations individually, but also on their interactions, which are largely overlooked by existing Mesh Quality Assessment (MQA) methods. To address this, we propose HybridMQA, which learns perception-aligned quality representations by integrating topology-aware geometric learning with appearance cues from rendered images to model geometry–texture interplay. Across diverse datasets and distortion types, HybridMQA achieves superior assessment accuracy and provides interpretable localization of perceptually meaningful regions. The second part studies controllable 3D content generation using INRs, where generating new content amounts to generating neural network parameters. Prior methods commonly use a single “flat” latent representation, which ignores the hierarchical structure of INRs and leads to entangled semantics and limited control during generation. We propose a representation learning framework that aligns hierarchical semantics with the layer-wise structure of INR through layer-wise representations and cross-layer dependency modeling. Experiments demonstrate improved generation quality and interpretable control compared to prior generative INR baselines across 3D content and additional modalities. The third part investigates aesthetic camera viewpoint suggestion for presenting 3D content, aiming to efficiently identify viewpoints with high aesthetic appeal. Existing approaches either provide limited adjustments for an anchor view without 3D understanding or rely on dense scene captures and costly exploration. We introduce the 3D aesthetic field, which distills 2D aesthetic knowledge into a feed-forward 3DGS representation for spatially grounded aesthetic reasoning given sparse input views. Combined with an efficient search pipeline, the proposed method identifies aesthetically appealing viewpoints with improved framing and composition quality in both quantitative and qualitative evaluations. Overall, this thesis shows that perception-aligned representation learning enables direct measurement and optimization of perceptual objectives across multiple stages of the 3D visual content pipeline. Through perception-aware quality assessment, controllable generation, and aesthetic viewpoint suggestion, the proposed methods demonstrate how such alignment can improve evaluation, generation, and presentation of 3D visual content.encomputer visionmachine learningrepresentation learningPerception-Aligned Representation Learning for 3D Visual ContentDoctoral Thesis