OneWorld: Taming Scene Generation with
3D Unified Representation Autoencoder

arXiv 2026
Sensen Gao1*, Zhaoqing Wang2*, Qihang Cao3, Dongdong Yu2, Changhu Wang2,
Tongliang Liu4,1†, Mingming Gong5,1†, Jiawang Bian6†
* Co-first authors    † Corresponding authors
1 MBZUAI   2 AISphere   3 SJTU   4 University of Sydney   5 University of Melbourne   6 NTU
OneWorld Teaser

(a) OneWorld generates 3DGS from a single view and renders novel views. (b) Architecture comparison: OneWorld generates directly in a unified 3D representation space, without compression or separate generation. (c) Performance comparison on WorldScore and DL3DV.

Abstract

Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train–inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods.

Key Contributions

3D Unified Representation Autoencoder (3D-URAE)

Built upon pretrained 3D foundation models, 3D-URAE injects appearance and distills semantics to jointly encode geometry, appearance, and semantics into a unified 3D latent space for diffusion-based generation.

Cross-View Correspondence (CVC) Loss

A token-level correspondence-preserving loss for conditional diffusion training, explicitly enforcing structural alignment between target and conditioning views to improve cross-view consistency.

Manifold-Drift Forcing (MDF)

Identifies sampling drift from train–inference mismatch and trains the 3DGS decoder on mixed diffusion-sampled and ground-truth latents to shape a robust 3D manifold for stable sampling.

Method Overview

OneWorld Method Overview

(a) Unified 3D representation space with appearance injection and semantic distillation. (b) Cross-view correspondence preserving during DiT training. (c) Manifold-drift forcing: mixing ground-truth 3D features with sampled features for robust 3D decoding.

Results

Novel View Synthesis (1-view conditioned)

Method RealEstate10K DL3DV-10K
PSNR ↑SSIM ↑LPIPS ↓ PSNR ↑SSIM ↑LPIPS ↓
LVSM18.240.6570.33114.620.4780.571
Gen3C17.850.6410.34514.310.4650.589
GF18.920.6780.29815.180.5120.521
Aether19.450.6950.27515.670.5350.498
FlashWorld20.130.7100.25816.240.5610.462
Gen3R20.480.7180.24916.510.5720.445
OneWorld (Ours)21.570.7350.23117.190.5890.418

Qualitative Comparison

Qualitative comparison on novel view generation

3DGS Visualization and Novel View Rendering

3DGS visualization

3D Scene Comparison

3D scene comparison

Comparison of 3D scenes: Gen3R uses point clouds, while FlashWorld and OneWorld use 3DGS.

Ablation Study

Effect of CVC and MDF

Variant PSNR ↑SSIM ↑LPIPS ↓
w/o CVC19.100.6820.284
w/o MDF20.590.7140.256
Full Model21.570.7350.231

Qualitative Ablation

Ablation visualization

Visualization of ablation study: full model vs. variants without CVC and without MDF.

Appearance Injection and Semantic Distillation

Ablation on appearance injection and semantic distillation

Effects of appearance injection (improved visual quality) and semantic distillation (structured feature space).

BibTeX

@misc{gao2026OneWorld,
      title={OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder},
      author={Sensen Gao and Zhaoqing Wang and Qihang Cao and Dongdong Yu and Changhu Wang and Tongliang Liu and Mingming Gong and Jiawang Bian},
      year={2026},
      eprint={2603.16099},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.16099},
}