arXiv 2026 · cs.CV

OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

Diffusion directly within a coherent 3D representation space — high-quality 3D scenes with superior cross-view consistency.

Sensen Gao1*, Zhaoqing Wang2*, Qihang Cao3, Dongdong Yu2, Changhu Wang2,
Tongliang Liu4,1†, Mingming Gong5,1†, Jiawang Bian6†
* Co-first authors   † Corresponding authors
1MBZUAI 2AISphere 3SJTU 4University of Sydney 5University of Melbourne 6NTU
0
PSNR · RealEstate10K
0
SSIM (best)
0
SOTA Baselines Surpassed
0
Core Innovations
OneWorld teaser: generation results, architecture comparison, performance comparison

(a) OneWorld generates 3DGS from a single view and renders novel views. (b) Architecture comparison: OneWorld generates directly in a unified 3D representation space, without compression or separate generation. (c) Performance comparison on WorldScore and DL3DV.

Abstract

Diffusion in a Coherent 3D Space

Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train–inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods.

What's New

Key Contributions

3D-URAE

Unified Representation Autoencoder

Built upon pretrained 3D foundation models, 3D-URAE injects appearance and distills semantics to jointly encode geometry, appearance, and semantics into a unified 3D latent space for diffusion-based generation.

CVC Loss

Cross-View Correspondence

A token-level correspondence-preserving loss for conditional diffusion training, explicitly enforcing structural alignment between target and conditioning views to improve cross-view consistency.

MDF

Manifold-Drift Forcing

Identifies sampling drift from train–inference mismatch and trains the 3DGS decoder on mixed diffusion-sampled and ground-truth latents to shape a robust 3D manifold for stable sampling.

How It Works

Method Overview

OneWorld method overview: 3D-URAE, cross-view correspondence, manifold-drift forcing

(a) Unified 3D representation space with appearance injection and semantic distillation. (b) Cross-view correspondence preserving during DiT training. (c) Manifold-drift forcing: mixing ground-truth 3D features with sampled features for robust 3D decoding.

Experiments

Results

OneWorld surpasses recent state-of-the-art 2D-based generators on novel view synthesis across both benchmarks.

Novel View Synthesis (1-view conditioned)

Method RealEstate10K DL3DV-10K
PSNR ↑SSIM ↑LPIPS ↓ PSNR ↑SSIM ↑LPIPS ↓
LVSM18.240.6570.33114.620.4780.571
Gen3C17.850.6410.34514.310.4650.589
GF18.920.6780.29815.180.5120.521
Aether19.450.6950.27515.670.5350.498
FlashWorld20.130.7100.25816.240.5610.462
Gen3R20.480.7180.24916.510.5720.445
OneWorld (Ours)21.570.7350.23117.190.5890.418

Qualitative Results

Qualitative comparison on novel view generation
3DGS visualization and novel view rendering
3D scene comparison

Analysis

Ablation Study

Both Cross-View Correspondence and Manifold-Drift Forcing contribute measurable gains.

Effect of CVC and MDF (RealEstate10K)

VariantPSNR ↑SSIM ↑LPIPS ↓
w/o CVC19.100.6820.284
w/o MDF20.590.7140.256
Full Model21.570.7350.231

Qualitative Ablation

Ablation: full model vs without CVC and without MDF
Ablation on appearance injection and semantic distillation

Reference

BibTeX

@misc{gao2026OneWorld,
  title={OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder},
  author={Sensen Gao and Zhaoqing Wang and Qihang Cao and Dongdong Yu and Changhu Wang and Tongliang Liu and Mingming Gong and Jiawang Bian},
  year={2026},
  eprint={2603.16099},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.16099},
}