arXiv · coming soon · cs.CV
PixWorld

Unifying 3D Scene Generation and Reconstruction in Pixel Space

One end-to-end pixel-space diffusion model for both 3D reconstruction and generation — supervising a pixel-aligned 3D Gaussian field directly through differentiable rendering, with no VAE or RAE.

Sensen Gao1*, Zhaoqing Wang2*, Qihang Cao1, Dongdong Yu2, Changhu Wang2, Jia-Wang Bian1†
* Co-first authors   † Corresponding author
1Nanyang Technological University 2AISphere
0
Per scene · 4-step distilled (480P)
0
Faster than FantasyWorld · ~5× vs FlashWorld
0
WorldScore Avg
Demo

PixWorld in Action

3D reconstruction, text→3D and image→3D generation from a single unified model — rendered as explorable 3D Gaussian scenes.

Reconstruction Text→3D Image→3D RGB + Depth ~0.6 s / scene · 4 steps
PixWorld teaser: unified pixel-space diffusion for 3D reconstruction and generation

PixWorld unifies 3D scene reconstruction and generation in a single model. Unlike prior approaches that compute losses in the latent space of a VAE or RAE, PixWorld applies a flow-matching objective directly in pixel space over multi-view renderings, enabling end-to-end optimization of the underlying 3D representation — avoiding latent information loss and the cost of pretraining an autoencoder.

Abstract

One Model, in Pixel Space

3D reconstruction and generation are commonly tackled by separate paradigms: pixel-based regression for reconstruction, and latent diffusion for generation. Recent works attempt to unify them in latent space, but with notable drawbacks: the diffusion objective is defined on latent features rather than the underlying 3D representation, and both branches suffer from information loss introduced by latent encoding, while requiring a pretrained Variational Autoencoder (VAE) or Representation Autoencoder (RAE). In this paper, we reformulate these two tasks under a unified pixel-space diffusion paradigm and introduce PixWorld, a single model that jointly addresses 3D reconstruction and generation. By supervising diffusion directly on rendered images, PixWorld removes these limitations and aligns optimization with 3D scene fidelity. Beyond photometric and perceptual supervision that operates at the 2D image level and lacks 3D geometric awareness, we further introduce a geometry perception loss that aligns rendered views with their ground truth in the geometry-aware feature space of a pretrained 3D foundation model, providing 3D structural supervision. PixWorld consistently improves over prior latent-space generation methods and matches strong reconstruction methods, demonstrating the effectiveness of a unified pixel-space approach.

What's New

Key Contributions

Pixel-Space Diffusion

End-to-End, No VAE/RAE

An end-to-end pixel-space diffusion framework that supervises a pixel-aligned 3D Gaussian representation directly through multi-view differentiable rendering — eliminating the information loss and training cost of a latent autoencoder and aligning the diffusion signal with 3D scene fidelity.

Unification

Generation + Reconstruction

Multi-view inputs are partitioned into clean and noisy subsets: clean views drive reconstruction, while noisy ones are generated conditioned on the clean ones — both producing the same 3D Gaussian representation in a single forward pass.

Geometry Perception Loss

3D Structural Supervision

A geometry perception loss aligns rendered views with ground truth in the geometry-aware feature space of a frozen 3D foundation model (e.g. π³ / VGGT), providing 3D structural supervision beyond 2D photometric and perceptual losses.

How It Works

Method Overview

Pixel-space vs latent-space reconstruction and generation

Pixel space vs. latent space. Prior latent methods compute losses in a VAE/RAE latent; PixWorld supervises diffusion directly in pixel space over rendered views, optimizing the 3D representation end-to-end for both reconstruction and generation.

PixWorld method overview: two-stream DiT, pixel-space flow matching, geometry perception loss

(a) A unified DiT-based framework takes noisy and clean multi-view inputs, with optional text, and jointly predicts depth and 3DGS through shared transformer blocks. (b) A pixel-space flow-matching loss is imposed on rendered multi-view images to directly optimize the 3D representation. (c) A geometry perception loss enforces structural consistency by aligning rendered views with ground truth through a frozen 3D foundation model.

Method Explainer Video

Pixel-space flow matching Clean / noisy partition Two-stream DiT Geometry perception loss
Experiments

Results

A single PixWorld model performs both 3D reconstruction and generation, evaluated on RealEstate10K, DL3DV-10K and WorldScore.

3D Reconstruction — Novel View Synthesis

Method RealEstate10K DL3DV-10K
PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓
4-views 8-views 4-views 8-views
MVSplat22.580.7620.26421.640.7190.30117.110.5010.41015.750.4320.491
DepthSplat25.160.8320.19427.770.8720.15420.380.7190.32019.260.6920.360
AnySplat20.070.7310.28620.520.7520.26220.110.6710.31820.020.6640.327
YoNoSplat25.830.8410.14328.350.8890.10723.050.7110.22821.920.6780.262
PixWorld (Ours)26.210.8440.13828.580.8920.10123.180.7140.22622.460.6810.257

Best in cyan, second best underlined. Gen3R is a unified gen+recon model but only supports point-cloud reconstruction, so it is not directly comparable on NVS and is omitted here.

PixWorld camera trajectories under reconstruction and generation settings

One model, both tasks. When all input views are clean, PixWorld reconstructs; when clean and noisy views are mixed, it generates. Blue and red frustums denote clean input views and generated views, respectively.

3D Scene Generation

Method Novel View Synthesis Generation Quality Camera Control
PSNR↑SSIM↑LPIPS↓ I2V Subj.↑I2V BG↑I.Q.↑Aes.Q.↑ AUC@30↑AUC@15↑AUC@5↑
RealEstate10K
LVSM17.820.6030.3360.9710.9700.5930.5060.7100.5920.372
GF15.630.5530.4540.9310.9410.5040.4750.5960.4780.290
Gen3C17.260.6240.3910.9510.9560.5610.5240.6480.5140.334
FlashWorld16.510.6260.4030.9580.9600.6150.5500.8430.7580.546
Gen3R17.590.6310.3820.9740.9710.5520.5360.6330.4330.147
PixWorld (Ours)18.880.7020.3250.9790.9780.6230.5560.8690.7980.614
DL3DV-10K
LVSM14.910.4330.5300.9310.9330.4940.4660.5520.3720.134
GF12.690.3560.5910.8980.9100.4740.4350.4910.3380.113
Gen3C15.580.5140.4790.9270.9330.5320.4960.5520.3770.128
FlashWorld15.420.4730.4610.9420.9500.6190.5580.7690.6740.420
Gen3R15.750.5030.4950.9440.9420.5470.5300.5930.3980.117
PixWorld (Ours)16.500.5270.4490.9520.9560.6310.5670.7930.7060.485

Single-image generation, averaged over First-Frame and Bidirectional trajectories.

Method Novel View Synthesis Generation Quality Camera Control
PSNR↑SSIM↑LPIPS↓ I2V Subj.↑I2V BG↑I.Q.↑Aes.Q.↑ AUC@30↑AUC@15↑AUC@5↑
RealEstate10K
LVSM23.610.8190.2150.9700.9640.6070.5160.8610.7880.611
GF18.270.6470.3530.9250.9390.5070.4640.6300.4730.223
Gen3C20.120.7140.3000.9480.9470.5670.5180.6980.5380.255
FlashWorld21.480.7700.2570.9640.9620.6190.5470.8770.8110.637
Gen3R21.330.7240.2830.9700.9720.5500.5400.7280.5760.258
PixWorld (Ours)23.540.8150.2100.9740.9740.6280.5610.8800.8170.649
DL3DV-10K
LVSM19.180.5890.3430.9150.9170.5330.5020.7400.6090.374
GF15.380.4590.4700.8970.9120.4790.4450.5630.3790.147
Gen3C17.620.5420.4120.9270.9340.5360.5020.6270.4330.176
FlashWorld18.270.5620.3590.9380.9480.6000.5580.8020.7140.514
Gen3R18.050.5580.3920.9420.9440.5350.5300.7260.5600.245
PixWorld (Ours)19.370.5940.3400.9500.9560.6070.5650.8210.7340.534

Two-view generation, averaged over Interpolation and Extrapolation configurations.

Method Camera
Control
Object
Control
Content
Align.
3D
Consist.
Photo.
Consist.
Style
Consist.
Subj.
Quality
Average
Wan-2.123.5340.3245.4478.7478.3677.1859.3857.56
WonderJourney84.6037.1035.5480.6079.0362.8266.5663.75
LucidDreamer88.9341.1875.0090.3790.2048.1058.9970.40
FlashWorld84.4350.2856.5485.8786.7279.3652.7570.85
PixWorld (Ours)91.0846.2555.2791.3993.8467.1152.3671.04

WorldScore official static split (2000 scenes). Among the compared baselines, PixWorld reports a 71.04 average, with strong camera control, 3D and photometric consistency.

Qualitative Gallery

Qualitative comparison of generated novel views against baselines

Comparison with baselines. The large view on top is the input; the two smaller views below are novel views generated by each method.

Inference speed comparison: PixWorld distilled vs diffusion baselines

Inference speed. After distillation, the 4-step PixWorld generates a scene in ~0.6 s — up to ~1000× faster than diffusion-based world generators (per sample, relative to PixWorld distilled: FantasyWorld 1041×, Gen3C 445×, Gen3R 148×, FlashWorld 5×).

Analysis

Ablation Study

The geometry perception loss supplies the 3D structural signal that 2D objectives cannot — sharply improving fidelity and pose accuracy.

Effect of the Geometry Perception Loss (RealEstate10K, 1-view)

Variant PSNR↑SSIM↑LPIPS↓ I2V Subj.↑I2V BG↑I.Q.↑Aes.Q.↑ AUC@30↑AUC@15↑AUC@5↑
Full model19.120.7170.3100.9720.9750.6190.5610.8860.8130.642
w/o Geometry Perception17.990.6120.3320.9730.9740.6130.5410.8470.7630.562

Removing the geometry perception loss drops PSNR by 1.13 dB, SSIM by 0.105 and AUC@5 by 0.080 (~12.5% relative), while 2D VBench-style scores barely move — confirming the loss targets 3D structure, not 2D appearance.

Qualitative ablation of the geometry perception loss

Qualitative ablation. Without the geometry perception loss, frames stay individually plausible but cross-view geometry drifts; the full model preserves consistent 3D structure.

Open Source

Release Plan

We are progressively open-sourcing PixWorld. The fast distilled model and inference code come first.

Within two weeks (early July 2026): release the PixWorld-480P-4steps distilled weights and inference code.
Reference

BibTeX

@misc{gao2026pixworld,
  title={PixWorld: Unifying 3D Scene Generation and Reconstruction in Pixel Space},
  author={Sensen Gao and Zhaoqing Wang and Qihang Cao and Dongdong Yu and Changhu Wang and Jia-Wang Bian},
  year={2026},
  eprint={2026.XXXXX},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://github.com/SensenGao/PixWorld},
}

arXiv ID will be added once the preprint is public.