A UNet that enhances spatial understanding capabilities of the StableDiffusion 2.1 text-to-image
diffusion model. This model demonstrates significant improvements in generating images with specific
spatial relationships between objects.
Explicit spatial relationships (e.g., "a photo of A to the right of B")
Training Details
Training Data
Built using the SCOP (Spatial Constraints-Oriented Pairing) data engine
~28,000 curated object pairs from COCO
Enforces criteria for:
Visual significance
Semantic distinction
Spatial clarity
Object relationships
Visual balance
Training Process
Trained for 80,000 steps
Effective batch size of 4
Learning rate: 5e-6
Optimizer: AdamW with β₁=0.9, β₂=0.999
Weight decay: 1e-2
Evaluation Results
Metric
StableDiffusion 1.4
+CoMPaSS
VISOR uncond (⬆️)
30.25%
62.06%
T2I-CompBench Spatial (⬆️)
0.13
0.32
GenEval Position (⬆️)
0.07
0.51
FID (⬇️)
21.65
16.96
CMMD (⬇️)
0.6472
0.4083
Citation
If you use this model in your research, please cite:
@inproceedings{zhang2025compass,
title={CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models},
author={Zhang, Gaoyang and Fu, Bingtao and Fan, Qingnan and Zhang, Qi and Liu, Runxing and Gu, Hong and Zhang, Huaqi and Liu, Xinguo},
booktitle={ICCV},
year={2025}
}