Panoramic Affordance Prediction

The First Exploration into Panoramic Affordance Prediction:
1) PAP-12K, the first large-scale benchmark dataset.
2) PAP, a training-free pipeline achieving SOTA accuracy.

1HKUST(GZ), 2HKUST, 3SJTU, 4MBZUAI 5UniTrento 6Knowin
*Equal contribution     †Corresponding author
12K+
Affordance Prediction Tasks
11904×5952
Ultra-High-Res Native Panorama
71.56 gIoU
State-of-the-Art Accuracy
Scroll to explore
PAP Teaser
  • Why: Existing affordance prediction focus on pinhole cameras with narrow FoV and fragmented context. Panoramic cameras, which capture global spatial relationships and holistic scene understanding, offer a natural solution to this bottleneck.
  • What: We built PAP-12K, a real-world 12K panoramic benchmark dataset with dense affordance task annotations and panoramic-specific challenges.
  • How: Our pipeline PAP uses Recursive Visual Routing + Adaptive Gaze + Cascaded Grounding to deal with panoramic challenges and achieves robust zero-shot performance.

PAP-12K Dataset

PAP-12K Statistics
Real-world only: native UHR 360° (11904×5952) captures from various real-world scenes.
Rich and high-quality labels: reasoning-oriented QA aligned with pixel-level masks.
Panorama-specific challenges: distortion, scale variation, and boundary discontinuity.
PAP-12K Challenges

Dataset Preview

Preview one sample scene per scene type from PAP-12K. You can click object buttons to inspect their segmentation masks and affordance questions.

Loading preview data...

PAP Framework

PAP Pipeline

PAP is a training-free coarse-to-fine pipeline inspired by human foveal vision.

1) Recursive Visual Routing

Prompt-guided zoom-in search localizes candidate actionable regions efficiently in ultra-high-res panoramas.

2) Adaptive Gaze

Local spherical regions are reprojected to perspective views to remove ERP distortion and boundary artifacts.

3) Cascaded Grounding

Open-vocabulary detection + segmentation extract precise instance-level affordance masks on rectified patches.

Key Results

Quantitative Snapshot

Method gIoU↑ cIoU↑ P50 P50-95
OV-Seg29.4817.8532.0018.80
LISA15.2116.3413.668.30
VisionReasoner49.3344.6451.0638.06
AffordanceVLM9.6613.118.965.41
Affordance-R151.8050.3255.4740.70
A4-Agent62.5549.9767.0954.28
PAP (Ours) 71.56 62.30 75.49 64.97

Qualitative Results

Qualitative Results

Qualitative visualization of our method. PAP demonstrates superior reasoning and grounding ability in 360° panoramas, effectively interpreting complex queries to accurately localize actionable regions despite geometric distortions and massive background clutter.

BibTeX

@article{zhang2026pap,
        title={Panoramic Affordance Prediction}, 
        author={Zhang, Zixin and Liao, Chenfei and Zhang, Hongfei and Chen, Harold Haodong and Chen, Kanghao and Wen, Zichen and Guo, Litao and Ren, Bin and Zheng, Xu and Li, Yinchuan and Hu, Xuming and Sebe, Nicu and Chen, Ying-Cong},
        journal={arXiv preprint arXiv:2603.15558},
        year={2026}
  }