Panoramic Affordance Prediction

The First Exploration into Panoramic Affordance Prediction:
1) PAP-12K, the first large-scale benchmark dataset.
2) PAP, a training-free pipeline achieving SOTA accuracy.

Zixin Zhang^1*, Chenfei Liao^1*, Hongfei Zhang¹, Harold H. Chen¹, Kanghao Chen¹, Zichen Wen³, Litao Guo¹,
Bin Ren⁴, Xu Zheng¹, Yinchuan Li⁶, Xuming Hu¹, Nicu Sebe⁵, Ying-Cong Chen^1,2†

¹HKUST(GZ), ²HKUST, ³SJTU, ⁴MBZUAI ⁵UniTrento ⁶Knowin

*Equal contribution †Corresponding author

12K+

Affordance Prediction Tasks

11904×5952

Ultra-High-Res Native Panorama

71.56 gIoU

State-of-the-Art Accuracy

Paper - 89 MB Paper - 44 MB arXiv

Code Try Dataset Preview Download Dataset

Scroll to explore

Why: Existing affordance prediction focus on pinhole cameras with narrow FoV and fragmented context. Panoramic cameras, which capture global spatial relationships and holistic scene understanding, offer a natural solution to this bottleneck.
What: We built PAP-12K, a real-world 12K panoramic benchmark dataset with dense affordance task annotations and panoramic-specific challenges.
How: Our pipeline PAP uses Recursive Visual Routing + Adaptive Gaze + Cascaded Grounding to deal with panoramic challenges and achieves robust zero-shot performance.

PAP-12K Dataset

Real-world only: native UHR 360° (11904×5952) captures from various real-world scenes.

Rich and high-quality labels: reasoning-oriented QA aligned with pixel-level masks.

Panorama-specific challenges: distortion, scale variation, and boundary discontinuity.

Dataset Preview

Preview one sample scene per scene type from PAP-12K. You can click object buttons to inspect their segmentation masks and affordance questions.

Loading preview data...

Mask Opacity

Magnifier Zoom

Hover to magnify

Scene Type

Objects

Task Instructions

PAP Framework

PAP is a training-free coarse-to-fine pipeline inspired by human foveal vision.

1) Recursive Visual Routing

Prompt-guided zoom-in search localizes candidate actionable regions efficiently in ultra-high-res panoramas.

2) Adaptive Gaze

Local spherical regions are reprojected to perspective views to remove ERP distortion and boundary artifacts.

3) Cascaded Grounding

Open-vocabulary detection + segmentation extract precise instance-level affordance masks on rectified patches.

Key Results

Quantitative Snapshot

Method	gIoU↑	cIoU↑	P₅₀↑	P_50-95↑
OV-Seg	29.48	17.85	32.00	18.80
LISA	15.21	16.34	13.66	8.30
VisionReasoner	49.33	44.64	51.06	38.06
AffordanceVLM	9.66	13.11	8.96	5.41
Affordance-R1	51.80	50.32	55.47	40.70
A4-Agent	62.55	49.97	67.09	54.28
PAP (Ours)	71.56	62.30	75.49	64.97

Qualitative Results

Qualitative visualization of our method. PAP demonstrates superior reasoning and grounding ability in 360° panoramas, effectively interpreting complex queries to accurately localize actionable regions despite geometric distortions and massive background clutter.

BibTeX

@article{zhang2026pap,
        title={Panoramic Affordance Prediction}, 
        author={Zhang, Zixin and Liao, Chenfei and Zhang, Hongfei and Chen, Harold Haodong and Chen, Kanghao and Wen, Zichen and Guo, Litao and Ren, Bin and Zheng, Xu and Li, Yinchuan and Hu, Xuming and Sebe, Nicu and Chen, Ying-Cong},
        journal={arXiv preprint arXiv:2603.15558},
        year={2026}
  }