Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding.
To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904×5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images.
PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, our framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.
We introduce PAP-12K, the first large-scale benchmark dataset dedicated to panoramic affordance prediction. Unlike synthetic or web-crawled datasets, all images in PAP-12K were natively captured in real-world environments using professional 360° cameras. It features three key characteristics:
1) 100% Real-World & Ultra-High Resolution: Captured at a massive 11904×5952 resolution, ensuring authentic geometric distortions, lighting conditions, and natural object scales.
2) Rich QA Pairs and Masks: Over 13,000 carefully annotated question-answer pairs that require complex reasoning, each associated with precise pixel-level segmentation masks.
3) Panoramic-Specific Challenges: Designed to explicitly incorporate the inherent challenges of 360° imagery and Equirectangular Projection (ERP), including Geometric Distortion, Extreme Scale Variations, and Boundary Discontinuity.
To tackle the unique challenges in 360° scenes, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system. Humans do not process a 360° environment with uniform high acuity; rather, we use our peripheral vision to generally locate regions of interest, direct our gaze to obtain a clear view, and finally perform detailed parsing. Mirroring this mechanism, our framework operates in three primary stages:
1) Recursive Visual Routing via Grid Prompting: To handle extreme scale variations and resolution burdens, we progressively guide Vision-Language Models (VLMs) using numerical grid prompting to dynamically "zoom in" and efficiently locate the general area of target tools based on intent.
2) Adaptive Gaze: Once the target region is coarsely localized, this mechanism steps in. It projects the specific spherical region onto a tailored perspective plane, effectively acting as a domain adapter that eliminates geometric distortions and boundary discontinuities inherent to Equirectangular Projection (ERP) images.
3) Cascaded Affordance Grounding: Finally, with a distortion-free local patch secured, we deploy robust 2D visual foundation models (an Open-Vocabulary Detector and the Segment Anything Model) strictly within the rectified local patch to extract precise, instance-level affordance masks.
We evaluate our method against state-of-the-art vision-language models and affordance reasoning frameworks on the newly proposed PAP-12K dataset. Our training-free pipeline effectively overcomes the extreme resolution and distortion challenges of panoramic images, achieving superior zero-shot performance.
| Method | gIoU↑ | cIoU↑ | P50↑ | P50-95↑ | Inference Time |
|---|---|---|---|---|---|
| OV-Seg | 29.48 | 17.85 | 32.00 | 18.80 | ~8s |
| LISA | 15.21 | 16.34 | 13.66 | 8.30 | ~7s |
| VisionReasoner | 49.33 | 44.64 | 51.06 | 38.06 | ~12s |
| AffordanceVLM | 9.66 | 13.11 | 8.96 | 5.41 | ~7.8s |
| Affordance-R1 | 51.80 | 50.32 | 55.47 | 40.70 | ~10.4s |
| A4-Agent | 62.55 | 49.97 | 67.09 | 54.28 | ~11.8s |
| PAP (Ours) | 71.56 | 62.30 | 75.49 | 64.97 | ~10s |
Qualitative visualization of our method. PAP demonstrates superior reasoning and grounding ability in 360° panoramas, effectively interpreting complex queries to accurately localize actionable regions despite geometric distortions and massive background clutter.
@article{
}