Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments.
In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a Dreamer that employs generative models to visualize how an interaction would look; (2) a Thinker that utilizes large vision-language models to decide what object part to interact with; and (3) a Spotter that orchestrates vision foundation models to precisely locate where the interaction area is.
By leveraging the complementary strengths of pre-trained models without any task-specific fine-tuning, our zero-shot framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks and demonstrates robust generalization to real-world settings.
Affordance prediction fundamentally requires two complementary capabilities: high-level reasoning (interpreting instructions and identifying relevant parts) and low-level grounding (precisely localizing these parts).
Prevailing end-to-end models tightly couple these processes into a single monolithic pipeline. This design introduces several issues, including a trade-off between reasoning and grounding, limited generalization to novel objects, and reduced flexibility. We question whether entangling these capabilities is the right path. Instead, we propose A4-Agent, which decouples reasoning and grounding, allowing for the coordination of specialized foundation models without task-specific training.
We decompose the task into a three-stage pipeline, with each stage managed by a specialized expert leveraging powerful foundation models:
1) Dreamer: Drawing inspiration from human cognitive processes, the Dreamer initiates an imagination phase. It employs generative models to synthesize visual scenarios depicting how an interaction would look.
2) Thinker: The Thinker utilizes leading Vision-Language Models (VLMs) to interpret task instructions. Integrating visual observations with the imagined scenarios, it generates structured textual descriptions that specify what to interact with.
3) Spotter: The Spotter orchestrates robust vision foundation models to execute precise spatial localization, pinpointing exactly where the interaction area is within the visual input.
We evaluate A4-Agent on three benchmarks: ReasonAff, RAGNet, and UMD. Crucially, our framework is completely zero-shot—it has never been trained or fine-tuned on any of these datasets.
| Model | gIoU↑ | cIoU↑ | P50-95↑ | P50↑ |
|---|---|---|---|---|
| VLPart | 4.21 | 3.88 | 0.85 | 1.31 |
| OVSeg | 16.52 | 10.59 | 4.12 | 9.89 |
| SAN | 10.21 | 13.45 | 3.17 | 7.18 |
| LISA-7B | 38.17 | 40.58 | 19.69 | 33.62 |
| SAM4MLLM | 45.51 | 33.64 | 22.79 | 43.48 |
| AffordanceLLM | 48.49 | 38.61 | 20.19 | 42.11 |
| InternVL3-8B | 31.79 | 24.68 | 21.93 | 35.41 |
| Qwen2.5VL-7B | 25.18 | 20.54 | 15.82 | 26.00 |
| AffordanceVLM | 30.50 | 25.54 | 18.31 | 30.29 |
| Seg-Zero | 59.26 | 48.03 | 45.87 | 61.33 |
| Vision Reasoner | 63.04 | 52.70 | 47.23 | 67.33 |
| Affordance-R1 | 67.41 | 62.72 | 55.22 | 74.50 |
| A4-Agent (Ours) | 70.52 | 64.62 | 55.22 | 75.24 |
| Model | Zero-shot | 3DOI | HANDAL-easy | HANDAL-hard | |||
|---|---|---|---|---|---|---|---|
| gIoU↑ | cIoU↑ | gIoU↑ | cIoU↑ | gIoU↑ | cIoU↑ | ||
| G-DINO | ✓ | 4.1 | 3.9 | 3.6 | 3.0 | 3.4 | 3.1 |
| LISA | ✓ | 12.3 | 8.1 | 15.5 | 11.9 | 12.3 | 8.1 |
| GLaMM | ✓ | 4.4 | 2.9 | 4.7 | 3.5 | 5.0 | 3.5 |
| Vision-Reasoner | ✓ | 39.6 | 30.3 | 29.6 | 19.8 | 27.7 | 16.7 |
| Affordance-R1 | ✓ | 39.0 | 33.4 | 43.1 | 38.7 | 40.7 | 37.9 |
| AffordanceVLM | ✗ | 38.1 | 39.4 | 58.3 | 58.1 | 58.2 | 57.8 |
| A4-Agent (Ours) | ✓ | 63.9 | 58.3 | 61.1 | 61.7 | 61.0 | 59.6 |
| Model | gIoU↑ | cIoU↑ | P50↑ | P50-95↑ |
|---|---|---|---|---|
| LISA-7B | 41.90 | 41.23 | 39.65 | 19.33 |
| SAM4MLLM | 12.40 | 8.41 | 4.12 | 0.05 |
| AffordanceLLM | 43.11 | 38.97 | 41.56 | 22.36 |
| Qwen2.5VL-7B | 33.21 | 29.83 | 25.17 | 10.45 |
| InternVL3-7B | 30.46 | 28.73 | 18.67 | 9.94 |
| AffordanceVLM | 25.41 | 17.96 | 9.37 | 25.10 |
| Seg-Zero | 44.26 | 39.30 | 39.93 | 16.53 |
| Vision Reasoner | 44.00 | 39.71 | 39.04 | 16.10 |
| Affordance-R1 | 49.85 | 42.24 | 53.35 | 34.08 |
| A4-Agent (Ours) | 65.38 | 59.81 | 77.31 | 43.78 |
A4-Agent demonstrates superior reasoning ability on the ReasonAff dataset, which requires deep understanding of implicit contextual instructions. Our zero-shot framework effectively interprets complex queries and accurately localizes the actionable regions, outperforming supervised baselines.
Qualitative comparison on the RAGNet dataset. Our zero-shot method effectively reasons over task instructions to identify correct regions and precisely localize them with masks, closely matching ground truth. This outperforms baseline methods including AffordanceVLM trained on this dataset.
We further evaluate A4-Agent on open-world images to assess its robustness. The results show that A4-Agent generalizes well to:
@article{zhang2025a4agent,
title={A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning},
author={Zhang, Zixin and Chen, Kanghao and Wang, Hanqing and Zhang, Hongfei and Chen, Harold Haodong and Liao, Chenfei and Guo, Litao and Chen, Ying-Cong},
journal={arXiv preprint arXiv:2512.14442},
year={2025}
}