A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

Zixin Zhang^1,4*, Kanghao Chen^1,4*, Hanqing Wang^1*, Hongfei Zhang¹,
Harold H. Chen^1,4, Chenfei Liao^1,3, Litao Guo¹, Ying-Cong Chen^1,2†

¹HKUST(GZ), ²HKUST, ³SJTU, ⁴Knowin

*Equal contribution †Corresponding author

Paper arXiv Code

Overview of A4-Agent. An affordance-centric vision-language agent that predicts actionable regions based on complex task instructions. By integrating image generation, object detection, segmentation, and a vision-language model, our framework imagines plausible interactions to accurately localize the action-specific part. A4-Agent is completely zero-shot and achieves state-of-the-art performance on all benchmarks.

Abstract

Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments.

In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a Dreamer that employs generative models to visualize how an interaction would look; (2) a Thinker that utilizes large vision-language models to decide what object part to interact with; and (3) a Spotter that orchestrates vision foundation models to precisely locate where the interaction area is.

By leveraging the complementary strengths of pre-trained models without any task-specific fine-tuning, our zero-shot framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks and demonstrates robust generalization to real-world settings.

Motivation

Affordance prediction fundamentally requires two complementary capabilities: high-level reasoning (interpreting instructions and identifying relevant parts) and low-level grounding (precisely localizing these parts).

Prevailing end-to-end models tightly couple these processes into a single monolithic pipeline. This design introduces several issues, including a trade-off between reasoning and grounding, limited generalization to novel objects, and reduced flexibility. We question whether entangling these capabilities is the right path. Instead, we propose A4-Agent, which decouples reasoning and grounding, allowing for the coordination of specialized foundation models without task-specific training.

Method

We decompose the task into a three-stage pipeline, with each stage managed by a specialized expert leveraging powerful foundation models:
1) Dreamer: Drawing inspiration from human cognitive processes, the Dreamer initiates an imagination phase. It employs generative models to synthesize visual scenarios depicting how an interaction would look.
2) Thinker: The Thinker utilizes leading Vision-Language Models (VLMs) to interpret task instructions. Integrating visual observations with the imagined scenarios, it generates structured textual descriptions that specify what to interact with.
3) Spotter: The Spotter orchestrates robust vision foundation models to execute precise spatial localization, pinpointing exactly where the interaction area is within the visual input.

Experiments

Quantitative Results

We evaluate A4-Agent on three benchmarks: ReasonAff, RAGNet, and UMD. Crucially, our framework is completely zero-shot—it has never been trained or fine-tuned on any of these datasets.

Results on ReasonAff Dataset

Model	gIoU↑	cIoU↑	P_50-95↑	P₅₀↑
VLPart	4.21	3.88	0.85	1.31
OVSeg	16.52	10.59	4.12	9.89
SAN	10.21	13.45	3.17	7.18
LISA-7B	38.17	40.58	19.69	33.62
SAM4MLLM	45.51	33.64	22.79	43.48
AffordanceLLM	48.49	38.61	20.19	42.11
InternVL3-8B	31.79	24.68	21.93	35.41
Qwen2.5VL-7B	25.18	20.54	15.82	26.00
AffordanceVLM	30.50	25.54	18.31	30.29
Seg-Zero	59.26	48.03	45.87	61.33
Vision Reasoner	63.04	52.70	47.23	67.33
Affordance-R1	67.41	62.72	55.22	74.50
A4-Agent (Ours)	70.52	64.62	55.22	75.24

Results on RAGNet Dataset

Model	Zero-shot	3DOI		HANDAL-easy		HANDAL-hard
Model	Zero-shot	gIoU↑	cIoU↑	gIoU↑	cIoU↑	gIoU↑	cIoU↑
G-DINO	✓	4.1	3.9	3.6	3.0	3.4	3.1
LISA	✓	12.3	8.1	15.5	11.9	12.3	8.1
GLaMM	✓	4.4	2.9	4.7	3.5	5.0	3.5
Vision-Reasoner	✓	39.6	30.3	29.6	19.8	27.7	16.7
Affordance-R1	✓	39.0	33.4	43.1	38.7	40.7	37.9
AffordanceVLM	✗	38.1	39.4	58.3	58.1	58.2	57.8
A4-Agent (Ours)	✓	63.9	58.3	61.1	61.7	61.0	59.6

Results on UMD Dataset

Model	gIoU↑	cIoU↑	P₅₀↑	P_50-95↑
LISA-7B	41.90	41.23	39.65	19.33
SAM4MLLM	12.40	8.41	4.12	0.05
AffordanceLLM	43.11	38.97	41.56	22.36
Qwen2.5VL-7B	33.21	29.83	25.17	10.45
InternVL3-7B	30.46	28.73	18.67	9.94
AffordanceVLM	25.41	17.96	9.37	25.10
Seg-Zero	44.26	39.30	39.93	16.53
Vision Reasoner	44.00	39.71	39.04	16.10
Affordance-R1	49.85	42.24	53.35	34.08
A4-Agent (Ours)	65.38	59.81	77.31	43.78

Qualitative Results on ReasonAff

A4-Agent demonstrates superior reasoning ability on the ReasonAff dataset, which requires deep understanding of implicit contextual instructions. Our zero-shot framework effectively interprets complex queries and accurately localizes the actionable regions, outperforming supervised baselines.

Qualitative Results on RAGNet

Qualitative comparison on the RAGNet dataset. Our zero-shot method effectively reasons over task instructions to identify correct regions and precisely localize them with masks, closely matching ground truth. This outperforms baseline methods including AffordanceVLM trained on this dataset.

Open-World Generalization

We further evaluate A4-Agent on open-world images to assess its robustness. The results show that A4-Agent generalizes well to:

Novel Objects: Identifying actionable regions on objects not seen in standard datasets.
Complex Scenes: Selecting the correct tool part in cluttered environments.
Deep Reasoning: Logically deducing appropriate tools for a given task (e.g., using a rock as a substitute for a hammer).

This confirms the potential of our training-free, agentic approach for real-world applications.

BibTeX

@article{zhang2025a4agent,
  title={A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning}, 
  author={Zhang, Zixin and Chen, Kanghao and Wang, Hanqing and Zhang, Hongfei and Chen, Harold Haodong and Liao, Chenfei and Guo, Litao and Chen, Ying-Cong},
  journal={arXiv preprint arXiv:2512.14442},
  year={2025}
}