A unified framework for open-vocabulary affordance grounding across both 3D and 2D domains, powered by foundation models and the large-scale Affo-150K dataset.
Affordance grounding — localizing object regions based on natural language descriptions of interactions — is a critical challenge for enabling intelligent agents to understand and interact with their environments. However, this task remains challenging due to the need for fine-grained localization, the ambiguity arising from multiple valid interaction regions, and the scarcity of large-scale datasets. We introduce AffoGato, a unified framework for open-vocabulary affordance grounding across both 3D and 2D. Our approach leverages supervision from foundation models to automatically generate scalable affordance annotations, enabling training without reliance on exhaustive manual labeling. As part of this pipeline, we construct Affo-150K, a large-scale automatically generated dataset of 150K 3D object instances with free-form affordance descriptions and corresponding 3D affordance heatmaps. Within AffoGato, we design simple yet effective models, Gato-3D and Gato-2D, by combining pre-trained part-aware vision encoders with text-conditional heatmap decoders. Our models achieve state-of-the-art performance across existing 3D and 2D benchmarks, with pretraining on Affo-150K further enhancing their open-vocabulary capabilities.
Our data annotation pipeline automatically generates high-quality affordance annotations by combining three foundation models. Given multi-view renderings of a 3D object, Gemma3 generates natural language affordance queries, Molmo points to the interaction regions, and MobileSAM produces precise segmentation masks. The multi-view mask logits are then aggregated on the 3D object surface to obtain affordance heatmaps.
Gemma3 analyzes multi-view images to produce natural language affordance queries describing how a human might interact with the object.
Molmo grounds each affordance query to spatial locations, predicting pixel coordinates for the most likely interaction point in each view.
MobileSAM converts predicted points into 2D segmentation masks, which are then lifted and aggregated onto the 3D surface via multi-view voting.
Affo-150K is the largest and most diverse dataset for affordance grounding to date, comprising 150K 3D object instances from Objaverse across four categories: Daily-Used, Furnitures, Transportations, and Electronics. Each object includes approximately 5 affordance query-heatmap pairs, totaling 750K annotations. Unlike existing datasets constrained by predefined taxonomies, Affo-150K achieves truly open-vocabulary coverage with >450 object classes and >350 affordance types.
| Dataset | Cov. ↑ | Div. ↑ |
|---|---|---|
| LASO (2024) | 0.6384 | 0.6578 |
| Affo-150K | 0.7532 | 2.6638 |
We present a minimalistic architecture for affordance grounding, dubbed Gato. Building on a shared architectural concept, we create two models: Gato-3D for 3D point clouds and Gato-2D for 2D images. Each model consists of a modality-specific visual encoder, a text encoder (CLIP), and a text-conditioned heatmap decoder. The affordance heatmap is predicted as the cross-modal similarity of the vision representation and the conditioned text embedding.
Gato-3D achieves state-of-the-art performance on the LASO benchmark for both seen and unseen settings. When pretrained on Affo-150K (Gato-3D*), the model shows significant improvements, particularly on unseen object categories — demonstrating the value of large-scale pretraining for generalization.
| Method | aIoU ↑ | AUC ↑ | SIM ↑ | MAE ↓ | |
|---|---|---|---|---|---|
| Seen | Ref. Trans. | 13.7 | 79.8 | 0.497 | 0.124 |
| 3D-SPS | 11.4 | 76.2 | 0.433 | 0.138 | |
| ReLA | 15.2 | 78.9 | 0.532 | 0.118 | |
| IAGNet | 17.8 | 82.3 | 0.561 | 0.109 | |
| OpenAD | 14.2 | 85.1 | 0.533 | 0.103 | |
| PointRefer | 20.8 | 87.3 | 0.629 | 0.093 | |
| Gato-3D | 20.4 | 86.0 | 0.633 | 0.102 | |
| Gato-3D* | 21.9 | 85.9 | 0.637 | 0.116 | |
| Unseen | Ref. Trans. | 10.2 | 69.1 | 0.432 | 0.145 |
| 3D-SPS | 7.9 | 68.8 | 0.402 | 0.158 | |
| ReLA | 10.7 | 69.7 | 0.429 | 0.144 | |
| IAGNet | 12.9 | 77.8 | 0.443 | 0.129 | |
| OpenAD | 14.6 | 80.7 | 0.518 | 0.109 | |
| PointRefer | 14.6 | 80.2 | 0.507 | 0.119 | |
| Gato-3D | 18.7 | 80.0 | 0.600 | 0.101 | |
| Gato-3D* | 20.8 | 82.9 | 0.614 | 0.122 |
Gato-2D demonstrates exceptional zero-shot generalization capability on the AGD20K benchmark. Despite being a lightweight combination of DINOv2 and CLIP, our model significantly outperforms heavily parameterized LLM-based approaches. When pretrained on Affo-150K and finetuned with full supervision (Gato-2D*), the model achieves the best performance across all metrics.
| Method | KLD ↓ | SIM ↑ | NSS ↑ | |
|---|---|---|---|---|
| Zero-shot (Seen) | Molmo+SAM2 | 1.804 | 0.261 | 0.729 |
| LISA-7B | 1.627 | 0.296 | 0.819 | |
| M²SA-7B | 1.772 | 0.258 | 0.620 | |
| Gato-2D | 1.426 | 0.402 | 0.985 | |
| Zero-shot (Unseen) | Molmo+SAM2 | 1.953 | 0.226 | 0.718 |
| LISA-7B | 1.830 | 0.256 | 0.765 | |
| M²SA-7B | 1.925 | 0.227 | 0.657 | |
| Gato-2D | 1.571 | 0.376 | 1.016 | |
| Full Supervision | AffordanceLLM | 1.463 | 0.377 | 1.070 |
| Gato-2D | 1.034 | 0.503 | 1.550 | |
| Gato-2D* | 0.974 | 0.519 | 1.645 |
We conducted a systematic quality assessment on 5K randomly sampled affordance query-heatmap pairs. The evaluation demonstrated an 84.8% pass rate, confirming the robustness of our automatic annotation pipeline. The analysis also reveals common failure modes across the three pipeline stages.
@article{lee2025affogato,
title={Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale},
author={Lee, Junha and Park, Eunha and Park, Chunghyun and Kang, Dahyun and Cho, Minsu},
journal={arXiv preprint arXiv:2506.12009},
year={2025}
}