AffoGato: Learning Open-Vocabulary Affordance Grounding with Foundation Models

A unified framework for open-vocabulary affordance grounding across both 3D and 2D domains, powered by foundation models and the large-scale Affo-150K dataset.

1Pohang University of Science and Technology (POSTECH) 2RLWRLD
{junha.lee, dmsgk724, p0125ch, dahyun.kang, mscho}@postech.ac.kr
AffoGato framework overview: large-scale dataset generation, pretraining, and transferring to target datasets

Overview of AffoGato. Our framework consists of three stages: large-scale dataset generation, pretraining, and transferring to target datasets. We leverage foundation models — Gemma, Molmo, and MobileSAM — to construct Affo-150K, a large-scale synthetic dataset comprising 150K 3D objects with open-vocabulary queries and spatially localized heatmap annotations. We pretrain Gato-3D and Gato-2D on the Affo-150K train split, then finetune on various target datasets.

Abstract

Affordance grounding — localizing object regions based on natural language descriptions of interactions — is a critical challenge for enabling intelligent agents to understand and interact with their environments. However, this task remains challenging due to the need for fine-grained localization, the ambiguity arising from multiple valid interaction regions, and the scarcity of large-scale datasets. We introduce AffoGato, a unified framework for open-vocabulary affordance grounding across both 3D and 2D. Our approach leverages supervision from foundation models to automatically generate scalable affordance annotations, enabling training without reliance on exhaustive manual labeling. As part of this pipeline, we construct Affo-150K, a large-scale automatically generated dataset of 150K 3D object instances with free-form affordance descriptions and corresponding 3D affordance heatmaps. Within AffoGato, we design simple yet effective models, Gato-3D and Gato-2D, by combining pre-trained part-aware vision encoders with text-conditional heatmap decoders. Our models achieve state-of-the-art performance across existing 3D and 2D benchmarks, with pretraining on Affo-150K further enhancing their open-vocabulary capabilities.

Data Annotation Pipeline

Our data annotation pipeline automatically generates high-quality affordance annotations by combining three foundation models. Given multi-view renderings of a 3D object, Gemma3 generates natural language affordance queries, Molmo points to the interaction regions, and MobileSAM produces precise segmentation masks. The multi-view mask logits are then aggregated on the 3D object surface to obtain affordance heatmaps.

Data annotation pipeline: Gemma generates queries, Molmo points, MobileSAM segments, then 2D to 3D lifting and voting
Overview of our data annotation pipeline. Given multi-view renderings of an object, Gemma3 generates affordance queries, Molmo points the affordance, and MobileSAM decodes the point to a mask logit. The multi-view mask logits are aggregated on the 3D object surface to obtain an affordance heatmap.
1

Query Generation

Gemma3 analyzes multi-view images to produce natural language affordance queries describing how a human might interact with the object.

2

Interaction Point Prediction

Molmo grounds each affordance query to spatial locations, predicting pixel coordinates for the most likely interaction point in each view.

3

Heatmap Generation & Aggregation

MobileSAM converts predicted points into 2D segmentation masks, which are then lifted and aggregated onto the 3D surface via multi-view voting.

The Affo-150K Dataset

Affo-150K is the largest and most diverse dataset for affordance grounding to date, comprising 150K 3D object instances from Objaverse across four categories: Daily-Used, Furnitures, Transportations, and Electronics. Each object includes approximately 5 affordance query-heatmap pairs, totaling 750K annotations. Unlike existing datasets constrained by predefined taxonomies, Affo-150K achieves truly open-vocabulary coverage with >450 object classes and >350 affordance types.

Word cloud of object categories in Affo-150K
(a) Object classes
Word cloud of affordance types in Affo-150K
(b) Affordance classes
Diverse affordance annotations on a single object showing press to turn on, aim, and hold
(c) Diverse annotations
DatasetCov. ↑Div. ↑
LASO (2024)0.63840.6578
Affo-150K0.75322.6638
(d) Heatmap quality

Dataset Comparison

Comparison of 3D affordance grounding datasets
3D affordance grounding datasets
Comparison of 2D affordance grounding datasets
2D affordance grounding datasets

Gato Model Architecture

We present a minimalistic architecture for affordance grounding, dubbed Gato. Building on a shared architectural concept, we create two models: Gato-3D for 3D point clouds and Gato-2D for 2D images. Each model consists of a modality-specific visual encoder, a text encoder (CLIP), and a text-conditioned heatmap decoder. The affordance heatmap is predicted as the cross-modal similarity of the vision representation and the conditioned text embedding.

Gato-3D and Gato-2D architectures with vision encoder, text encoder, and conditional heatmap decoder
Gato-3D and Gato-2D architectures. Each model consists of a 3D or 2D visual encoder, a text encoder, and a text-conditioned heatmap decoder. The affordance heatmap is predicted as the cross-modal similarity of the vision representation and the conditioned text embedding.

Results

3D Affordance Grounding

Gato-3D achieves state-of-the-art performance on the LASO benchmark for both seen and unseen settings. When pretrained on Affo-150K (Gato-3D*), the model shows significant improvements, particularly on unseen object categories — demonstrating the value of large-scale pretraining for generalization.

Open-vocabulary 3D affordance grounding on the LASO test split. * denotes models pretrained on Affo-150K.
Method aIoU ↑ AUC ↑ SIM ↑ MAE ↓
SeenRef. Trans.13.779.80.4970.124
3D-SPS11.476.20.4330.138
ReLA15.278.90.5320.118
IAGNet17.882.30.5610.109
OpenAD14.285.10.5330.103
PointRefer20.887.30.6290.093
Gato-3D20.486.00.6330.102
Gato-3D*21.985.90.6370.116
UnseenRef. Trans.10.269.10.4320.145
3D-SPS7.968.80.4020.158
ReLA10.769.70.4290.144
IAGNet12.977.80.4430.129
OpenAD14.680.70.5180.109
PointRefer14.680.20.5070.119
Gato-3D18.780.00.6000.101
Gato-3D*20.882.90.6140.122
Qualitative comparison of 3D affordance grounding between OpenAD, PointRefer, Gato-3D, and ground truth on LASO
Qualitative comparison on LASO. Gato-3D produces more accurate and focused affordance heatmaps compared to OpenAD and PointRefer, closely matching the ground truth.

2D Affordance Grounding

Gato-2D demonstrates exceptional zero-shot generalization capability on the AGD20K benchmark. Despite being a lightweight combination of DINOv2 and CLIP, our model significantly outperforms heavily parameterized LLM-based approaches. When pretrained on Affo-150K and finetuned with full supervision (Gato-2D*), the model achieves the best performance across all metrics.

Open-vocabulary 2D affordance grounding on the AGD20K test split. * denotes Affo-150K pretraining + AGD20K full supervision.
Method KLD ↓ SIM ↑ NSS ↑
Zero-shot
(Seen)
Molmo+SAM21.8040.2610.729
LISA-7B1.6270.2960.819
M²SA-7B1.7720.2580.620
Gato-2D1.4260.4020.985
Zero-shot
(Unseen)
Molmo+SAM21.9530.2260.718
LISA-7B1.8300.2560.765
M²SA-7B1.9250.2270.657
Gato-2D1.5710.3761.016
Full
Supervision
AffordanceLLM1.4630.3771.070
Gato-2D1.0340.5031.550
Gato-2D*0.9740.5191.645
Qualitative comparison of 2D affordance grounding between Cross-view-AG, LOCATE, WSAG-PLSP, OOAL, Gato-2D, and ground truth
Qualitative comparison on AGD20K. Gato-2D (pretrained on Affo-150K, finetuned on AGD20K) produces more precise and well-localized affordance heatmaps compared to existing methods.

Annotation Quality

We conducted a systematic quality assessment on 5K randomly sampled affordance query-heatmap pairs. The evaluation demonstrated an 84.8% pass rate, confirming the robustness of our automatic annotation pipeline. The analysis also reveals common failure modes across the three pipeline stages.

Annotation quality evaluation: 84.8% success rate with breakdown of failure modes
Annotation quality evaluation results. Our automatic pipeline achieves 84.8% accuracy. Failure modes include object misidentification (Stage 1), incorrect affordance queries (Stage 1), wrong interaction points (Stage 2), SAM edge bias (Stage 3), and limited camera angles.

Citation

@article{lee2025affogato,
  title={Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale},
  author={Lee, Junha and Park, Eunha and Park, Chunghyun and Kang, Dahyun and Cho, Minsu},
  journal={arXiv preprint arXiv:2506.12009},
  year={2025}
}