Affogato: Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale

Affogato benchmark overview: large-scale dataset generation, pretraining on Affogato-750K, and transferring to target datasets

Overview of our framework. Our framework consists of three stages: large-scale dataset generation, pretraining, and transferring to target datasets. In the dataset generation stage, we leverage foundation models — Gemma, Molmo, and MobileSAM — to construct Affogato-750K, a large-scale synthetic dataset comprising 750K open-vocabulary affordance annotations across 150K 3D objects. We pretrain Espresso-3D and Espresso-2D on the Affogato-750K train split, then finetune our pretrained models on various target datasets.

Abstract

Affordance grounding aims to localize where to interact with an object, a fundamental capability for embodied agents. Yet progress is bottlenecked by data: manual annotation is prohibitively expensive and confines existing datasets to a narrow set of predefined object and affordance categories. We introduce Affogato, a framework for open-vocabulary affordance grounding centered on Affogato-750K, a large-scale dataset of 750K 3D affordance heatmaps paired with natural language queries. We build it with a fully automated pipeline that orchestrates foundation models to generate them at scale without human labeling. It covers significantly more diverse categories than any existing dataset. For reliable evaluation, we further provide 5K human-verified test pairs. We also present Espresso-3D and Espresso-2D, simple yet effective models with a unified architecture across both modalities. Pretraining on Affogato-750K improves both Espresso and prior methods and yields the largest gains on unseen object and affordance categories, showing that it provides broadly transferable supervision across architectures.

Data Annotation Pipeline

Our data annotation pipeline automatically generates high-quality affordance annotations by chaining three foundation models. Given multi-view renderings of a 3D object, Gemma3 generates open-vocabulary natural language affordance queries via chain-of-thought prompting, Molmo points to the interaction region for each query, and MobileSAM converts each point into a 2D segmentation mask. The multi-view masks are then lifted and aggregated on the 3D object surface via cross-view voting to obtain a final 3D affordance heatmap.

1

Open-Vocabulary Query Generation

Gemma3 analyzes multi-view images and, via chain-of-thought prompting, produces five open-vocabulary affordance queries per object that describe how a human might interact with it.

2

Interaction Point Prediction

Molmo grounds each affordance query to a spatial location, predicting the most likely interaction-point coordinates in every view.

3

Heatmap Generation & Aggregation

MobileSAM converts predicted points into 2D segmentation masks, which are projected onto the 3D surface and fused via multi-view voting to suppress per-view errors.

The Affogato-750K Dataset

Affogato-750K is the largest and most diverse dataset for affordance grounding to date, comprising 750K open-vocabulary affordance annotations across 150K 3D object instances from Objaverse. The objects span four categories — Daily-Used (121,799), Transportations (11,609), Furnitures (8,759), and Electronics (7,937) — each paired with five affordance query–heatmap pairs. Unlike existing datasets constrained by predefined taxonomies, Affogato-750K achieves truly open-vocabulary coverage with >450 object classes and >350 affordance types. A 5K human-verified test split is constructed via manual refinement of borderline cases, ensuring rigorous and reliable evaluation.

0

Affordance annotations

0

3D object instances

0

Object classes

0

Affordance types

Word cloud of object categories in Affogato-750K — (a) Object classes

Characteristics of Affogato-750K. The dataset exhibits high diversity in both object and affordance categories.

Dataset Comparison

Affogato-750K substantially surpasses existing 3D and 2D affordance datasets in both diversity and scale. Switch metrics below to compare the number of affordance types, object classes, and samples across datasets.

3D datasets

2D datasets

Espresso Model Architecture

We present Espresso, a unified baseline designed to cleanly validate the contribution of Affogato-750K. Building on a shared architecture, we instantiate two models: Espresso-3D for 3D point clouds and Espresso-2D for 2D images. Each consists of a modality-specific visual encoder, a text encoder, and a text-conditioned heatmap decoder. The core of our design is a heatmap decoder that replaces learnable queries with text embeddings, naturally supporting open-vocabulary affordance grounding without predefined categories. The affordance heatmap is predicted as the cross-modal similarity between the vision representation and the conditioned text embedding.

Espresso-3D and Espresso-2D architectures with vision encoder, text encoder, and conditional heatmap decoder

Results

3D Affordance Grounding (LASO)

On the LASO benchmark, pretraining on Affogato-750K consistently and substantially improves unseen performance across all methods, regardless of architecture. Affogato-750K-pretrained PointRefer achieves the largest unseen gain (+4.0%p aIoU), while Espresso-3D improves on both seen (+1.5%p) and unseen (+2.1%p) settings — confirming Affogato-750K as a strong, model-agnostic source of open-vocabulary supervision.

Method	Seen				Unseen
Method	aIoU ↑	AUC ↑	SIM ↑	MAE ↓	aIoU ↑	AUC ↑	SIM ↑	MAE ↓
Ref. Trans.	13.7	79.8	49.7	0.124	10.2	69.1	43.2	0.145
3D-SPS	11.4	76.2	43.3	0.138	7.9	68.8	40.2	0.158
ReLA	15.2	78.9	53.2	0.118	10.7	69.7	42.9	0.144
IAGNet	17.8	82.3	56.1	0.109	12.9	77.8	44.3	0.129
OpenAD	14.2	85.1	53.3	0.103	14.6	80.7	51.8	0.109
+ Affogato-750K pretrain	16.1	86.8	53.9	0.100	15.5	81.8	53.4	0.103
PointRefer	20.8	87.3	62.9	0.093	14.6	80.2	50.7	0.119
+ Affogato-750K pretrain	20.2	86.0	60.0	0.098	18.6	81.4	56.1	0.103
Espresso-3D	20.4	86.0	63.3	0.102	18.7	80.0	60.0	0.101
+ Affogato-750K pretrain	21.9	85.9	63.7	0.116	20.8	82.9	61.4	0.122

Qualitative comparison of 3D affordance grounding between OpenAD, PointRefer, Espresso-3D, and ground truth on LASO — **Qualitative comparison on LASO.** Espresso-3D produces more accurate and focused affordance heatmaps compared to OpenAD and PointRefer, closely matching the ground truth.

2D Affordance Grounding (AGD20K)

On AGD20K, Espresso-2D with Affogato-750K pretraining outperforms all baselines under zero-shot evaluation, reaching 40.2 / 37.6 SIM on the seen / unseen splits versus 22–30 SIM for competing methods. Under supervised learning, Affogato-750K pretraining provides consistent improvements over training solely on AGD20K — Espresso-2D gains +1.6 SIM and −0.060 KLD under full supervision, and existing methods such as AffordanceNet similarly benefit.

(a) Zero-shot on the AGD20K **unseen** split (no AGD20K training); Espresso-2D is pretrained on Affogato-750K.
Method	KLD ↓	SIM ↑	NSS ↑
Molmo+SAM2	1.953	22.6	0.718
LISA-7B	1.830	25.6	0.765
M²SA-7B	1.925	22.7	0.657
AffordanceNet	1.896	22.3	0.736
Espresso-2D	1.571	37.6	1.016

(b) Fully-supervised on AGD20K. “+ Affogato-750K pretrain” denotes pretraining on Affogato-750K, then finetuning.
Method	KLD ↓	SIM ↑	NSS ↑
LOCATE-Sup	1.907	23.6	0.641
LOCATE-Sup-OWL	1.927	23.4	0.624
AffordanceLLM	1.463	37.7	1.070
AffordanceNet	1.627	37.0	1.002
+ Affogato-750K pretrain	1.622	38.0	1.022
Espresso-2D	1.034	50.3	1.550
+ Affogato-750K pretrain	0.974	51.9	1.645

Qualitative comparison of 2D affordance grounding between Cross-view-AG, LOCATE, WSAG-PLSP, OOAL, Espresso-2D, and ground truth — **Qualitative comparison on AGD20K.** Espresso-2D (pretrained on Affogato-750K, finetuned on AGD20K) produces more precise and well-localized affordance heatmaps compared to existing methods.

Annotation Quality

To construct a reliable benchmark, we conducted a human evaluation on 5K randomly sampled affordance query–heatmap pairs. The evaluation shows strong agreement between the automatically generated annotations and human judgment, achieving an 84.8% success rate. This human validation process is also used to construct the test split: pairs with incorrect queries are removed, and pairs with incorrect interaction points are manually corrected, yielding a reliable, high-quality evaluation benchmark.

Human validation results. Our automatic pipeline achieves an 84.8% success rate. Hover a slice to inspect each failure mode — object misidentification and incorrect affordance queries (Stage 1), wrong interaction points (Stage 2), SAM edge bias merging functionally distinct parts (Stage 3), and limited camera angles.

Zero-shot Transfer to Robot Manipulation

As a harder test of real-world transfer, we evaluate Espresso-2D zero-shot on the Open X-Embodiment dataset, whose robot-manipulation scenes are visually distant from any Objaverse render and were never seen during pretraining. Despite this large domain gap, Espresso-2D still localizes the queried affordance accurately, demonstrating that Affogato-750K's synthetic-only supervision transfers to real robot inputs.

Zero-shot Espresso-2D affordance grounding for "Grasp the banana" on an Open X-Embodiment robot scene — **Zero-shot Espresso-2D on Open X-Embodiment robot scenes.** Affogato-750K-pretrained Espresso-2D accurately grounds open-vocabulary affordance queries across diverse real robot-manipulation scenes, despite never seeing real photos during pretraining.

Zero-shot Espresso-2D affordance grounding for "Grasp the knife" on an Open X-Embodiment robot scene — **Zero-shot Espresso-2D on Open X-Embodiment robot scenes.** Affogato-750K-pretrained Espresso-2D accurately grounds open-vocabulary affordance queries across diverse real robot-manipulation scenes, despite never seeing real photos during pretraining.

Citation

@article{lee2025affogato,
  title={Affogato: Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale},
  author={Lee, Junha and Park, Eunha and Park, Chunghyun and Kang, Dahyun and Cho, Minsu},
  journal={arXiv preprint arXiv:2506.12009},
  year={2025}
}

Affogato Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale