Drag · scroll · or tap a model
Affordance grounding, localizing object regions given natural language interaction descriptions, is fundamental for embodied agents yet remains bottlenecked by the scarcity of large-scale training data. Manual affordance annotation is prohibitively expensive, confining existing datasets to narrow object and affordance categories and preventing open-vocabulary generalization. We introduce Affogato, a benchmark for open-vocabulary affordance grounding. At its core is Affogato-750K, a large-scale dataset of 750K open-vocabulary 3D affordance annotations, constructed through an automated pipeline that leverages foundation models for affordance query generation, interaction-point prediction, and mask generation, followed by multi-view aggregation. It covers significantly more diverse categories than any existing dataset, with 5K human-verified test pairs ensuring evaluation reliability. We further present Espresso-3D and Espresso-2D, simple yet effective baselines sharing the same architecture across 3D and 2D. Pretraining on Affogato-750K further boosts both Espresso and existing methods, yielding substantial gains especially on unseen object and affordance categories, demonstrating the dataset's effectiveness as a scalable source of open-vocabulary affordance supervision.
Our data annotation pipeline automatically generates high-quality affordance annotations by chaining three foundation models. Given multi-view renderings of a 3D object, Gemma3 generates open-vocabulary natural language affordance queries via chain-of-thought prompting, Molmo points to the interaction region for each query, and MobileSAM converts each point into a 2D segmentation mask. The multi-view masks are then lifted and aggregated on the 3D object surface via cross-view voting to obtain a final 3D affordance heatmap.
Gemma3 analyzes multi-view images and, via chain-of-thought prompting, produces five open-vocabulary affordance queries per object that describe how a human might interact with it.
Molmo grounds each affordance query to a spatial location, predicting the most likely interaction-point coordinates in every view.
MobileSAM converts predicted points into 2D segmentation masks, which are projected onto the 3D surface and fused via multi-view voting to suppress per-view errors.
Affogato-750K is the largest and most diverse dataset for affordance grounding to date, comprising 750K open-vocabulary affordance annotations across 150K 3D object instances from Objaverse. The objects span four categories — Daily-Used (121,799), Transportations (11,609), Furnitures (8,759), and Electronics (7,937) — each paired with five affordance query–heatmap pairs. Unlike existing datasets constrained by predefined taxonomies, Affogato-750K achieves truly open-vocabulary coverage with >450 object classes and >350 affordance types. A 5K human-verified test split is constructed via manual refinement of borderline cases, ensuring rigorous and reliable evaluation.
Characteristics of Affogato-750K. The dataset exhibits high diversity in both object and affordance categories.
Affogato-750K substantially surpasses existing 3D and 2D affordance datasets in both diversity and scale. Switch metrics below to compare the number of affordance types, object classes, and samples across datasets.
We present Espresso, a unified baseline designed to cleanly validate the contribution of Affogato-750K. Building on a shared architecture, we instantiate two models: Espresso-3D for 3D point clouds and Espresso-2D for 2D images. Each consists of a modality-specific visual encoder, a text encoder, and a text-conditioned heatmap decoder. The core of our design is a heatmap decoder that replaces learnable queries with text embeddings, naturally supporting open-vocabulary affordance grounding without predefined categories. The affordance heatmap is predicted as the cross-modal similarity between the vision representation and the conditioned text embedding.
On the LASO benchmark, pretraining on Affogato-750K consistently and substantially improves unseen performance across all methods, regardless of architecture. Affogato-750K-pretrained PointRefer achieves the largest unseen gain (+4.0%p aIoU), while Espresso-3D improves on both seen (+1.5%p) and unseen (+2.1%p) settings — confirming Affogato-750K as a strong, model-agnostic source of open-vocabulary supervision.
| Method | Seen | Unseen | ||||||
|---|---|---|---|---|---|---|---|---|
| aIoU ↑ | AUC ↑ | SIM ↑ | MAE ↓ | aIoU ↑ | AUC ↑ | SIM ↑ | MAE ↓ | |
| Ref. Trans. | 13.7 | 79.8 | 49.7 | 0.124 | 10.2 | 69.1 | 43.2 | 0.145 |
| 3D-SPS | 11.4 | 76.2 | 43.3 | 0.138 | 7.9 | 68.8 | 40.2 | 0.158 |
| ReLA | 15.2 | 78.9 | 53.2 | 0.118 | 10.7 | 69.7 | 42.9 | 0.144 |
| IAGNet | 17.8 | 82.3 | 56.1 | 0.109 | 12.9 | 77.8 | 44.3 | 0.129 |
| OpenAD | 14.2 | 85.1 | 53.3 | 0.103 | 14.6 | 80.7 | 51.8 | 0.109 |
| + Affogato-750K pretrain | 16.1 | 86.8 | 53.9 | 0.100 | 15.5 | 81.8 | 53.4 | 0.103 |
| PointRefer | 20.8 | 87.3 | 62.9 | 0.093 | 14.6 | 80.2 | 50.7 | 0.119 |
| + Affogato-750K pretrain | 20.2 | 86.0 | 60.0 | 0.098 | 18.6 | 81.4 | 56.1 | 0.103 |
| Espresso-3D | 20.4 | 86.0 | 63.3 | 0.102 | 18.7 | 80.0 | 60.0 | 0.101 |
| + Affogato-750K pretrain | 21.9 | 85.9 | 63.7 | 0.116 | 20.8 | 82.9 | 61.4 | 0.122 |
On AGD20K, Espresso-2D with Affogato-750K pretraining outperforms all baselines under zero-shot evaluation, reaching 40.2 / 37.6 SIM on the seen / unseen splits versus 22–30 SIM for competing methods. Under supervised learning, Affogato-750K pretraining provides consistent improvements over training solely on AGD20K — Espresso-2D gains +1.6 SIM and −0.060 KLD under full supervision, and existing methods such as AffordanceNet similarly benefit.
| Method | KLD ↓ | SIM ↑ | NSS ↑ |
|---|---|---|---|
| Molmo+SAM2 | 1.953 | 22.6 | 0.718 |
| LISA-7B | 1.830 | 25.6 | 0.765 |
| M²SA-7B | 1.925 | 22.7 | 0.657 |
| AffordanceNet | 1.896 | 22.3 | 0.736 |
| Espresso-2D | 1.571 | 37.6 | 1.016 |
| Method | KLD ↓ | SIM ↑ | NSS ↑ |
|---|---|---|---|
| LOCATE-Sup | 1.907 | 23.6 | 0.641 |
| LOCATE-Sup-OWL | 1.927 | 23.4 | 0.624 |
| AffordanceLLM | 1.463 | 37.7 | 1.070 |
| AffordanceNet | 1.627 | 37.0 | 1.002 |
| + Affogato-750K pretrain | 1.622 | 38.0 | 1.022 |
| Espresso-2D | 1.034 | 50.3 | 1.550 |
| + Affogato-750K pretrain | 0.974 | 51.9 | 1.645 |
To construct a reliable benchmark, we conducted a human evaluation on 5K randomly sampled affordance query–heatmap pairs. The evaluation shows strong agreement between the automatically generated annotations and human judgment, achieving an 84.8% success rate. This human validation process is also used to construct the test split: pairs with incorrect queries are removed, and pairs with incorrect interaction points are manually corrected, yielding a reliable, high-quality evaluation benchmark.
As a harder test of real-world transfer, we evaluate Espresso-2D zero-shot on the Open X-Embodiment dataset, whose robot-manipulation scenes are visually distant from any Objaverse render and were never seen during pretraining. Despite this large domain gap, Espresso-2D still localizes the queried affordance accurately, demonstrating that Affogato-750K's synthetic-only supervision transfers to real robot inputs.
“Grasp the banana”
“Grasp the knife”
“Open the box drawer”
“Open the oven”
@article{lee2025affogato,
title={Affogato: Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale},
author={Lee, Junha and Park, Eunha and Park, Chunghyun and Kang, Dahyun and Cho, Minsu},
journal={arXiv preprint arXiv:2506.12009},
year={2025}
}