{junha.lee, dmsgk724, p0125ch, dahyun.kang, mscho}@postech.ac.kr
Affogato benchmark overview: large-scale dataset generation, pretraining on Affogato-750K, and transferring to target datasets

Overview of our framework. Our framework consists of three stages: large-scale dataset generation, pretraining, and transferring to target datasets. In the dataset generation stage, we leverage foundation models — Gemma, Molmo, and MobileSAM — to construct Affogato-750K, a large-scale synthetic dataset comprising 750K open-vocabulary affordance annotations across 150K 3D objects. We pretrain Espresso-3D and Espresso-2D on the Affogato-750K train split, then finetune our pretrained models on various target datasets.

Abstract

Affordance grounding, localizing object regions given natural language interaction descriptions, is fundamental for embodied agents yet remains bottlenecked by the scarcity of large-scale training data. Manual affordance annotation is prohibitively expensive, confining existing datasets to narrow object and affordance categories and preventing open-vocabulary generalization. We introduce Affogato, a benchmark for open-vocabulary affordance grounding. At its core is Affogato-750K, a large-scale dataset of 750K open-vocabulary 3D affordance annotations, constructed through an automated pipeline that leverages foundation models for affordance query generation, interaction-point prediction, and mask generation, followed by multi-view aggregation. It covers significantly more diverse categories than any existing dataset, with 5K human-verified test pairs ensuring evaluation reliability. We further present Espresso-3D and Espresso-2D, simple yet effective baselines sharing the same architecture across 3D and 2D. Pretraining on Affogato-750K further boosts both Espresso and existing methods, yielding substantial gains especially on unseen object and affordance categories, demonstrating the dataset's effectiveness as a scalable source of open-vocabulary affordance supervision.

Data Annotation Pipeline

Our data annotation pipeline automatically generates high-quality affordance annotations by chaining three foundation models. Given multi-view renderings of a 3D object, Gemma3 generates open-vocabulary natural language affordance queries via chain-of-thought prompting, Molmo points to the interaction region for each query, and MobileSAM converts each point into a 2D segmentation mask. The multi-view masks are then lifted and aggregated on the 3D object surface via cross-view voting to obtain a final 3D affordance heatmap.

Data annotation pipeline: Gemma generates queries, Molmo points, MobileSAM segments, then 2D to 3D lifting and voting
1

Open-Vocabulary Query Generation

Gemma3 analyzes multi-view images and, via chain-of-thought prompting, produces five open-vocabulary affordance queries per object that describe how a human might interact with it.

2

Interaction Point Prediction

Molmo grounds each affordance query to a spatial location, predicting the most likely interaction-point coordinates in every view.

3

Heatmap Generation & Aggregation

MobileSAM converts predicted points into 2D segmentation masks, which are projected onto the 3D surface and fused via multi-view voting to suppress per-view errors.

The Affogato-750K Dataset

Affogato-750K is the largest and most diverse dataset for affordance grounding to date, comprising 750K open-vocabulary affordance annotations across 150K 3D object instances from Objaverse. The objects span four categories — Daily-Used (121,799), Transportations (11,609), Furnitures (8,759), and Electronics (7,937) — each paired with five affordance query–heatmap pairs. Unlike existing datasets constrained by predefined taxonomies, Affogato-750K achieves truly open-vocabulary coverage with >450 object classes and >350 affordance types. A 5K human-verified test split is constructed via manual refinement of borderline cases, ensuring rigorous and reliable evaluation.

0
Affordance annotations
0
3D object instances
0
Object classes
0
Affordance types
Word cloud of object categories in Affogato-750K
(a) Object classes
Word cloud of affordance types in Affogato-750K
(b) Affordance classes

Characteristics of Affogato-750K. The dataset exhibits high diversity in both object and affordance categories.

Dataset Comparison

Affogato-750K substantially surpasses existing 3D and 2D affordance datasets in both diversity and scale. Switch metrics below to compare the number of affordance types, object classes, and samples across datasets.

3D datasets

2D datasets

Espresso Model Architecture

We present Espresso, a unified baseline designed to cleanly validate the contribution of Affogato-750K. Building on a shared architecture, we instantiate two models: Espresso-3D for 3D point clouds and Espresso-2D for 2D images. Each consists of a modality-specific visual encoder, a text encoder, and a text-conditioned heatmap decoder. The core of our design is a heatmap decoder that replaces learnable queries with text embeddings, naturally supporting open-vocabulary affordance grounding without predefined categories. The affordance heatmap is predicted as the cross-modal similarity between the vision representation and the conditioned text embedding.

Espresso-3D and Espresso-2D architectures with vision encoder, text encoder, and conditional heatmap decoder

Results

3D Affordance Grounding (LASO)

On the LASO benchmark, pretraining on Affogato-750K consistently and substantially improves unseen performance across all methods, regardless of architecture. Affogato-750K-pretrained PointRefer achieves the largest unseen gain (+4.0%p aIoU), while Espresso-3D improves on both seen (+1.5%p) and unseen (+2.1%p) settings — confirming Affogato-750K as a strong, model-agnostic source of open-vocabulary supervision.

Method Seen Unseen
aIoU ↑AUC ↑SIM ↑MAE ↓ aIoU ↑AUC ↑SIM ↑MAE ↓
Ref. Trans.13.779.849.70.12410.269.143.20.145
3D-SPS11.476.243.30.1387.968.840.20.158
ReLA15.278.953.20.11810.769.742.90.144
IAGNet17.882.356.10.10912.977.844.30.129
OpenAD14.285.153.30.10314.680.751.80.109
+ Affogato-750K pretrain16.186.853.90.10015.581.853.40.103
PointRefer20.887.362.90.09314.680.250.70.119
+ Affogato-750K pretrain20.286.060.00.09818.681.456.10.103
Espresso-3D20.486.063.30.10218.780.060.00.101
+ Affogato-750K pretrain21.985.963.70.11620.882.961.40.122
Qualitative comparison of 3D affordance grounding between OpenAD, PointRefer, Espresso-3D, and ground truth on LASO
Qualitative comparison on LASO. Espresso-3D produces more accurate and focused affordance heatmaps compared to OpenAD and PointRefer, closely matching the ground truth.

2D Affordance Grounding (AGD20K)

On AGD20K, Espresso-2D with Affogato-750K pretraining outperforms all baselines under zero-shot evaluation, reaching 40.2 / 37.6 SIM on the seen / unseen splits versus 22–30 SIM for competing methods. Under supervised learning, Affogato-750K pretraining provides consistent improvements over training solely on AGD20K — Espresso-2D gains +1.6 SIM and −0.060 KLD under full supervision, and existing methods such as AffordanceNet similarly benefit.

(a) Zero-shot on the AGD20K unseen split (no AGD20K training); Espresso-2D is pretrained on Affogato-750K.
Method KLD ↓ SIM ↑ NSS ↑
Molmo+SAM21.95322.60.718
LISA-7B1.83025.60.765
M²SA-7B1.92522.70.657
AffordanceNet1.89622.30.736
Espresso-2D1.57137.61.016
(b) Fully-supervised on AGD20K. “+ Affogato-750K pretrain” denotes pretraining on Affogato-750K, then finetuning.
Method KLD ↓ SIM ↑ NSS ↑
LOCATE-Sup1.90723.60.641
LOCATE-Sup-OWL1.92723.40.624
AffordanceLLM1.46337.71.070
AffordanceNet1.62737.01.002
+ Affogato-750K pretrain1.62238.01.022
Espresso-2D1.03450.31.550
+ Affogato-750K pretrain0.97451.91.645
Qualitative comparison of 2D affordance grounding between Cross-view-AG, LOCATE, WSAG-PLSP, OOAL, Espresso-2D, and ground truth
Qualitative comparison on AGD20K. Espresso-2D (pretrained on Affogato-750K, finetuned on AGD20K) produces more precise and well-localized affordance heatmaps compared to existing methods.

Annotation Quality

To construct a reliable benchmark, we conducted a human evaluation on 5K randomly sampled affordance query–heatmap pairs. The evaluation shows strong agreement between the automatically generated annotations and human judgment, achieving an 84.8% success rate. This human validation process is also used to construct the test split: pairs with incorrect queries are removed, and pairs with incorrect interaction points are manually corrected, yielding a reliable, high-quality evaluation benchmark.

84.8% Success
    Human validation results. Our automatic pipeline achieves an 84.8% success rate. Hover a slice to inspect each failure mode — object misidentification and incorrect affordance queries (Stage 1), wrong interaction points (Stage 2), SAM edge bias merging functionally distinct parts (Stage 3), and limited camera angles.

    Zero-shot Transfer to Robot Manipulation

    As a harder test of real-world transfer, we evaluate Espresso-2D zero-shot on the Open X-Embodiment dataset, whose robot-manipulation scenes are visually distant from any Objaverse render and were never seen during pretraining. Despite this large domain gap, Espresso-2D still localizes the queried affordance accurately, demonstrating that Affogato-750K's synthetic-only supervision transfers to real robot inputs.

    Zero-shot Espresso-2D affordance grounding for "Grasp the banana" on an Open X-Embodiment robot scene “Grasp the banana”
    Zero-shot Espresso-2D affordance grounding for "Grasp the knife" on an Open X-Embodiment robot scene “Grasp the knife”
    Zero-shot Espresso-2D affordance grounding for "Open the box drawer" on an Open X-Embodiment robot scene “Open the box drawer”
    Zero-shot Espresso-2D affordance grounding for "Open the oven" on an Open X-Embodiment robot scene “Open the oven”
    Zero-shot Espresso-2D on Open X-Embodiment robot scenes. Affogato-750K-pretrained Espresso-2D accurately grounds open-vocabulary affordance queries across diverse real robot-manipulation scenes, despite never seeing real photos during pretraining.

    Citation

    @article{lee2025affogato,
      title={Affogato: Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale},
      author={Lee, Junha and Park, Eunha and Park, Chunghyun and Kang, Dahyun and Cho, Minsu},
      journal={arXiv preprint arXiv:2506.12009},
      year={2025}
    }