DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning

1POSTECH, 2RLWRLD
 
DextER teaser: Embodied reasoning for language-driven dexterous grasp generation

DextER introduces contact-based embodied reasoning for language-driven dexterous grasp generation. Given a 3D object and instruction, DextER autoregressively predicts which finger links contact where on the object surface before generating the final grasp.

Abstract

Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions.

We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration.

On DexGYS[1], DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83%p with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.

Key Contributions

Embodied Contact Reasoning: We propose contact-centric reasoning as an effective embodied thinking process for language-driven dexterous manipulation, achieving state-of-the-art performance by explicitly modeling the physical interaction structure between multi-fingered hands and objects.

Physics-based Contact Annotations: We annotate DexGYS[1] and Dexonomy[2] datasets with physics-based contact annotations using MuJoCo simulator, and generate natural language grasp descriptions using a vision-language model, enabling large-scale training of contact-aware models.

Steerable Grasp Generation: Our autoregressive generation framework enables steerable grasp generation, where users can guide the model by specifying partial contact constraints (e.g., which fingers should contact the object, or precise contact positions), and the model completes the remaining sequence while respecting these constraints.

Method

DextER model architecture

DextER processes 3D point clouds and language instructions to predict dexterous grasping actions for the multi-fingered robotic hand. The input point clouds and textual grasp descriptions are encoded into tokens using a pretrained point cloud encoder (PartField[3]) and a text tokenizer. The LLM backbone (Qwen2.5 0.5B) fuses point cloud embeddings with text prompts and autoregressively generates discretized contact and action tokens. The generated contact and action tokens are de-tokenized into contact positions, hand joint configurations, and grasp poses.

The model first produces embodied contact tokens that specify which finger links contact the object and their 3D positions on the object surface, followed by grasp tokens that encode the complete hand configuration. All tokens are generated autoregressively within a unified next-token prediction framework, enabling interpretable contact reasoning as an intermediate step.

Results on DexGYS

Quantitative Results

Method Intention Quality Diversity
P-FID ↓ CD ↓ Con. ↓ Success ↑ Q1 Pen. ↓ δt δr δq
GraspCVAE[5] 29.023.140.96 29.120.540.55 0.181.760.18
GraspTTA[6] 33.1512.191.11 43.460.710.19 2.116.153.87
SceneDiffusers[7] 7.931.680.45 62.240.830.25 0.353.460.39
DGTR[4] 15.772.900.78 51.910.780.16 2.0514.014.30
DexGYSNet[1] 5.601.200.36 63.310.830.22 6.1255.686.12
DextER (w/o ER) 0.301.950.40 62.370.660.44 8.7877.1313.77
DextER 0.201.460.34 67.140.890.37 8.8477.9813.63

Language-conditioned grasp generation on DexGYS validation set. DextER outperforms all baselines with substantial improvements in intention alignment (P-FID, CD, Con.), physical quality (Success, Q1, Pen.), and diversity (δt, δr, δq).

Qualitative Results

Qualitative results on DexGYS benchmark

Given object point clouds and natural language instructions, DextER generates embodied contact predictions (shown as colored spheres on object surfaces) followed by grasp configurations. The model successfully captures task-specific contact patterns and produces physically plausible grasps that align with language instructions across diverse objects and manipulation intents.

More Qualitative Results

Additional qualitative results on DexGYS

Additional qualitative comparisons on the DexGYS validation set. DextER consistently generates grasps that better align with the given language instructions compared to DexGYSNet, demonstrating superior intention alignment across a variety of objects and grasp descriptions.

Zero-shot Generalization on Dexonomy

We evaluate DextER's zero-shot generalization on the Dexonomy dataset, which organizes grasps into a taxonomy of 31 types (e.g., power grasp, precision pinch). We design data splits to test generalization to novel objects, unseen grasp taxonomies, and the challenging setting where both are novel. DextER outperforms baseline methods across all settings.

Qualitative results on Dexonomy benchmark

Zero-shot generalization results on Dexonomy. DextER produces grasps that more closely match the ground truth configurations compared to DexGYSNet, especially for fine-grained grasp instructions involving specific finger placements.

Steerable Grasp Generation

Beyond standard language-conditioned generation, DextER enables steerable grasp generation that provides users with fine-grained control over the grasp configuration. By providing a partially filled contact sequence as context (e.g., specifying 1–5 finger joints and their contact positions), users can guide the model to complete the remaining sequence while respecting these constraints. Providing more steering context leads to generated grasps that more closely align with the constraints, as evidenced by drastically improved intention alignment metrics.

Steerable grasp generation results

Steerable grasp generation with varying levels of partial contact specification (Steer1–5 = specifying 1–5 finger link contacts). As more contact constraints are provided, the generated grasps progressively converge toward the ground truth, demonstrating fine-grained user control over grasp synthesis.

References

  1. Yi-Lin Wei, Jian-Jian Jiang, Chengyi Xing, Xian-Tuo Tan, Xiao-Ming Wu, Hao Li, Mark Cutkosky, and Wei-Shi Zheng. Grasp as You Say: Language-guided Dexterous Grasp Generation. NeurIPS, 2024.
  2. Jiayi Chen, Yubin Ke, Lin Peng, and He Wang. Dexonomy: Synthesizing All Dexterous Grasp Types in a Grasp Taxonomy. RSS, 2025.
  3. Minghua Liu, Mikaela Angelina Uy, Donglai Xiang, Hao Su, Sanja Fidler, Nicholas Sharp, and Jun Gao. PartField: Learning 3D Feature Fields for Part Segmentation and Beyond. ICCV, 2025.
  4. Guo-Hao Xu, Yi-Lin Wei, Dian Zheng, Xiao-Ming Wu, and Wei-Shi Zheng. DGTR: Dexterous Grasp Transformer. CVPR, 2024.
  5. Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning Structured Output Representation using Deep Conditional Generative Models. NeurIPS, 2015.
  6. Hanwen Jiang, Shaowei Liu, Jiashun Wang, and Xiaolong Wang. Hand-Object Contact Consistency Reasoning for Human Grasps Generation. ICCV, 2021.
  7. Siyuan Huang, Zan Wang, Puhao Li, Baoxiong Jia, Tengyu Liu, Yixin Zhu, Wei Liang, and Song-Chun Zhu. Diffusion-based Generation, Optimization, and Planning in 3D Scenes. CVPR, 2023.

BibTeX

@article{lee2026dexter,
    title={DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning},
    author={Lee, Junha and Park, Eunha and Cho, Minsu},
    journal={arXiv preprint arXiv:2601.16046},
    year={2026}
}