DAGDiff

Video Explanation

Abstract

Reliable dual-arm grasping is essential for manipulating large and complex objects but remains a challenging problem due to stability, collision, and generalization requirements. Prior methods typically decompose the task into two independent grasp proposals, relying on region priors or heuristics that limit generalization and provide no principled guarantee of stability. We propose DAGDiff, an end-to-end framework that directly denoises to grasp pairs in the \(SE(3) \times SE(3)\) space. Our key insight is that stability and collision can be enforced more effectively by guiding the diffusion process with classifier signals, rather than relying on explicit region detection or object priors. To this end, DAGDiff integrates geometry-, stability-, and collision-aware guidance terms that steer the generative process toward grasps that are physically valid and force-closure compliant. We comprehensively evaluate DAGDiff through analytical force-closure checks, collision analysis, and large-scale physics-based simulations, showing consistent improvements over previous work on these metrics. Finally, we demonstrate that our framework generates dual-arm grasps directly from real-world point clouds of previously unseen objects, which are executed on a heterogeneous dual-arm setup where two manipulators reliably grasp and lift them.

Model Architecture

Overview of the proposed method: (a) Given an object point cloud \(P\), our network encodes geometric features into dense feature maps. Next, randomly initialized dual-arm grasps \(H\) are used to transform a fixed query cloud into query points, followed by feature sampling through bilinear interpolation. Conditioned on the noise step \(t\) these features are passed through \(F_{\theta}\), which predicts the SDF of the query points and a feature vector. This vector is then used by three output heads that predict energy \((E_\alpha)\), force-closure probability \((C_{\beta}^{\text{fc}})\) and collision probability \((C_{\gamma}^{\text{col}})\), jointly guiding the diffusion process. (b) At inference, denoising proceeds from random intialization \((t=250\)) to refine grasps \((t=0\)). The energy head drives the generative dynamics, while the force-closure and collision heads bias the generation until stable, collision-free dual-arm grasps emerge.

\(SE(3) \times SE(3) \longleftrightarrow \mathbb{R}^{12}\)

Denoising in the dual-arm grasp space: Additionally, dual-arm grasp poses are represented as pairs of rigid-body transformations in \(SE(3) \times SE(3)\), which are mapped into a \(12\text{D}\) Euclidean space for diffusion and back. Each \(SE(3)\) element is first projected into its \(6\text{D}\) Lie algebra representation via the logarithmic map \((\operatorname{Logmap_{2}})\), and concatenated to form a vector in \(\mathbb{R}^{12}\).

The diffusion process is then carried out in this Euclidean space. To obtain valid grasp poses, the exponential map \((\operatorname{Expmap_{2}})\) maps vectors in \(\mathbb{R}^{12}\) back to \(SE(3) \times SE(3)\). This bidirectional mapping enables diffusion while ensuring grasps remain consistent with rigid-body motion.

Denoising using Classifier Guidance

Noisy Grasp Pairs Stable Grasp Pairs

Overview of the denoising process: The above clip shows the joint denoising process step by step. As the time progresses, the Energy \((E_\alpha)\) gradually decreases, which means grasps are moving towards the object and not just floating in free space. At the same time, the Force-Closure Probability \((C_{\beta}^{\text{fc}})\) steadily increases, highlighting how the grasp becomes more stable and reliable over time. Finally, in the later stages of denoising, colliding grasps are refined for a small number of iterations using Collision Classifier \((C_{\gamma}^{\text{col}})\), resulting in dual-arm grasps that are force-closure stable as well as collision-free.

Real Life Results ^†

^† Unseen object categories

(a) Bucket

(b) Tray

(c) Drone

(d) Frypan

(e) Saucepan

Quantitative Results

Comparison on our evaluation set (higher is better, lower is better).

1. Comparison with Baselines

Method	FCE (%)	GSR (%)	GCR (%)
DAGDiff (ours)	60.14	72.50	15.10
CGDF	35.14	56.25	30.55
VCGS	16.85	23.36	74.73
UniDiffGrasp	10.10	31.68	59.90
RoboBrainGrasp-KP	9.80	27.85	66.30
RoboBrainGrasp-BB	7.12	27.81	70.26

2. Zero-Shot on Dual-Afford Objects^†

Method	FCE (%)	GSR (%)	GCR (%)
Ours-DA^†	56.45	68.80	18.59
Dual-Afford^††	–	54.33	–

^† Evaluated on Dual-Afford objects in a zero-shot setting.
^†† Values reported directly from the Dual-Afford paper.

3. Real-World Dual-Arm Grasp Results

Object	Tray	Bucket	Saucepan	Frypan	Drone
Success	6/10	8/10	7/10	6/10	5/10

Quantitative Results: DAGDiff consistently outperforms across all evaluation settings. It outperforms prior methods in Force-Closure Evaluation (FCE) and Grasp Success Rate (GSR) while maintaining the lowest Grasp Collision Rate (GCR), indicating more physically valid and robust dual-arm grasps. In zero-shot transfer to Dual-Afford objects, DAGDiff continues to show strong generalization without task-specific retraining. Finally, real-world experiments on unseen objects such as trays, buckets, and pans demonstrate consistent success, confirming that DAGDiff’s classifier-guided diffusion produces grasps that are stable, collision-free, and transferable beyond simulation. Real-life failures occur mostly due to noisy point-cloud estimation and hance generated grasps are not always perfect.

BibTeX

Will be updated

DAGDiff: Guiding Dual-Arm Grasp Diffusion to Stable and Collision-Free Grasps