Abstract
Reliable dual-arm grasping is essential for manipulating large and complex objects but remains a
challenging problem due to stability, collision, and generalization requirements. Prior methods
typically decompose the task into two independent grasp proposals, relying on region priors or
heuristics that limit generalization and provide no principled guarantee of stability. We
propose DAGDiff, an end-to-end framework that directly denoises to grasp pairs in the \(SE(3)
\times SE(3)\) space. Our key insight is that stability and collision can be enforced more
effectively by guiding the diffusion process with classifier signals, rather than relying on
explicit region detection or object priors. To this end, DAGDiff integrates geometry-,
stability-, and collision-aware guidance terms that steer the generative process toward grasps
that are physically valid and force-closure compliant. We comprehensively evaluate DAGDiff
through analytical force-closure checks, collision analysis, and large-scale physics-based
simulations, showing consistent improvements over previous work on these metrics. Finally, we
demonstrate that our framework generates dual-arm grasps directly from real-world point clouds
of previously unseen objects, which are executed on a heterogeneous dual-arm setup where two
manipulators reliably grasp and lift them.