Definition

Grasp planning is the problem of determining a gripper pose (position and orientation) and finger configuration that will result in a stable grasp on a target object. Given sensory input — typically a point cloud or depth image from an RGB-D camera — a grasp planner outputs one or more candidate 6-DOF grasp poses ranked by predicted quality. The robot then executes the highest-ranked feasible grasp, accounting for kinematic reachability and collision avoidance.

Grasping is the most fundamental manipulation skill: nearly every pick-and-place, assembly, or logistics task begins with a grasp. Despite decades of research, robust grasping of arbitrary novel objects in cluttered scenes remains a significant challenge. Objects vary in shape, size, weight, surface texture, and fragility. A grasp that works perfectly for a rigid box will fail on a soft banana or a thin pen. The combinatorial space of possible gripper poses is enormous, and the physical stability of a grasp depends on friction, contact geometry, and applied force — all of which are difficult to measure or predict accurately.

Modern grasp planners have achieved 90–95% success rates on structured bin-picking benchmarks, but performance degrades with transparent objects, deformable items, and extreme clutter. The integration of grasp planning with learned manipulation policies represents the current frontier, where the grasp planner selects where to grasp and a policy handles the approach, contact, and post-grasp manipulation.

Analytical vs Data-Driven Approaches

Analytical (model-based) grasp planning requires a 3D model (CAD or reconstructed mesh) of the target object. The planner evaluates candidate grasps using physical criteria: force closure (can the fingers resist arbitrary external wrenches?) and form closure (is the object geometrically constrained?). These methods are provably correct when the model is accurate but fail when the object is unknown, partially occluded, or deformable.

Data-driven grasp planning uses neural networks trained on large datasets of grasps (simulated or real) to predict grasp quality directly from sensory input. The network takes a point cloud or depth image and outputs a set of grasp poses with confidence scores. No object model is required, and the network generalizes to novel objects through learned geometric features. This approach dominates modern research and industry applications.

Hybrid methods combine both: a neural network generates coarse grasp candidates, and an analytical filter refines them using physics-based quality metrics. This provides the generalization of data-driven methods with the reliability guarantees of analytical approaches.

Key Algorithms

  • Dex-Net / GQ-CNN (Mahler et al., 2017) — Grasp Quality CNN trained on millions of simulated grasps. Takes a depth image and outputs a grasp quality score for each candidate grasp. The Dex-Net 2.0 dataset contains 6.7 million point clouds with analytically computed grasp quality labels. Designed for parallel-jaw grippers.
  • GraspNet-1Billion (Fang et al., 2020) — A large-scale benchmark and model for 6-DOF grasp detection from point clouds. Processes a full scene point cloud and outputs hundreds of grasp proposals in a single forward pass. Trained on 1 billion grasp annotations across 88 objects.
  • Contact-GraspNet (Sundermeyer et al., 2021) — Predicts grasps by first estimating contact points on the object surface, then computing gripper poses from those contacts. Handles cluttered scenes well because it reasons about local surface geometry rather than global object shape.
  • AnyGrasp (Fang et al., 2023) — Successor to GraspNet with improved speed (50+ FPS) and accuracy. Supports both parallel-jaw and suction grippers. Available as a commercial SDK for industrial deployment.

Input Modalities

Point cloud: The standard input for modern grasp planners. Generated from one or more depth cameras (Intel RealSense, Zivid, Photoneo). Provides 3D geometric information but no color. Most algorithms operate on the raw point cloud or a voxelized representation.

RGB-D (depth + color): Adds color information that helps distinguish objects with similar geometry (e.g., red vs. blue cups). Some algorithms use RGB for segmentation and depth for grasp pose computation.

CAD model (known objects): When the object is known, its CAD model can be registered to the observed point cloud via pose estimation. This enables analytically optimal grasp computation. Used in manufacturing where part geometries are known in advance.

Tactile (post-contact): Force-torque and tactile sensors provide feedback after initial contact. This enables grasp adjustment: if the object is slipping, the gripper can increase force or reposition. Closed-loop grasping with tactile feedback achieves higher success rates than open-loop approaches.

Grasp Quality Metrics

Grasp success rate: The percentage of attempted grasps that successfully lift and hold the object. The primary evaluation metric, measured over hundreds of trials on diverse object sets. State-of-the-art systems achieve 90–95% on structured benchmarks, 70–85% on novel objects in clutter.

Epsilon quality (force closure margin): The minimum wrench that can destabilize the grasp. Higher epsilon means the grasp is more robust to disturbances. Computed analytically from contact points and friction coefficients.

Clearance: The minimum distance between the gripper and nearby obstacles during the approach. Low clearance increases collision risk. Good planners penalize grasps that require threading the gripper through narrow gaps.

Computation time: Time from sensor input to grasp output. Real-time systems (<100 ms) are needed for dynamic environments. Batch processing is acceptable for static bin-picking.

Practical Requirements

Sensors: A depth camera or structured-light scanner is essential. Intel RealSense D435/D455 ($200–$400) is the research standard. Industrial applications use Zivid, Photoneo, or Ensenso for higher accuracy and noise resistance. Camera placement matters: top-down for bin-picking, eye-in-hand for complex geometries.

Compute: Data-driven grasp planners run on GPU. GraspNet and Contact-GraspNet require an RTX 2070 or better and produce results in 100–500 ms. AnyGrasp achieves 50+ FPS on modern GPUs. Analytical planners run on CPU but are slower for complex scenes.

Integration: Grasp planners must be integrated with a motion planner (MoveIt, OMPL) that computes collision-free approach trajectories. The complete pipeline is: perceive (depth camera) → plan grasp (GraspNet) → plan motion (MoveIt) → execute.

6-DOF Grasp Pose Estimation

Modern grasp planners estimate full 6-DOF grasp poses: position (x, y, z) and orientation (roll, pitch, yaw) of the gripper relative to the object or scene. This is significantly more challenging than planar (top-down) grasping because it must reason about approach direction, wrist rotation, and finger alignment in 3D.

The 6-DOF grasp pose is typically parameterized as a 4x4 homogeneous transformation matrix specifying the gripper frame at the moment of contact. For parallel-jaw grippers, the pose defines the center point between the fingers, the approach direction (along which the gripper moves to contact), and the closing axis (along which the fingers move). For suction grippers, only the contact point and approach normal are needed (3-DOF).

Key technical challenges in 6-DOF estimation: (1) the search space is enormous (SE(3) for each candidate grasp); (2) self-collision between the gripper and the scene must be checked for each candidate; (3) the approach trajectory must be kinematically feasible for the robot arm, not just geometrically valid at the grasp point. Integrated grasp-and-motion planning addresses the last challenge by jointly optimizing the grasp pose and the approach trajectory.

Integration with Learned Policies

Grasp planning and learned manipulation policies are increasingly combined rather than used in isolation:

Grasp + place pipeline: A grasp planner selects the initial grasp pose, and a learned policy handles the post-grasp manipulation (placing, inserting, stacking). This division of labor leverages the grasp planner's geometric reasoning and the policy's ability to handle complex post-grasp dynamics.

End-to-end learned grasping: Policies like ACT and Diffusion Policy learn the entire grasp sequence (approach, grasp, lift) from demonstrations without an explicit grasp planner. This works well for specific tasks but requires more training data and does not generalize to novel objects as well as dedicated grasp planners.

VLA-guided grasping: Vision-Language-Action models can select grasp strategies based on language instructions ("pick up the mug by the handle") and learned physical intuition, combining semantic understanding with geometric reasoning.

At SVRC, we benchmark different grasping approaches (analytical, learned, hybrid) on standardized object sets at our facilities, helping teams select the right approach for their specific application.

See Also

Key Papers

  • Mahler, J. et al. (2017). "Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics." RSS 2017. Pioneered large-scale simulation for grasp planning, training on 6.7 million synthetic grasps.
  • Fang, H. et al. (2020). "GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping." CVPR 2020. Established the standard benchmark for 6-DOF grasp detection from point clouds.
  • Sundermeyer, M. et al. (2021). "Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes." ICRA 2021. Introduced contact-point-based grasp prediction that excels in cluttered multi-object scenes.

Related Terms

  • End-Effector — The gripper whose geometry defines the grasp planning problem
  • Force-Torque Sensing — Contact feedback for closed-loop grasp adjustment
  • Policy Learning — Learned policies that incorporate grasp planning as a component
  • Sim-to-Real Transfer — Many grasp planners are trained in simulation and deployed on real hardware
  • Impedance Control — Compliant control during the grasp approach and object lifting

Deploy Grasp Planning at SVRC

Silicon Valley Robotics Center provides depth camera integration, grasp planning pipeline setup (GraspNet, Contact-GraspNet, AnyGrasp), and motion planning configuration for pick-and-place applications. Our team can benchmark different grasp planners on your specific object set and recommend the best approach for your deployment.

Explore Data Services   Contact Us