Researchers from the Shibaura Institute of Technology in Japan and FPT University in Vietnam have developed a deep learning framework aimed at improving the accuracy and robustness of hand-held object pose estimation in robotics and augmented reality applications. The framework, led by Associate Professor Phan Xuan Tan, introduces a vote-based model that integrates both 2D and 3D data from RGB-D images to address common challenges in pose estimation, including occlusions caused by hand-object interactions and difficulties in fusing multimodal data.
Conventional pose estimation models often suffer from reduced accuracy when the object is partially obscured by the hand or when non-rigid transformations alter the object’s perceived shape. Moreover, many existing methods rely on separate processing pipelines for RGB and RGB-D data, which can lead to misaligned features and unstable performance. The new model seeks to overcome these issues through a unified vote-based fusion mechanism and a self-attention module tailored to hand-object interactions.
The framework includes dedicated backbones for extracting features from 2D images and 3D point cloud data, followed by voting modules that identify keypoints—critical markers used to infer the position and orientation of both the hand and the object. These keypoints are processed through a vote-based fusion model using radius-based neighborhood projection and channel attention to ensure local detail is preserved and features are aligned across modalities. A final hand-aware pose estimation module uses self-attention to better interpret the interactions between hand and object, accounting for variable grips and deformations.
In experimental evaluations conducted on three public datasets, the model demonstrated accuracy improvements of up to 15% and an average precision of 76.8%, with a performance increase of up to 13.9% over existing methods. Inference times ranged from 40 milliseconds without refinement to 200 milliseconds with refinement, indicating potential for real-time applications.
The study will appear in the May 2025 issue of the Alexandria Engineering Journal (Volume 120). According to the researchers, the model could facilitate developments in robotic manipulation, human-assistive technologies, and immersive AR/VR systems by enabling more reliable and efficient hand-object pose estimation.
Photo credit: Credit: Dan Ruscoe