HandMvNet: Real-Time 3D Hand Pose Estimation Using Multi-View Cross-Attention Fusion
HandMvNet: Real-Time 3D Hand Pose Estimation Using Multi-View Cross-Attention Fusion
In: Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. International Conference on Computer Vision Theory and Applications (VISAPP-2025), February 26-28, Porto, Portugal, Vol. 2, ISBN 978-989-758-728-3, SCITEPRESS, Portugal, 2/2025.
- Abstract:
- In this work, we present HandMvNet, one of the first real-time method designed to estimate 3D hand motion and shape from multi-view camera images. Unlike previous monocular approaches, which suffer from scale-depth ambiguities, our method ensures consistent and accurate absolute hand poses and shapes. This is achieved through a multi-view attention-fusion mechanism that effectively integrates features from multiple viewpoints. In contrast to previous multi-view methods, our approach eliminates the need for camera parameters as input to learn 3D geometry. HandMvNet also achieves a substantial reduction in inference time while delivering competitive results compared to the state-of-the-art methods, making it suitable for real-time applications. Evaluated on publicly available datasets, HandMvNet qualitatively and quantitatively outperforms previous methods under identical settings. Code is available at github.com/pyxploiter/handmvnet.