Learning to Fuse: A Deep Learning Approach to Visual-Inertial Camera Pose Estimation

Learning to Fuse: A Deep Learning Approach to Visual-Inertial Camera Pose Estimation
Jason Raphael Rambach, Aditya Tewari, Alain Pagani, Didier Stricker
IEEE International Symposium on Mixed and Augmented Reality (ISMAR-2016), September 19-23, Merida, Mexico

Abstract:
Camera pose estimation is the cornerstone of Augmented Reality applications. Pose tracking based on camera images exclusively has been shown to be sensitive to motion blur, occlusions, and illumination changes. Thus, a lot of work has been conducted over the last years on visual-inertial pose tracking using acceleration and angular velocity measurements from inertial sensors in order to improve the visual tracking. Most proposed systems use statistical filtering techniques to approach the sensor fusion problem, that require complex system modelling and calibrations in order to perform adequately. In this work we present a novel approach to sensor fusion using a deep learning method to learn the relation between camera poses and inertial sensor measurements. A long short-term memory model (LSTM) is trained to provide an estimate of the current pose based on previous poses and inertial measurements. This estimate is then appropriately combined with the output of a visual tracking system using a linear Kalman Filter to provide a robust final pose estimate. Our experimental results confirm the applicability and tracking performance improvement gained from the proposed sensor fusion system.
Keywords:
Sensor Fusion, tracking, deep learning