Dynamic Cost Volumes with Scalable Transformer Architecture for Optical Flow

Dynamic Cost Volumes with Scalable Transformer Architecture for Optical Flow
Vemburaj Yadav, Alain Pagani, Didier Stricker
In: Irish Pattern Recognition and Classification Society. Irish Machine Vision and Image Processing Conference (IMVIP-2023), August 30 - September 1, Galway, Ireland, zenodo, 2023.

Abstract:
We introduce DCV-Net, a scalable transformer-based architecture for optical flow with dynamic cost volumes. Recently, FlowFormer [Huang et al., 2022], which applies transformers on the full 4D cost vol- umes instead of the visual feature maps, has shown significant improvements in the flow estimation accuracy. The major drawback of FlowFormer is its scalability for high-resolution input images, since the the com- plexity of the attention mechanism on the 4D cost volumes scales to O(N^4 ) , with N being the number of visual feature tokens. We propose a novel architecture where we obtain the FlowFormer type enrichment of matching cost representations, but using light-weight attention on the visual feature maps with quadratic ( O(N^2 ) ) complexity. Firstly, we generate sequential updates to the visual feature representations and, con- sequently, the cost volumes using lightweight attention layers. Secondly, we interleave this sequence of cost volumes with iterations of flow refinement, thereby modeling the update operator in our refinement stage to handle dynamic cost volumes. Our architecture, with two orders of computational complexity lower than that of FlowFormer, demonstrates strong cross-domain generalization on the Sintel and KITTI datasets. We outperform FlowFormer on the KITTI dataset and achieve highly competitive flow estimation accuracies on the Sintel dataset.