This paper presents an investigation of vision transformer learning for multi-view geometry tasks, such as optical flow estimation, by fine-tuning video foundation models. Unlike previous methods that involve custom architectural designs and task-specific pretraining, our research finds that general-purpose models pretrained on videos can be readily transferred to multi-view problems with minimal adaptation. The core insight is that general-purpose attention between patches learns temporal and spatial information for geometric reasoning. We demonstrate that appending a linear decoder to the Transformer backbone produces satisfactory results, and iterative refinement can further elevate performance to state-of-the-art levels. This conceptually simple approach achieves top cross-dataset generalization results for optical flow estimation with end-point error (EPE) of 0.69, 1.78, and 3.15 on the Sintel clean, Sintel final, and KITTI datasets, respectively. Our method additionally establishes a new record on the online test benchmark with EPE values of 0.79, 1.88, and F1 value of 3.79. Applications to 3D depth estimation and stereo matching also show strong performance, illustrating the ver- satility of video-pretrained models in addressing geometric vision tasks.
Our work transfers the encoder in video foundation models to multi-view geometry tasks. Any video foundation model whose encoder follows a transformer architecture can be used. To process the video data, the transformer splits the spatio-temporal data into 3D patches, adds spatial and temporal positional encodings, and then feeds the visual tokens into the self-attention blocks. To adapt pretrained 3D ViTs for two-frame tasks, we first interpolate the 2D spatial positional encodings to match the desired input size in the fine-tuning stage. See Figure 1(a) for a visualized demonstration.
Then we incorporate an iterative refinement mechanism to the 3D ViT. Given an image pair \(I_1, I_2\), the residual geometric property \(\Delta g_t\) at each iteration \(t\) is predicted as follows:
where \(F_{\text{enc}}\) denotes a spatiotemporal ViT that takes (warped) image pairs as input and returns features corresponding to source images. The source image features together with the current prediction are fed to the decoder \(F_{\text{dec}}\) to predict the residuals. The decoder is instantiated as a ConvGRU unit following RAFT. The warping operation takes an image as input and outputs another image according to the geometrical property \(g_{t-1}\). The warping operation for optical flow estimation and stereo matching is straightforward. For 3D depth estimation, we first convert the depth representation to pixel displacement using camera parameters, and convert the prediction results back to depth. The predicted residual is then aggregated with the prediction of the previous step. See Figure 1(b) for a demonstration.
We conduct comprehensive experiments qualitatively and quantitatively for optical flow estimation, stereo matching and 3D depth estimation.