Awesome-Transformer-based-SLAM

Awesome Transformer-based SLAM

This repository contains a curated list of resources addressing SLAM related task employing Transformer, including optical flow, view/feature correspondences, stereo matching, depth estimation, 3D reconstruction, pose estimation, etc.

If you find some ignored papers, feel free to create pull requests, or open issues.

Contributions in any form to make this list more comprehensive are welcome.

If you find this repositorie is useful, a simple star (Github stars) should be the best affirmation. 😊

Feel free to share this list with others!

Overview

Transformer-based SLAM

Full SLAM, including pose and depth

Year Venue Paper Title Repository Note
2025 arXiv 3R-GS: Best Practice in Optimizing Camera Poses Along with 3DGS β€” website
MASt3R-SfM+3DGS
2025 CVPR AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos Github stars website
2025 CVPR SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos Github stars test
2025 CVPR MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors Github stars Website
Test
2022 ECCV Jperceiver: Joint perception network for depth, pose and layout estimation in driving scenes Github stars β€”

Transformer-based Pose Tracking

or pose/state estimation

Year Venue Paper Title Repository Note
2025 arXiv XIRVIO: Critic-guided Iterative Refinement for Visual-Inertial Odometry with Explainable Adaptive Weighting β€” β€”
2025 IEEE Acess Transformer-based model for monocular visual odometry: a video understanding approach Github stars β€”
2025 arXiv Light3R-SfM: Towards Feed-forward Structure-from-Motion β€” β€”
2024 arXiv MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion Github stars MASt3R sfm version
2024 CVPR VGGSfM: Visual Geometry Grounded Deep Structure From Motion Github stars website
2024 Msc Thesis End-to-End Learned Visual Odometry Based on Vision Transformer β€” β€”
2024 arXiv Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry Github stars β€”
2024 IJRA DDETR-SLAM: A Transformer-Based Approach to Pose Optimization in Dynamic Environments β€” β€”
2023 ITM Web of Conferences ViT VO-A Visual Odometry technique Using CNN-Transformer Hybrid Architecture β€” β€”
2023 arXiv TransFusionOdom: interpretable transformer-based LiDAR-inertial fusion odometry estimation Github stars β€”
2023 CVPR Modality-invariant Visual Odometry for Embodied Vision Github stars Website
2023 MAV ViTVO: Vision Transformer based Visual Odometry with Attention Supervision β€” β€”
2023 International Conference on Haptics and Virtual Reality VIOFormer: Advancing Monocular Visual-Inertial Odometry Through Transformer-Based Fusion β€” β€”
2022 IEEE Intelligent Vehicles Symposium Attention guided unsupervised learning of monocular visual-inertial odometry β€” β€”
2022 IEEE-SJ Ema-vio: Deep visual–inertial odometry with external memory attention β€” β€”
2022 IROS AFT-VO: Asynchronous fusion transformers for multi-view visual odometry estimation β€” β€”
2022 arXiv Dense prediction transformer for scale estimation in monocular visual odometry β€” β€”
2021 Neural Computing and Applications Transformer guided geometry model for flow-based unsupervised visual odometry β€” β€”

Transformer-based Optical Flow

Year Venue Paper Title Repository Note
2024 ECCV Cotracker: It is better to track together Github stars β€”
2023 arXiv Win-win: Training high-resolution vision transformers from two windows β€” β€”
2023 arXiv Flowformer: A transformer architecture and its masked cost volume autoencoding for optical flow β€” β€”
2023 CVPR FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation Github stars β€”
2023 CVPR Transflow: Transformer as flow learner β€” β€”
2023 ICCV Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow Github stars Cross view Match
2022 NIPS Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion Github stars Cross view Match
2023 PAMI Unifying flow, stereo and depth estimation Github stars β€”
2022 CVPR Gmflow: Learning optical flow via global matching Github stars β€”
2022 CVPR Craft: Cross-attentional flow transformer for robust optical flow Github stars β€”
2022 CVPR Learning optical flow with kernel patch attention Github stars β€”
2022 CVPR Global Matching with Overlapping Attention for Optical Flow Estimation Github stars β€”
2022 CVPR Flowformer: A transformer architecture for optical flow Github stars β€”

Transformer-based View Matching

or Data Assoication, Correspondences

Year Venue Paper Title Repository Note
2025 arXiv TAPIP3D: Tracking Any Point in Persistent 3D Geometry Github stars website
2025 CVPR DEFOM-Stereo: Depth Foundation Model Based Stereo Matching Github stars website
depth anything v2 + RAFT-Stereo
2025 arXiv MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training Github stars website
2025 arXiv POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction Github stars β€”
2025 arXiv CoMatcher: Multi-View Collaborative Feature Matching β€” β€”
2025 arXiv CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching β€” β€”
2025 CVPR Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better Github stars website
2025 CVPR FG2:Fine-Grained Cross-View Localization by Fine-Grained Feature Matching Github stars β€”
2025 arXiv ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration Github stars β€”
2025 arXiv Loop Closure from Two Views: Revisiting PGO for Scalable Trajectory Estimation through Monocular Priors β€” MASt3R for Loop Closure
2025 arXiv Speedy MASt3R β€” β€”
2025 CVPR VGGT: Visual Geometry Grounded Transformer Github stars website
Test
2024 ECCV Grounding Image Matching in 3D with MASt3R Github stars Website
Test
2024 CVPR Efficient LoFTR: Semi-dense local feature matching with sparse-like speed Github stars ELoFTR
2024 CVPR RoMa: Robust dense feature matching Github stars β€”
2023 CVPR Rotation-invariant transformer for point cloud matching Github stars Point Cloud Matching
2022 ECCV Aspanformer: Detector-free image matching with adaptive span transformer β€” β€”
2022 ICARM Tlcd: A transformer based loop closure detection for robotic visual slam β€” β€”
2021 ICCV Cotr: Correspondence transformer for matching across images Github stars β€”
2021 CVPR LoFTR: Detector-free local feature matching with transformers Github stars β€”
2020 CVPR Superglue: Learning feature matching with graph neural networks Github stars borrows the self-attention

Transformer-based Mapping

or depth estimation or 3D reconstruction

Year Venue Paper Title Repository Note
2025 CVPR MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds Github stars website
2025 CVPR MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision Github stars website
2025 arXiv Regist3R: Incremental Registration with Stereo Foundation Model β€” β€”
2025 arXiv St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World β€” website
2025 CVPR AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis Github stars website
2025 arXiv Mono3R: Exploiting Monocular Cues for Geometric 3D Reconstruction β€” β€”
2025 CVPR MonSter: Marry Monodepth to Stereo Unleashes Power Github stars β€”
2025 arXiv D2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes β€” website
2025 arXiv FlowR: Flowing from Sparse to Dense 3D Reconstructions β€” website
2025 arXiv Easi3R: Estimating Disentangled Motion from DUSt3R Without Training Github stars website
4D DUSt3R
test
2025 arXiv SparseGS-W: Sparse-View 3D Gaussian Splatting in the Wild with Generative Priors β€” DUSt3R+Diffusion+3DGS
2025 ICLR M3: 3D-Spatial Multimodal Memory Github stars website
compression & Gaussian Memory Attention
2025 CVPR MVSAnywhere: Zero-Shot Multi-View Stereo Github stars website
2025 CVPR CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis β€” website
2025 CVPR Pow3R: empowering unconstrained 3D reconstruction with camera and scene priors β€” website
DUSt3R+multi information input
2025 CVPR Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding Github stars TSP3D
2025 CVPR UniK3D: Universal Camera Monocular 3D Estimation Github stars website
2025 CVPR Sonata: Self-Supervised Learning of Reliable Point Representations Github stars website
2024 CVPR Point transformer v3: Simpler faster stronger Github stars β€”
2022 NIPS Point transformer v2: Grouped vector attention and partition-based pooling Github stars β€”
2021 ICCV Point transformer β€” unofficial implementation
2025 arXiv Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction β€” website
Dynamic DUSt3R, DPM
2025 ICLR MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion Github stars website
Test
2025 CVPR Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos Github stars website
2025 CVPR Continuous 3D Perception Model with Persistent State Github stars website
CUT3R
2025 CVPR SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction Github stars MASt3R+COLMAP+3DGS
2025 arXiv SplatVoxel: History-Aware Novel View Streaming without Temporal Training β€” β€”
2025 CVPR GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding Github stars 3DGS+Transformer
2025 CVPR DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers Github stars website
distillation
2025 arXiv MUSt3R: Multi-view Network for Stereo 3D Reconstruction Github stars multiple views DUSt3R
2025 CVPR Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass Github stars Website
Test
2024 NIPS Depth anything v2 Github stars website
2024 CVPR Depth anything: Unleashing the power of large-scale unlabeled data Github stars Website
2024 CVPR DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions β€” β€”
2024 CVPR Learning to adapt clip for few-shot monocular depth estimation β€” β€”
2024 arXiv 3d reconstruction with spatial memory Github stars website
Spann3R
2024 CVPR DUSt3R: Geometric 3D Vision Made Easy Github stars Website
Test
2024 ECCV Gs-lrm: Large reconstruction model for 3d gaussian splatting β€” website
3DGS+Transformer
2024 TIP BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation Github stars β€”
2024 TIP GLPanoDepth: Global-to-Local Panoramic Depth Estimation β€” β€”
2023 ICCV Towards zero-shot scale-aware monocular depth estimation Github stars website
2023 ICCV Egformer: Equirectangular geometry-biased transformer for 360 depth estimation Github stars β€”
2023 Machine Intelligence Research Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation β€” β€”
2023 CVPR Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation β€” β€”
2023 CVPR CompletionFormer: Depth Completion with Convolutions and Vision Transformers Github stars website
2023 ICRA Lightweight monocular depth estimation via token-sharing transformer β€” β€”
2023 AAAI ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation β€” β€”
2023 ICRA TODE-Trans: Transparent Object Depth Estimation with Transformer Github stars β€”
2023 AAAI Deep digging into the generalization of self-supervised monocular depth estimation Github stars β€”
2022 ECCV PanoFormer: Panorama Transformer for Indoor 360 Depth Estimation Github stars β€”
2022 AAAI Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learning β€” β€”
2022 arXiv MVSFormer: Multi-view stereo by learning robust image features and temperature-based depth β€” β€”
2022 arXiv Objcavit: improving monocular depth estimation using natural language models and image-object cross-attention Github stars β€”
2022 arXiv Depthformer: Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion Github stars β€”
2022 arXiv Sidert: A real-time pure transformer architecture for single image depth estimation β€” β€”
2022 ECCV Hybrid transformer based feature fusion for self-supervised monocular depth estimation β€” β€”
2022 ECCV Spike transformer: Monocular depth estimation for spiking camera Github stars β€”
2022 3DV MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer Github stars β€”
2022 arXiv DEST: β€œDepth Estimation with Simplified Transformer β€” β€”
2022 arXiv SparseFormer: Attention-based Depth Completion Network β€” β€”
2022 CVPR GuideFormer: Transformers for Image Guided Depth Completion β€” β€”
2022 CVPR Multi-frame self-supervised depth with transformers β€” β€”
2022 arXiv Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics β€” β€”
2021 ICCV Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers β€” STTR
stereo matching
2021 BMVC Transformer-based Monocular Depth Estimation with Attention Supervision Github stars β€”
2021 ICCV Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction Github stars β€”
2021 ICCV Vision transformers for dense prediction Github stars DPT

Other Resources

Year Venue Paper Title Repository Note
2024 Transactions on Machine Learning Research Journal Dinov2: Learning robust visual features without supervision Github stars DINO2
2021 ICML Is space-time attention all you need for video understanding? Github stars TimeSformer
2021 CVPR Taming transformers for high-resolution image synthesis Github stars High resolution CNN+Transformer
2021 ICCV Emerging properties in self-supervised vision transformers Github stars DINO
SSL
2021 ICCV Vivit: A video vision transformer Github stars β€”
2020 ICLR An image is worth 16x16 words: Transformers for image recognition at scale Github stars ViT