Awesome-Transformer-based-SLAM

Awesome Transformer-based SLAM

This repository contains a curated list of resources addressing SLAM-related tasks employing Transformer, including optical flow, view/feature correspondences, stereo matching, depth estimation, 3D reconstruction, pose estimation, etc.

If you find some ignored papers, feel free to create pull requests, or open issues.

Contributions in any form to make this list more comprehensive are welcome.

If you find this repository is useful, a simple star (Github stars) should be the best affirmation. 😊

Feel free to share this list with others!

Overview

Transformer-based SLAM

Full SLAM, including pose and depth

Year Venue Paper Title Repository Note
2025 arXiv DVGT: Driving Visual Geometry Transformer Github stars website
视觉特征提取部分采用DINOv3作为骨干网络提取视觉特征,然后基于一系列注意力机制(分解式时空注意力/Factorized Spatial-Temporal Attention)实现时空几何关系的理解,最终并行的head输出全局3D点图+Ego Pose
2025 arXiv DriveVGGT: Visual Geometry Transformer for Autonomous Driving 面向自动驾驶的,多相机,VGGT重建
2025 arXiv Any4D: Unified Feed-Forward Metric 4D Reconstruction Github stars website
2025 arXiv Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM
2025 arXiv SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors
2025 arXiv Multi-Agent Monocular Dense SLAM With 3D Reconstruction Priors
2025 arXiv AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend Github stars website
2025 arXiv LiDAR-VGGT: Cross-Modal Coarse-to-Fine Fusion for Globally Consistent and Metric-Scale Dense Mapping Github stars
2025 arXiv SAIL-Recon: Large SfM by Augmenting Scene Regression with Localization Github stars website
2025 arXiv EC3R-SLAM: Efficient and Consistent Monocular Dense SLAM with Feed-Forward 3D Reconstruction Github stars Website
2025 arXiv GRS-SLAM3R: Real-Time Dense SLAM with Gated Recurrent State
2025 arXiv PROFusion: Robust and Accurate Dense Reconstruction via Camera Pose Regression and Optimization Github stars
2025 arXiv SLAM-Former: Putting SLAM into One Transformer Github stars website
2025 arXiv ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association Github stars website
2025 arXiv VGGT-Long: Chunk it, Loop it, Align it – Pushing VGGT’s Limits on Kilometer-scale Long RGB Sequences Github stars
2025 arXiv DINO-SLAM: DINO-informed RGB-D SLAM for Neural Implicit and Explicit Representations
2025 arXiv VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold
2025 arXiv 3R-GS: Best Practice in Optimizing Camera Poses Along with 3DGS website
MASt3R-SfM+3DGS
2025 CVPR AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos Github stars website
2025 CVPR SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos Github stars test
2025 CVPR MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors Github stars Website
Test
2022 ECCV Jperceiver: Joint perception network for depth, pose and layout estimation in driving scenes Github stars

Transformer-based Pose Tracking

or pose/state estimation

Year Venue Paper Title Repository Note
2025 arXiv MASt3R-Fusion: Integrating Feed-Forward Visual Model with IMU, GNSS for High-Functionality SLAM Github stars
2025 CVPR Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization Github stars
2025 arXiv Dense-depth map guided deep Lidar-Visual Odometry with Sparse Point Clouds and Images
2025 RAL DINO-VO: A Feature-Based Visual Odometry Leveraging a Visual Foundation Model
2025 TRO BotVIO: A Lightweight Transformer-Based Visual-Inertial Odometry for Robotics Github stars
2025 IROS SplatPose: Geometry-Aware 6-DoF Pose Estimation from Single RGB Image via 3D Gaussian Splatting
2025 CVPR ZeroVO: Visual Odometry with Minimal Assumptions website
2025 arXiv XIRVIO: Critic-guided Iterative Refinement for Visual-Inertial Odometry with Explainable Adaptive Weighting
2025 IEEE Acess Transformer-based model for monocular visual odometry: a video understanding approach Github stars
2025 arXiv Light3R-SfM: Towards Feed-forward Structure-from-Motion
2024 arXiv MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion Github stars MASt3R sfm version
2024 CVPR VGGSfM: Visual Geometry Grounded Deep Structure From Motion Github stars website
2024 Msc Thesis End-to-End Learned Visual Odometry Based on Vision Transformer
2024 Information Fusion Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance
2024 arXiv Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry Github stars
2024 IJRA DDETR-SLAM: A Transformer-Based Approach to Pose Optimization in Dynamic Environments
2023 International Conference on Neural Computing for Advanced Applications A lightweight sensor fusion for neural visual inertial odometry
2023 ITM Web of Conferences ViT VO-A Visual Odometry technique Using CNN-Transformer Hybrid Architecture
2023 arXiv TransFusionOdom: interpretable transformer-based LiDAR-inertial fusion odometry estimation Github stars
2023 CVPR Modality-invariant Visual Odometry for Embodied Vision Github stars Website
2023 MAV ViTVO: Vision Transformer based Visual Odometry with Attention Supervision
2023 International Conference on Haptics and Virtual Reality VIOFormer: Advancing Monocular Visual-Inertial Odometry Through Transformer-Based Fusion
2022 IEEE Intelligent Vehicles Symposium Attention guided unsupervised learning of monocular visual-inertial odometry
2022 IEEE-SJ Ema-vio: Deep visual–inertial odometry with external memory attention
2022 IROS AFT-VO: Asynchronous fusion transformers for multi-view visual odometry estimation
2022 arXiv Dense prediction transformer for scale estimation in monocular visual odometry
2021 Neural Computing and Applications Transformer guided geometry model for flow-based unsupervised visual odometry

Transformer-based Optical Flow

Year Venue Paper Title Repository Note
2024 ECCV Cotracker: It is better to track together Github stars
2023 arXiv Win-win: Training high-resolution vision transformers from two windows
2023 arXiv Flowformer: A transformer architecture and its masked cost volume autoencoding for optical flow
2023 CVPR FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation Github stars
2023 CVPR Transflow: Transformer as flow learner
2023 ICCV Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow Github stars Cross view Match
2022 NIPS Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion Github stars Cross view Match
2023 PAMI Unifying flow, stereo and depth estimation Github stars
2022 CVPR Gmflow: Learning optical flow via global matching Github stars
2022 CVPR Craft: Cross-attentional flow transformer for robust optical flow Github stars
2022 CVPR Learning optical flow with kernel patch attention Github stars
2022 CVPR Global Matching with Overlapping Attention for Optical Flow Estimation Github stars
2022 CVPR Flowformer: A transformer architecture for optical flow Github stars

Transformer-based View Matching

or Data Association, Correspondences

Year Venue Paper Title Repository Note
2025 arXiv TAPIP3D: Tracking Any Point in Persistent 3D Geometry Github stars website
2025 CVPR DEFOM-Stereo: Depth Foundation Model Based Stereo Matching Github stars website
depth anything v2 + RAFT-Stereo
2025 arXiv MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training Github stars website
2025 ICCV POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction Github stars
2025 arXiv CoMatcher: Multi-View Collaborative Feature Matching
2025 arXiv CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching
2025 CVPR Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better Github stars website
2025 CVPR FG2:Fine-Grained Cross-View Localization by Fine-Grained Feature Matching Github stars
2025 arXiv ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration Github stars
2025 arXiv Loop Closure from Two Views: Revisiting PGO for Scalable Trajectory Estimation through Monocular Priors MASt3R for Loop Closure
2025 arXiv Speedy MASt3R
2025 CVPR VGGT: Visual Geometry Grounded Transformer Github stars website
Test
2024 ECCV Grounding Image Matching in 3D with MASt3R Github stars Website
Test
2024 CVPR Efficient LoFTR: Semi-dense local feature matching with sparse-like speed Github stars ELoFTR
2024 CVPR RoMa: Robust dense feature matching Github stars
2023 CVPR Rotation-invariant transformer for point cloud matching Github stars Point Cloud Matching
2022 ECCV Aspanformer: Detector-free image matching with adaptive span transformer
2022 ICARM Tlcd: A transformer based loop closure detection for robotic visual slam
2021 ICCV Cotr: Correspondence transformer for matching across images Github stars
2021 CVPR LoFTR: Detector-free local feature matching with transformers Github stars
2020 CVPR Superglue: Learning feature matching with graph neural networks Github stars borrows the self-attention

Transformer-based Mapping

or depth estimation or 3D reconstruction

Year Venue Paper Title Repository Note
2025 arXiv TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction Github stars 在线增量建图
2025 arXiv InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams Github stars 基于Key-Diversity的滚动记忆机制/Rolling Memory,实现面向无限长序列的视觉几何重建框架;算每个 Key 与均值 Key 的负余弦相似度。得分越高,代表该 Token 与平均值差异越大,包含越独特的几何信息,应予以保留
2025 arXiv DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass Github stars 利用多张图像实现动态场景的密集点跟踪与3D重建,提取 deep spatio-temporal features,通过dense prediction heads来回归pixel-wise 地图;VGGT+Dense Prediction Transformer(DPT,分别作为Depth Head、Point Head和Motion Head)
2025 arXiv Depth Anything 3: Recovering the Visual Space from Any Views Github stars website
2025 arXiv HTTM: Head-wise Temporal Token Merging for Faster VGGT
2025 arXiv FastVGGT: Training-Free Acceleration of Visual Geometry Transformer
2025 arXiv VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction Github stars website
2025 arXiv OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Github stars website
2025 arXiv Human3R: Everyone Everywhere All at Once Github stars website
2025 arXiv MapAnything: Universal Feed-Forward Metric 3D Reconstruction Github stars website
2025 arXiv POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction Github stars
2025 ICCV PanoSplatt3R: Leveraging Perspective Pretraining for Generalized Unposed Wide-Baseline Panorama Reconstruction Github stars website
2025 arXiv Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos
2025 ICCV LONG3R: Long Sequence Streaming 3D Reconstruction Github stars website
2025 arXiv Dens3R: A Foundation Model for 3D Geometry Prediction
2025 arXiv StreamVGGT: Streaming 4D Visual Geometry Transformer Github stars website
2025 arXiv Test3R: Learning to Reconstruct 3D at Test Time Github stars website
2025 arXiv Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory Github stars website
2025 arXiv 4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos website
2025 arXiv Rig3R: Rig-Aware Conditioning for Learned 3D Reconstruction
2025 CVPR SAB3R: Semantic-Augmented Backbone in 3D Reconstruction website
2025 arXiv Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles Github stars website
2025 CVPR MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models Github stars
2025 arXiv Recollection from Pensieve: Novel View Synthesis via Learning from Uncalibrated Videos Github stars
2025 arXiv STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes Github stars website
2025 CVPRW DELTA: Dense Depth from Events and LiDAR using Transformer’s Attention Github stars website
2025 CVPR MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds Github stars website
2025 CVPR MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision Github stars website
2025 arXiv Regist3R: Incremental Registration with Stereo Foundation Model
2025 arXiv St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World website
2025 CVPR AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis Github stars website
2025 arXiv Mono3R: Exploiting Monocular Cues for Geometric 3D Reconstruction
2025 CVPR MonSter: Marry Monodepth to Stereo Unleashes Power Github stars
2025 arXiv D2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes website
2025 arXiv FlowR: Flowing from Sparse to Dense 3D Reconstructions website
2025 arXiv Easi3R: Estimating Disentangled Motion from DUSt3R Without Training Github stars website
4D DUSt3R
test
2025 arXiv SparseGS-W: Sparse-View 3D Gaussian Splatting in the Wild with Generative Priors DUSt3R+Diffusion+3DGS
2025 ICLR M3: 3D-Spatial Multimodal Memory Github stars website
compression & Gaussian Memory Attention
2025 CVPR MVSAnywhere: Zero-Shot Multi-View Stereo Github stars website
2025 CVPR CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis website
2025 CVPR Pow3R: empowering unconstrained 3D reconstruction with camera and scene priors website
DUSt3R+multi information input
2025 CVPR Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding Github stars TSP3D
2025 CVPR UniK3D: Universal Camera Monocular 3D Estimation Github stars website
2025 CVPR Sonata: Self-Supervised Learning of Reliable Point Representations Github stars website
2024 CVPR Point transformer v3: Simpler faster stronger Github stars
2022 NIPS Point transformer v2: Grouped vector attention and partition-based pooling Github stars
2021 ICCV Point transformer unofficial implementation
2025 arXiv Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction website
Dynamic DUSt3R, DPM
2025 ICLR MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion Github stars website
Test
2025 CVPR Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos Github stars website
2025 CVPR Continuous 3D Perception Model with Persistent State Github stars website
CUT3R
2025 CVPR SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction Github stars MASt3R+COLMAP+3DGS
2025 arXiv SplatVoxel: History-Aware Novel View Streaming without Temporal Training
2025 CVPR GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding Github stars 3DGS+Transformer
2025 CVPR DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers Github stars website
distillation
2025 arXiv MUSt3R: Multi-view Network for Stereo 3D Reconstruction Github stars multiple views DUSt3R
2025 CVPR Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass Github stars Website
Test
2024 NIPS Depth anything v2 Github stars website
2024 CVPR Depth anything: Unleashing the power of large-scale unlabeled data Github stars Website
2024 CVPR DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions
2024 CVPR Learning to adapt clip for few-shot monocular depth estimation
2025 3DV 3d reconstruction with spatial memory Github stars website
Spann3R
2024 CVPR DUSt3R: Geometric 3D Vision Made Easy Github stars Website
Test
2024 ECCV Gs-lrm: Large reconstruction model for 3d gaussian splatting website
3DGS+Transformer
2024 TIP BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation Github stars
2024 TIP GLPanoDepth: Global-to-Local Panoramic Depth Estimation
2023 ICCV Towards zero-shot scale-aware monocular depth estimation Github stars website
2023 ICCV Egformer: Equirectangular geometry-biased transformer for 360 depth estimation Github stars
2023 Machine Intelligence Research Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation
2023 CVPR Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation
2023 CVPR CompletionFormer: Depth Completion with Convolutions and Vision Transformers Github stars website
2023 ICRA Lightweight monocular depth estimation via token-sharing transformer
2023 AAAI ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation
2023 ICRA TODE-Trans: Transparent Object Depth Estimation with Transformer Github stars
2023 AAAI Deep digging into the generalization of self-supervised monocular depth estimation Github stars
2022 ECCV PanoFormer: Panorama Transformer for Indoor 360 Depth Estimation Github stars
2022 AAAI Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learning
2022 arXiv MVSFormer: Multi-view stereo by learning robust image features and temperature-based depth
2022 arXiv Objcavit: improving monocular depth estimation using natural language models and image-object cross-attention Github stars
2022 arXiv Depthformer: Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion Github stars
2022 arXiv Sidert: A real-time pure transformer architecture for single image depth estimation
2022 ECCV Hybrid transformer based feature fusion for self-supervised monocular depth estimation
2022 ECCV Spike transformer: Monocular depth estimation for spiking camera Github stars
2022 3DV MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer Github stars
2022 arXiv DEST: “Depth Estimation with Simplified Transformer
2022 arXiv SparseFormer: Attention-based Depth Completion Network
2022 CVPR GuideFormer: Transformers for Image Guided Depth Completion
2022 CVPR Multi-frame self-supervised depth with transformers
2022 arXiv Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics
2021 ICCV Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers STTR
stereo matching
2021 BMVC Transformer-based Monocular Depth Estimation with Attention Supervision Github stars
2021 ICCV Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction Github stars
2021 ICCV Vision transformers for dense prediction Github stars DPT

Other Resources

Year Venue Paper Title Repository Note
2025 arXiv DINOV3 Github stars website
2024 Transactions on Machine Learning Research Journal Dinov2: Learning robust visual features without supervision Github stars DINO2
2021 ICML Is space-time attention all you need for video understanding? Github stars TimeSformer
2021 CVPR Taming transformers for high-resolution image synthesis Github stars High resolution CNN+Transformer
2021 ICCV Emerging properties in self-supervised vision transformers Github stars DINO
SSL
2021 ICCV Vivit: A video vision transformer Github stars
2020 ICLR An image is worth 16x16 words: Transformers for image recognition at scale Github stars ViT