Awesome-Transformer-based-SLAM

Awesome Transformer-based SLAM

This repository contains a curated list of resources addressing SLAM related task employing Transformer, including optical flow, view/feature correspondences, stereo matching, depth estimation, 3D reconstruction, pose estimation, etc.

If you find some ignored papers, feel free to create pull requests, or open issues.

Contributions in any form to make this list more comprehensive are welcome.

If you find this repositorie is useful, a simple star () should be the best affirmation. 😊

Feel free to share this list with others!

Overview

Transformer-based SLAM
Transformer-based Pose Tracking
Transformer-based Optical Flow
Transformer-based View Matching
Transformer-based Mapping

Transformer-based SLAM

Full SLAM, including pose and depth

Year	Venue	Paper Title	Repository	Note
2025	`arXiv`	VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold	—	—
2025	`arXiv`	3R-GS: Best Practice in Optimizing Camera Poses Along with 3DGS	—	website MASt3R-SfM+3DGS
2025	`CVPR`	AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos		website
2025	`CVPR`	SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos		test
2025	`CVPR`	MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors		Website Test
2022	`ECCV`	Jperceiver: Joint perception network for depth, pose and layout estimation in driving scenes		—

Transformer-based Pose Tracking

or pose/state estimation

Year	Venue	Paper Title	Repository	Note
2025	`IROS`	SplatPose: Geometry-Aware 6-DoF Pose Estimation from Single RGB Image via 3D Gaussian Splatting	—	—
2025	`CVPR`	ZeroVO: Visual Odometry with Minimal Assumptions	—	website
2025	`arXiv`	XIRVIO: Critic-guided Iterative Refinement for Visual-Inertial Odometry with Explainable Adaptive Weighting	—	—
2025	`IEEE Acess`	Transformer-based model for monocular visual odometry: a video understanding approach		—
2025	`arXiv`	Light3R-SfM: Towards Feed-forward Structure-from-Motion	—	—
2024	`arXiv`	MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion		MASt3R sfm version
2024	`CVPR`	VGGSfM: Visual Geometry Grounded Deep Structure From Motion		website
2024	Msc Thesis	End-to-End Learned Visual Odometry Based on Vision Transformer	—	—
2024	`arXiv`	Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry		—
2024	`IJRA`	DDETR-SLAM: A Transformer-Based Approach to Pose Optimization in Dynamic Environments	—	—
2023	`ITM Web of Conferences`	ViT VO-A Visual Odometry technique Using CNN-Transformer Hybrid Architecture	—	—
2023	`arXiv`	TransFusionOdom: interpretable transformer-based LiDAR-inertial fusion odometry estimation		—
2023	`CVPR`	Modality-invariant Visual Odometry for Embodied Vision		Website
2023	`MAV`	ViTVO: Vision Transformer based Visual Odometry with Attention Supervision	—	—
2023	`International Conference on Haptics and Virtual Reality`	VIOFormer: Advancing Monocular Visual-Inertial Odometry Through Transformer-Based Fusion	—	—
2022	`IEEE Intelligent Vehicles Symposium`	Attention guided unsupervised learning of monocular visual-inertial odometry	—	—
2022	`IEEE-SJ`	Ema-vio: Deep visual–inertial odometry with external memory attention	—	—
2022	`IROS`	AFT-VO: Asynchronous fusion transformers for multi-view visual odometry estimation	—	—
2022	`arXiv`	Dense prediction transformer for scale estimation in monocular visual odometry	—	—
2021	`Neural Computing and Applications`	Transformer guided geometry model for flow-based unsupervised visual odometry	—	—

Transformer-based Optical Flow

Year	Venue	Paper Title	Repository	Note
2024	`ECCV`	Cotracker: It is better to track together		—
2023	`arXiv`	Win-win: Training high-resolution vision transformers from two windows	—	—
2023	`arXiv`	Flowformer: A transformer architecture and its masked cost volume autoencoding for optical flow	—	—
2023	`CVPR`	FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation		—
2023	`CVPR`	Transflow: Transformer as flow learner	—	—
2023	`ICCV`	Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow		Cross view Match
2022	`NIPS`	Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion		Cross view Match
2023	`PAMI`	Unifying flow, stereo and depth estimation		—
2022	`CVPR`	Gmflow: Learning optical flow via global matching		—
2022	`CVPR`	Craft: Cross-attentional flow transformer for robust optical flow		—
2022	`CVPR`	Learning optical flow with kernel patch attention		—
2022	`CVPR`	Global Matching with Overlapping Attention for Optical Flow Estimation		—
2022	`CVPR`	Flowformer: A transformer architecture for optical flow		—

Transformer-based View Matching

or Data Assoication, Correspondences

Year	Venue	Paper Title	Repository	Note
2025	`arXiv`	TAPIP3D: Tracking Any Point in Persistent 3D Geometry		website
2025	`CVPR`	DEFOM-Stereo: Depth Foundation Model Based Stereo Matching		website depth anything v2 + RAFT-Stereo
2025	`arXiv`	MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training		website
2025	`arXiv`	POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction		—
2025	`arXiv`	CoMatcher: Multi-View Collaborative Feature Matching	—	—
2025	`arXiv`	CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching	—	—
2025	`CVPR`	Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better		website
2025	`CVPR`	FG2:Fine-Grained Cross-View Localization by Fine-Grained Feature Matching		—
2025	`arXiv`	ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration		—
2025	`arXiv`	Loop Closure from Two Views: Revisiting PGO for Scalable Trajectory Estimation through Monocular Priors	—	MASt3R for Loop Closure
2025	`arXiv`	Speedy MASt3R	—	—
2025	`CVPR`	VGGT: Visual Geometry Grounded Transformer		website Test
2024	`ECCV`	Grounding Image Matching in 3D with MASt3R		Website Test
2024	`CVPR`	Efficient LoFTR: Semi-dense local feature matching with sparse-like speed		ELoFTR
2024	`CVPR`	RoMa: Robust dense feature matching		—
2023	`CVPR`	Rotation-invariant transformer for point cloud matching		Point Cloud Matching
2022	`ECCV`	Aspanformer: Detector-free image matching with adaptive span transformer	—	—
2022	`ICARM`	Tlcd: A transformer based loop closure detection for robotic visual slam	—	—
2021	`ICCV`	Cotr: Correspondence transformer for matching across images		—
2021	`CVPR`	LoFTR: Detector-free local feature matching with transformers		—
2020	`CVPR`	Superglue: Learning feature matching with graph neural networks		borrows the self-attention

Transformer-based Mapping

or depth estimation or 3D reconstruction

Year	Venue	Paper Title	Repository	Note
2025	`arXiv`	Test3R: Learning to Reconstruct 3D at Test Time		website
2025	`arXiv`	Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory		website
2025	`arXiv`	4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos	—	website
2025	`arXiv`	Rig3R: Rig-Aware Conditioning for Learned 3D Reconstruction	—	—
2025	`CVPR`	SAB3R: Semantic-Augmented Backbone in 3D Reconstruction	—	website
2025	`arXiv`	Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles		website
2025	`CVPR`	MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models		—
2025	`arXiv`	Recollection from Pensieve: Novel View Synthesis via Learning from Uncalibrated Videos		—
2025	`arXiv`	STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes		website
2025	`CVPRW`	DELTA: Dense Depth from Events and LiDAR using Transformer’s Attention		website
2025	`CVPR`	MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds		website
2025	`CVPR`	MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision		website
2025	`arXiv`	Regist3R: Incremental Registration with Stereo Foundation Model	—	—
2025	`arXiv`	St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World	—	website
2025	`CVPR`	AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis		website
2025	`arXiv`	Mono3R: Exploiting Monocular Cues for Geometric 3D Reconstruction	—	—
2025	`CVPR`	MonSter: Marry Monodepth to Stereo Unleashes Power		—
2025	`arXiv`	D2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes	—	website
2025	`arXiv`	FlowR: Flowing from Sparse to Dense 3D Reconstructions	—	website
2025	`arXiv`	Easi3R: Estimating Disentangled Motion from DUSt3R Without Training		website 4D DUSt3R test
2025	`arXiv`	SparseGS-W: Sparse-View 3D Gaussian Splatting in the Wild with Generative Priors	—	DUSt3R+Diffusion+3DGS
2025	`ICLR`	M3: 3D-Spatial Multimodal Memory		website compression & Gaussian Memory Attention
2025	`CVPR`	MVSAnywhere: Zero-Shot Multi-View Stereo		website
2025	`CVPR`	CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis	—	website
2025	`CVPR`	Pow3R: empowering unconstrained 3D reconstruction with camera and scene priors	—	website DUSt3R+multi information input
2025	`CVPR`	Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding		TSP3D
2025	`CVPR`	UniK3D: Universal Camera Monocular 3D Estimation		website
2025	`CVPR`	Sonata: Self-Supervised Learning of Reliable Point Representations		website
2024	`CVPR`	Point transformer v3: Simpler faster stronger		—
2022	`NIPS`	Point transformer v2: Grouped vector attention and partition-based pooling		—
2021	`ICCV`	Point transformer	—	unofficial implementation
2025	`arXiv`	Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction	—	website Dynamic DUSt3R, DPM
2025	`ICLR`	MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion		website Test
2025	`CVPR`	Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos		website
2025	`CVPR`	Continuous 3D Perception Model with Persistent State		website CUT3R
2025	`CVPR`	SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction		MASt3R+COLMAP+3DGS
2025	`arXiv`	SplatVoxel: History-Aware Novel View Streaming without Temporal Training	—	—
2025	`CVPR`	GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding		3DGS+Transformer
2025	`CVPR`	DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers		website distillation
2025	`arXiv`	MUSt3R: Multi-view Network for Stereo 3D Reconstruction		multiple views DUSt3R
2025	`CVPR`	Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass		Website Test
2024	`NIPS`	Depth anything v2		website
2024	`CVPR`	Depth anything: Unleashing the power of large-scale unlabeled data		Website
2024	`CVPR`	DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions	—	—
2024	`CVPR`	Learning to adapt clip for few-shot monocular depth estimation	—	—
2024	`arXiv`	3d reconstruction with spatial memory		website Spann3R
2024	`CVPR`	DUSt3R: Geometric 3D Vision Made Easy		Website Test
2024	`ECCV`	Gs-lrm: Large reconstruction model for 3d gaussian splatting	—	website 3DGS+Transformer
2024	`TIP`	BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation		—
2024	`TIP`	GLPanoDepth: Global-to-Local Panoramic Depth Estimation	—	—
2023	`ICCV`	Towards zero-shot scale-aware monocular depth estimation		website
2023	`ICCV`	Egformer: Equirectangular geometry-biased transformer for 360 depth estimation		—
2023	`Machine Intelligence Research`	Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation	—	—
2023	`CVPR`	Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation	—	—
2023	`CVPR`	CompletionFormer: Depth Completion with Convolutions and Vision Transformers		website
2023	`ICRA`	Lightweight monocular depth estimation via token-sharing transformer	—	—
2023	`AAAI`	ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation	—	—
2023	`ICRA`	TODE-Trans: Transparent Object Depth Estimation with Transformer		—
2023	`AAAI`	Deep digging into the generalization of self-supervised monocular depth estimation		—
2022	`ECCV`	PanoFormer: Panorama Transformer for Indoor 360 Depth Estimation		—
2022	`AAAI`	Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learning	—	—
2022	`arXiv`	MVSFormer: Multi-view stereo by learning robust image features and temperature-based depth	—	—
2022	`arXiv`	Objcavit: improving monocular depth estimation using natural language models and image-object cross-attention		—
2022	`arXiv`	Depthformer: Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion		—
2022	`arXiv`	Sidert: A real-time pure transformer architecture for single image depth estimation	—	—
2022	`ECCV`	Hybrid transformer based feature fusion for self-supervised monocular depth estimation	—	—
2022	`ECCV`	Spike transformer: Monocular depth estimation for spiking camera		—
2022	`3DV`	MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer		—
2022	`arXiv`	DEST: “Depth Estimation with Simplified Transformer	—	—
2022	`arXiv`	SparseFormer: Attention-based Depth Completion Network	—	—
2022	`CVPR`	GuideFormer: Transformers for Image Guided Depth Completion	—	—
2022	`CVPR`	Multi-frame self-supervised depth with transformers	—	—
2022	`arXiv`	Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics	—	—
2021	`ICCV`	Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers	—	STTR stereo matching
2021	`BMVC`	Transformer-based Monocular Depth Estimation with Attention Supervision		—
2021	`ICCV`	Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction		—
2021	`ICCV`	Vision transformers for dense prediction		DPT

Other Resources

Awesome-Transformer-Attention
Dense-Prediction-Transformer-Based-Visual-Odometry
Visual SLAM with Vision Transformers(ViT)
Awesome-Learning-based-VO-VIO
Some basic paper in ViT:

Year	Venue	Paper Title	Note
2024	`Transactions on Machine Learning Research Journal`	Dinov2: Learning robust visual features without supervision	DINO2
2021	`ICML`	Is space-time attention all you need for video understanding?	TimeSformer
2021	`CVPR`	Taming transformers for high-resolution image synthesis	High resolution CNN+Transformer
2021	`ICCV`	Emerging properties in self-supervised vision transformers	DINO SSL
2021	`ICCV`	Vivit: A video vision transformer	—
2020	`ICLR`	An image is worth 16x16 words: Transformers for image recognition at scale	ViT