Awesome-Transformer-based-SLAM

Awesome Transformer-based SLAM

This repository contains a curated list of resources addressing SLAM-related tasks employing Transformer, including optical flow, view/feature correspondences, stereo matching, depth estimation, 3D reconstruction, pose estimation, etc.

If you find some ignored papers, feel free to create pull requests, or open issues.

Contributions in any form to make this list more comprehensive are welcome.

If you find this repository is useful, a simple star () should be the best affirmation. 😊

Feel free to share this list with others!

Overview

Transformer-based SLAM
Transformer-based Pose Tracking
Transformer-based Optical Flow
Transformer-based View Matching
Transformer-based Mapping

Transformer-based SLAM

Full SLAM, including pose and depth

Year	Venue	Paper Title	Repository	Note
2026	`CVPR`	VGGT-SLAM++	—	将VGGT的输出与Sim(3)模型相结合，后端优先采用高频率的局部BA，提出 Digital Elevation Map (DEM数字高程图)作为地图表征，结合DINOv2实现回环；
2026	`arXiv`	SceneVGGT: VGGT-based online 3D semantic SLAM for indoor scene understanding and navigation		website 采用YOLOv9e对2D RGB进行分割，用VGGT实现2D语义到3D到映射；采用滑动窗口方法处理连续视频流，将流划分为大小为 n 的连续不相交块，每个窗口包含当前块和前一个块的部分帧作为关键帧锚点。通过这些重叠的锚帧估计块间的位姿变换，使当前块的相对变换与先前累积的轨迹对齐，确保连续子地图与初始全局锚点对齐。将VGGT预测的深度与LiDAR到深度进行对齐获取绝对尺度。RTX 4090 上速度为7.23fps
2025	`arXiv`	Dynamic Visual SLAM using a General 3D Prior		采用patch-based bundle adjustment（DPV-SLAM）与前馈模型（π³mos）相结合的方式实现SLAM，通过前馈模型来过滤动态区域，然后用深度预测来进行inpainting；π³mos是基于π³的，先通过DINOv2来提取语义（移动物体）；对于每帧的输入，π³预测pixel motion以及深度图，然后走dpv-slam的流程实现定位与建图
2026	`CVPR`	ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training		website 采用 Test-Time Training (TTT) layers 作为全局信息聚合的核心机制：并非对所有 token 执行全局自注意力，而是通过 TTT 将输入视图的上下文写入一组 fast weights（动态参数），并在测试时通过参数更新过程隐式构建一个可查询的 scene state。这种方式本质上是一种状态化建模（stateful modeling），能够在保持跨视图全局一致性的同时，将计算复杂度从二次降低到线性（即在一次前向过程中，将大规模图像集合“压缩（zip）”为一个紧凑的隐藏场景状态）。在架构上，模型结合局部窗口 attention + TTT layers，逐步将多视图信息压缩为一个紧凑的 scene representation，并支持后续的几何与新视角查询。ATE和runtime都远超VGGT、Pi3、CUT3R和TTT3R
2025	`arXiv`	VGGT-SLAM 2.0: Real time Dense Feed-forward Scene Reconstruction		Test VGGT-SLAM的高维15自由度对齐会引入漂移，通过强制规定子地图对齐中的两个重叠帧必须具有相同的位置、旋转和标定参数，并求解一致的尺度因子，消除了VGGT子地图对齐的高维漂移与平面退化问题；改进因子图设计（所有关键帧均为节点，图中包含连接子地图内关键帧的内部边，以及连接子地图间重叠帧的中间边/其实也即是正常的因子图）；VGGT注意力层的输出作为图像匹配（建立有效的回环检测），建图完成后可用于检索开放集物体，在Jetson Thor上实现在线实时建图
2025	`arXiv`	DVGT: Driving Visual Geometry Transformer		website 视觉特征提取部分采用DINOv3作为骨干网络提取视觉特征，然后基于一系列注意力机制（分解式时空注意力/Factorized Spatial-Temporal Attention）实现时空几何关系的理解，最终并行的head输出全局3D点图+Ego Pose
2025	`arXiv`	DriveVGGT: Visual Geometry Transformer for Autonomous Driving	—	面向自动驾驶的，多相机，VGGT重建
2025	`arXiv`	Any4D: Unified Feed-Forward Metric 4D Reconstruction		website
2025	`arXiv`	Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM	—	—
2025	`arXiv`	SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors	—	—
2025	`arXiv`	Multi-Agent Monocular Dense SLAM With 3D Reconstruction Priors	—	—
2025	`arXiv`	AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend		website AMB3R 整体采用了冻结的 VGGT 作为前端来输出 pointmap 及其对应的特征,将输出的pointmap先聚合成稀疏体素网格（Sparse voxel grid）。随后，通过空间填充曲线（Space-filling curves）将这些稀疏体素序列化为一维特征序列。利用 Transformer 高效地在紧凑的体素空间中处理这些一维序列特征，随后利用 KNN 插值将处理后的特征重新映射回 2D 视角。
2025	`arXiv`	LiDAR-VGGT: Cross-Modal Coarse-to-Fine Fusion for Globally Consistent and Metric-Scale Dense Mapping		—
2025	`arXiv`	SAIL-Recon: Large SfM by Augmenting Scene Regression with Localization		website
2025	`arXiv`	EC3R-SLAM: Efficient and Consistent Monocular Dense SLAM with Feed-Forward 3D Reconstruction		Website
2025	`arXiv`	GRS-SLAM3R: Real-Time Dense SLAM with Gated Recurrent State	—	—
2026	`ICRA`	PROFusion: Robust and Accurate Dense Reconstruction via Camera Pose Regression and Optimization		相机位姿回归（SLAM3R+Reloc3r）提供鲁棒性，传统优化算法确保准确度。二者相结合，形成了一个简单高效的相机跟踪和场景重建系统。
2025	`arXiv`	SLAM-Former: Putting SLAM into One Transformer		website
2025	`arXiv`	ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association		website
2025	`arXiv`	VGGT-Long: Chunk it, Loop it, Align it – Pushing VGGT’s Limits on Kilometer-scale Long RGB Sequences		针对室外大场景的VGGT，提出一个chunk-and-align pipeline (也就是每部分为一个VGGT,VGGT结果之间通过sim3对齐),回环检测(通过DINOv2训练的Visual Place Recognition预训练),也是基于sim3进行对齐。最后全局通过基于sim3的LM-优化
2025	`arXiv`	DINO-SLAM: DINO-informed RGB-D SLAM for Neural Implicit and Explicit Representations	—	—
2025	`NeurIPS`	VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold		—
2025	`arXiv`	3R-GS: Best Practice in Optimizing Camera Poses Along with 3DGS	—	website MASt3R-SfM+3DGS
2025	`CVPR`	AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos		website
2025	`CVPR`	SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos		test
2025	`CVPR`	MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors		Website Test
2022	`ECCV`	Jperceiver: Joint perception network for depth, pose and layout estimation in driving scenes		—

Transformer-based Pose Tracking

or pose/state estimation

Year	Venue	Paper Title	Repository	Note
2026	`CVPR`	Dark3R: Learning Structure from Motion in the Dark		website 用于暗光环境下SfM的端到端框架,将近期3D基础模型（如MASt3R）学习到的强先验知识适应到弱光环境.受师生知识蒸馏的启发，我们开发了一种训练策略，该策略将在良好曝光的原始图像对上由MASt3R预测的密集特征图，与在弱光原始图像对上由学生模型产生的特征图对齐
2025	`arXiv`	MASt3R-Fusion: Integrating Feed-Forward Visual Model with IMU, GNSS for High-Functionality SLAM		—
2025	`CVPR`	Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization		—
2025	`arXiv`	Dense-depth map guided deep Lidar-Visual Odometry with Sparse Point Clouds and Images	—	—
2025	`RAL`	DINO-VO: A Feature-Based Visual Odometry Leveraging a Visual Foundation Model	—	—
2025	`TRO`	BotVIO: A Lightweight Transformer-Based Visual-Inertial Odometry for Robotics		—
2025	`IROS`	SplatPose: Geometry-Aware 6-DoF Pose Estimation from Single RGB Image via 3D Gaussian Splatting	—	—
2025	`CVPR`	ZeroVO: Visual Odometry with Minimal Assumptions	—	website
2025	`arXiv`	XIRVIO: Critic-guided Iterative Refinement for Visual-Inertial Odometry with Explainable Adaptive Weighting	—	—
2025	`IEEE Acess`	Transformer-based model for monocular visual odometry: a video understanding approach		—
2025	`arXiv`	Light3R-SfM: Towards Feed-forward Structure-from-Motion	—	—
2024	`arXiv`	MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion		MASt3R sfm version
2024	`CVPR`	VGGSfM: Visual Geometry Grounded Deep Structure From Motion		website
2024	Msc Thesis	End-to-End Learned Visual Odometry Based on Vision Transformer	—	—
2024	`Information Fusion`	Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance	—	—
2024	`arXiv`	Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry		—
2024	`IJRA`	DDETR-SLAM: A Transformer-Based Approach to Pose Optimization in Dynamic Environments	—	—
2023	`International Conference on Neural Computing for Advanced Applications`	A lightweight sensor fusion for neural visual inertial odometry	—	—
2023	`ITM Web of Conferences`	ViT VO-A Visual Odometry technique Using CNN-Transformer Hybrid Architecture	—	—
2023	`arXiv`	TransFusionOdom: interpretable transformer-based LiDAR-inertial fusion odometry estimation		—
2023	`CVPR`	Modality-invariant Visual Odometry for Embodied Vision		Website
2023	`MAV`	ViTVO: Vision Transformer based Visual Odometry with Attention Supervision	—	—
2023	`International Conference on Haptics and Virtual Reality`	VIOFormer: Advancing Monocular Visual-Inertial Odometry Through Transformer-Based Fusion	—	—
2022	`IEEE Intelligent Vehicles Symposium`	Attention guided unsupervised learning of monocular visual-inertial odometry	—	—
2022	`IEEE-SJ`	Ema-vio: Deep visual–inertial odometry with external memory attention	—	—
2022	`IROS`	AFT-VO: Asynchronous fusion transformers for multi-view visual odometry estimation	—	—
2022	`arXiv`	Dense prediction transformer for scale estimation in monocular visual odometry	—	—
2021	`Neural Computing and Applications`	Transformer guided geometry model for flow-based unsupervised visual odometry	—	—

Transformer-based Optical Flow

Year	Venue	Paper Title	Repository	Note
2024	`ECCV`	Cotracker: It is better to track together		—
2023	`arXiv`	Win-win: Training high-resolution vision transformers from two windows	—	—
2023	`arXiv`	Flowformer: A transformer architecture and its masked cost volume autoencoding for optical flow	—	—
2023	`CVPR`	FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation		—
2023	`CVPR`	Transflow: Transformer as flow learner	—	—
2023	`ICCV`	Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow		Cross view Match
2022	`NIPS`	Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion		Cross view Match
2023	`PAMI`	Unifying flow, stereo and depth estimation		—
2022	`CVPR`	Gmflow: Learning optical flow via global matching		—
2022	`CVPR`	Craft: Cross-attentional flow transformer for robust optical flow		—
2022	`CVPR`	Learning optical flow with kernel patch attention		—
2022	`CVPR`	Global Matching with Overlapping Attention for Optical Flow Estimation		—
2022	`CVPR`	Flowformer: A transformer architecture for optical flow		—

Transformer-based View Matching

or Data Association, Correspondences

Year	Venue	Paper Title	Repository	Note
2025	`arXiv`	TAPIP3D: Tracking Any Point in Persistent 3D Geometry		website
2025	`CVPR`	DEFOM-Stereo: Depth Foundation Model Based Stereo Matching		website depth anything v2 + RAFT-Stereo
2025	`arXiv`	MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training		website
2025	`ICCV`	POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction		—
2025	`arXiv`	CoMatcher: Multi-View Collaborative Feature Matching	—	—
2025	`arXiv`	CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching	—	—
2025	`CVPR`	Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better		website
2025	`CVPR`	FG2:Fine-Grained Cross-View Localization by Fine-Grained Feature Matching		—
2025	`arXiv`	ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration		—
2025	`arXiv`	Loop Closure from Two Views: Revisiting PGO for Scalable Trajectory Estimation through Monocular Priors	—	MASt3R for Loop Closure
2025	`arXiv`	Speedy MASt3R	—	—
2025	`CVPR`	VGGT: Visual Geometry Grounded Transformer		website Test
2024	`ECCV`	Grounding Image Matching in 3D with MASt3R		Website Test
2024	`CVPR`	Efficient LoFTR: Semi-dense local feature matching with sparse-like speed		ELoFTR
2024	`CVPR`	RoMa: Robust dense feature matching		—
2023	`CVPR`	Rotation-invariant transformer for point cloud matching		Point Cloud Matching
2022	`ECCV`	Aspanformer: Detector-free image matching with adaptive span transformer	—	—
2022	`ICARM`	Tlcd: A transformer based loop closure detection for robotic visual slam	—	—
2021	`ICCV`	Cotr: Correspondence transformer for matching across images		—
2021	`CVPR`	LoFTR: Detector-free local feature matching with transformers		—
2020	`CVPR`	Superglue: Learning feature matching with graph neural networks		borrows the self-attention

Transformer-based Mapping

or depth estimation or 3D reconstruction

Year	Venue	Paper Title	Repository	Note
2026	`arXiv`	Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction	—	无需训练(即插即用)，加速前馈3D重建模型，兼容VGGT、π3 、DepthAnythingV3（1000帧上高达28倍加速）.将query tokens 和 key-value tokens的压缩解耦了，并分别配不同的压缩算子。ayer-adaptive key-value reduction schedule ：根据每个全局注意力层对压缩的实测敏感性，调整其键值缩减因子
2026	`CVPR`	From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis		website 通过自监督的方式来学习显式的3D几何和相机内外参。通过重构3DGS，然后渲染，利用2D图像的自监督
2026	`CVPR`	VGGT-Ω		website 将模型参数量扩展至10B，引入寄存器注意力（Register Attention）：ViT往往自发利用小部分的Token携带全局信息，因此作者在 25% 的全局注意力层中，不同帧之间的信息交换被严格限制在寄存器之间，形成了有效的信息瓶颈，强制寄存器提取高层几何特征；对于解码头进行显存优化；从4000万数据中选出80万高质量序列，结合合成数据有400万个带3D label的训练序列（原本VGGT的15倍以上）。
2026	`ICLR`	Ttt3r: 3d reconstruction as test-time training		website 基于CUT3R的改进，CUT3R属于RNN架构，参考现有的RNN架构来改进CUT3R；TTT正是为了解决RNN长度泛化能力（隐藏状态在训练见过的长度/模式上“记死”了；信息在时间上被不断覆盖或衰减，长序列里早期重要信息丢失；序列变长后，隐藏状态会进入训练阶段几乎没出现过的区域，行为不可控、误差容易放大）
2026	`CVPR`	SwiftVGGT: A Scalable Visual Geometry Grounded Transformer for Large-Scale Scenes		website 在保持高质量密集3D重建的同时显著降低了推理时间;主要改进点有二：（1）将基于迭代IRLS的Sim(3)对齐替换为基于单步Sim(3)的奇异值分解流程；（2）直接使用VGGT的DINO块标记执行闭环检测，消除了外部VPR模型引入的冗余计算（实现回环检测）
2026	`arXiv`	Geometric Context Transformer for Streaming 3D Reconstruction		website LingBot-Map，纯粹的自回归模型，单个RGB实现流式 3D 重建;DINO 骨干网络提取图像特征，经过交替排列的 Frame Attention 层和 GCA 层进行精炼。在 GCA 中，当前视图从 Anchor Context、局部 Pose-Reference Window 和 Trajectory Memory 中聚合信息。任务特定的预测头分别输出相机位姿和深度图。
2026	`CVPR`	PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery	—	基于全景图的前馈三维重建；引入了针对全景图像的特殊三轴SO(3)旋转处理机制，关键应该是数据集
2026	`ICLR`	PAGE-4D: Disentangled Pose and Geometry Estimation for VGGT-4D Perception		website 通过对VGGT关键层的特征可视化分析，研究团队观察到动态区域表现出比静态区域更弱的激活，这表明VGGT倾向于忽略动态内容。提出Dynamics-aware Aggregator：通过三个模块整合空间和时间线索——Frame Attention用于帧间patch关系，Global Attention用于帧内patch关系，Dynamics-Aware Global Attention用于解耦动态和静态内容；Dynamics-aware Aggregator预测动态感知掩码，实现静态与动态信息的解耦：在姿态估计中抑制运动线索，在几何重建中增强动态表示。
2026	`arXiv`	MeMix: Writing Less, Remembering More for Streaming 3D Reconstruction		website “流式 3D 重建”是指模型像人眼一样，不断接收连续的视频帧流，并实时构建周围的 3D 场景。为了处理无限长的视频流，现有的主流模型通常采用“循环状态（Recurrent State）”机制——即把过去的记忆压缩到一个固定大小的状态中。本文不再把状态看作一个整体，而是将其划分为多个独立的记忆补丁(patches)，然后选择性的更新，这样可以避免长序列的遗忘问题。
2025	`arXiv`	4D-VGGT: A General Foundation Model with SpatioTemporal Awareness for Dynamic Scene Geometry Estimation	—	Multi-level representation:对于空间（Spatial），采用“跨视图的全局融合（cross-view global fusion）”，以确保同一时刻下多视角几何的全局一致性。对于时间（Temporal），采用“跨时间的局部融合（cross-time local fusion）”，以精准捕捉物体在前后帧之间的局部运动细节。在获取了时空特征的基础上，通过多个prediction head来预测深度、场景流等
2026	`arXiv`	OVGGT: O (1) Constant-Cost Streaming Visual Geometry Transformer		website Self-Selective Caching (SSC,自选择性缓存) and Dynamic Anchor Protection (DAP，动态锚点保护)。固定的显存下，处理任意长度序列
2026	`arXiv`	VGG-T^3: Offline Feed-Forward 3D Reconstruction at Scale	—	website 离线的大尺度3D重建，对2K张图片，48.5秒可实现三维重建，比VGGT快33倍。将 VGGT 中的可变长度场景表征（KV 空间）压缩为固定尺寸的 MLP 权重，从而将复杂度从二次降为线性O(n)；将 VGGT 中的全局 Softmax Attention 模块替换为一套基于测试时训练（Test-Time Training, TTT）的线性模块
2026	`arXiv`	Flow4R: Unifying 4D Reconstruction and Tracking with Scene Flow	—	website Flow4R会对每一幅图像预测点位置（相当于pointmap）、场景流、姿态权重和置信度,其中，场景流（scene flow）实现了捕捉点相对于摄像机的运动情况，概括来说就是一个模型同时看懂场景、跟踪物体、估相机运动、完成重建”，而它选择的统一语言不是传统位姿或深度，而是 scene flow
2026	`CVPR`	LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging		website 提出了一种geometry-aware token merging策略（减少冗余），此外，结合微调和FP8量化，使得原本的VGGT可以处理1000张图，推理加速10倍
2026	`arXiv`	Fast-SAM3D: 3Dfy Anything in Images but Faster		从输入图片中重构物体三维模型
2026	`arXiv`	Masked Depth Modeling for Spatial Perception		website 面向真实场景的深度补全模型：将传感器失效区域视为可学习的结构线索而非简单噪声，通过掩码深度建模（Masked Depth Modeling, MDM）机制充分利用 RGB 图像中的视觉上下文信息，对缺失深度区域进行智能推断与补全。架构：RGB+Deoth输入ViT（标准架构）；合成数据LingBot Depth-S（生成带有自然缺陷的逼真深度观测值，442个室内场景中渲染了100万个合成样本）+真实数据LingBot Depth-R（200万个场景）+开源数据集补充（共1000万个训练样本）
2026	`arXiv`	V-DPM: 4D Video Reconstruction with Dynamic Point Maps		website 将动态点图（DPM）从图像对扩展为多帧视频，进而可以实现4D重建；建立于VGGT基础上，通过time-conditioned decoder将重建过程分解为两个步骤：用VGGT生成视角不变但时间可变的点图（即所有点都在同一个相机坐标系）+用时间条件解码器生成时间不变的点图（即不仅在同一个坐标系下，也传递到同一时间戳上）
2026	`arXiv`	MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation		从稀疏的RGB图像和文本描述中，同时完成场景重建和目标分割；双分支架构：重建Branch（三维重建）和多模态Branch（接收来自于重建Branch的几何特征，与图像的视觉特征、文本encoder进行融合，实现对描述区域的关注，输出2D mask并投影到3D点云上）
2025	`arXiv`	One4D: Unified 4D Generation and Reconstruction via Decoupled LoRA Control		website 一个模型（视频生成模型Wan Video）同步输出 RGB 视频与 Pointmap（XYZ）几何视频
2025	`arXiv`	TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction		全局一致性的在线增量建图；基于全局控制点传播与薄板样条变换的精细几何对齐策略，基于重叠相机位姿平均的子地图注册，并支持不同的基础模型（如VGGT，π³、MapAnything）
2025	`arXiv`	InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams		基于Key-Diversity的滚动记忆机制/Rolling Memory,实现面向无限长序列的视觉几何重建框架；算每个 Key 与均值 Key 的负余弦相似度。得分越高，代表该 Token 与平均值差异越大，包含越独特的几何信息，应予以保留
2025	`arXiv`	DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass		利用多张图像实现动态场景的密集点跟踪与3D重建，提取 deep spatio-temporal features，通过dense prediction heads来回归pixel-wise 地图；VGGT+Dense Prediction Transformer（DPT，分别作为Depth Head、Point Head和Motion Head）
2025	`arXiv`	Depth Anything 3: Recovering the Visual Space from Any Views		website
2025	`arXiv`	HTTM: Head-wise Temporal Token Merging for Faster VGGT	—	—
2025	`arXiv`	FastVGGT: Training-Free Acceleration of Visual Geometry Transformer	—	—
2025	`arXiv`	VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction		website
2025	`arXiv`	OmniVGGT: Omni-Modality Driven Visual Geometry Grounded		website
2025	`arXiv`	Human3R: Everyone Everywhere All at Once		website
2025	`arXiv`	MapAnything: Universal Feed-Forward Metric 3D Reconstruction		website
2025	`arXiv`	POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction		—
2025	`ICCV`	PanoSplatt3R: Leveraging Perspective Pretraining for Generalized Unposed Wide-Baseline Panorama Reconstruction		website
2025	`arXiv`	Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos	—	—
2025	`ICCV`	LONG3R: Long Sequence Streaming 3D Reconstruction		website
2025	`arXiv`	Dens3R: A Foundation Model for 3D Geometry Prediction	—	—
2025	`arXiv`	StreamVGGT: Streaming 4D Visual Geometry Transformer		website
2025	`arXiv`	Test3R: Learning to Reconstruct 3D at Test Time		website
2025	`arXiv`	Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory		website
2025	`arXiv`	4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos	—	website
2025	`arXiv`	Rig3R: Rig-Aware Conditioning for Learned 3D Reconstruction	—	—
2025	`CVPR`	SAB3R: Semantic-Augmented Backbone in 3D Reconstruction	—	website
2025	`arXiv`	Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles		website
2025	`CVPR`	MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models		—
2025	`arXiv`	Recollection from Pensieve: Novel View Synthesis via Learning from Uncalibrated Videos		—
2025	`arXiv`	STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes		website
2025	`CVPRW`	DELTA: Dense Depth from Events and LiDAR using Transformer’s Attention		website
2025	`CVPR`	MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds		website
2025	`CVPR`	MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision		website
2025	`arXiv`	Regist3R: Incremental Registration with Stereo Foundation Model	—	—
2025	`arXiv`	St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World	—	website
2025	`CVPR`	AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis		website
2025	`arXiv`	Mono3R: Exploiting Monocular Cues for Geometric 3D Reconstruction	—	—
2025	`CVPR`	MonSter: Marry Monodepth to Stereo Unleashes Power		—
2025	`arXiv`	D2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes	—	website
2025	`arXiv`	FlowR: Flowing from Sparse to Dense 3D Reconstructions	—	website
2025	`arXiv`	Easi3R: Estimating Disentangled Motion from DUSt3R Without Training		website 4D DUSt3R test
2025	`arXiv`	SparseGS-W: Sparse-View 3D Gaussian Splatting in the Wild with Generative Priors	—	DUSt3R+Diffusion+3DGS
2025	`ICLR`	M3: 3D-Spatial Multimodal Memory		website compression & Gaussian Memory Attention
2025	`CVPR`	MVSAnywhere: Zero-Shot Multi-View Stereo		website
2025	`CVPR`	CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis	—	website
2025	`CVPR`	Pow3R: empowering unconstrained 3D reconstruction with camera and scene priors	—	website DUSt3R+multi information input
2025	`CVPR`	Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding		TSP3D
2025	`CVPR`	UniK3D: Universal Camera Monocular 3D Estimation		website
2025	`CVPR`	Sonata: Self-Supervised Learning of Reliable Point Representations		website
2024	`CVPR`	Point transformer v3: Simpler faster stronger		—
2022	`NIPS`	Point transformer v2: Grouped vector attention and partition-based pooling		—
2021	`ICCV`	Point transformer	—	unofficial implementation
2025	`arXiv`	Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction	—	website Dynamic DUSt3R, DPM
2025	`ICLR`	MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion		website Test
2025	`CVPR`	Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos		website
2025	`CVPR`	Continuous 3D Perception Model with Persistent State		website CUT3R
2025	`CVPR`	SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction		MASt3R+COLMAP+3DGS
2025	`arXiv`	SplatVoxel: History-Aware Novel View Streaming without Temporal Training	—	—
2025	`CVPR`	GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding		3DGS+Transformer
2025	`CVPR`	DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers		website distillation
2025	`arXiv`	MUSt3R: Multi-view Network for Stereo 3D Reconstruction		multiple views DUSt3R
2025	`CVPR`	Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass		Website Test
2024	`NIPS`	Depth anything v2		website
2024	`CVPR`	Depth anything: Unleashing the power of large-scale unlabeled data		Website
2024	`CVPR`	DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions	—	—
2024	`CVPR`	Learning to adapt clip for few-shot monocular depth estimation	—	—
2025	`3DV`	3d reconstruction with spatial memory		website Spann3R
2024	`CVPR`	DUSt3R: Geometric 3D Vision Made Easy		Website Test
2024	`ECCV`	Gs-lrm: Large reconstruction model for 3d gaussian splatting	—	website 3DGS+Transformer
2024	`TIP`	BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation		—
2024	`TIP`	GLPanoDepth: Global-to-Local Panoramic Depth Estimation	—	—
2023	`ICCV`	Towards zero-shot scale-aware monocular depth estimation		website
2023	`ICCV`	Egformer: Equirectangular geometry-biased transformer for 360 depth estimation		—
2023	`Machine Intelligence Research`	Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation	—	—
2023	`CVPR`	Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation	—	—
2023	`CVPR`	CompletionFormer: Depth Completion with Convolutions and Vision Transformers		website
2023	`ICRA`	Lightweight monocular depth estimation via token-sharing transformer	—	—
2023	`AAAI`	ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation	—	—
2023	`ICRA`	TODE-Trans: Transparent Object Depth Estimation with Transformer		—
2023	`AAAI`	Deep digging into the generalization of self-supervised monocular depth estimation		—
2022	`ECCV`	PanoFormer: Panorama Transformer for Indoor 360 Depth Estimation		—
2022	`AAAI`	Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learning	—	—
2022	`arXiv`	MVSFormer: Multi-view stereo by learning robust image features and temperature-based depth	—	—
2022	`arXiv`	Objcavit: improving monocular depth estimation using natural language models and image-object cross-attention		—
2022	`arXiv`	Depthformer: Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion		—
2022	`arXiv`	Sidert: A real-time pure transformer architecture for single image depth estimation	—	—
2022	`ECCV`	Hybrid transformer based feature fusion for self-supervised monocular depth estimation	—	—
2022	`ECCV`	Spike transformer: Monocular depth estimation for spiking camera		—
2022	`3DV`	MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer		—
2022	`arXiv`	DEST: “Depth Estimation with Simplified Transformer	—	—
2022	`arXiv`	SparseFormer: Attention-based Depth Completion Network	—	—
2022	`CVPR`	GuideFormer: Transformers for Image Guided Depth Completion	—	—
2022	`CVPR`	Multi-frame self-supervised depth with transformers	—	—
2022	`arXiv`	Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics	—	—
2021	`ICCV`	Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers	—	STTR stereo matching
2021	`BMVC`	Transformer-based Monocular Depth Estimation with Attention Supervision		—
2021	`ICCV`	Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction		—
2021	`ICCV`	Vision transformers for dense prediction		DPT

Other Resources

Awesome-Transformer-Attention
Dense-Prediction-Transformer-Based-Visual-Odometry
Visual SLAM with Vision Transformers(ViT)
Awesome-Learning-based-VO-VIO
Some basic paper in ViT:

Year	Venue	Paper Title	Note
2025	`arXiv`	DINOV3	website
2024	`Transactions on Machine Learning Research Journal`	Dinov2: Learning robust visual features without supervision	DINO2
2021	`ICML`	Is space-time attention all you need for video understanding?	TimeSformer
2021	`CVPR`	Taming transformers for high-resolution image synthesis	High resolution CNN+Transformer
2021	`ICCV`	Emerging properties in self-supervised vision transformers	DINO SSL
2021	`ICCV`	Vivit: A video vision transformer	—
2020	`ICLR`	An image is worth 16x16 words: Transformers for image recognition at scale	ViT