| 2026 |
ICLR |
Ttt3r: 3d reconstruction as test-time training |
 |
website 基于CUT3R的改进,CUT3R属于RNN架构,参考现有的RNN架构来改进CUT3R;TTT正是为了解决RNN长度泛化能力(隐藏状态在训练见过的长度/模式上“记死”了;信息在时间上被不断覆盖或衰减,长序列里早期重要信息丢失;序列变长后,隐藏状态会进入训练阶段几乎没出现过的区域,行为不可控、误差容易放大) |
| 2026 |
CVPR |
SwiftVGGT: A Scalable Visual Geometry Grounded Transformer for Large-Scale Scenes |
 |
website 在保持高质量密集3D重建的同时显著降低了推理时间;主要改进点有二:(1)将基于迭代IRLS的Sim(3)对齐替换为基于单步Sim(3)的奇异值分解流程;(2)直接使用VGGT的DINO块标记执行闭环检测,消除了外部VPR模型引入的冗余计算(实现回环检测) |
| 2026 |
arXiv |
Geometric Context Transformer for Streaming 3D Reconstruction |
 |
website LingBot-Map,纯粹的自回归模型,单个RGB实现流式 3D 重建;DINO 骨干网络提取图像特征,经过交替排列的 Frame Attention 层和 GCA 层进行精炼。在 GCA 中,当前视图从 Anchor Context、局部 Pose-Reference Window 和 Trajectory Memory 中聚合信息。任务特定的预测头分别输出相机位姿和深度图。 |
| 2026 |
CVPR |
PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery |
— |
基于全景图的前馈三维重建;引入了针对全景图像的特殊三轴SO(3)旋转处理机制,关键应该是数据集 |
| 2026 |
ICLR |
PAGE-4D: Disentangled Pose and Geometry Estimation for VGGT-4D Perception |
 |
website 通过对VGGT关键层的特征可视化分析,研究团队观察到动态区域表现出比静态区域更弱的激活,这表明VGGT倾向于忽略动态内容。提出Dynamics-aware Aggregator:通过三个模块整合空间和时间线索——Frame Attention用于帧间patch关系,Global Attention用于帧内patch关系,Dynamics-Aware Global Attention用于解耦动态和静态内容;Dynamics-aware Aggregator预测动态感知掩码,实现静态与动态信息的解耦:在姿态估计中抑制运动线索,在几何重建中增强动态表示。 |
| 2026 |
arXiv |
MeMix: Writing Less, Remembering More for Streaming 3D Reconstruction |
 |
website “流式 3D 重建”是指模型像人眼一样,不断接收连续的视频帧流,并实时构建周围的 3D 场景。为了处理无限长的视频流,现有的主流模型通常采用“循环状态(Recurrent State)”机制——即把过去的记忆压缩到一个固定大小的状态中。本文不再把状态看作一个整体,而是将其划分为多个独立的记忆补丁(patches),然后选择性的更新,这样可以避免长序列的遗忘问题。 |
| 2025 |
arXiv |
4D-VGGT: A General Foundation Model with SpatioTemporal Awareness for Dynamic Scene Geometry Estimation |
— |
Multi-level representation:对于空间(Spatial),采用“跨视图的全局融合(cross-view global fusion)”,以确保同一时刻下多视角几何的全局一致性。对于时间(Temporal),采用“跨时间的局部融合(cross-time local fusion)”,以精准捕捉物体在前后帧之间的局部运动细节。在获取了时空特征的基础上,通过多个prediction head来预测深度、场景流等 |
| 2026 |
arXiv |
OVGGT: O (1) Constant-Cost Streaming Visual Geometry Transformer |
 |
website Self-Selective Caching (SSC,自选择性缓存) and Dynamic Anchor Protection (DAP,动态锚点保护)。固定的显存下,处理任意长度序列 |
| 2026 |
arXiv |
VGG-T^3: Offline Feed-Forward 3D Reconstruction at Scale |
— |
website 离线的大尺度3D重建,对2K张图片,48.5秒可实现三维重建,比VGGT快33倍。将 VGGT 中的可变长度场景表征(KV 空间)压缩为固定尺寸的 MLP 权重,从而将复杂度从二次降为线性O(n);将 VGGT 中的 全局 Softmax Attention 模块 替换为一套基于 测试时训练(Test-Time Training, TTT) 的线性模块 |
| 2026 |
arXiv |
Flow4R: Unifying 4D Reconstruction and Tracking with Scene Flow |
— |
website Flow4R会对每一幅图像预测点位置(相当于pointmap)、场景流、姿态权重和置信度,其中,场景流(scene flow)实现了捕捉点相对于摄像机的运动情况,概括来说就是一个模型同时看懂场景、跟踪物体、估相机运动、完成重建”,而它选择的统一语言不是传统位姿或深度,而是 scene flow |
| 2026 |
CVPR |
LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging |
 |
website 提出了一种geometry-aware token merging策略(减少冗余),此外,结合微调和FP8量化,使得原本的VGGT可以处理1000张图,推理加速10倍 |
| 2026 |
arXiv |
Fast-SAM3D: 3Dfy Anything in Images but Faster |
 |
从输入图片中重构物体三维模型 |
| 2026 |
arXiv |
Masked Depth Modeling for Spatial Perception |
 |
website 面向真实场景的深度补全模型:将传感器失效区域视为可学习的结构线索而非简单噪声,通过掩码深度建模(Masked Depth Modeling, MDM)机制充分利用 RGB 图像中的视觉上下文信息,对缺失深度区域进行智能推断与补全。架构:RGB+Deoth输入ViT(标准架构);合成数据LingBot Depth-S(生成带有自然缺陷的逼真深度观测值,442个室内场景中渲染了100万个合成样本)+真实数据LingBot Depth-R(200万个场景)+开源数据集补充(共1000万个训练样本) |
| 2026 |
arXiv |
V-DPM: 4D Video Reconstruction with Dynamic Point Maps |
 |
website 将动态点图(DPM)从图像对扩展为多帧视频,进而可以实现4D重建;建立于VGGT基础上,通过time-conditioned decoder将重建过程分解为两个步骤:用VGGT生成视角不变但时间可变的点图(即所有点都在同一个相机坐标系)+用时间条件解码器生成时间不变的点图(即不仅在同一个坐标系下,也传递到同一时间戳上) |
| 2026 |
arXiv |
MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation |
 |
从稀疏的RGB图像和文本描述中,同时完成场景重建和目标分割;双分支架构:重建Branch(三维重建)和多模态Branch(接收来自于重建Branch的几何特征,与图像的视觉特征、文本encoder进行融合,实现对描述区域的关注,输出2D mask并投影到3D点云上) |
| 2025 |
arXiv |
One4D: Unified 4D Generation and Reconstruction via Decoupled LoRA Control |
 |
website 一个模型(视频生成模型Wan Video)同步输出 RGB 视频与 Pointmap(XYZ)几何视频 |
| 2025 |
arXiv |
TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction |
 |
全局一致性的在线增量建图;基于全局控制点传播与薄板样条变换的精细几何对齐策略,基于重叠相机位姿平均的子地图注册,并支持不同的基础模型(如VGGT,π³、MapAnything) |
| 2025 |
arXiv |
InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams |
 |
基于Key-Diversity的滚动记忆机制/Rolling Memory,实现面向无限长序列的视觉几何重建框架;算每个 Key 与均值 Key 的负余弦相似度。得分越高,代表该 Token 与平均值差异越大,包含越独特的几何信息,应予以保留 |
| 2025 |
arXiv |
DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass |
 |
利用多张图像实现动态场景的密集点跟踪与3D重建,提取 deep spatio-temporal features,通过dense prediction heads来回归pixel-wise 地图;VGGT+Dense Prediction Transformer(DPT,分别作为Depth Head、Point Head和Motion Head) |
| 2025 |
arXiv |
Depth Anything 3: Recovering the Visual Space from Any Views |
 |
website |
| 2025 |
arXiv |
HTTM: Head-wise Temporal Token Merging for Faster VGGT |
— |
— |
| 2025 |
arXiv |
FastVGGT: Training-Free Acceleration of Visual Geometry Transformer |
— |
— |
| 2025 |
arXiv |
VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction |
 |
website |
| 2025 |
arXiv |
OmniVGGT: Omni-Modality Driven Visual Geometry Grounded |
 |
website |
| 2025 |
arXiv |
Human3R: Everyone Everywhere All at Once |
 |
website |
| 2025 |
arXiv |
MapAnything: Universal Feed-Forward Metric 3D Reconstruction |
 |
website |
| 2025 |
arXiv |
POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction |
 |
— |
| 2025 |
ICCV |
PanoSplatt3R: Leveraging Perspective Pretraining for Generalized Unposed Wide-Baseline Panorama Reconstruction |
 |
website |
| 2025 |
arXiv |
Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos |
— |
— |
| 2025 |
ICCV |
LONG3R: Long Sequence Streaming 3D Reconstruction |
 |
website |
| 2025 |
arXiv |
Dens3R: A Foundation Model for 3D Geometry Prediction |
— |
— |
| 2025 |
arXiv |
StreamVGGT: Streaming 4D Visual Geometry Transformer |
 |
website |
| 2025 |
arXiv |
Test3R: Learning to Reconstruct 3D at Test Time |
 |
website |
| 2025 |
arXiv |
Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory |
 |
website |
| 2025 |
arXiv |
4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos |
— |
website |
| 2025 |
arXiv |
Rig3R: Rig-Aware Conditioning for Learned 3D Reconstruction |
— |
— |
| 2025 |
CVPR |
SAB3R: Semantic-Augmented Backbone in 3D Reconstruction |
— |
website |
| 2025 |
arXiv |
Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles |
 |
website |
| 2025 |
CVPR |
MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models |
 |
— |
| 2025 |
arXiv |
Recollection from Pensieve: Novel View Synthesis via Learning from Uncalibrated Videos |
 |
— |
| 2025 |
arXiv |
STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes |
 |
website |
| 2025 |
CVPRW |
DELTA: Dense Depth from Events and LiDAR using Transformer’s Attention |
 |
website |
| 2025 |
CVPR |
MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds |
 |
website |
| 2025 |
CVPR |
MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision |
 |
website |
| 2025 |
arXiv |
Regist3R: Incremental Registration with Stereo Foundation Model |
— |
— |
| 2025 |
arXiv |
St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World |
— |
website |
| 2025 |
CVPR |
AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis |
 |
website |
| 2025 |
arXiv |
Mono3R: Exploiting Monocular Cues for Geometric 3D Reconstruction |
— |
— |
| 2025 |
CVPR |
MonSter: Marry Monodepth to Stereo Unleashes Power |
 |
— |
| 2025 |
arXiv |
D2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes |
— |
website |
| 2025 |
arXiv |
FlowR: Flowing from Sparse to Dense 3D Reconstructions |
— |
website |
| 2025 |
arXiv |
Easi3R: Estimating Disentangled Motion from DUSt3R Without Training |
 |
website 4D DUSt3R test |
| 2025 |
arXiv |
SparseGS-W: Sparse-View 3D Gaussian Splatting in the Wild with Generative Priors |
— |
DUSt3R+Diffusion+3DGS |
| 2025 |
ICLR |
M3: 3D-Spatial Multimodal Memory |
 |
website compression & Gaussian Memory Attention |
| 2025 |
CVPR |
MVSAnywhere: Zero-Shot Multi-View Stereo |
 |
website |
| 2025 |
CVPR |
CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis |
— |
website |
| 2025 |
CVPR |
Pow3R: empowering unconstrained 3D reconstruction with camera and scene priors |
— |
website DUSt3R+multi information input |
| 2025 |
CVPR |
Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding |
 |
TSP3D |
| 2025 |
CVPR |
UniK3D: Universal Camera Monocular 3D Estimation |
 |
website |
| 2025 |
CVPR |
Sonata: Self-Supervised Learning of Reliable Point Representations |
 |
website |
| 2024 |
CVPR |
Point transformer v3: Simpler faster stronger |
 |
— |
| 2022 |
NIPS |
Point transformer v2: Grouped vector attention and partition-based pooling |
 |
— |
| 2021 |
ICCV |
Point transformer |
— |
unofficial implementation |
| 2025 |
arXiv |
Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction |
— |
website Dynamic DUSt3R, DPM |
| 2025 |
ICLR |
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion |
 |
website Test |
| 2025 |
CVPR |
Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos |
 |
website |
| 2025 |
CVPR |
Continuous 3D Perception Model with Persistent State |
 |
website CUT3R |
| 2025 |
CVPR |
SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction |
 |
MASt3R+COLMAP+3DGS |
| 2025 |
arXiv |
SplatVoxel: History-Aware Novel View Streaming without Temporal Training |
— |
— |
| 2025 |
CVPR |
GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding |
 |
3DGS+Transformer |
| 2025 |
CVPR |
DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers |
 |
website distillation |
| 2025 |
arXiv |
MUSt3R: Multi-view Network for Stereo 3D Reconstruction |
 |
multiple views DUSt3R |
| 2025 |
CVPR |
Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass |
 |
Website Test |
| 2024 |
NIPS |
Depth anything v2 |
 |
website |
| 2024 |
CVPR |
Depth anything: Unleashing the power of large-scale unlabeled data |
 |
Website |
| 2024 |
CVPR |
DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions |
— |
— |
| 2024 |
CVPR |
Learning to adapt clip for few-shot monocular depth estimation |
— |
— |
| 2025 |
3DV |
3d reconstruction with spatial memory |
 |
website Spann3R |
| 2024 |
CVPR |
DUSt3R: Geometric 3D Vision Made Easy |
 |
Website Test |
| 2024 |
ECCV |
Gs-lrm: Large reconstruction model for 3d gaussian splatting |
— |
website 3DGS+Transformer |
| 2024 |
TIP |
BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation |
 |
— |
| 2024 |
TIP |
GLPanoDepth: Global-to-Local Panoramic Depth Estimation |
— |
— |
| 2023 |
ICCV |
Towards zero-shot scale-aware monocular depth estimation |
 |
website |
| 2023 |
ICCV |
Egformer: Equirectangular geometry-biased transformer for 360 depth estimation |
 |
— |
| 2023 |
Machine Intelligence Research |
Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation |
— |
— |
| 2023 |
CVPR |
Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation |
— |
— |
| 2023 |
CVPR |
CompletionFormer: Depth Completion with Convolutions and Vision Transformers |
 |
website |
| 2023 |
ICRA |
Lightweight monocular depth estimation via token-sharing transformer |
— |
— |
| 2023 |
AAAI |
ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation |
— |
— |
| 2023 |
ICRA |
TODE-Trans: Transparent Object Depth Estimation with Transformer |
 |
— |
| 2023 |
AAAI |
Deep digging into the generalization of self-supervised monocular depth estimation |
 |
— |
| 2022 |
ECCV |
PanoFormer: Panorama Transformer for Indoor 360 Depth Estimation |
 |
— |
| 2022 |
AAAI |
Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learning |
— |
— |
| 2022 |
arXiv |
MVSFormer: Multi-view stereo by learning robust image features and temperature-based depth |
— |
— |
| 2022 |
arXiv |
Objcavit: improving monocular depth estimation using natural language models and image-object cross-attention |
 |
— |
| 2022 |
arXiv |
Depthformer: Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion |
 |
— |
| 2022 |
arXiv |
Sidert: A real-time pure transformer architecture for single image depth estimation |
— |
— |
| 2022 |
ECCV |
Hybrid transformer based feature fusion for self-supervised monocular depth estimation |
— |
— |
| 2022 |
ECCV |
Spike transformer: Monocular depth estimation for spiking camera |
 |
— |
| 2022 |
3DV |
MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer |
 |
— |
| 2022 |
arXiv |
DEST: “Depth Estimation with Simplified Transformer |
— |
— |
| 2022 |
arXiv |
SparseFormer: Attention-based Depth Completion Network |
— |
— |
| 2022 |
CVPR |
GuideFormer: Transformers for Image Guided Depth Completion |
— |
— |
| 2022 |
CVPR |
Multi-frame self-supervised depth with transformers |
— |
— |
| 2022 |
arXiv |
Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics |
— |
— |
| 2021 |
ICCV |
Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers |
— |
STTR stereo matching |
| 2021 |
BMVC |
Transformer-based Monocular Depth Estimation with Attention Supervision |
 |
— |
| 2021 |
ICCV |
Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction |
 |
— |
| 2021 |
ICCV |
Vision transformers for dense prediction |
 |
DPT |