| 2026 |
arXiv |
Fast-SAM3D: 3Dfy Anything in Images but Faster |
 |
从输入图片中重构物体三维模型 |
| 2026 |
arXiv |
Masked Depth Modeling for Spatial Perception |
 |
website 面向真实场景的深度补全模型:将传感器失效区域视为可学习的结构线索而非简单噪声,通过掩码深度建模(Masked Depth Modeling, MDM)机制充分利用 RGB 图像中的视觉上下文信息,对缺失深度区域进行智能推断与补全。架构:RGB+Deoth输入ViT(标准架构);合成数据LingBot Depth-S(生成带有自然缺陷的逼真深度观测值,442个室内场景中渲染了100万个合成样本)+真实数据LingBot Depth-R(200万个场景)+开源数据集补充(共1000万个训练样本) |
| 2026 |
arXiv |
V-DPM: 4D Video Reconstruction with Dynamic Point Maps |
 |
website 将动态点图(DPM)从图像对扩展为多帧视频,进而可以实现4D重建;建立于VGGT基础上,通过time-conditioned decoder将重建过程分解为两个步骤:用VGGT生成视角不变但时间可变的点图(即所有点都在同一个相机坐标系)+用时间条件解码器生成时间不变的点图(即不仅在同一个坐标系下,也传递到同一时间戳上) |
| 2026 |
arXiv |
MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation |
 |
从稀疏的RGB图像和文本描述中,同时完成场景重建和目标分割;双分支架构:重建Branch(三维重建)和多模态Branch(接收来自于重建Branch的几何特征,与图像的视觉特征、文本encoder进行融合,实现对描述区域的关注,输出2D mask并投影到3D点云上) |
| 2025 |
arXiv |
One4D: Unified 4D Generation and Reconstruction via Decoupled LoRA Control |
 |
website 一个模型(视频生成模型Wan Video)同步输出 RGB 视频与 Pointmap(XYZ)几何视频 |
| 2025 |
arXiv |
TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction |
 |
全局一致性的在线增量建图;基于全局控制点传播与薄板样条变换的精细几何对齐策略,基于重叠相机位姿平均的子地图注册,并支持不同的基础模型(如VGGT,π³、MapAnything) |
| 2025 |
arXiv |
InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams |
 |
基于Key-Diversity的滚动记忆机制/Rolling Memory,实现面向无限长序列的视觉几何重建框架;算每个 Key 与均值 Key 的负余弦相似度。得分越高,代表该 Token 与平均值差异越大,包含越独特的几何信息,应予以保留 |
| 2025 |
arXiv |
DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass |
 |
利用多张图像实现动态场景的密集点跟踪与3D重建,提取 deep spatio-temporal features,通过dense prediction heads来回归pixel-wise 地图;VGGT+Dense Prediction Transformer(DPT,分别作为Depth Head、Point Head和Motion Head) |
| 2025 |
arXiv |
Depth Anything 3: Recovering the Visual Space from Any Views |
 |
website |
| 2025 |
arXiv |
HTTM: Head-wise Temporal Token Merging for Faster VGGT |
— |
— |
| 2025 |
arXiv |
FastVGGT: Training-Free Acceleration of Visual Geometry Transformer |
— |
— |
| 2025 |
arXiv |
VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction |
 |
website |
| 2025 |
arXiv |
OmniVGGT: Omni-Modality Driven Visual Geometry Grounded |
 |
website |
| 2025 |
arXiv |
Human3R: Everyone Everywhere All at Once |
 |
website |
| 2025 |
arXiv |
MapAnything: Universal Feed-Forward Metric 3D Reconstruction |
 |
website |
| 2025 |
arXiv |
POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction |
 |
— |
| 2025 |
ICCV |
PanoSplatt3R: Leveraging Perspective Pretraining for Generalized Unposed Wide-Baseline Panorama Reconstruction |
 |
website |
| 2025 |
arXiv |
Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos |
— |
— |
| 2025 |
ICCV |
LONG3R: Long Sequence Streaming 3D Reconstruction |
 |
website |
| 2025 |
arXiv |
Dens3R: A Foundation Model for 3D Geometry Prediction |
— |
— |
| 2025 |
arXiv |
StreamVGGT: Streaming 4D Visual Geometry Transformer |
 |
website |
| 2025 |
arXiv |
Test3R: Learning to Reconstruct 3D at Test Time |
 |
website |
| 2025 |
arXiv |
Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory |
 |
website |
| 2025 |
arXiv |
4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos |
— |
website |
| 2025 |
arXiv |
Rig3R: Rig-Aware Conditioning for Learned 3D Reconstruction |
— |
— |
| 2025 |
CVPR |
SAB3R: Semantic-Augmented Backbone in 3D Reconstruction |
— |
website |
| 2025 |
arXiv |
Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles |
 |
website |
| 2025 |
CVPR |
MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models |
 |
— |
| 2025 |
arXiv |
Recollection from Pensieve: Novel View Synthesis via Learning from Uncalibrated Videos |
 |
— |
| 2025 |
arXiv |
STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes |
 |
website |
| 2025 |
CVPRW |
DELTA: Dense Depth from Events and LiDAR using Transformer’s Attention |
 |
website |
| 2025 |
CVPR |
MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds |
 |
website |
| 2025 |
CVPR |
MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision |
 |
website |
| 2025 |
arXiv |
Regist3R: Incremental Registration with Stereo Foundation Model |
— |
— |
| 2025 |
arXiv |
St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World |
— |
website |
| 2025 |
CVPR |
AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis |
 |
website |
| 2025 |
arXiv |
Mono3R: Exploiting Monocular Cues for Geometric 3D Reconstruction |
— |
— |
| 2025 |
CVPR |
MonSter: Marry Monodepth to Stereo Unleashes Power |
 |
— |
| 2025 |
arXiv |
D2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes |
— |
website |
| 2025 |
arXiv |
FlowR: Flowing from Sparse to Dense 3D Reconstructions |
— |
website |
| 2025 |
arXiv |
Easi3R: Estimating Disentangled Motion from DUSt3R Without Training |
 |
website 4D DUSt3R test |
| 2025 |
arXiv |
SparseGS-W: Sparse-View 3D Gaussian Splatting in the Wild with Generative Priors |
— |
DUSt3R+Diffusion+3DGS |
| 2025 |
ICLR |
M3: 3D-Spatial Multimodal Memory |
 |
website compression & Gaussian Memory Attention |
| 2025 |
CVPR |
MVSAnywhere: Zero-Shot Multi-View Stereo |
 |
website |
| 2025 |
CVPR |
CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis |
— |
website |
| 2025 |
CVPR |
Pow3R: empowering unconstrained 3D reconstruction with camera and scene priors |
— |
website DUSt3R+multi information input |
| 2025 |
CVPR |
Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding |
 |
TSP3D |
| 2025 |
CVPR |
UniK3D: Universal Camera Monocular 3D Estimation |
 |
website |
| 2025 |
CVPR |
Sonata: Self-Supervised Learning of Reliable Point Representations |
 |
website |
| 2024 |
CVPR |
Point transformer v3: Simpler faster stronger |
 |
— |
| 2022 |
NIPS |
Point transformer v2: Grouped vector attention and partition-based pooling |
 |
— |
| 2021 |
ICCV |
Point transformer |
— |
unofficial implementation |
| 2025 |
arXiv |
Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction |
— |
website Dynamic DUSt3R, DPM |
| 2025 |
ICLR |
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion |
 |
website Test |
| 2025 |
CVPR |
Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos |
 |
website |
| 2025 |
CVPR |
Continuous 3D Perception Model with Persistent State |
 |
website CUT3R |
| 2025 |
CVPR |
SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction |
 |
MASt3R+COLMAP+3DGS |
| 2025 |
arXiv |
SplatVoxel: History-Aware Novel View Streaming without Temporal Training |
— |
— |
| 2025 |
CVPR |
GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding |
 |
3DGS+Transformer |
| 2025 |
CVPR |
DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers |
 |
website distillation |
| 2025 |
arXiv |
MUSt3R: Multi-view Network for Stereo 3D Reconstruction |
 |
multiple views DUSt3R |
| 2025 |
CVPR |
Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass |
 |
Website Test |
| 2024 |
NIPS |
Depth anything v2 |
 |
website |
| 2024 |
CVPR |
Depth anything: Unleashing the power of large-scale unlabeled data |
 |
Website |
| 2024 |
CVPR |
DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions |
— |
— |
| 2024 |
CVPR |
Learning to adapt clip for few-shot monocular depth estimation |
— |
— |
| 2025 |
3DV |
3d reconstruction with spatial memory |
 |
website Spann3R |
| 2024 |
CVPR |
DUSt3R: Geometric 3D Vision Made Easy |
 |
Website Test |
| 2024 |
ECCV |
Gs-lrm: Large reconstruction model for 3d gaussian splatting |
— |
website 3DGS+Transformer |
| 2024 |
TIP |
BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation |
 |
— |
| 2024 |
TIP |
GLPanoDepth: Global-to-Local Panoramic Depth Estimation |
— |
— |
| 2023 |
ICCV |
Towards zero-shot scale-aware monocular depth estimation |
 |
website |
| 2023 |
ICCV |
Egformer: Equirectangular geometry-biased transformer for 360 depth estimation |
 |
— |
| 2023 |
Machine Intelligence Research |
Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation |
— |
— |
| 2023 |
CVPR |
Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation |
— |
— |
| 2023 |
CVPR |
CompletionFormer: Depth Completion with Convolutions and Vision Transformers |
 |
website |
| 2023 |
ICRA |
Lightweight monocular depth estimation via token-sharing transformer |
— |
— |
| 2023 |
AAAI |
ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation |
— |
— |
| 2023 |
ICRA |
TODE-Trans: Transparent Object Depth Estimation with Transformer |
 |
— |
| 2023 |
AAAI |
Deep digging into the generalization of self-supervised monocular depth estimation |
 |
— |
| 2022 |
ECCV |
PanoFormer: Panorama Transformer for Indoor 360 Depth Estimation |
 |
— |
| 2022 |
AAAI |
Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learning |
— |
— |
| 2022 |
arXiv |
MVSFormer: Multi-view stereo by learning robust image features and temperature-based depth |
— |
— |
| 2022 |
arXiv |
Objcavit: improving monocular depth estimation using natural language models and image-object cross-attention |
 |
— |
| 2022 |
arXiv |
Depthformer: Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion |
 |
— |
| 2022 |
arXiv |
Sidert: A real-time pure transformer architecture for single image depth estimation |
— |
— |
| 2022 |
ECCV |
Hybrid transformer based feature fusion for self-supervised monocular depth estimation |
— |
— |
| 2022 |
ECCV |
Spike transformer: Monocular depth estimation for spiking camera |
 |
— |
| 2022 |
3DV |
MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer |
 |
— |
| 2022 |
arXiv |
DEST: “Depth Estimation with Simplified Transformer |
— |
— |
| 2022 |
arXiv |
SparseFormer: Attention-based Depth Completion Network |
— |
— |
| 2022 |
CVPR |
GuideFormer: Transformers for Image Guided Depth Completion |
— |
— |
| 2022 |
CVPR |
Multi-frame self-supervised depth with transformers |
— |
— |
| 2022 |
arXiv |
Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics |
— |
— |
| 2021 |
ICCV |
Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers |
— |
STTR stereo matching |
| 2021 |
BMVC |
Transformer-based Monocular Depth Estimation with Attention Supervision |
 |
— |
| 2021 |
ICCV |
Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction |
 |
— |
| 2021 |
ICCV |
Vision transformers for dense prediction |
 |
DPT |