实验笔记之——VGGT-SLAM2.0复现与测试

2026-02-28

引言

本博文复现过程采用的代码及代码注释（如有）：My github repository（PS：代码是基于NAVIDA Jeson Thor配置的，对应存在的一些bug也已经修复）
PDF
原github：Link

实验复现

安装配置

git clone git@github.com:R-C-Group/VGGT-SLAM.git
conda create -n vggt-slam python=3.11
conda activate vggt-slam

下载第三方包（这个部分包括了下载所有的第三方包，Perception Encoder, SAM 3，VGGT）：

chmod +x setup.sh
./setup.sh
# 注意重新修复了bug后再次运行./setup.sh需要：要么清空全部文件，要么注释掉已经完成的脚本

在pip install -e ./salad安装时若存在time out报错，如WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)")': /simple/setuptools/(PS:本仓库的代码已经改进)

cd third_party
pip install -e ./salad -i https://pypi.tuna.tsinghua.edu.cn/simple
cd ..

在pip install -e ./perception_models -i https://pypi.tuna.tsinghua.edu.cn/simple安装时报错：

ERROR: Ignored the following versions that require a different python version: 1.21.2 Requires-Python >=3.7,<3.11; 1.21.3 Requires-Python >=3.7,<3.11; 1.21.4 Requires-Python >=3.7,<3.11; 1.21.5 Requires-Python >=3.7,<3.11; 1.21.6 Requires-Python >=3.7,<3.11; 1.6.2 Requires-Python >=3.7,<3.10; 1.6.3 Requires-Python >=3.7,<3.10; 1.7.0 Requires-Python >=3.7,<3.10; 1.7.1 Requires-Python >=3.7,<3.10; 1.7.2 Requires-Python >=3.7,<3.11; 1.7.3 Requires-Python >=3.7,<3.11; 1.8.0 Requires-Python >=3.8,<3.11; 1.8.0rc1 Requires-Python >=3.8,<3.11; 1.8.0rc2 Requires-Python >=3.8,<3.11; 1.8.0rc3 Requires-Python >=3.8,<3.11; 1.8.0rc4 Requires-Python >=3.8,<3.11; 1.8.1 Requires-Python >=3.8,<3.11
ERROR: Could not find a version that satisfies the requirement decord==0.6.0 (from perception-models) (from versions: none)
ERROR: No matching distribution found for decord==0.6.0

解决方案：进入目录third_party/perception_models/requirements.txt，将decord==0.6.0改为decord2。（PS：本代码中已经解决）

测试验证

主目录下提供了验证的数据office_loop.zip:

unzip office_loop.zip
conda activate vggt-slam
python3 main.py --image_folder office_loop --max_loops 1 --vis_map

首次加载时需要下载dino的权重https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_pretrain.pth.下载后上传到服务器的目录/home/kwanwaipang/.cache/torch/hub/checkpoints/.再次运行命令，程序会检测到本地已有文件，跳过下载步骤。
此外，也需要从https://github.com/serizba/salad/releases下载dino_salad.ckpt到/home/kwanwaipang/.cache/torch/hub/checkpoints
从https://huggingface.co/facebook/VGGT-1B/resolve/main/model.pt中下载模型，也是到上述位置（注意VGGT模型较大，有几个G）
通过代码ls -l /home/kwanwaipang/.cache/torch/hub/checkpoints/再次确认模型已经下载好了。
注意检查GPU是否可用python -c "import torch; print('GPU 是否可用:', torch.cuda.is_available()); print('检测到的 GPU 数量:', torch.cuda.device_count())"，
对于上述输出GPU不可用，大几率是安装时拉取的CUDA版本与Thor显卡驱动不兼容。Thor需要安装:pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130(注意，需要先卸载pip uninstall torch torchvision -y)
遇到CUDA报错:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasLtMatmulAlgoGetHeuristic( ltHandle, computeDesc.descriptor(), Adesc.descriptor(), Bdesc.descriptor(), Cdesc.descriptor(), Cdesc.descriptor(), preference.descriptor(), 1, &heuristicResult, &returnedResult)`

此时，并不一定真的是GPU内存不够，可以看看最简单的linear层运行的情况

python3 -c "
import torch
print('CUDA available:', torch.cuda.is_available())
x = torch.randn(2, 3, 224, 224).cuda()
linear = torch.nn.Linear(224, 128).cuda()
with torch.no_grad():
    out = linear(x)
print('Basic CUDA linear: OK', out.shape)
"

有可能是因为nvidia-cublas 13.1.0.3不兼容导致的，也可能是之前安装过一次，然后卸载部分依赖卸载不到导致的

# 查看 vggt-slam 环境的 nvidia 库版本  
conda activate vggt-slam
pip list | grep -E "nvidia-cublas|nvidia-cuda-runtime|nvidia-cudnn|triton"

决定重新配置conda remove --name vggt-slam --all环境:

conda create -n vggt-slam python=3.11.9 
conda activate vggt-slam
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130 
# pip install uv -i https://mirrors.aliyun.com/pypi/simple
# uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

# ./setup.sh
pip3 install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple #此代码已修改，下次配置应该不需要重走这个bug
cd third_party
pip install -e ./salad -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install -e ./vggt -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install -e ./perception_models -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install -e ./sam3 -i https://pypi.tuna.tsinghua.edu.cn/simple
cd ..
pip install -e . -i https://pypi.tuna.tsinghua.edu.cn/simple

更新：原因应该是 LD_LIBRARY_PATH 中某个路径的库与 CUDA 13.0 冲突，先查看echo $LD_LIBRARY_PATH,然后不设置unset LD_LIBRARY_PATH 后正常.注释掉不该有的路径，然后执行source ~/.bashrc
最终运行脚本：

conda activate vggt-slam
unset LD_LIBRARY_PATH
python3 main.py --image_folder office_loop --max_loops 1 --vis_map

打开浏览器(输入http://localhost:8080/)即可可视化结果：

若需要3D开放目标检测，采用flag --run_os,接下来会提示用户输入文本查询，并且在Viser的地图上绘制检测到的3D bounding box

# https://huggingface.co/facebook/sam3 需要先登录，申请模型的访问权限
# 需要使用到Hugging Face，因此采用国内镜像源。
export HF_ENDPOINT=https://hf-mirror.com
export HF_TOKEN="hf_xxxxxx" #输入Access Token
python3 main.py --image_folder office_loop --max_loops 1 --vis_map --run_os

获取access token的方式：

在 Hugging Face 网页右上角点击你的头像，选择 Settings（设置）。
在左侧菜单点击 Access Tokens。
点击 “Create new token”，类型选择 Read（只读）即可，生成后复制那串以 hf_ 开头的长字符。

另外也可以从网上下载离线模型，然后修改代码/VGGT-SLAM/third_party/sam3/sam3/model_builder.py

-     checkpoint_path = hf_hub_download(repo_id=SAM3_MODEL_ID, filename=SAM3_CKPT_NAME)
+    checkpoint_path = "/home/kwanwaipang/VGGT-SLAM/third_party/sam3_model/weight/sam3.pt" # 换成你实际的文件路径

需要从网站https://huggingface.co/facebook/PE-Core-L14-336上，下载权重（应该是CLIP的权重）PE-Core-L14-336.pt，放置到~/.cache/torch/hub/。注意，不可以放置在/home/kwanwaipang/.cache/huggingface/hub/models--facebook--PE-Core-L14-336文件夹中，因为Hugging Face 的缓存机制不是普通的文件夹。它底层靠的是“软链接（Symlinks）”和“哈希码（Hashes）”。正常的缓存目录下，会包含 blobs（存实际数据的文件夹）、refs 和 snapshots（存哈希值软链接的文件夹）。当你的代码去调用hf_hub_download 时，它会去寻找带有特定哈希值的路径。你直接丢一个 .pt 文件进去，它根本不认，会判定缓存不存在，进而强制发起网络请求。

-        clip_model = pe.CLIP.from_config("PE-Core-L14-336", pretrained=True)  # Downloads from HF
+        clip_model = pe.CLIP.from_config("PE-Core-L14-336", pretrained=False)
+        clip_model.load_ckpt("/home/kwanwaipang/.cache/torch/hub/PE-Core-L14-336.pt")

运行后，先建图，建图结束后会提示输入检测物体。

提示词为：laptop

三维重建效果及开放词汇物体检测

接下来采用手机录制一段视频的建图效果。请注意，虽然竖屏手机视频也可以观看，但为了避免图像被裁剪，建议使用横屏视频。

# 创建路径
mkdir <desired_location>/img_folder
ffmpeg -i /path/to/video.MOV -vf "fps=10" <desired_location>/img_folder/frame_%04d.jpg

python3 main.py --image_folder office_loop_test --max_loops 1 --vis_map --run_os

窗口无输出

模块里有几个在 Jetson 上都属于“重型模块”，尤其是 gtsam、open3d、torch/torchvision、salad 这条链，偶发卡、初始化异常、GPU/驱动状态不干净，都可能表现成“脚本像没启动一样”。解决措施：

# 看是否有残留 Python 进程
ps -ef | grep python

# 看 8080 是否被旧 viser 占用
ss -ltnp | grep 8080

# 关掉一些进程
pkill -f "python3 main.py"
pkill -f viser

代码解读

请看github仓库及解说文档：Link
关于开放词汇语义检测的流程请见下图：

flowchart LR subgraph Preprocessing["建图阶段（特征库构建）"] CLIP_IMG["CLIP Image Encoder"] KF["SLAM 关键帧
基于视差筛选"] DB[("Image Token 库")] end subgraph Retrieval["测试阶段：语义检索"] CLIP_TXT["CLIP Text Encoder"] Query["文本指令/Prompt"] Vec["文本语义特征向量"] Sim{"计算余弦相似度
遍历 Token 库"} BestKF["选取得分最高
历史关键帧"] end subgraph Segmentation["获取物体3D位置"] SAM3["SAM3 模型"] Mask2D["2D 图像掩码 Mask"] Mapping{"3D 投影映射"} SLAM_DATA["VGGT生成的 Pointmap & Pose"] PC["像素对应 3D 点云映射"] BBox3D["生成 3D Bounding Box"] end KF --> CLIP_IMG CLIP_IMG --> DB Query --> CLIP_TXT CLIP_TXT --> Vec Vec --> Sim DB -.-> Sim Sim --> BestKF BestKF --> SAM3 Query -.-> SAM3 SAM3 --> Mask2D SLAM_DATA --> Mapping Mask2D --> Mapping Mapping --> PC PC --> BBox3D KF:::input CLIP_IMG:::process DB:::storage Query:::input CLIP_TXT:::process Sim:::process SAM3:::process Mapping:::process BBox3D:::output classDef storage fill:#f9f,stroke:#333,stroke-width:2px classDef process fill:#bbf,stroke:#333,stroke-width:2px classDef input fill:#dfd,stroke:#333,stroke-width:2px classDef output fill:#fdd,stroke:#333,stroke-width:2px

几何：光流视差筛关键帧，再按窗口叠成子图；VGGT 给出稠密深度与位姿，解码后反投影成点云并写入地图。位姿在 GTSAM 上优化，变量落在 SL(4) 射影位姿流形；插入图时并未再强制把齐次矩阵归一化到该流形（实现里对应步骤处于关闭状态）。子图间尺度与射影位姿并列：先在重叠区域用三维点对估计一个整体尺度因子，再与位姿边一同优化，不宜说成尺度完全由射影位姿单独决定。回环：SALAD 配 DINOv2 提全局描述子，在历史帧上按描述子欧氏距离找最近邻（距离越小越像，并非余弦）；有候选后，对查询帧与回环帧再跑一次双图 VGGT，在其网络深层全局注意力中计算两帧匹配程度，与固定阈值比较通过后才采纳回环几何。

开放词汇：开启语义模式后，仅对每个子图里实际参与建图的关键帧用 CLIP（感知编码器） 提图像向量入库；用户句子提文本向量，在库中做余弦相似度检索取最相关帧。SAM3 按文本得到二维掩膜与框；当前三维包围盒只由掩膜在稠密点云中圈出的点拟合得到，二维框尚未用于抬升到三维。

flowchart TB
  subgraph GEO["几何构建原理"]
    direction TB
    G0[图像序列输入] --> G1[光流法筛选关键帧]
    G1 --> G2[累积为子图窗口 重叠帧供子图内约束]
    G2 --> G3[SALAD+DINOv2 子图帧检索向量]
    G3 --> G4[VGGT 稠密深度与位姿编码]
    G4 --> G5[解码内外参 反投影点云 构图节点]
    G5 --> G6[与历史帧描述子 L2 距离最小 回环候选]
    G6 -->|无候选| G10
    G6 -->|有候选| G7[双图 VGGT forward compute_similarity]
    G7 --> G8[target_layer 全局注意力 KQ → image_match_ratio 门控]
    G8 -->|未通过| G10
    G8 -->|通过| G9[回环因子写入位姿图]
    G9 --> G10[重叠点对 estimate_scale_pairwise 子图间尺度]
    G10 --> G11[SL4 变量 Between Prior 位姿图]
    G11 --> G12[GTSAM Levenberg-Marquardt 优化]
  end

  subgraph OVL["开放词汇检测原理"]
    direction TB
    O0[各子图批次内的关键帧路径] --> O1[run_os 时 CLIP PE encode_image 存 Submap]
    O2[用户文本] --> O3[CLIP encode_text]
    O3 --> O4[遍历已存 image embedding 最大 cosine 定位帧]
    O4 --> O5[SAM3 文本提示 → 2D mask 与 bbox]
    O5 --> O6[当前实现 mask 内稠密点取 3D 点集]
    O6 --> O7[compute_obb_from_points 得 3D OBB 与可视化]
  end

  G1 -.->|同一批关键帧路径| O0