View on GitHub

RGB+IMU video — metric scene graphs without RGB-D

Depth-Free Metric Scene Graphs on Consumer Hardware

~14 min read April 2026

Dense 3D surfaces, fused detections, and queryable spatial relations from RGB+IMU alone. No depth camera, no datacenter GPU.

Overview

This project presents a proof-of-concept system that extracts metric-scale semantic scene graphs from RGB+IMU video on consumer hardware. Unlike existing approaches (ConceptGraphs, HOV-SG) that require RGB-D sensors and high-end GPUs, this system achieves spatial relationship understanding from monocular RGB with inertial sensor data on an 8GB laptop GPU.

The system combines 3D Gaussian Splatting (SuGaR) for dense surface reconstruction, open-vocabulary detection (YOLO World + SAM 2) for semantic understanding, and ARKit IMU alignment for metric scale recovery. By using monocular RGB with inertial sensors instead of depth sensors, this unique combination enables scene graph extraction on consumer hardware, though at the cost of dimensional precision.

⚠️ Evaluation Scope: This work presents single-scene validation on a representative bedroom environment. All reported metrics are specific to this scene and may not generalize. While single-scene evaluation limits generalization claims, these results establish feasibility and identify key accessibility-precision tradeoffs for future multi-scene investigation.

3D Segmented Reconstruction
3D Segmented Surface

Methodology

The system architecture is designed as a modular, feed-forward pipeline that transitions from raw, unstructured RGB video into a structured mathematical 3D scene representation. As visualized in the flowchart alongside this section, the process begins by simultaneously extracting sparse camera trajectories through COLMAP while performing zero-shot object detection and temporal tracking via YOLO World and SAM 2. These parallel tracks are then unified through a Semantic Loop Closure algorithm that lifts 2D projections into a metric-aligned 3D SuGaR surface, finally yielding the queryable Topological Scene Graph.

Hardware & VRAM Engineering: Implementing this pipeline on a consumer GPU required separating the geometry and semantic stages to stay within VRAM limits. The pipeline decouples the geometric and semantic processing stages to prevent cumulative memory fragmentation during training. Furthermore, runtime logic patches for the SuGaR CamerasWrapper modules enable dynamic resolution rescaling and hyper-parameter downscales to maintain high surface fidelity on an 8 GB GPU.

Architecture Flowchart

3D Surface Reconstruction

Testing on a 59-second video shot at 30 FPS, we down-sampled to 5 FPS (~295 total frames). The pipeline processes these frames through COLMAP Structure-from-Motion (SfM) to estimate initial sparse camera posings.

To achieve continuous physical geometry within constrained 8GB VRAM limits, we aggressively downscaled images globally (`factor=2`) and resolution (`factor=4`). After just 15 minutes of training, vanilla 3DGS peaked at approximately 1.4 million Gaussians. The SuGaR algorithm then rigorously optimized and pruned this surface down to a refined cloud containing 303,738 3D Gaussian splats.

Raw Input Sequence
1. Raw RGB Dataset
COLMAP Point Cloud
2. Sparse COLMAP Cloud
SuGaR 3D Splatting
3. Pruned 303k Dense Splats

Semantic Loop Closure

We use YOLO World and SAM 2 to perform zero-shot segmentation and temporal mask tracking across the video feed. Because temporal tracking masks frequently shatter across visual occlusions (e.g., separating a bed into bed_1 vs bed_2), we implement a Semantic Loop Closure technique. By lifting masks into the metric 3D cloud, we autonomously merge fragmented point clouds of identical string-classes if their 3D Euclidean bounding boxes fall within tight metric proximities.

Tracking Masks
SAM 2 Temporal Masks
Segmented 3D
Loop-Closed 3D Fusion

Metric Scale Recovery

Because vanilla COLMAP reconstructions are strictly dimensionless, the pipeline extracts the hardware iOS ARKit IMU trajectory natively associated with the video bounds. The system rigidly aligns the two camera trajectories using an optimized Sim(3) transformation matrix.

$$ \mathbf{p}' = s \mathbf{R} \mathbf{p} + \mathbf{t} $$

Where \( \mathbf{p} \) represents the unscaled dimensionless coordinate, \( s \) is the isotropic scale factor expanding logic to metrics, \( \mathbf{R} \) is the proper \( 3 \times 3 \) rotational alignment, and \( \mathbf{t} \) is the \( 3 \times 1 \) translation vector mapping origin-to-origin yielding the absolute physical ARKit coordinate \( \mathbf{p}' \).

Applying this transformation over the entire point cloud scales the reconstruction to metric units. Using COLMAP's model_aligner with robust alignment enabled, the system achieves a trajectory alignment RMSE of 1.83 cm between the reconstructed camera path and ARKit ground-truth across 291 frames, confirming the camera trajectory remains in metric space.

ARKit Alignment
Metric ARKit Trajectory Sim3 Math

Topological Scene Graph

Once the scene is transformed to metric space, the system computes physical Oriented Bounding Boxes (OBBs) for all discrete semantic entities produced by mask lifting and loop closure. By executing Principal Component Analysis (PCA) strictly along the XZ (floor) plane, the algorithm removes any bounding-box skews across tall geometries like doors and wardrobes.

We then compute pairwise spatial predicates (e.g., resting_on, directly_above, next_to) based on bounding box proximity and overlap. This outputs the formalized node-edge Topology Graph that ultimately bridges 3D mechanics to standard NLP querying.

Predicate Topology Graph
Predicate Edge Topology Graph

Results

Quantitative Analysis

Summary tiles and tables (including SOTA references where available).

Camera Trajectory
1.83 cm
RMSE vs. ARKit (291 frames)
ConceptGraphs: Not reported
Object Dimensions
17.7 cm
Mean error (depth-sensor-free tradeoff)
ConceptGraphs: Not reported
Object Detection
52%
Recall (13/25 classes)
ConceptGraphs: 71%
Scene Graph
156
Spatial relationships (9 types)
ConceptGraphs: 88% edge accuracy

Key Insight: Systems such as ConceptGraphs and HOV-SG do not report dimensional accuracy; like this pipeline, they emphasize relational and navigational semantics over geometric metrology. The object-dimension numbers in the table below contextualize that design choice for depth-sensor-free capture.

✓ SUITABLE FOR
Spatial understanding • Relationship reasoning • Robotics navigation
✗ NOT SUITABLE FOR
Precision metrology • Dimensional measurement • CAD modeling

Geometric Reconstruction Quality

Metric COLMAP SuGaR
Point Density 18,508 points 303,738 Gaussian centers
Densification Factor 1.0× (baseline) 16.4× increase
Surface Continuity Sparse / Edge-Centered Manifold / Complete
Mask Lifting Alignment ~65% Pixel-Hit Rate ~99% Pixel-Hit Rate

Semantic Scene Understanding

Target Object Classes 25 open-vocabulary categories
SAM 2 raw detections 24 fragmented masks
After Semantic Loop Closure 13 unified objects
Fragmentation Reduction 45.8% fewer fragments
Segmented 3D Points 131,171 points (43% coverage)
Spatial Relationships 156 pairwise predicates (9 types)

Physical Dimension Accuracy

Object Predicted (W × D × H) Measured (W × D × H) Mean Error
Workdesk 0.92m × 0.54m × 0.43m 1.22m × 0.70m × 0.70m 24.4 cm
Keyboard 0.46m × 0.28m × 0.17m 0.38m × 0.16m × 0.02m 11.6 cm
Dresser 0.68m × 0.38m × 0.49m 0.80m × 0.30m × 0.80m 17.1 cm

Mean Absolute Error: 17.7 cm across 3 objects (9 dimensions), with systematic underprediction from occlusion and segmentation limits. Trajectory alignment stays tight in the summary tiles; object-level error reflects mask lifting on RGB+IMU without depth. The densification and loop-closure rows explain how stable geometry supports that graph.

⚠️ Important distinction: trajectory RMSE ≠ object accuracy

Validation against ground-truth object dimensions revealed that camera trajectory RMSE does not equate to object dimension accuracy. Physical measurements showed a mean object dimension error of 17.7 cm, with systematic underestimation across all measured objects (per-axis breakdown in the physical-dimension validation table).

This bias is consistent with incomplete point clouds from partial occlusion and SAM 2 mask boundary limitations. The system achieves accurate spatial relationships (which objects are near each other, relative positions) but not precise dimensional metrology (exact object sizes).

Many navigation-style and language-grounded uses depend mainly on relationships and coarse relative scale (what is beside what, what reads as larger or farther), not on centimeter-exact extents. That does not make large box errors acceptable for every task: clearances, fit, fabrication, and similar goals still need tighter geometry than this pipeline targets.

Discussion

Sparse vs. surface-aligned geometry

The raw COLMAP point cloud is sparse and concentrated along texture edges, as expected from SIFT. However, it leaves the central surfaces of objects geometrically empty.

In contrast, SuGaR’s surface regularization pulls Gaussians toward the true scene manifold during the refinement stage. This prevents floating splats common in vanilla 3DGS and provides a continuous substrate for robust 3D semantic mask-lifting.

COLMAP Sparse Points
COLMAP: Edge-Based Sparse (18k pts)
SuGaR Semantic Splats
SuGaR: Surface-Regularized Gaussian centers with Semantic Masks (303k pts)

Conclusion

Metric scene graphs from RGB+IMU on consumer hardware are feasible. Camera alignment is tight, relational queries work, but object dimensions drift. That is an expected cost of lifting masks without depth.

To summarize, this pipeline demonstrates that metric alignment and relationship extraction can operate on consumer hardware without a depth sensor even with coarse object dimensions. It achieves strong trajectory alignment with ARKit, outputs useful scene graphs and LLM-grounded queries despite dimensional errors, reduces SAM 2 fragmentation through loop closure, and runs end-to-end within 8GB VRAM. Quantitative results are summarized in the Results section.

Impact and Applications: The graph targets tasks where relative layout matters more than sub-centimeter sizing:

Unique Contributions

First Depth-Sensor-Free Scene Graphs

SOTA systems (ConceptGraphs, HOV-SG) require RGB-D sensors. This project achieves scene graph extraction from RGB+IMU (monocular RGB with inertial sensors) on consumer hardware.

Impact: Removes $400-600 depth sensor barrier
Caveat: Single-scene validation

5× Cost Reduction

SOTA: $1,400-2,600 (RGB-D sensor + high-end GPU)
This work: $300 (consumer GPU, phone camera)

Impact: Enables accessible spatial AI
Tradeoff: Coarser than RGB-D / LiDAR metrology (validation tables)

Semantic Loop Closure Algorithm

Novel algorithm reduces SAM 2 temporal fragmentation by 45.8%. Merges split detections based on 3D proximity and class identity.

Impact: Addresses foundation model limitations
Result: 24 fragments → 13 objects

Detailed System Comparison

Capability ConceptGraphs
(ICRA'24)
HOV-SG
(RSS'24)
Semantic Gaussians
(2024)
This Work
Core Approach
Input Modality RGB-D RGB-D RGB(-D) RGB+IMU
Depth Sensor Required ✓ Yes ✓ Yes ○ Optional ✗ No
3D Representation Point clouds Point clouds 3D Gaussians 3D Gaussians (SuGaR)
Hardware High-end GPU High-end GPU High-end GPU 8GB consumer
Semantic Capabilities
Object Detection 71% accuracy SOTA Yes 52% recall
Scene Graph ✓ Yes ✓ Hierarchical ✗ No ✓ Yes (156 relations)
Relationship Accuracy 88% edges SOTA N/A Qualitative
Geometric Capabilities
Metric Scale Recovery ✗ No ✗ No ✗ No ✓ Yes (1.83cm)
Dimensional Accuracy Not reported Not reported Not reported 17.7 cm (validated)
Surface Representation Sparse points Sparse points Dense splats Dense (SuGaR)
Accessibility
Evaluation Scope Multiple datasets Multiple buildings Multiple scenes Single scene ⚠️
Total Hardware Cost $1.4K-2.4K+ $1.4K-2.4K+ $1K-2K+ $300
Deployment Research/lab Research/lab Research/lab Consumer laptop
Processing Time Not reported Not reported Not reported 45 minutes

This table summarizes modality, semantics, geometry, hardware cost, and evaluation scope

⚠️ Caveat: Single-scene figures are not interchangeable with SOTA multi-scene averages or variance reports. Treat any cell-wise comparison as directional until matched evaluation exists.

Limitations & Explainability

The three panels below map each limitation to a simple mitigation. The columns assume the same relational use case emphasized in Conclusion and in the downstream LLM example; stricter metrology or detection targets imply RGB-D, multi-view capture, or both.

Dimensional Accuracy

Issue: 17.7cm mean error (vs. 1.83cm trajectory RMSE)

Root Causes:

  • RGB+IMU instead of RGB-D (LiDAR achieves <1cm)
  • SAM 2 boundary imprecision (documented limitation)
  • Single-pass capture leads to occlusion and incomplete point clouds

Mitigation: We could capture the scene from multiple angles to improve dimensional accuracy.

Detection Recall

Issue: 52% recall (13/25 target classes)

Root Causes:

  • Zero-shot detection (no training)
  • Occlusion during single-pass capture
  • Small objects (mug, bottle) harder to detect

Mitigation: We could use a more accurate detection model or train a detection model on the dataset to improve detection recall.

Geometric Coverage

Issue: Partial object reconstruction

Root Causes:

  • Linear trajectory (not orbital)
  • Occlusion planes along unobserved axes
  • 2D mask lifting to incomplete 3D clouds

Mitigation: We could try to capture the scene from multiple angles to improve geometric coverage.

Future Directions

Having established that depth-sensor-free semantic scene understanding is viable, future work could: (1) improve detection recall toward ConceptGraphs' 71% through tighter SAM 2 and loop-closure integration, (2) develop quantitative relationship benchmarks comparable to their 88% edge accuracy, (3) expand spatial predicate vocabulary (orientation, containment), (4) implement hierarchical room-level organization, and (5) validate on real robotics tasks to empirically demonstrate that 17.7cm errors don't impede downstream applications.

 python3 query.py "relationship between coffee mug, water bottle, desk"

🤖 Sending query...
==================================================
Based on the provided data:

1. Water Bottle and Workdesk:
- Distance: 1.23 meters
- Relationship: directly above

2. Coffee Mug and Water Bottle:
- Distance: 0.22 meters
- Relationship: next to
note: suggests it is on a shelf overhead, side-by-side.

Final Answer:
Both items are directly above the workdesk placed side by side.
==================================================

Downstream example: LLM spatial reasoning

This section illustrates impact and applications: a simple downstream task built on the exported graph.

Unlike traditional active VLM pipelines that rely on massive textured splats and often hallucinate scale, this approach compiles the environment into a structured Topological JSON graph with explicit spatial predicates.

The intermediate representation reduces a dense point sample to surface-aligned primitives with metric grounding. When that graph is passed to Ollama, the model retrieves relationships and proximity (e.g. "next to," "directly above," distance estimates), illustrating the intended use case: relational queries rather than precise metrology.


Acknowledgements

We acknowledge the fundamental contributions of the research teams behind COLMAP, 3D Gaussian Splatting, SuGaR, YOLO World, and SAM 2. Furthermore, this work is motivated by the pioneering research in 3D Semantic Scene Graphs, specifically the structural methodologies established by HOV-SG and ConceptGraphs. This pipeline was developed and optimized on consumer-grade hardware to demonstrate the accessibility of high-fidelity semantic scene reconstruction.