Depth-free metric scene graphs from RGB+IMU

Quick summary

This page walks through an end-to-end pipeline that starts from RGB+IMU video and ends with metric, semantic scene graphs extracted on consumer-grade GPU, without any depth sensors. Geometry comes from COLMAP and SuGaR, semantics from YOLO World and SAM 2, scale from an ARKit Sim(3) alignment, fragmented instances are merged with semantic loop closure, and the export is a predicate scene graph you can query.
Every quantitative result is anchored to one bedroom scene. In that setting the camera trajectory aligns tightly with ARKit, while object bounding boxes drift by tens of centimeters on average—an expected cost of lifting masks without depth. Still, the same reconstruction supports dense mask lifting, 156 typed spatial relations, and a small downstream LLM example.
Overall, this work is about feasibility of getting strong relational layout from a laptop, even if object sizes are imprecise compared to RGB-D or LiDAR-based methods.

Overview

This project presents a proof-of-concept system that extracts metric-scale semantic scene graphs from RGB+IMU video on consumer hardware. Unlike existing approaches (ConceptGraphs, HOV-SG) that require RGB-D sensors and high-end GPUs, this system achieves spatial relationship understanding from monocular RGB with inertial sensor data on an 8GB laptop GPU.

The system combines 3D Gaussian Splatting (SuGaR) for dense surface reconstruction, open-vocabulary detection (YOLO World + SAM 2) for semantic understanding, and ARKit IMU alignment for metric scale recovery. By using monocular RGB with inertial sensors instead of depth sensors, this unique combination enables scene graph extraction on consumer hardware, though at the cost of dimensional precision.

⚠️ Evaluation Scope: This work presents single-scene validation on a representative bedroom environment. All reported metrics are specific to this scene and may not generalize. While single-scene evaluation limits generalization claims, these results establish feasibility and identify key accessibility-precision tradeoffs for future multi-scene investigation.

3D Segmented Surface

Methodology

The system architecture is designed as a modular, feed-forward pipeline that transitions from raw, unstructured RGB video into a structured mathematical 3D scene representation. As visualized in the flowchart alongside this section, the process begins by simultaneously extracting sparse camera trajectories through COLMAP while performing zero-shot object detection and temporal tracking via YOLO World and SAM 2. These parallel tracks are then unified through a Semantic Loop Closure algorithm that lifts 2D projections into a metric-aligned 3D SuGaR surface, finally yielding the queryable Topological Scene Graph.

Hardware & VRAM Engineering: Implementing this pipeline on a consumer GPU required separating the geometry and semantic stages to stay within VRAM limits. The pipeline decouples the geometric and semantic processing stages to prevent cumulative memory fragmentation during training. Furthermore, runtime logic patches for the SuGaR CamerasWrapper modules enable dynamic resolution rescaling and hyper-parameter downscales to maintain high surface fidelity on an 8 GB GPU.

3D Surface Reconstruction

Testing on a 59-second video shot at 30 FPS, we down-sampled to 5 FPS (~295 total frames). The pipeline processes these frames through COLMAP Structure-from-Motion (SfM) to estimate initial sparse camera posings.

To achieve continuous physical geometry within constrained 8GB VRAM limits, we aggressively downscaled images globally (`factor=2`) and resolution (`factor=4`). After just 15 minutes of training, vanilla 3DGS peaked at approximately 1.4 million Gaussians. The SuGaR algorithm then rigorously optimized and pruned this surface down to a refined cloud containing 303,738 3D Gaussian splats.

1. Raw RGB Dataset

2. Sparse COLMAP Cloud

3. Pruned 303k Dense Splats

Semantic Loop Closure

We use YOLO World and SAM 2 to perform zero-shot segmentation and temporal mask tracking across the video feed. Because temporal tracking masks frequently shatter across visual occlusions (e.g., separating a bed into bed_1 vs bed_2), we implement a Semantic Loop Closure technique. By lifting masks into the metric 3D cloud, we autonomously merge fragmented point clouds of identical string-classes if their 3D Euclidean bounding boxes fall within tight metric proximities.

SAM 2 Temporal Masks

Loop-Closed 3D Fusion

Metric Scale Recovery

Because vanilla COLMAP reconstructions are strictly dimensionless, the pipeline extracts the hardware iOS ARKit IMU trajectory natively associated with the video bounds. The system rigidly aligns the two camera trajectories using an optimized Sim(3) transformation matrix.

$$ \mathbf{p}' = s \mathbf{R} \mathbf{p} + \mathbf{t} $$

Where $ \mathbf{p} $ represents the unscaled dimensionless coordinate, $ s $ is the isotropic scale factor expanding logic to metrics, $ \mathbf{R} $ is the proper $ 3 \times 3 $ rotational alignment, and $ \mathbf{t} $ is the $ 3 \times 1 $ translation vector mapping origin-to-origin yielding the absolute physical ARKit coordinate $ \mathbf{p}' $.

Applying this transformation over the entire point cloud scales the reconstruction to metric units. Using COLMAP's model_aligner with robust alignment enabled, the system achieves a trajectory alignment RMSE of 1.83 cm between the reconstructed camera path and ARKit ground-truth across 291 frames, confirming the camera trajectory remains in metric space.

Metric ARKit Trajectory Sim3 Math

Topological Scene Graph

Once the scene is transformed to metric space, the system computes physical Oriented Bounding Boxes (OBBs) for all discrete semantic entities produced by mask lifting and loop closure. By executing Principal Component Analysis (PCA) strictly along the XZ (floor) plane, the algorithm removes any bounding-box skews across tall geometries like doors and wardrobes.

We then compute pairwise spatial predicates (e.g., resting_on, directly_above, next_to) based on bounding box proximity and overlap. This outputs the formalized node-edge Topology Graph that ultimately bridges 3D mechanics to standard NLP querying.

Predicate Edge Topology Graph

Results

Quantitative Analysis

Summary tiles and tables (including SOTA references where available).

Camera Trajectory

1.83 cm

RMSE vs. ARKit (291 frames)

ConceptGraphs: Not reported

Object Dimensions

17.7 cm

Mean error (depth-sensor-free tradeoff)

ConceptGraphs: Not reported

Object Detection

52%

Recall (13/25 classes)

ConceptGraphs: 71%

Scene Graph

156

Spatial relationships (9 types)

ConceptGraphs: 88% edge accuracy

Key Insight: Systems such as ConceptGraphs and HOV-SG do not report dimensional accuracy; like this pipeline, they emphasize relational and navigational semantics over geometric metrology. The object-dimension numbers in the table below contextualize that design choice for depth-sensor-free capture.

✓ SUITABLE FOR

Spatial understanding • Relationship reasoning • Robotics navigation

✗ NOT SUITABLE FOR

Precision metrology • Dimensional measurement • CAD modeling

Geometric Reconstruction Quality

Metric	COLMAP	SuGaR
Point Density	18,508 points	303,738 Gaussian centers
Densification Factor	1.0× (baseline)	16.4× increase
Surface Continuity	Sparse / Edge-Centered	Manifold / Complete
Mask Lifting Alignment	~65% Pixel-Hit Rate	~99% Pixel-Hit Rate

Semantic Scene Understanding

Target Object Classes	25 open-vocabulary categories
SAM 2 raw detections	24 fragmented masks
After Semantic Loop Closure	13 unified objects
Fragmentation Reduction	45.8% fewer fragments
Segmented 3D Points	131,171 points (43% coverage)
Spatial Relationships	156 pairwise predicates (9 types)

Physical Dimension Accuracy

Object	Predicted (W × D × H)	Measured (W × D × H)	Mean Error
Workdesk	0.92m × 0.54m × 0.43m	1.22m × 0.70m × 0.70m	24.4 cm
Keyboard	0.46m × 0.28m × 0.17m	0.38m × 0.16m × 0.02m	11.6 cm
Dresser	0.68m × 0.38m × 0.49m	0.80m × 0.30m × 0.80m	17.1 cm

Mean Absolute Error: 17.7 cm across 3 objects (9 dimensions), with systematic underprediction from occlusion and segmentation limits. Trajectory alignment stays tight in the summary tiles; object-level error reflects mask lifting on RGB+IMU without depth. The densification and loop-closure rows explain how stable geometry supports that graph.

⚠️ Important distinction: trajectory RMSE ≠ object accuracy

Validation against ground-truth object dimensions revealed that camera trajectory RMSE does not equate to object dimension accuracy. Physical measurements showed a mean object dimension error of 17.7 cm, with systematic underestimation across all measured objects (per-axis breakdown in the physical-dimension validation table).

This bias is consistent with incomplete point clouds from partial occlusion and SAM 2 mask boundary limitations. The system achieves accurate spatial relationships (which objects are near each other, relative positions) but not precise dimensional metrology (exact object sizes).

Many navigation-style and language-grounded uses depend mainly on relationships and coarse relative scale (what is beside what, what reads as larger or farther), not on centimeter-exact extents. That does not make large box errors acceptable for every task: clearances, fit, fabrication, and similar goals still need tighter geometry than this pipeline targets.

Discussion

Sparse vs. surface-aligned geometry

The raw COLMAP point cloud is sparse and concentrated along texture edges, as expected from SIFT. However, it leaves the central surfaces of objects geometrically empty.

In contrast, SuGaR’s surface regularization pulls Gaussians toward the true scene manifold during the refinement stage. This prevents floating splats common in vanilla 3DGS and provides a continuous substrate for robust 3D semantic mask-lifting.

COLMAP: Edge-Based Sparse (18k pts)

SuGaR: Surface-Regularized Gaussian centers with Semantic Masks (303k pts)

Conclusion

Metric scene graphs from RGB+IMU on consumer hardware are feasible. Camera alignment is tight, relational queries work, but object dimensions drift. That is an expected cost of lifting masks without depth.

To summarize, this pipeline demonstrates that metric alignment and relationship extraction can operate on consumer hardware without a depth sensor even with coarse object dimensions. It achieves strong trajectory alignment with ARKit, outputs useful scene graphs and LLM-grounded queries despite dimensional errors, reduces SAM 2 fragmentation through loop closure, and runs end-to-end within 8GB VRAM. Quantitative results are summarized in the Results section.

Impact and Applications: The graph targets tasks where relative layout matters more than sub-centimeter sizing:

Spatial Scene Understanding: Grounded language queries for relationships, proximity, and navigation in robotics, assistive systems, and similar settings.
Novel View Synthesis (NVS): Rapid architectural visualization where designers can navigate camera orbits around semantically identified sub-meshes, useful for scene previsualization and spatial layout understanding.
Semantic Interaction: Enabling localized NLP-based spatial reasoning within complex physical environments, allowing for grounded semantic queries and interaction layers safely isolated from cloud latency.

Unique Contributions

First Depth-Sensor-Free Scene Graphs

SOTA systems (ConceptGraphs, HOV-SG) require RGB-D sensors. This project achieves scene graph extraction from RGB+IMU (monocular RGB with inertial sensors) on consumer hardware.

Impact: Removes $400-600 depth sensor barrier
Caveat: Single-scene validation

5× Cost Reduction

SOTA: $1,400-2,600 (RGB-D sensor + high-end GPU)
This work: $300 (consumer GPU, phone camera)

Impact: Enables accessible spatial AI
Tradeoff: Coarser than RGB-D / LiDAR metrology (validation tables)

Semantic Loop Closure Algorithm

Novel algorithm reduces SAM 2 temporal fragmentation by 45.8%. Merges split detections based on 3D proximity and class identity.

Impact: Addresses foundation model limitations
Result: 24 fragments → 13 objects

Detailed System Comparison

Capability	ConceptGraphs (ICRA'24)	HOV-SG (RSS'24)	Semantic Gaussians (2024)	This Work
Core Approach
Input Modality	RGB-D	RGB-D	RGB(-D)	RGB+IMU
Depth Sensor Required	✓ Yes	✓ Yes	○ Optional	✗ No
3D Representation	Point clouds	Point clouds	3D Gaussians	3D Gaussians (SuGaR)
Hardware	High-end GPU	High-end GPU	High-end GPU	8GB consumer
Semantic Capabilities
Object Detection	71% accuracy	SOTA	Yes	52% recall
Scene Graph	✓ Yes	✓ Hierarchical	✗ No	✓ Yes (156 relations)
Relationship Accuracy	88% edges	SOTA	N/A	Qualitative
Geometric Capabilities
Metric Scale Recovery	✗ No	✗ No	✗ No	✓ Yes (1.83cm)
Dimensional Accuracy	Not reported	Not reported	Not reported	17.7 cm (validated)
Surface Representation	Sparse points	Sparse points	Dense splats	Dense (SuGaR)
Accessibility
Evaluation Scope	Multiple datasets	Multiple buildings	Multiple scenes	Single scene ⚠️
Total Hardware Cost	$1.4K-2.4K+	$1.4K-2.4K+	$1K-2K+	$300
Deployment	Research/lab	Research/lab	Research/lab	Consumer laptop
Processing Time	Not reported	Not reported	Not reported	45 minutes

This table summarizes modality, semantics, geometry, hardware cost, and evaluation scope

⚠️ Caveat: Single-scene figures are not interchangeable with SOTA multi-scene averages or variance reports. Treat any cell-wise comparison as directional until matched evaluation exists.

Limitations & Explainability

The three panels below map each limitation to a simple mitigation. The columns assume the same relational use case emphasized in Conclusion and in the downstream LLM example; stricter metrology or detection targets imply RGB-D, multi-view capture, or both.

Dimensional Accuracy

Issue: 17.7cm mean error (vs. 1.83cm trajectory RMSE)

Root Causes:

RGB+IMU instead of RGB-D (LiDAR achieves <1cm)
SAM 2 boundary imprecision (documented limitation)
Single-pass capture leads to occlusion and incomplete point clouds

Mitigation: We could capture the scene from multiple angles to improve dimensional accuracy.

Detection Recall

Issue: 52% recall (13/25 target classes)

Root Causes:

Zero-shot detection (no training)
Occlusion during single-pass capture
Small objects (mug, bottle) harder to detect

Mitigation: We could use a more accurate detection model or train a detection model on the dataset to improve detection recall.

Geometric Coverage

Issue: Partial object reconstruction

Root Causes:

Linear trajectory (not orbital)
Occlusion planes along unobserved axes
2D mask lifting to incomplete 3D clouds

Mitigation: We could try to capture the scene from multiple angles to improve geometric coverage.

Future Directions

Having established that depth-sensor-free semantic scene understanding is viable, future work could: (1) improve detection recall toward ConceptGraphs' 71% through tighter SAM 2 and loop-closure integration, (2) develop quantitative relationship benchmarks comparable to their 88% edge accuracy, (3) expand spatial predicate vocabulary (orientation, containment), (4) implement hierarchical room-level organization, and (5) validate on real robotics tasks to empirically demonstrate that 17.7cm errors don't impede downstream applications.

❯ python3 query.py "relationship between coffee mug, water bottle, desk"

🤖 Sending query...
==================================================
Based on the provided data:

1. Water Bottle and Workdesk:
- Distance: 1.23 meters
- Relationship: directly above

2. Coffee Mug and Water Bottle:
- Distance: 0.22 meters
- Relationship: next to
note: suggests it is on a shelf overhead, side-by-side.

Final Answer:
Both items are directly above the workdesk placed side by side.
==================================================

Downstream example: LLM spatial reasoning

This section illustrates impact and applications: a simple downstream task built on the exported graph.

Unlike traditional active VLM pipelines that rely on massive textured splats and often hallucinate scale, this approach compiles the environment into a structured Topological JSON graph with explicit spatial predicates.

The intermediate representation reduces a dense point sample to surface-aligned primitives with metric grounding. When that graph is passed to Ollama, the model retrieves relationships and proximity (e.g. "next to," "directly above," distance estimates), illustrating the intended use case: relational queries rather than precise metrology.

Acknowledgements

We acknowledge the fundamental contributions of the research teams behind COLMAP, 3D Gaussian Splatting, SuGaR, YOLO World, and SAM 2. Furthermore, this work is motivated by the pioneering research in 3D Semantic Scene Graphs, specifically the structural methodologies established by HOV-SG and ConceptGraphs. This pipeline was developed and optimized on consumer-grade hardware to demonstrate the accessibility of high-fidelity semantic scene reconstruction.

Scene Recon.

Depth-Free Metric Scene Graphs on Consumer Hardware

Overview

Methodology

3D Surface Reconstruction

Semantic Loop Closure

Metric Scale Recovery

Topological Scene Graph

Results

Quantitative Analysis

Geometric Reconstruction Quality

Semantic Scene Understanding

Physical Dimension Accuracy

⚠️ Important distinction: trajectory RMSE ≠ object accuracy

Discussion

Sparse vs. surface-aligned geometry

Conclusion

Unique Contributions

First Depth-Sensor-Free Scene Graphs

5× Cost Reduction

Semantic Loop Closure Algorithm

Detailed System Comparison

Limitations & Explainability

Dimensional Accuracy

Detection Recall

Geometric Coverage

Future Directions

Downstream example: LLM spatial reasoning

Acknowledgements