106A Gaussian Splatting

Introduction

What is Gaussian Splatting and how can it be useful for robotics?

Gaussian splatting is a technique used to reconstruct a 3D scene by optimizing numerous tiny 3D Gaussians, creating a dense point cloud that represents the spatial distribution of objects. The process begins with capturing different viewpoints of the scene, which gives a set of image-camera pose pairs (anywhere from 50 to >1000) which serve as the input. The output is a Gaussian splat or point cloud that captures the scene’s geometry and structure.

The primary motivation for using Gaussian splatting in robotics is its ability to provide real-time rendering of complex 3D environments. Compared to ray-based methods like NeRF (Neural Radiance Fields), Gaussian splatting offers more efficient processing, especially when dynamic scene updates are required (which is not trivial but certainly possible).

In our use case, we aim to leverage static Gaussian splatting to recreate a tabletop scene for robotic manipulation. The high-level goal is to enable a robot to navigate a crowded environment, identify and interact with a target object, and perform tasks such as picking it up and placing it at a desired location—all with the help of 3D scene reconstruction from Gaussian splatting. To achieve this, we needed to address the following challenges with our pipeline:

Capturing the Scene: Collecting enough diverse images and camera poses of the tabletop scene, and also making sure the scene is reliably recreated each time for testing.
Training the Gaussian splatting model: Optimizing the Gaussians to accurately represent the scene’s geometry and structure.
Integrating Point Cloud into MoveIt: Using the cleaned, optimized point cloud in MoveIt collision planning as an Octomap.
Collision avoidance and Evaluation: Finetuning hyperparameters such as the Octomap resolution to ensure that collision avoidance works as expected for different target positions.

Project Photo — The tabletop setup for our project, showing one of our captured scenes.

Design

Our starting goal for the project was creating a high quality point cloud to use as a scene representation for robotic control. We were interested in how we can further improve on the Structure from Motion (SfM) techniques discussed in class. After some research, we settled on Gaussian splatting as our scene reconstruction technique. The main trade-off here was that we had to settle for a static representation (no dynamic scene updates) with Gaussian splatting, since updating Gaussian splats was not easy to implement in the limited time we had. However, one of the big advantages of this choice was that point clouds are a very natural extension from splatting, making the integration from the scene reconstruction module to collision avoidance and motion planning easier and the pipeline as a whole more robust. In addition, splats yield impressively high quality point clouds and are relatively quick to train (about 15-20 minutes per scene), allowing for efficiency and quick iterative design. Lastly, when creating a splat for a scene, reproducibility was important, which we ensured by measuring relative distances between objects and creating a standard boundary for the scene.

Implementation

How we brought our project to life.

Our final pipeline consists of the following steps, with a description of things we tried while in the experimental phase:

Step 1: Collect images: We record a revolving video of the scene. After experimenting with different cameras and mounts (including a camera mounted on Sawyer’s wrist), we found the best quality reconstructions came from handheld videos with our phones. We sampled the ~40 second videos at 10 fps, yielding 300-500 initial images.
Step 2: COLMAP: We use COLMAP, a Structure from Motion (SfM) program that takes in input images and outputs a sparse point cloud (~10,000 points) and estimates for camera poses corresponding to the images. While not necessary, we found that this step is a good starting point for the Gaussian splatting training and helps it converge faster.

Photo Description — Sparse point cloud and camera poses from COLMAP.

Step 3: Gaussian splatting training: We train the Gaussian splat based on the original codebase. For our scenes, we found training converges after 15,000 steps.
Step 4: Localizing target object: After cleaning and scaling the output point cloud, we use color filters and clustering to find the centroid (x,y,z) of the target object.
Step 5: Octomap and MoveIt: We add the point cloud data to MoveIt!’s PlanningScene as an Octomap, a 3D occupancy grid that represents the environment and objects within it. Aligning the Octomap to the robot model’s base frame allows the system to understand the spatial layout, including obstacles, and provides a foundation for dynamic motion planning. Using MoveIt!’s Command Group functionality, we perform trajectory planning with integrated collision checking, ensuring that the robot’s movements are both efficient and safe.
Step 6: Collision Planning: MoveIt! utilizes the Open Motion Planning Library (OMPL) to handle motion planning, offering access to a suite of powerful algorithms such as RRT (Rapidly-exploring Random Tree) and A*. By feeding the Octomap into the PlanningScene interface and specifying a goal position, the robot can successfully navigate the scene and complete the task.

In terms of the code for this project, we used the original Gaussian splatting codebase described above and COLMAP, with minimal changes such as hyperparameters. We wrote the code ourselves for point cloud noise removal, clustering and color filtering for object localization. We also implemented the Octomap integration code with MoveIt! ourselves, which we provide in our Github repository.

Results

Point cloud visualizations and video demo!

After some extensive experimentation—particularly with the Gaussian splatting part of the pipeline—we were able to reliably create high quality splats for our scenes. We provide interactive visualizations for two of our final scenes and target objects below.

We were able to perform pick place tasks with both objects with precision, both in terms of the object localization (the reconstruction part) and avoiding obstacles (the motion planning part). Since our project has no live vision component and relies entirely on the scene reconstruction, it was critical that the point cloud obtained from Gaussian splatting was high quality, correctly scaled and oriented and the object localization component was accurate enough. Our integration of the point cloud with Octomap was also smooth as collision avoidance worked as expected. We provide a visualization of the Octomap and a video demo for one of our scenes below.

Video demo of our robot performing a pick place task.

Conclusion

Key takeaways and future improvements.

In conclusion, we were able to use Gaussian splatting for a high quality, albeit static, 3D reconstruction of a tabletop scene and use it for pick and place tasks. Two problems we faced and the solutions we came up with are described below:

Problem: Noisy point cloud from splatting
- Solution: We slowly refined the image collection and gaussian splatting pipeline to increase the quality of point cloud. We experimented with different cameras and mounts as previously described, and also with different frame extraction schemes before settling on our final version.
Problem: Difficult to recreate scene accurately
- Solution: To solve the issue of a reproducible scene, we used a simple scene boundary made with tape, and took many initial measurements such as relative object distances during scene setup. This allowed us to keep refining our approach over multiple work sessions without having to set up a new scene or train a new splat every time.

An improvement we would make to a future iteration of this project is adding a grasp planning module that takes into consideration the point cloud of the target object to decide what the best grasping pose would be. We successfully implemented a version of this, where given the smoothed reconstructed mesh of the object of interest, we sample candidate grasps and use the Ferrari-Canny grasp metric to select the best grasp. With the improved grasp, we are able to specify a 6 DoF pose for the end effector to grasp the object more robustly. However due to time constraints, we were unable to test this module on the Sawyer robot. An example of the generated grasps is shown below.

OUR TEAM

The ones who made it happen.

Jialin Zhang

Sophomore studying CS and math, interested in both robotics and software engineering.

Zekai Wang

junior majoring in CS and math, interested in machine learning and robotics.

Tony Wan

senior majoring in CS and Applied Math, interested in artificial intelligence and robotics.

Yuvan Sharma

junior majoring in CS and Astrophysics, interested in robotic applications for space exploration.

Links

Check out our code and more!

Our codebase
Our website source code (adapted from the Agency Jekyll Theme)