Gaussian splatting is a technique used to reconstruct a 3D scene by optimizing numerous tiny 3D Gaussians, creating a dense point cloud that represents the spatial distribution of objects. The process begins with capturing different viewpoints of the scene, which gives a set of image-camera pose pairs (anywhere from 50 to >1000) which serve as the input. The output is a Gaussian splat or point cloud that captures the scene’s geometry and structure.
The primary motivation for using Gaussian splatting in robotics is its ability to provide real-time rendering of complex 3D environments. Compared to ray-based methods like NeRF (Neural Radiance Fields), Gaussian splatting offers more efficient processing, especially when dynamic scene updates are required (which is not trivial but certainly possible).
In our use case, we aim to leverage static Gaussian splatting to recreate a tabletop scene for robotic manipulation. The high-level goal is to enable a robot to navigate a crowded environment, identify and interact with a target object, and perform tasks such as picking it up and placing it at a desired location—all with the help of 3D scene reconstruction from Gaussian splatting. To achieve this, we needed to address the following challenges with our pipeline:
Our starting goal for the project was creating a high quality point cloud to use as a scene representation for robotic control. We were interested in how we can further improve on the Structure from Motion (SfM) techniques discussed in class. After some research, we settled on Gaussian splatting as our scene reconstruction technique. The main trade-off here was that we had to settle for a static representation (no dynamic scene updates) with Gaussian splatting, since updating Gaussian splats was not easy to implement in the limited time we had. However, one of the big advantages of this choice was that point clouds are a very natural extension from splatting, making the integration from the scene reconstruction module to collision avoidance and motion planning easier and the pipeline as a whole more robust. In addition, splats yield impressively high quality point clouds and are relatively quick to train (about 15-20 minutes per scene), allowing for efficiency and quick iterative design. Lastly, when creating a splat for a scene, reproducibility was important, which we ensured by measuring relative distances between objects and creating a standard boundary for the scene.
Our final pipeline consists of the following steps, with a description of things we tried while in the experimental phase:
In terms of the code for this project, we used the original Gaussian splatting codebase described above and COLMAP, with minimal changes such as hyperparameters. We wrote the code ourselves for point cloud noise removal, clustering and color filtering for object localization. We also implemented the Octomap integration code with MoveIt! ourselves, which we provide in our Github repository.
After some extensive experimentation—particularly with the Gaussian splatting part of the pipeline—we were able to reliably create high quality splats for our scenes. We provide interactive visualizations for two of our final scenes and target objects below.
We were able to perform pick place tasks with both objects with precision, both in terms of the object localization (the reconstruction part) and avoiding obstacles (the motion planning part). Since our project has no live vision component and relies entirely on the scene reconstruction, it was critical that the point cloud obtained from Gaussian splatting was high quality, correctly scaled and oriented and the object localization component was accurate enough. Our integration of the point cloud with Octomap was also smooth as collision avoidance worked as expected. We provide a visualization of the Octomap and a video demo for one of our scenes below.
In conclusion, we were able to use Gaussian splatting for a high quality, albeit static, 3D reconstruction of a tabletop scene and use it for pick and place tasks. Two problems we faced and the solutions we came up with are described below:
An improvement we would make to a future iteration of this project is adding a grasp planning module that takes into consideration the point cloud of the target object to decide what the best grasping pose would be. We successfully implemented a version of this, where given the smoothed reconstructed mesh of the object of interest, we sample candidate grasps and use the Ferrari-Canny grasp metric to select the best grasp. With the improved grasp, we are able to specify a 6 DoF pose for the end effector to grasp the object more robustly. However due to time constraints, we were unable to test this module on the Sawyer robot. An example of the generated grasps is shown below.
Sophomore studying CS and math, interested in both robotics and software engineering.
junior majoring in CS and math, interested in machine learning and robotics.
senior majoring in CS and Applied Math, interested in artificial intelligence and robotics.
junior majoring in CS and Astrophysics, interested in robotic applications for space exploration.