DroneSplat

Poster

Abstract

Drones have become essential tools for reconstructing wild scenes due to their outstanding maneuverability. Recent advances in radiance field methods have achieved remarkable rendering quality, providing a new avenue for 3D reconstruction from drone imagery. However, dynamic distractors in wild environments challenge the static scene assumption in radiance fields, while limited view constraints hinder the accurate capture of underlying scene geometry. To address these challenges, we introduce DroneSplat, a novel framework designed for robust 3D reconstruction from in-the-wild drone imagery. Our method adaptively adjusts masking thresholds by integrating local-global segmentation heuristics with statistical approaches, enabling precise identification and elimination of dynamic distractors in static scenes. We enhance 3D Gaussian Splatting with multi-view stereo predictions and a voxel-guided optimization strategy, supporting high-quality rendering under limited view constraints. For comprehensive evaluation, we provide a drone-captured 3D reconstruction dataset encompassing both dynamic and static scenes. Extensive experiments demonstrate that DroneSplat outperforms both 3DGS and NeRF baselines in handling in-the-wild drone imagery.

Framework

Given a few posed drone imagery of a wild scene, our goal is to identify and eliminate dynamic distractors. We first predict a dense point cloud through a learning-based multi-view stereo method, followed by point sampling based on confidence and geometric features. The sampled point cloud is used to initialize Gaussian primitives, which is then optimized using a voxel-guided strategy. At iteration \(t=n\), we calculate the normalized residual of the rendered image and combine it with the segmentation results to obtain the object-wise residuals. We adaptively adjust the threshold based on the object-wise residuals and statistical approaches to obtain local masks. Meanwhile, we mark objects with high residuals as tracking candidates, deriving the global set at \(t=n\) by combining the global set at \(t=n-1\) with the tracking outcomes at \(t=n\). After merging the local mask and the global mask retrieved from the global set, we can obtain the final mask at time \(t=n\). The mask set illustrates the dynamic distractors we predicted.

DroneSplat Dataset

Reconstruction Results

DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction

from In-the-Wild Drone Imagery