See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation
Abstract
See, Point, Fly (SPF) is a training-free aerial vision-and-language navigation framework that treats action prediction as a 2D spatial grounding task, outperforming existing methods in both simulation and real-world evaluations.
We present See, Point, Fly (SPF), a training-free aerial vision-and-language navigation (AVLN) framework built atop vision-language models (VLMs). SPF is capable of navigating to any goal based on any type of free-form instructions in any kind of environment. In contrast to existing VLM-based approaches that treat action prediction as a text generation task, our key insight is to consider action prediction for AVLN as a 2D spatial grounding task. SPF harnesses VLMs to decompose vague language instructions into iterative annotation of 2D waypoints on the input image. Along with the predicted traveling distance, SPF transforms predicted 2D waypoints into 3D displacement vectors as action commands for UAVs. Moreover, SPF also adaptively adjusts the traveling distance to facilitate more efficient navigation. Notably, SPF performs navigation in a closed-loop control manner, enabling UAVs to follow dynamic targets in dynamic environments. SPF sets a new state of the art in DRL simulation benchmark, outperforming the previous best method by an absolute margin of 63%. In extensive real-world evaluations, SPF outperforms strong baselines by a large margin. We also conduct comprehensive ablation studies to highlight the effectiveness of our design choice. Lastly, SPF shows remarkable generalization to different VLMs. Project page: https://spf-web.pages.dev
Community
We are excited to share our newest work published in CoRL 2025:
See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation
See, Point, Fly (SPF) enables UAVs to navigate to any goal based on free-form natural language instructions in any environment, without task-specific training. The system demonstrates robust performance across diverse scenarios including obstacle avoidance, long-horizon planning, and dynamic target following.
Visit our project page and repo to learn more!
Project Page: https://spf-web.pages.dev
GitHub Repo: https://github.com/Hu-chih-yao/see-point-fly
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AeroDuo: Aerial Duo for UAV-based Vision and Language Navigation (2025)
- FiLM-Nav: Efficient and Generalizable Navigation via VLM Fine-tuning (2025)
- DreamNav: A Trajectory-Based Imaginative Framework for Zero-Shot Vision-and-Language Navigation (2025)
- VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation (2025)
- FlightDiffusion: Revolutionising Autonomous Drone Training with Diffusion Models Generating FPV Video (2025)
- Boosting Zero-Shot VLN via Abstract Obstacle Map-Based Waypoint Prediction with TopoGraph-and-VisitInfo-Aware Prompting (2025)
- MAG-Nav: Language-Driven Object Navigation Leveraging Memory-Reserved Active Grounding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper