More Works

NavTrust: Benchmarking Trustworthiness
for Embodied Navigation

1University of California, Riverside     2University of Michigan     3Workday     4University of Southern California     5Lehigh University     6Texas A&M University    
* Equal Contribution     Correponding Author

TASL Lab Logo Trustworthy Autonomous Systems Laboratory (TASL)

Are Current Navigation Models Trustworthy Enough?

Visualization of trajectories from VLFM, NaVid, Uni-NaVid and WMNav under different types of corruptions.

Paper Demonstration Video

Abstract

Embodied navigation remains challenging due to cluttered layouts, complex semantics, and language-conditioned instructions. Recent breakthroughs in complex indoor domains require robots to interpret cluttered scenes, reason over long-horizon visual memories, and follow natural language instructions. Broadly, there are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To the best of our knowledge, NavTrust is the first benchmark to expose embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial success-rate degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. As part of this roadmap, we systematically evaluate four distinct mitigation strategies: data augmentation, teacher-student knowledge distillation, safeguard LLM, and lightweight adapter tuning, to enhance robustness. Our experiments offer a practical path for developing more resilient embodied agents.

Potential Issues in Trustworthiness and Reliability

Potential Issues in Trustworthiness and Reliability

State-of-the-art VLN agents are known to fail under minor linguistic perturbations, while top OGN agents break down under small domain shifts like low lighting or motion blur, leading to unreliable behaviors. These vulnerabilities are often ignored by existing benchmarks, which typically report performance on clean, idealized inputs. The existing work typically evaluates perceptual and linguistic robustness in isolation, often ignores depth sensor corruptions, and lacks a unified benchmark for comparing mitigation strategies.

Corruptions Applied

Corruptions Applied

An overall illustration of three types of corruptions supported in our NavTrust benchmark, which highlights robustness challenges in language instructions and onboard sensor measurements.

Mitigation Strategies Applied

Mitigation Strategies Applied

To address the vulnerabilities identified by our NavTrust benchmark, we investigate four strategies for enhancing agent robustness on a subset of R2R dataset. These complementary mechanisms provide a constructive path toward developing more trustworthy and resilient embodied navigation systems.

Corruption Results

Corruption Results

Success Rate (%) ↑ and SPL ↑ across corruption types (left: RGB corruption, middle: depth corruption, right: instruction corruption; L-L: Low-lighting). The first and the second rows show the PRS ↑ (Performance Retention Score, quantifies robustness to corruptions by reporting the fraction of clean performance an agent retains on average) based on SR and SPL.

Mitigation Results
of Data Augmentation, Teacher-Student Distillation and Adapter

Mitigation Results

Corruption mitigation strategies: SR per corruption for ETPNav where (σ) indicates the intensity (Adap.: Adapter, DA: Data Augmentation, PF: Per-frame, PE: Per-episode, SD: Success Rate Distributed, T-S distil.: Teacher-Student distillation, L-L: Low-lighting, results tested in R2R dataset).

Mitigation Results
of Safeguard LLM

Mitigation Results

Instruction-level mitigation strategies on R2R dataset: SR across instruction variants.

BibTeX

@misc{
  chaudhary2026navtrust,
  title={NavTrust: Benchmarking Trustworthiness for Embodied Navigation},
  author={Yash Chaudhary and Huaide Jiang and Yuping Wang and Raghav Sharma and Manan Mehta and Lichao Sun and Zhiwen Fan and Zhengzhong Tu and Jiachen Li},
  year={2026},
  url={https://openreview.net/forum?id=ANbAB0tXv3}
}