PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

NavTrust: Benchmarking Trustworthiness
for Embodied Navigation

Huaide Jiang¹^*, Yash Chaudhary¹^*, Yuping Wang^1,2, Zehao Wang¹, Raghav Sharma³, Manan Mehta⁴, Yang Zhou⁶, Lichao Sun⁵, Zhiwen Fan⁶, Zhengzhong Tu⁶, Jiachen Li¹^‡

¹University of California, Riverside ²University of Michigan ³Workday ⁴University of Southern California ⁵Lehigh University ⁶Texas A&M University
^* Equal Contribution ^‡ Correponding Author
TASL Lab Logo

Trustworthy Autonomous Systems Laboratory (TASL)

Paper Code (Coming Soon)

Are Current Navigation Models Trustworthy Enough?

Visualization of trajectories from VLFM, NaVid, Uni-NaVid and WMNav under different types of corruptions.

Paper Demonstration Video

Abstract

Embodied navigation remains challenging due to cluttered layouts, complex semantics, and language-conditioned instructions. Recent breakthroughs in complex indoor domains require robots to interpret cluttered scenes, reason over long-horizon visual memories, and follow natural language instructions. Broadly, there are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To the best of our knowledge, NavTrust is the first benchmark to expose embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial success-rate degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. As part of this roadmap, we systematically evaluate four distinct mitigation strategies: data augmentation, teacher-student knowledge distillation, safeguard LLM, and lightweight adapter tuning, to enhance robustness. Our experiments offer a practical path for developing more resilient embodied agents.

Potential Issues in Trustworthiness and Reliability

State-of-the-art VLN agents are known to fail under minor linguistic perturbations, while top OGN agents break down under small domain shifts like low lighting or motion blur, leading to unreliable behaviors. These vulnerabilities are often ignored by existing benchmarks, which typically report performance on clean, idealized inputs. The existing work typically evaluates perceptual and linguistic robustness in isolation, often ignores depth sensor corruptions, and lacks a unified benchmark for comparing mitigation strategies.

Corruptions Applied

An overall illustration of three types of corruptions supported in our NavTrust benchmark, which highlights robustness challenges in language instructions and onboard sensor measurements.

Mitigation Strategies Applied

To address the vulnerabilities identified by our NavTrust benchmark, we investigate four strategies for enhancing agent robustness on a subset of R2R dataset. These complementary mechanisms provide a constructive path toward developing more trustworthy and resilient embodied navigation systems.

Corruption Results

Success Rate (%) ↑ and SPL ↑ across corruption types (left: RGB corruption, middle: depth corruption, right: instruction corruption; L-L: Low-lighting). The first and the second rows show the PRS ↑ (Performance Retention Score, quantifies robustness to corruptions by reporting the fraction of clean performance an agent retains on average) based on SR and SPL.

Mitigation Results
of Data Augmentation, Teacher-Student Distillation and Adapter

Corruption mitigation strategies: SR per corruption for ETPNav where (σ) indicates the intensity (Adap.: Adapter, DA: Data Augmentation, PF: Per-frame, PE: Per-episode, SD: Success Rate Distributed, T-S distil.: Teacher-Student distillation, L-L: Low-lighting, results tested in R2R dataset).

Mitigation Results
of Safeguard LLM

Instruction-level mitigation strategies on R2R dataset: SR across instruction variants.

BibTeX

@misc{
  chaudhary2026navtrust,
  title={NavTrust: Benchmarking Trustworthiness for Embodied Navigation},
  author={Yash Chaudhary and Huaide Jiang and Yuping Wang and Raghav Sharma and Manan Mehta and Lichao Sun and Zhiwen Fan and Zhengzhong Tu and Jiachen Li},
  year={2026},
  url={https://openreview.net/forum?id=ANbAB0tXv3}
}

NavTrust: Benchmarking Trustworthiness for Embodied Navigation