BranchOut: Capturing Realistic Multimodality in Autonomous Driving Decisions

Hee Jae Kim Zekai Yin Lei Lai Jason Lee Eshed Ohn-Bar
Boston University CoRL 2025

Paper (PDF) Video Code (coming soon)

nuScenes (Open-loop)

L2@3s: 1.41 m

Avg L2: 0.83 m; Fréchet: 2.29; NLL: 3.72; JSD: 0.36

HUGSIM (Closed-loop)

HD-Score: 0.47

NC: 0.76; DAC: 0.99; TTC: 0.69; COM: 1.00; Rc: 0.58

Multimodal GT Realism

3s L2: 0.79 m

Fréchet: 1.46; NLL: 3.48 (photorealistic sim)

Abstract

Modeling the nuanced, multimodal nature of human driving remains a core challenge for autonomous systems, as existing methods often fail to capture the diversity of plausible behaviors in complex real-world scenarios. We introduce an end-to-end planner and a benchmark for modeling realistic multimodality in autonomous driving decisions. Our Gaussian Mixture Model (GMM)-based diffusion planner explicitly captures human like, multimodal driving decisions in diverse contexts. While the model achieves state-of-the-art performance on current benchmarks, it also reveals weaknesses in standard evaluation practices that rely on single ground-truth trajectories or coarse closed-loop metrics and inadvertently penalize diverse yet plausible alternatives. We further develop a human-in-the-loop simulation benchmark that enables finer-grained evaluation and measures multimodal realism in challenging driving settings.

Overview Video

Motivation & Contributions

Real-world driving involves multiple plausible and safe decisions in a given scenario. Yet, many current planners are deterministic or exhibit mode collapse, and most datasets provide only a single ground truth per scene, penalizing other feasible alternatives. BranchOut advances both modeling and evaluation of realistic multimodality in planning.

GMM-based Diffusion Planner. An end-to-end, scene-conditioned diffusion model with a branched GMM head that outputs multiple trajectory hypotheses, improving multimodal coverage and realism.
Human-in-the-Loop Multimodal Benchmark. A photorealistic, closed-loop re-driving framework that collects diverse, human trajectories per scene, which as been validated against real-world logs and complemented by a virtual digital-twin setup.
State-of-the-art Results. Strong open-loop and closed-loop performance, and improved distributional alignment (NLL/JSD), while using compact sampling (one sample per command).

Method

BranchOut follows a standard end-to-end planning setup: a scene context comprises six multi-view camera images and an HD map. Unlike some prior work, we do not use ego-status or past ego trajectory inputs by default. A transformer denoiser leverages multi-head cross-attention (MHCA) to condition ego-trajectory queries on scene features featuring a branched GMM head, predicts K mean trajectories and mixture weights at the same time, enabling selection among plausible futures.

We train with a diffusion objective and an additional negative log-likelihood over the predicted multimodal distributions, plus safety constraints: L = L_plan + λ_NLLL_NLL + λ_cL_constraints (λ_NLL=0.1, λ_c=0.1). At inference, we initialize from Gaussian noise and solve the reverse ODE with a fast single-step DPM-Solver++ sampler. Our GMM head enables lightweight yet effective multi-modal prediction, explicitly modeling diverse, human-like trajectories.

Human-in-the-Loop Simulation & Multimodal GT

We augment a photorealistic 3D reconstruction-based simulator with an interactive re-driving interface, a kinematic bicycle model for ego motion, and collision detection using depth from reconstruction. Participants re-drive each scene multiple times (five per participant), producing a set of realistic, diverse trajectories per scenario. We validate realism by comparing to logged real-world trajectories; a virtual digital-twin environment provides an additional reference.

Evaluation & Results

Setup & Metrics. We report L2 displacement error against the single GT provided by nuScenes, and introduce a multimodal evaluation using 16 GT trajectories per scene (15 from our benchmark + 1 from nuScenes) with minimum Fréchet distance. We further assess distributional quality via Negative Log-Likelihood (NLL) and Speed Jensen–Shannon Divergence (JSD). For closed-loop, we follow HUGSIM with No Collision (NC), Drivable Area Compliance (DAC), Time to Collision (TTC), Comfort (COM), Route Completion (Rc), and the composite HD-Score.

Open-Loop (nuScenes, Single & Multimodal GT)

BranchOut achieves L2@3s = 1.41 m (Avg 0.83 m), outperforming prior multimodal planners with compact sampling (one sample per command). Under multimodal metrics, BranchOut yields Fréchet = 2.29, NLL = 3.72, and JSD = 0.36, reflecting better coverage and human-aligned diversity.

Closed-Loop (HUGSIM)

BranchOut attains the best overall HD-Score = 0.47, with strong safety and progress: NC 0.76, DAC 0.99, TTC 0.69, COM 1.00, Rc 0.58, which is substantially higher route completion than unimodal baselines.

Open-Loop Qualitative Examples

Qualitative results of BranchOut evaluation in open-loop evaluation with multi-modal settings, where it achieved state-of-the-art in planning accuracy and multi-modal distribution coherence.

Closed-Loop Qualitative Examples

Qualitative results of BranchOut evaluation in HUGSIM, a closed-loop photorealistic simulator, where it drove robustly and outperformed baselines on all metrics.

VAD-BASE

BranchOut

BibTeX

@inproceedings{kim2025branchout,
  title     = {BranchOut: Capturing Realistic Multimodality in Autonomous Driving Decisions},
  author    = {Kim, Hee Jae and Yin, Zekai and Lai, Lei and Lee, Jason and Ohn-Bar, Eshed},
  booktitle = {Proceedings of the 9th Conference on Robot Learning (CoRL)},
  year      = {2025}
}

Acknowledgments

We thank the Red Hat Collaboratory (award #2024-01-RH07, #2025-01-RH04) and the National Science Foundation (IIS-2152077) for supporting this research.