nuScenes (Open-loop)
L2@3s: 1.41 m
Avg L2: 0.83 m; Fréchet: 2.29; NLL: 3.72; JSD: 0.36
HUGSIM (Closed-loop)
HD-Score: 0.47
NC: 0.76; DAC: 0.99; TTC: 0.69; COM: 1.00; Rc: 0.58
Multimodal GT Realism
3s L2: 0.79 m
Fréchet: 1.46; NLL: 3.48 (photorealistic sim)
Modeling the nuanced, multimodal nature of human driving remains a core challenge for autonomous systems, as existing methods often fail to capture the diversity of plausible behaviors in complex real-world scenarios. We introduce an end-to-end planner and a benchmark for modeling realistic multimodality in autonomous driving decisions. Our Gaussian Mixture Model (GMM)-based diffusion planner explicitly captures human like, multimodal driving decisions in diverse contexts. While the model achieves state-of-the-art performance on current benchmarks, it also reveals weaknesses in standard evaluation practices that rely on single ground-truth trajectories or coarse closed-loop metrics and inadvertently penalize diverse yet plausible alternatives. We further develop a human-in-the-loop simulation benchmark that enables finer-grained evaluation and measures multimodal realism in challenging driving settings.
Real-world driving involves multiple plausible and safe decisions in a given scenario. Yet, many current planners are deterministic or exhibit mode collapse, and most datasets provide only a single ground truth per scene, penalizing other feasible alternatives. BranchOut advances both modeling and evaluation of realistic multimodality in planning.
BranchOut follows a standard end-to-end planning setup: a scene context comprises six multi-view camera images and an HD map. Unlike some prior work, we do not use ego-status or past ego trajectory inputs by default. A transformer denoiser leverages multi-head cross-attention (MHCA) to condition ego-trajectory queries on scene features featuring a branched GMM head, predicts K mean trajectories and mixture weights at the same time, enabling selection among plausible futures.
We train with a diffusion objective and an additional negative log-likelihood over the predicted multimodal distributions, plus safety constraints: L = Lplan + λNLLLNLL + λcLconstraints (λNLL=0.1, λc=0.1). At inference, we initialize from Gaussian noise and solve the reverse ODE with a fast single-step DPM-Solver++ sampler. Our GMM head enables lightweight yet effective multi-modal prediction, explicitly modeling diverse, human-like trajectories.
We augment a photorealistic 3D reconstruction-based simulator with an interactive re-driving interface, a kinematic bicycle model for ego motion, and collision detection using depth from reconstruction. Participants re-drive each scene multiple times (five per participant), producing a set of realistic, diverse trajectories per scenario. We validate realism by comparing to logged real-world trajectories; a virtual digital-twin environment provides an additional reference.
Setup & Metrics. We report L2 displacement error against the single GT provided by nuScenes, and introduce a multimodal evaluation using 16 GT trajectories per scene (15 from our benchmark + 1 from nuScenes) with minimum Fréchet distance. We further assess distributional quality via Negative Log-Likelihood (NLL) and Speed Jensen–Shannon Divergence (JSD). For closed-loop, we follow HUGSIM with No Collision (NC), Drivable Area Compliance (DAC), Time to Collision (TTC), Comfort (COM), Route Completion (Rc), and the composite HD-Score.
BranchOut achieves L2@3s = 1.41 m (Avg 0.83 m), outperforming prior multimodal planners with compact sampling (one sample per command). Under multimodal metrics, BranchOut yields Fréchet = 2.29, NLL = 3.72, and JSD = 0.36, reflecting better coverage and human-aligned diversity.
BranchOut attains the best overall HD-Score = 0.47, with strong safety and progress: NC 0.76, DAC 0.99, TTC 0.69, COM 1.00, Rc 0.58, which is substantially higher route completion than unimodal baselines.
Qualitative results of BranchOut evaluation in open-loop evaluation with multi-modal settings, where it achieved state-of-the-art in planning accuracy and multi-modal distribution coherence.
Qualitative results of BranchOut evaluation in HUGSIM, a closed-loop photorealistic simulator, where it drove robustly and outperformed baselines on all metrics.
@inproceedings{kim2025branchout,
title = {BranchOut: Capturing Realistic Multimodality in Autonomous Driving Decisions},
author = {Kim, Hee Jae and Yin, Zekai and Lai, Lei and Lee, Jason and Ohn-Bar, Eshed},
booktitle = {Proceedings of the 9th Conference on Robot Learning (CoRL)},
year = {2025}
}