Do LLM Modules Generalize? A Study on Motion Generation for Autonomous Driving

1 Autolab, Westlake University    2 UDEER.AI    3 Zhejiang University
*Indicates Equal Contribution

Project Video for CoRL 2025

Abstract

Recent breakthroughs in large language models (LLMs) have not only advanced natural language processing but also inspired their application in domains with structurally similar problems—most notably, autonomous driving motion generation. Both domains involve autoregressive sequence modeling, token-based representations, and context-aware decision making, making the transfer of LLM components a natural and increasingly common practice. However, despite promising early attempts, a systematic understanding of which LLM modules are truly transferable remains lacking. In this paper, we present a comprehensive evaluation of five key LLM modules—tokenizer design, positional embedding, pre-training paradigms, post-training strategies, and test-time computation—within the context of motion generation for autonomous driving. Through extensive experiments on the Waymo Sim Agents benchmark, we demonstrate that, when appropriately adapted, these modules can significantly improve performance for autonomous driving motion generation. In addition, we identify which techniques can be effectively transferred, analyze the potential reasons for the failure of others, and discuss the specific adaptations needed for autonomous driving scenarios. We evaluate our method on the Sim Agents task and achieve competitive results.

Motivation and Method

LLM to AD Pipeline Comparison

As illustrated in the above figure, the technical pipeline of autonomous driving motion generation exhibits a notable resemblance to that of large language models (LLMs). This observation naturally prompts the question: which modules, proven effective in LLMs, can be directly transferred to motion generation for autonomous driving, and which ones necessitate domain-specific adaptation? In this work, we conduct a systematic investigation of these five core components. Our key finding is that, despite differences in application domains, several technical modules are transferable between large language models (LLMs) and motion generation tasks.

Motion Generator Architecture

We design a GPT-like trajectory generation model that predicts the next motion token in an autoregressive manner and iteratively constructs complete trajectories. To maintain awareness of the map and surrounding agents throughout the generation process, our model employs multiple attention mechanisms during inference. Specifically, we perform:
(1) self-attention over each agent's motion tokens across different time steps;
(2) self-attention between different agents at the same time step;
(3) cross-attention over the static map context; and
(4) cross-attention over non-predicted agents that are excluded from the GPT input during rollout.

More Demonstrations

Scenario 1: Parking Maneuvers

This scenario demonstrates our model's capability in handling complex parking maneuvers, including negotiating tight encounters in parking lots, vehicle exiting and entering parking spaces, precise control in narrow lanes, and effective collision avoidance.

Scenario 2: Multi-Agent Interactions

This scenario showcases our model's ability to handle complex interactions between multiple agents, including cooperative driving, negotiation in intersections, and maintaining safe distances in dense traffic.

Scenario 3: Lane Change Maneuvers

This scenario showcases the model’s accurate representation and prediction of complex lane-changing behaviors, including merging into traffic, overtaking slower vehicles, handling overtaking-specific maneuvers, and responding to cut-in scenarios.

Scenario 4: Narrow Road Navigation

This scenario highlights our model's ability to navigate through narrow roads and tight spaces, demonstrating precise control and strong spatial awareness in complex urban environments. The model successfully handles challenging maneuvers such as navigating around parked vehicles, entering confined road segments, and threading between closely spaced cars.