Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance

Wenqian Cui¹, Lei Zhu², Xiaohui Li², Zhihan Guo¹, Haoli Bai², Lu Hou², Irwin King¹
¹The Chinese University of Hong Kong, ²Huawei Technologies
[Paper] [Code] [Demo]

Abstract

Full-Duplex Speech Language Models (FD-SLMs) are specialized speech foundation models designed to enable natural, real-time spoken interactions by capturing complex conversational dynamics, such as interruptions, backchannels, and overlapping speech. While cascaded FD-SLMs rely on external modules to learn discrete, predefined behaviors for duplex communications, end-to-end (e2e) FD-SLMs leverage real-world conversational data, enabling models to capture nuanced dialogue patterns for more human-like interactions---a key advantage that motivates our focus on e2e systems. However, e2e FD-SLMs face a significant challenge: their conversational abilities often degrade compared to text-based Large Language Models due to the prolonged nature of speech sequences and the scarcity of high-quality spoken dialogue data. To address this, we propose a novel planning-inspired methodology, TurnGuide, for integrating turn-level text guidance into double-channel spoken conversational contexts. Our approach dynamically segments the assistant's speech into dialogue turns and trains the assistant to first output text guidance for each turn before generating the corresponding speech. This not only aligns with conversational flow but also addresses the critical issues of timing and length of text guidance. Extensive experiments demonstrate that our method significantly enhances e2e FD-SLMs' ability to generate semantically meaningful and coherent speech, while preserving the natural flow of full-duplex spoken dialogues.

Audio Demonstrations

Note: The first 30 seconds of audio serves as the prompt to the model, and the model generates a 90 seconds continuations. SCI and Moshi TS represent Speech Chunk Interleaving and Moshi Training Strategy, which are the baselines in our paper.
ID Original Speech Speech Continuations
Prompt dGSLM SCI Moshi TS TurnGuide (Ours) Ground Truth
0
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
1
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Audio Transcripts

Select an audio sample to view its transcript.