Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance

Abstract

Full-Duplex Speech Language Models (FD-SLMs) are specialized speech foundation models designed to enable natural, real-time spoken interactions by capturing complex conversational dynamics, such as interruptions, backchannels, and overlapping speech. While cascaded FD-SLMs rely on external modules to learn discrete, predefined behaviors for duplex communications, end-to-end (e2e) FD-SLMs leverage real-world conversational data, enabling models to capture nuanced dialogue patterns for more human-like interactions---a key advantage that motivates our focus on e2e systems. However, e2e FD-SLMs face a significant challenge: their conversational abilities often degrade compared to text-based Large Language Models due to the prolonged nature of speech sequences and the scarcity of high-quality spoken dialogue data. To address this, we propose a novel planning-inspired methodology, TurnGuide, for integrating turn-level text guidance into double-channel spoken conversational contexts. Our approach dynamically segments the assistant's speech into dialogue turns and trains the assistant to first output text guidance for each turn before generating the corresponding speech. This not only aligns with conversational flow but also addresses the critical issues of timing and length of text guidance. Extensive experiments demonstrate that our method significantly enhances e2e FD-SLMs' ability to generate semantically meaningful and coherent speech, while preserving the natural flow of full-duplex spoken dialogues.

Audio Demonstrations

Note: The first 30 seconds of audio serves as the prompt to the model, and the model generates a 90 seconds continuations. SCI and Moshi TS represent Speech Chunk Interleaving and Moshi Training Strategy, which are the baselines in our paper.

ID	Original Speech	Speech Continuations
ID	Prompt	dGSLM	SCI	Moshi TS	TurnGuide (Ours)	Ground Truth
0	Loading...	Loading...	Loading...	Loading...	Loading...	Loading...
1	Loading...	Loading...	Loading...	Loading...	Loading...	Loading...

Abstract

Audio Demonstrations

Audio Transcripts