Full-Duplex Speech Language Models (FD-SLMs) are specialized speech foundation models designed to enable natural, real-time spoken interactions by capturing complex conversational dynamics, such as interruptions, backchannels, and overlapping speech. While cascaded FD-SLMs rely on external modules to learn discrete, predefined behaviors for duplex communications, end-to-end (e2e) FD-SLMs leverage real-world conversational data, enabling models to capture nuanced dialogue patterns for more human-like interactions---a key advantage that motivates our focus on e2e systems. However, e2e FD-SLMs face a significant challenge: their conversational abilities often degrade compared to text-based Large Language Models due to the prolonged nature of speech sequences and the scarcity of high-quality spoken dialogue data. To address this, we propose a novel planning-inspired methodology, TurnGuide, for integrating turn-level text guidance into double-channel spoken conversational contexts. Our approach dynamically segments the assistant's speech into dialogue turns and trains the assistant to first output text guidance for each turn before generating the corresponding speech. This not only aligns with conversational flow but also addresses the critical issues of timing and length of text guidance. Extensive experiments demonstrate that our method significantly enhances e2e FD-SLMs' ability to generate semantically meaningful and coherent speech, while preserving the natural flow of full-duplex spoken dialogues.
ID | Original Speech | Speech Continuations | ||||
---|---|---|---|---|---|---|
Prompt | dGSLM | SCI | Moshi TS | TurnGuide (Ours) | Ground Truth | |
0 |
Loading...
|
Loading...
|
Loading...
|
Loading...
|
Loading...
|
Loading...
|
1 |
Loading...
|
Loading...
|
Loading...
|
Loading...
|
Loading...
|
Loading...
|