Adapting Diffusion Large Language Models for Search Agents
DLLM-Searcher decodes the tool call region before the think region, enabling the model to keep thinking while waiting for tool response.
<tool_call> and </tool_call> are decoded at step 1, while <think> content appears after step 32.
Recently, Diffusion Large Language Models (dLLMs) have demonstrated unique efficiency advantages, enabled by their inherently parallel decoding mechanism and flexible generation paradigm. Meanwhile, despite the rapid advancement of Search Agents, their practical deployment is constrained by a fundamental limitation: the Latency Challenge — the serial execution of multi-round reasoning, tool calling, and tool response waiting under the ReAct agent paradigm induces severe end-to-end latency.
In this paper, we propose DLLM-Searcher, an optimization framework for dLLM-based Search Agents. To solve the Agent Ability Challenge, we design a two-stage post-training pipeline encompassing Agentic SFT and Agentic VRPO, which enhances the backbone dLLM's information seeking and reasoning capabilities.
To mitigate the Latency Challenge, we propose a novel agent paradigm termed P-ReAct (Parallel-Reasoning and Acting). P-ReAct guides the model to prioritize decoding tool_call instructions, thereby allowing the model to keep thinking while waiting for the tool's return.
Experimental results demonstrate that DLLM-Searcher achieves performance comparable to mainstream LLM-based search agents and P-ReAct delivers approximately 15% inference acceleration.
Figure: DLLM-Searcher includes training process and inference process. In training, both Agentic SFT and Agentic VRPO use Block Attention and Agentic Noising to compute the Agentic ELBO. In inference, we employ the P-ReAct agent paradigm with token pre-filling and confidence biasing.
Comparison with traditional RAG methods, LLM agents, and dLLM agents
| Models | HotpotQA† | 2Wiki† | Bamboogle‡ | Musique† | Avg | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| ACCR | ACCL | ACCR | ACCL | ACCR | ACCL | ACCR | ACCL | ACCR | ACCL | |
| Traditional RAG | ||||||||||
| SuRe | 32.4 | 48.4 | 22.2 | 26.8 | 17.6 | 28.0 | 7.2 | 10.0 | 19.9 | 28.3 |
| Selective-Context | 33.2 | 43.4 | 27.4 | 29.6 | 15.2 | 20.8 | 5.8 | 8.8 | 20.4 | 25.7 |
| Adaptive-RAG | 38.0 | 47.4 | 27.8 | 25.8 | 21.6 | 25.0 | 7.2 | 11.6 | 23.7 | 27.5 |
| IRCoT | 48.8 | 55.2 | 41.0 | 38.6 | 32.0 | 39.2 | 11.6 | 15.8 | 33.4 | 37.2 |
| Iter-RetGen | 41.6 | 54.4 | 32.4 | 34.4 | 26.4 | 32.0 | 14.8 | 18.2 | 28.8 | 34.8 |
| CR-Planner | 44.4 | 33.6 | 48.2 | 22.0 | 35.2 | 34.4 | 12.2 | 11.4 | 35.0 | 25.4 |
| ReARTeR | 46.8 | 50.6 | 55.4 | 53.4 | 49.6 | 54.4 | 29.6 | 30.2 | 45.4 | 47.2 |
| ARM-based LLMs Agent | ||||||||||
| Search-o1 | 40.8 | 53.2 | 47.0 | 51.2 | 49.6 | 52.0 | 15.2 | 19.0 | 38.2 | 43.9 |
| Search-R1 | 49.6 | 62.2 | 46.0 | 50.0 | 47.2 | 56.0 | 28.0 | 26.0 | 42.7 | 48.6 |
| WebSailor* | 50.4 | 52.4 | 59.4 | 61.4 | 57.6 | 65.6 | 22.0 | 28.0 | 47.4 | 51.9 |
| R1Searcher* | 58.0 | 62.2 | 59.6 | 63.4 | 66.4 | 68.8 | 28.2 | 31.4 | 53.1 | 56.5 |
| dLLMs Agent | ||||||||||
| SDAR | / | / | / | / | / | / | / | / | / | / |
| Dream | 11.0 | 11.6 | 13.6 | 12.0 | 12.0 | 13.6 | 3.8 | 3.2 | 10.1 | 10.1 |
| LLaDA | 36.0 | 32.8 | 42.0 | 38.8 | 46.4 | 42.4 | 15.2 | 15.8 | 34.9 | 32.5 |
| DLLM-Searcher | 60.4 | 62.4 | 69.8 | 64.6 | 68.8 | 69.6 | 29.0 | 29.8 | 57.0 | 56.6 |
Traditional ReAct paradigm suffers from sequential execution: the model must complete reasoning before making tool calls, then wait idly for tool responses. This creates significant latency bottlenecks in multi-turn agent interactions.
P-ReAct addresses this by:
<tool_call> token generation at early decoding stepsKey Insight: By decoding tool calls first, the model can continue thinking while external APIs execute, effectively overlapping computation with I/O wait time.
Only 11 lines of code to implement complete token-prefilling and confidence biasing.
elif 'toolcall_pre_rl' in seq.remasking_strategy:
if seq.current_denoising_step == 0:
seq_x0[your_tool_end] = 151658 # </tool_call>
transfer_index[your_tool_end] = True
seq_x0[your_tool_start] = 151657 # <tool_call>
transfer_index[your_tool_start] = True
else:
confidence = torch.where(mask_index, seq_x0_p, -np.inf)
confidence[your_tool_start:your_tool_end + 1] = \
confidence[your_tool_start:your_tool_end + 1] + 0.5
_, top_indices = torch.topk(confidence, num_to_transfer)
transfer_index[top_indices] = True
@misc{zhao2026dllmsearcheradaptingdiffusionlarge,
title={DLLM-Searcher: Adapting Diffusion Large Language Model for Search Agents},
author={Jiahao Zhao and Shaoxuan Xu and Zhongxiang Sun and Fengqi Zhu and Jingyang Ou and Yuling Shi and Chongxuan Li and Xiao Zhang and Jun Xu},
year={2026},
eprint={2602.07035},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.07035},
}