DLLM-Searcher

Adapting Diffusion Large Language Models for Search Agents

*Equal Contribution   Project Leader   Corresponding Author
~15%
Inference Acceleration
SFT & VRPO
Agentic Post-Training
P-ReAct
New Agent Paradigm

One P-ReAct Iteration

DLLM-Searcher decodes the tool call region before the think region, enabling the model to keep thinking while waiting for tool response.

Interactive Decoding Process

💡 Notice: <tool_call> and </tool_call> are decoded at step 1, while <think> content appears after step 32.
Step 0 / 0
Speed
Revealed: 0 / 0  |  New this step: 0
Decoding Order (first_unmask_times - each cell = one token's unmask step)

📋 Abstract

Recently, Diffusion Large Language Models (dLLMs) have demonstrated unique efficiency advantages, enabled by their inherently parallel decoding mechanism and flexible generation paradigm. Meanwhile, despite the rapid advancement of Search Agents, their practical deployment is constrained by a fundamental limitation: the Latency Challenge — the serial execution of multi-round reasoning, tool calling, and tool response waiting under the ReAct agent paradigm induces severe end-to-end latency.

In this paper, we propose DLLM-Searcher, an optimization framework for dLLM-based Search Agents. To solve the Agent Ability Challenge, we design a two-stage post-training pipeline encompassing Agentic SFT and Agentic VRPO, which enhances the backbone dLLM's information seeking and reasoning capabilities.

To mitigate the Latency Challenge, we propose a novel agent paradigm termed P-ReAct (Parallel-Reasoning and Acting). P-ReAct guides the model to prioritize decoding tool_call instructions, thereby allowing the model to keep thinking while waiting for the tool's return.

Experimental results demonstrate that DLLM-Searcher achieves performance comparable to mainstream LLM-based search agents and P-ReAct delivers approximately 15% inference acceleration.

🏗️ Architecture Overview

DLLM-Searcher Architecture

Figure: DLLM-Searcher includes training process and inference process. In training, both Agentic SFT and Agentic VRPO use Block Attention and Agentic Noising to compute the Agentic ELBO. In inference, we employ the P-ReAct agent paradigm with token pre-filling and confidence biasing.

📊 Experimental Results

Performance on Multi-hop QA Benchmarks

Comparison with traditional RAG methods, LLM agents, and dLLM agents

Models HotpotQA 2Wiki Bamboogle Musique Avg
ACCR ACCL ACCR ACCL ACCR ACCL ACCR ACCL ACCR ACCL
Traditional RAG
SuRe 32.448.4 22.226.8 17.628.0 7.210.0 19.928.3
Selective-Context 33.243.4 27.429.6 15.220.8 5.88.8 20.425.7
Adaptive-RAG 38.047.4 27.825.8 21.625.0 7.211.6 23.727.5
IRCoT 48.855.2 41.038.6 32.039.2 11.615.8 33.437.2
Iter-RetGen 41.654.4 32.434.4 26.432.0 14.818.2 28.834.8
CR-Planner 44.433.6 48.222.0 35.234.4 12.211.4 35.025.4
ReARTeR 46.850.6 55.453.4 49.654.4 29.630.2 45.447.2
ARM-based LLMs Agent
Search-o1 40.853.2 47.051.2 49.652.0 15.219.0 38.243.9
Search-R1 49.662.2 46.050.0 47.256.0 28.026.0 42.748.6
WebSailor* 50.452.4 59.461.4 57.665.6 22.028.0 47.451.9
R1Searcher* 58.062.2 59.663.4 66.468.8 28.231.4 53.156.5
dLLMs Agent
SDAR // // // // //
Dream 11.011.6 13.612.0 12.013.6 3.83.2 10.110.1
LLaDA 36.032.8 42.038.8 46.442.4 15.215.8 34.932.5
DLLM-Searcher 60.462.4 69.864.6 68.869.6 29.029.8 57.056.6

⚖️ P-ReAct vs ReAct

P-ReAct vs ReAct Comparison

Why P-ReAct is Better

Traditional ReAct paradigm suffers from sequential execution: the model must complete reasoning before making tool calls, then wait idly for tool responses. This creates significant latency bottlenecks in multi-turn agent interactions.

P-ReAct addresses this by:

  • Prioritizing <tool_call> token generation at early decoding steps
  • Enabling parallel reasoning while waiting for tool API responses

Key Insight: By decoding tool calls first, the model can continue thinking while external APIs execute, effectively overlapping computation with I/O wait time.

🔬 Analysis Experiments

DLLMs use P-ReAct vs ReAct

Ablation Study

DLLMs vs LLMs use P-ReAct

Latency Analysis

💻 P-ReAct Implementation

Only 11 lines of code to implement complete token-prefilling and confidence biasing.

scheduler.py
elif 'toolcall_pre_rl' in seq.remasking_strategy:
    if seq.current_denoising_step == 0:
        seq_x0[your_tool_end] = 151658  # </tool_call>
        transfer_index[your_tool_end] = True
        seq_x0[your_tool_start] = 151657  # <tool_call>
        transfer_index[your_tool_start] = True
    else:
        confidence = torch.where(mask_index, seq_x0_p, -np.inf)
        confidence[your_tool_start:your_tool_end + 1] = \
            confidence[your_tool_start:your_tool_end + 1] + 0.5
        _, top_indices = torch.topk(confidence, num_to_transfer)
        transfer_index[top_indices] = True

📝 Citation

BibTeX

@misc{zhao2026dllmsearcheradaptingdiffusionlarge,
            title={DLLM-Searcher: Adapting Diffusion Large Language Model for Search Agents}, 
            author={Jiahao Zhao and Shaoxuan Xu and Zhongxiang Sun and Fengqi Zhu and Jingyang Ou and Yuling Shi and Chongxuan Li and Xiao Zhang and Jun Xu},
            year={2026},
            eprint={2602.07035},
            archivePrefix={arXiv},
            primaryClass={cs.AI},
            url={https://arxiv.org/abs/2602.07035}, 
      }
      

🙏 Acknowledgements

We sincerely thank the authors of the following open-source repositories for their efforts:

Training Frameworks: TraceRL, ESPO
Evaluation & Serving: WebSailor, R1Searcher
Base Models: LLaDA, Dream7B, SDAR