DLLM-Searcher: Adapting Diffusion Large Language Models for Search Agents

⚡ One P-ReAct Iteration

DLLM-Searcher decodes the tool call region before the think region, enabling the model to keep thinking while waiting for tool response.

Interactive Decoding Process

            💡 Notice: <tool_call> and </tool_call> are decoded at step 1, while <think> content appears after step 32.
          

Step 0 / 0

Speed

Revealed: 0 / 0 | New this step: 0

Decoding Order (first_unmask_times - each cell = one token's unmask step)

📋 Abstract

Recently, Diffusion Large Language Models (dLLMs) have demonstrated unique efficiency advantages, enabled by their inherently parallel decoding mechanism and flexible generation paradigm. Meanwhile, despite the rapid advancement of Search Agents, their practical deployment is constrained by a fundamental limitation: the Latency Challenge — the serial execution of multi-round reasoning, tool calling, and tool response waiting under the ReAct agent paradigm induces severe end-to-end latency.

In this paper, we propose DLLM-Searcher, an optimization framework for dLLM-based Search Agents. To solve the Agent Ability Challenge, we design a two-stage post-training pipeline encompassing Agentic SFT and Agentic VRPO, which enhances the backbone dLLM's information seeking and reasoning capabilities.

To mitigate the Latency Challenge, we propose a novel agent paradigm termed P-ReAct (Parallel-Reasoning and Acting). P-ReAct guides the model to prioritize decoding tool_call instructions, thereby allowing the model to keep thinking while waiting for the tool's return.

Experimental results demonstrate that DLLM-Searcher achieves performance comparable to mainstream LLM-based search agents and P-ReAct delivers approximately 15% inference acceleration.

🏗️ Architecture Overview

Figure: DLLM-Searcher includes training process and inference process. In training, both Agentic SFT and Agentic VRPO use Block Attention and Agentic Noising to compute the Agentic ELBO. In inference, we employ the P-ReAct agent paradigm with token pre-filling and confidence biasing.

📊 Experimental Results

Performance on Multi-hop QA Benchmarks

Comparison with traditional RAG methods, LLM agents, and dLLM agents

Models	HotpotQA^†		2Wiki^†		Bamboogle^‡		Musique^†		Avg
Models	ACC_R	ACC_L	ACC_R	ACC_L	ACC_R	ACC_L	ACC_R	ACC_L	ACC_R	ACC_L
Traditional RAG
SuRe	32.4	48.4	22.2	26.8	17.6	28.0	7.2	10.0	19.9	28.3
Selective-Context	33.2	43.4	27.4	29.6	15.2	20.8	5.8	8.8	20.4	25.7
Adaptive-RAG	38.0	47.4	27.8	25.8	21.6	25.0	7.2	11.6	23.7	27.5
IRCoT	48.8	55.2	41.0	38.6	32.0	39.2	11.6	15.8	33.4	37.2
Iter-RetGen	41.6	54.4	32.4	34.4	26.4	32.0	14.8	18.2	28.8	34.8
CR-Planner	44.4	33.6	48.2	22.0	35.2	34.4	12.2	11.4	35.0	25.4
ReARTeR	46.8	50.6	55.4	53.4	49.6	54.4	29.6	30.2	45.4	47.2
ARM-based LLMs Agent
Search-o1	40.8	53.2	47.0	51.2	49.6	52.0	15.2	19.0	38.2	43.9
Search-R1	49.6	62.2	46.0	50.0	47.2	56.0	28.0	26.0	42.7	48.6
WebSailor^*	50.4	52.4	59.4	61.4	57.6	65.6	22.0	28.0	47.4	51.9
R1Searcher^*	58.0	62.2	59.6	63.4	66.4	68.8	28.2	31.4	53.1	56.5
dLLMs Agent
SDAR	/	/	/	/	/	/	/	/	/	/
Dream	11.0	11.6	13.6	12.0	12.0	13.6	3.8	3.2	10.1	10.1
LLaDA	36.0	32.8	42.0	38.8	46.4	42.4	15.2	15.8	34.9	32.5
DLLM-Searcher	60.4	62.4	69.8	64.6	68.8	69.6	29.0	29.8	57.0	56.6

⚖️ P-ReAct vs ReAct

Why P-ReAct is Better

Traditional ReAct paradigm suffers from sequential execution: the model must complete reasoning before making tool calls, then wait idly for tool responses. This creates significant latency bottlenecks in multi-turn agent interactions.

P-ReAct addresses this by:

Prioritizing <tool_call> token generation at early decoding steps
Enabling parallel reasoning while waiting for tool API responses

Key Insight: By decoding tool calls first, the model can continue thinking while external APIs execute, effectively overlapping computation with I/O wait time.

🔬 Analysis Experiments

DLLMs use P-ReAct vs ReAct

DLLMs vs LLMs use P-ReAct

💻 P-ReAct Implementation

Only 11 lines of code to implement complete token-prefilling and confidence biasing.

scheduler.py

elif 'toolcall_pre_rl' in seq.remasking_strategy:
    if seq.current_denoising_step == 0:
        seq_x0[your_tool_end] = 151658  # </tool_call>
        transfer_index[your_tool_end] = True
        seq_x0[your_tool_start] = 151657  # <tool_call>
        transfer_index[your_tool_start] = True
    else:
        confidence = torch.where(mask_index, seq_x0_p, -np.inf)
        confidence[your_tool_start:your_tool_end + 1] = \
            confidence[your_tool_start:your_tool_end + 1] + 0.5
        _, top_indices = torch.topk(confidence, num_to_transfer)
        transfer_index[top_indices] = True

📝 Citation

BibTeX

@misc{zhao2026dllmsearcheradaptingdiffusionlarge,
            title={DLLM-Searcher: Adapting Diffusion Large Language Model for Search Agents}, 
            author={Jiahao Zhao and Shaoxuan Xu and Zhongxiang Sun and Fengqi Zhu and Jingyang Ou and Yuling Shi and Chongxuan Li and Xiao Zhang and Jun Xu},
            year={2026},
            eprint={2602.07035},
            archivePrefix={arXiv},
            primaryClass={cs.AI},
            url={https://arxiv.org/abs/2602.07035}, 
      }

🙏 Acknowledgements

We sincerely thank the authors of the following open-source repositories for their efforts:

Training Frameworks: TraceRL, ESPO
Evaluation & Serving: WebSailor, R1Searcher
Base Models: LLaDA, Dream7B, SDAR