DeR² Sandbox

Decoupling Retrieval and Reasoning Capabilities

Frontier Scientific Reasoning Capability Decoupling 81 PhD Annotators 2-Phase Validation
Paper GitHub Dataset

DeR2 Framework

Figure 1: The DeR² building framework. It isolates document-grounded reasoning by decoupling evidence access from logic execution across four controlled regimes, enabling fine-grained error attribution.

Abstract

Despite strong performance on existing benchmarks, it remains unclear whether large language models can reason over genuinely novel scientific information. Most evaluations score end-to-end RAG pipelines, where reasoning is confounded with retrieval and toolchain choices, and the signal is further contaminated by parametric memorization and open-web volatility.

We introduce DeR², a controlled deep-research sandbox that isolates document-grounded reasoning while preserving core difficulties of deep search: multi-step synthesis, denoising, and evidence-based conclusion making. DeR² decouples evidence access from reasoning via four regimes—Instruction-only, Concepts (gold concepts without documents), Related-only (only relevant documents), and Full-set (relevant documents plus topically related distractors)—yielding interpretable regime gaps that operationalize retrieval loss vs. reasoning loss and enable fine-grained error attribution.

To prevent parametric leakage, we apply a two-phase validation that requires parametric failure without evidence while ensuring oracle-concept solvability. To ensure reproducibility, each instance provides a frozen document library (drawn from 2023–2025 theoretical papers) with expert-annotated concepts and validated rationales.

Experiments across a diverse set of state-of-the-art foundation models reveal substantial variation and significant headroom: some models exhibit mode-switch fragility, performing worse with the Full-set than with Instruction-only, while others show structural concept misuse, correctly naming concepts but failing to execute them as procedures.

Evaluation Regimes

Isolating knowledge selection, extraction, and composition

1

Instruction-Only

Measures Parametric Knowledge. Models must fail here three times to ensure true novelty.

2

Concepts-Only

Provides Oracle Concepts. Upper bound for concept-level composition and scheduling reasoning.

3

Related-Only

Instruction + Relevant Docs. Tests extraction and reasoning under clear evidence.

4

Full-Set

Related + Noise Docs. Measures knowledge selection, de-noising, and robust reasoning.

Data Construction Pipeline

Expert-driven curation for deep research reliability

1
Source Selection

Theoretical papers from 2023-2025 across Physics, Math, and IT.

2
Item Construction

PhD annotators extract (Instruction, Concepts, Answer, CoT).

3
Difficulty Control

Offline AI tests ensure failure without concepts and success with them.

4
Doc Set Curation

At least one Related doc per concept + topically adjacent Noise docs.

*Each annotated question was rewarded at 2500 RMB to attract top-tier researchers.

Dataset Distribution

Structural and contextual complexity of the DeR² benchmark [cite: 11, 49]

Model Performance Leaderboard

Model Instruction-Only Full-Set Related-Only Concepts-Only RLoss
OpenAI-GPT-5.2-high 65.8 71.1 71.4 83.8 12.7
Gemini-3-Pro-Preview 64.2 53.7 68.3 80.9 27.2
OpenAI-GPT-5.1-high 59.8 57.0 66.9 81.4 24.4
DeepSeek-V3.2-Thinking 57.6 49.3 61.3 75.3 26.0
Claude-Opus-4.1-thinking 49.3 40.0 52.0 72.4 32.4
Average Score 55.9 51.2 62.9 75.4 24.2

Note: RLoss (Retrieval Loss) = Score(Concepts) - Score(Full-Set).

Core Failure Modes

  • Reasoning Mode Switching: Models abandon viable parametric paths but fail to anchor new evidence-driven chains, leading to Score(Instr) > Score(FullSet).
  • Structural Retrieval Errors: Models recognize definitions but fail to execute constructive mechanisms (e.g., algorithmic steps).
  • Concept Coordination Breakdown: Poor scheduling of extracted concepts even when they are present.

Regime Analysis

The sandbox enables fine-grained diagnosis of three recurring gaps:

  • 1. Knowledge Loss: Concepts vs. Instruction
  • 2. Retrieval Loss: Related vs. Concepts
  • 3. Noise-Induced Loss: Full-set vs. Related

Citation

@article{der2_sandbox2026,
  title={Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities},
  author={ByteDance Seed and M-A-P Team},
  year={2026},
  url={https://github.com/Retrieval-Infused-Reasoning-Sandbox}
}