Timon Harz

February 11, 2024

Inference Scaling for Long-Context Retrieval Augmented Generation

Uncover the impact of inference scaling on long-context RAG models and learn how computation allocation can predict optimal configurations. Discover how this approach lays the groundwork for improving RAG efficiency in real-world applications.

Introduction

Advancements in inference computation have unlocked the potential of long-context large language models (LLMs) across various domains. For knowledge-intensive tasks, additional computational resources are often dedicated to integrating more external information. However, simply expanding the context window without effectively utilizing the incorporated knowledge may not always improve performance.

This study examines inference scaling for retrieval-augmented generation (RAG), exploring strategies that go beyond merely increasing the volume of retrieved knowledge. We emphasize two approaches: in-context learning and iterative prompting. These methods introduce flexibility in test-time computation by scaling the number of retrieved documents or iterative generation steps, enabling LLMs to better acquire and apply contextual information.

We address two central questions:

How does RAG performance improve with optimized inference computation scaling?
Can we predict the optimal allocation of test-time computational resources for a given budget by modeling the relationship between RAG performance and inference parameters?

Our findings highlight that, when optimally configured, increased inference computation delivers nearly linear gains in RAG performance—establishing "inference scaling laws" for RAG. Furthermore, we developed a computational allocation model to predict performance across different inference setups. This model accurately estimates optimal inference configurations under various computational constraints, aligning closely with experimental outcomes.

By leveraging these optimized strategies, we demonstrate that scaling inference computation for long-context LLMs yields significant performance improvements, achieving up to 58.9% gains on benchmark datasets compared to traditional RAG setups.

Long-context large language models (LLMs) are designed to process extended input sequences, enabling them to handle significantly longer contexts. With increased inference computation, these models achieve improved performance across a range of tasks. For example, many-shot in-context learning (ICL) allows these models to perform comparably to supervised fine-tuning by utilizing extensive in-context examples. In knowledge-intensive tasks, particularly those involving retrieval-augmented generation (RAG), increasing the number or size of retrieved documents—up to an optimal threshold—consistently enhances performance.

Figure 1: Normalized performance versus effective context lengths on MuSiQue. Each line corresponds to a fixed configuration, scaled by varying the number of documents. Red dots and dashed lines indicate the optimal configurations and their fitted trends. While standard RAG reaches a performance plateau early at \(10^4\) tokens, DRAG and IterDRAG demonstrate nearly linear improvements as the effective context length increases.

Previous approaches to scaling inference in RAG have largely focused on increasing the amount of retrieved knowledge by either expanding the number or length of retrieved documents. However, prioritizing knowledge quantity alone introduces challenges. Long-context LLMs often struggle to effectively locate relevant information within ultra-long sequences for complex tasks. Their optimal performance frequently occurs without fully utilizing the maximum available context length. Additionally, exceeding retrieval thresholds, such as including more than the top-10 documents, can lead to diminishing returns or even performance degradation due to noise in the extended context. This added noise can distract the model, adversely affecting generation quality and overall task performance. As a result, scaling inference for long-context RAG remains a challenging problem.

In this study, we adopt a broader perspective to explore how RAG can benefit from advanced inference scaling strategies. One such approach is demonstration-based RAG (DRAG), which uses multiple RAG examples as demonstrations, enabling LLMs to leverage their long-context capabilities for learning how to locate and apply relevant information through in-context learning. However, single-step retrieval often falls short in providing sufficient information, especially for complex tasks.

To address these limitations, we introduce iterative demonstration-based RAG (IterDRAG). IterDRAG decomposes complex queries into simpler sub-queries, retrieving and generating responses iteratively. This interleaved process allows LLMs to construct reasoning chains, bridging gaps in compositional understanding, particularly for multi-hop queries. Together, DRAG and IterDRAG offer flexible and effective strategies for scaling inference computation, empowering long-context LLMs to handle complex, knowledge-intensive tasks more effectively.

Building on these strategies, we explore various methods to scale inference computation, quantified as the total input tokens across all iterations—termed effective context length. In DRAG, this scaling involves increasing two key parameters: the number of retrieved documents and in-context examples. For IterDRAG, additional scaling is achieved by incorporating more generation steps during test time. Since different parameter combinations result in varied distributions of computational resources, our objective is to model the relationship between RAG performance and the allocation of inference computation at different scales.

Through extensive evaluations on benchmark QA datasets, we demonstrate that combining these strategies produces an almost linear relationship between RAG performance and effective context length. This trend is visualized in Figure 1 (right). Furthermore, our proposed strategies surpass the performance gains achieved by simply increasing the number of retrieved documents. Notably, they enable state-of-the-art results using the compact Gemini 1.5 Flash model, as highlighted in Figure 2.

Based on our observations, we analyze the relationship between RAG performance and inference computation, which we refer to as the "inference scaling laws for RAG." These laws demonstrate that RAG performance consistently improves as the effective context length expands, provided the configurations are optimal. We further delve into modeling RAG performance based on varying inference computation allocations. The goal is to predict the optimal set of inference parameters to maximize performance across different RAG tasks.

To achieve this, we quantitatively model the relationship between RAG performance and inference configurations using a computation allocation model for RAG. This model helps identify optimal configurations that can be empirically determined and generalized across various scenarios, maximizing computation budget utilization. We summarize our contributions as follows:

We systematically investigate inference scaling for long-context RAG, introducing two effective scaling strategies, DRAG and IterDRAG.

We conduct comprehensive evaluations of DRAG and IterDRAG, which not only achieve state-of-the-art performance but also show superior scaling properties compared to simply increasing the number of retrieved documents.

Through extensive experiments on benchmark QA datasets, we demonstrate that when test-time compute is optimally allocated, long-context RAG performance scales almost linearly with the increasing computation budget.

We develop a computation allocation model that quantifies the relationship between RAG performance and various inference parameters. This model aligns closely with experimental results and generalizes well across different scenarios, offering practical guidance for optimal computation allocation in long-context RAG tasks.

Figure 2: Evaluation accuracy of Gemini 1.5 Flash using various methods: zero-shot QA, many-shot QA, RAG (with optimal document count), DRAG, and IterDRAG on benchmark QA datasets. As inference computation is scaled up (up to 5M tokens), DRAG consistently outperforms baseline approaches. IterDRAG further improves upon DRAG by incorporating interleaved retrieval and iterative generation.

Related Work

Long-Context LLMs

Long-context large language models (LLMs) are designed to leverage extended context, enhancing their generative abilities. Early efforts to extend context lengths focused on techniques like sparse or low-rank kernels to reduce memory consumption, enabling the processing of longer sequences (Kitaev et al., 2019; Beltagy et al., 2020; Zaheer et al., 2020; Choromanski et al., 2020). Additionally, alternative models such as recurrent neural networks and state-space models (SSMs) were proposed as efficient replacements for traditional transformer-based approaches (Gu et al., 2021; Gu and Dao, 2023; Peng et al., 2023a; Beck et al., 2024). For causal LLMs, methods like extrapolation and interpolation have been effective in expanding context window lengths (Press et al., 2021; Chen et al., 2023; Sun et al., 2023; Peng et al., 2023b). More recently, innovations in efficient attention mechanisms have made it possible for LLMs to train and perform inference on input sequences containing millions of tokens (Achiam et al., 2023; Team et al., 2023; Reid et al., 2024).

In-Context Learning

In-context learning (ICL) is an efficient way to improve model performance during inference by conditioning on a few task demonstrations. To further enhance ICL, recent work focuses on pretraining strategies that help language models better learn from in-context examples. Additionally, selectively using few-shot examples has been shown to improve downstream task performance. Optimizing the formatting and order of in-context examples also boosts ICL effectiveness. With the advent of long-context LLMs, scaling the number of in-context examples is now possible. For example, many-shot ICL has been shown to mitigate pretraining biases within LLMs, improving performance across a variety of tasks.

Retrieval Augmented Generation

Retrieval-augmented generation (RAG) enhances language model performance by integrating external knowledge from relevant sources. Optimizing the retrieval process, beyond basic RAG, improves the relevance of the context and leads to better generation outcomes. Various strategies, such as using language models as supervision for dense retrievers, can improve the efficiency and effectiveness of the retrieval stage. Additionally, encoding documents effectively boosts knowledge retrieval, improving the model’s ability to generate more accurate outputs. Some approaches focus on selectively using knowledge from retrieved documents to reduce the impact of irrelevant context, making models more robust. Moreover, methods like training with negative examples help refine the retrieval process, improving generation quality. Recent developments in long-document retrieval and datastore scaling are aimed at further enhancing RAG performance, but inference scaling remains an underexplored area, especially in knowledge-intensive tasks. This work aims to address this gap by examining how adjustments in inference computation can optimize RAG performance, particularly for downstream tasks.

Inference Scaling Strategies for RAG

Inference computation in retrieval-augmented generation (RAG) is quantified by the effective context length, which refers to the total number of input tokens processed across all iterations before the language model (LLM) generates its final output. For most methods that involve a single LLM call, the effective context length is simply the number of tokens in the prompt, constrained by the LLM’s context window. In contrast, methods that involve multiple calls to the LLM allow for the extension of the effective context length through iterative strategies, potentially increasing it indefinitely. Notably, output tokens and retrieval costs are excluded from this analysis, as LLMs typically generate very few tokens (under 10) for knowledge-intensive tasks. Moreover, retrieval operations tend to be less computationally demanding than LLM inference, particularly with scalable retrieval techniques.

Our primary focus is to investigate how RAG performance evolves when scaling inference computation. In demonstration-based RAG (DRAG), we scale up by incorporating both extensive document retrieval and a variety of in-context examples. To further enhance this, we use iterative generation in IterDRAG, which involves multiple generation steps to refine results. We will explore both strategies in the following sections to demonstrate their impact on performance as we scale inference computation.

Demonstration-Based RAG

Demonstration-based retrieval-augmented generation (DRAG) enhances the performance of long-context large language models (LLMs) by utilizing in-context learning. It does so by generating answers directly from an extended input context, which includes both retrieved documents and in-context examples. DRAG builds upon the basic RAG framework by integrating a combination of documents and in-context examples into the input prompt, enabling the model to generate answers to queries in a single inference step (see Figure 3, left). For both the in-context examples and the query at test time, a retrieval model is used to fetch the top-k most relevant documents from a large corpus (e.g., Wikipedia). The retrieved documents are reordered, with the highest-ranked documents placed closest to the query.

When employing instruction-tuned LLMs, we craft a prompt template with a structured format that includes prefixes for the retrieved documents, the input query, and the output (see Appendix G). Unlike prior approaches, DRAG integrates extensive documents into the demonstrations, enabling the LLM to better utilize the extended context for answering queries by learning to extract and process relevant information efficiently.

Figure 3: DRAG vs. IterDRAG
IterDRAG improves upon DRAG by decomposing the input query into smaller sub-queries, answering each step to refine the accuracy of the final response. During test-time, IterDRAG scales computation by performing multiple inference steps, allowing for the decomposition of complex queries and more effective document retrieval.

Iterative Demonstration-Based RAG

Despite having access to external knowledge, complex multi-hop queries remain a significant challenge due to their compositional nature. To address this, we introduce Iterative Demonstration-Based RAG (IterDRAG), which decomposes complex queries into simpler sub-queries. For each sub-query, retrieval is performed to gather additional context, which is then used to generate intermediate answers. Once all sub-queries are resolved, the context, sub-queries, and their answers are synthesized to generate the final answer (see Figure 3, right).

While many datasets provide training queries and answers, they typically lack sub-queries and intermediate answers. To generate in-context examples with these steps, we employ constrained decoding in LLMs, prompting them to follow the Self-Ask format. In each iteration, the LLM generates either a sub-query, an intermediate answer, or the final answer. If a sub-query is generated, additional documents are retrieved and incorporated into the prompt to help generate the intermediate answer. IterDRAG continues this process until the final answer is produced or a set number of iterations is reached, at which point the LLM is required to generate the final answer. We maintain examples with intermediate steps and correct final answers to build in-context demonstrations, ensuring that each example includes the retrieved documents, sub-query-answer pairs, and the final answer.

During inference, in-context examples are added to the initial set of retrieved documents for the input query. Each inference request generates either a sub-query, an intermediate answer, or the final answer. When sub-queries are generated, additional documents are retrieved and integrated with the initial documents to produce intermediate answers. In our implementation, we allow up to five iterations of query decomposition before the final answer is generated. This iterative process scales test-time computation, with the input tokens from all iterations contributing to the effective context length. IterDRAG offers a more refined approach by learning to: (1) decompose complex queries into simpler sub-queries, and (2) retrieve and integrate relevant information to answer those sub-queries. By iteratively retrieving and generating answers, this strategy helps bridge the compositionality gap, improves knowledge extraction, and ultimately enhances the overall performance of RAG.

RAG Performance and Inference Computation Scale

Optimizing Performance Within a Fixed Computation Budget

Given a fixed budget for inference computation, represented by a maximum effective context length LmaxLmax, there are several strategies to optimize the use of computational resources through the adjustment of inference parameters. For instance, in the DRAG approach, we can vary both the number of retrieved documents and in-context examples. In IterDRAG, we introduce an additional parameter: the number of iterations for retrieval and generation. These parameters are collectively denoted as θθ.

For each input query (xi,yi)∈X(xi,yi)∈X, we apply the RAG inference strategy ff, which is parameterized by θθ. The effective input context length provided to the LLM is denoted as l(xi;θ)l(xi;θ), and the prediction generated is y^i=f(xi;θ)y^i=f(xi;θ). A performance metric P(yi,y^i)P(yi,y^i) is then calculated based on the true label yiyi and the predicted answer y^iy^i.

To explore the relationship between RAG performance and inference computation, we sample different computation budgets LmaxLmax. For each budget, the goal is to identify the optimal average performance metric P∗(Lmax)P∗(Lmax), which is achieved by testing various configurations of θ∈Θθ∈Θ. This allows us to evaluate how different combinations of inference parameters affect the overall performance within the constraints of the budget.

Our goal is to understand the relationship between the inference computation budget LmaxLmax and the optimal performance P∗(Lmax)P∗(Lmax) that can be achieved within this budget. This involves leveraging various strategies and parameter configurations to allocate inference computation resources effectively. For simplicity, we will refer to P∗(Lmax)P∗(Lmax) as the optimal performance.

We investigate the following key factors within the inference parameter set θθ:

Number of retrieved documents (kk): This refers to the documents retrieved from a large corpus (e.g., Wikipedia) based on the input query.
Number of in-context examples (mm): Each example consists of kk documents, an input query, and its corresponding label.
Number of generation iterations (nn): In DRAG, the answer is generated directly from the input context, so n=1n=1. In IterDRAG, multiple retrieval and generation steps are interleaved, increasing both the effective context length and the computational load, without requiring longer context windows.

This exploration helps us optimize RAG strategies for better performance within a fixed budget.

We evaluate the performance of Gemini 1.5 Flash on knowledge-intensive question answering tasks, including multi-hop datasets such as Bamboogle, HotpotQA, MuSiQue, and 2WikiMultiHopQA. Results from these experiments are detailed in Appendix B and C. To manage computational costs, we follow previous work and sample 1,200 examples from each dataset for evaluation. The metrics used in the evaluation are exact match (EM), F1 score (F1), and accuracy (Acc), with accuracy assessing whether the ground truth is included in the prediction.

We experiment with various inference computation budgets, including 16k, 32k, 128k, 1M, and 5M tokens. For the DRAG strategy, we test different numbers of retrieved documents (\( k \)) from a set of values ranging from 0 to 1000, and in-context examples (\( m \)) from 0 to 256. In IterDRAG, we explore up to five iterations of retrieval and generation.

We compare the performance of Gemini 1.5 Flash with several baselines: (1) zero-shot QA, where no retrieved documents or demonstrations are used; (2) many-shot QA, which uses varying numbers of demonstrations but no retrieved documents; and (3) retrieval-augmented generation (RAG), which uses retrieved documents but no demonstrations. The optimal performance of each method is reported for various maximum context length budgets, achieved by adjusting inference parameters.

Figure 4: Normalized performance versus effective context lengths across datasets. Each line represents a specific configuration, with variations in the number of documents. The red dots indicate the configurations that achieve optimal performance, while the dashed line represents the fitting results. The data suggest that the optimal performance exhibits a linear relationship with the effective context length.

Overall Performance

Table 1 presents the optimal performance P∗(Lmax)P∗(Lmax) for various inference strategies, highlighting the best configurations of inference parameters for different computation budgets LmaxLmax. Some configurations are omitted for certain values of LmaxLmax as they do not scale to the corresponding context length. For instance, zero-shot QA's prompt cannot be expanded, and many-shot QA's in-context examples are capped at 28, which limits scalability at Lmax=32kLmax=32k. Similarly, RAG performance plateaus beyond Lmax=128kLmax=128k, while DRAG is constrained by the LLM's context window limit of 1M tokens.

In contrast to the QA and RAG baselines, the performance of DRAG and IterDRAG consistently improves as the maximum effective context length increases. Key observations include:

DRAG and IterDRAG outperform baselines: Baseline methods like many-shot QA peak at 16k tokens, and RAG performance improves only until 128k tokens, beyond which it plateaus. In comparison, DRAG and IterDRAG adapt their configurations to better leverage the available computational resources, demonstrating superior performance and scalability. DRAG continues to improve until reaching 1M tokens, and IterDRAG further enhances RAG performance by utilizing up to 5M tokens in iterative steps.
Scaling behavior of DRAG and IterDRAG: DRAG performs best at shorter maximum context lengths (16k and 32k tokens), while IterDRAG scales more effectively with longer context lengths, especially at 128k tokens and beyond. This emphasizes the effectiveness of iterative retrieval and generation for enhancing performance with extended context windows.

These results indicate that increasing LmaxLmax can significantly improve RAG performance, with DRAG and IterDRAG excelling at different scales depending on the effective context length.

Inference Scaling Laws for RAG

To assess how performance changes with varying effective context lengths, we present the performance of all configurations across datasets in Figure 4. As shown in Figure 1, we visualize the performance of DRAG and IterDRAG, highlighting the optimal performance P∗(Lmax)P∗(Lmax) for different values of LmaxLmax. The fitting results are depicted as grey dashed lines. Additional dataset-specific results are provided in Appendix D.

Our analysis reveals the following key observations about the scaling of RAG performance:

Linear scaling of optimal performance: The optimal performance increases almost linearly with the magnitude of the inference compute. This suggests that RAG performance can be enhanced by expanding the computational budget, offering a more accurate prediction of performance based on available resources.
Effective scaling with IterDRAG: For Lmax>105Lmax>105, IterDRAG continues to scale effectively by interleaving retrieval and iterative generation. This finding aligns with the results in Table 1, where IterDRAG makes better use of computation budgets for context lengths exceeding 128k.
Diminishing returns beyond 1M context length: While the performance continues to improve with larger context lengths, gains become less significant after reaching 1M tokens. The improvement from 1M to 5M tokens is modest or plateaus, potentially due to challenges in long-context modeling.

In conclusion, while the performance gains become smaller after reaching 1M tokens, optimal RAG performance still scales nearly linearly with increasing inference compute, especially with the DRAG and IterDRAG strategies.

(a)Averaged DRAG performance heatmap for different metrics.

Figure 5: Changes in RAG performance with varying numbers of documents and in-context examples. Panel 5(a) presents the average metric values across datasets, while panels 5(b) and 5(c) show the normalized performance for a consistent configuration as the number of documents or shots increases.

Parameter-Specific Scaling

To further understand the dynamics of DRAG and IterDRAG, we perform a grid search across various combinations of parameters (θθ) and evaluate their performance. The results, shown in Figure 5, feature heatmaps visualizing DRAG performance (see the IterDRAG heatmap in Appendix C). We also include additional results that explore different numbers of documents (kk) and shots (mm).

In general, scaling retrieval, demonstrations, and generation steps tends to improve performance, though the magnitude of these gains depends on the effective context length and the method used. Specifically, we observe the following:

Documents and In-Context Examples Are Not Equally Effective: For a given configuration, increasing the number of retrieved documents (kk) often results in more significant performance gains, as illustrated by the varying slopes in Figure 5.
Increasing Shots (mm) Benefits IterDRAG More: For IterDRAG, increasing the number of shots (mm)—such as from 0 to 1—typically provides greater improvements than increasing kk. This is likely due to the impact of demonstrations on improving in-context query decomposition and knowledge extraction.
Saturation of Gains Differs for DRAG and IterDRAG: For example, increasing mm from 0 to 1 brings significant improvements for IterDRAG, but only minimal effects for DRAG. After reaching certain thresholds, further increases in kk or mm may yield marginal gains or even reduce performance.
Optimal θθ Depends on Method, Metric, and Dataset: The optimal combination of parameters varies depending on the method, evaluation metric, and dataset. As shown in Figure 5(a) and Figure 8, the ideal settings for performance are sensitive to these factors, making performance modeling for θθ challenging.

In conclusion, while scaling the number of documents, demonstrations, and iterations can boost RAG performance, each factor contributes differently to the overall results. Consequently, determining the best hyperparameter combination remains complex.

Inference Computation Allocation for Long-Context RAG

After analyzing the overall performance of different RAG strategies and understanding the varying impacts of inference parameters, we aim to quantify the relationship between performance and the hyperparameter set θθ. For long-context RAG, we propose a computation allocation model, which captures the test-time scaling properties and can guide the selection of θθ based on the maximum effective context length LmaxLmax.

5.1 Formulation and Estimation

To simplify notation, we define the average performance metric PP (e.g., accuracy) on dataset XX as a function of θθ. The hyperparameters within θθ include the number of documents kk, the number of demonstrations mm, and the maximum iterations nn, where θ:=(k,m,n)Tθ:=(k,m,n)T.

To account for variances across methods and tasks, we introduce i:=(idoc,ishot,0)Ti:=(idoc,ishot,0)T, where idocidoc and ishotishot represent the informativeness of documents and in-context examples, respectively. While we can define iiteriiterto measure the informativeness of additional generation steps, applying it does not yield improved accuracy, so we set it to 0 in our experiments.

We then formulate the computation allocation model as:

Here, ⊙⊙ represents the element-wise product, and aa, bb, and cc are scalar parameters to be estimated. The term ii can be computed based on the specific task. We define idocidoc as the performance gain by adding one document compared to the zero-shot QA baseline. Similarly, ishotishot is the performance gain from adding one in-context example compared to zero-shot QA.

To account for the sub-linearity observed in very long contexts (beyond 1M tokens), we apply an inverse sigmoidal mapping σ−1σ−1 to scale the metric PP. Additional implementation details are provided in Appendix G.

In Equation 2, the parameters \( a \), \( b \), and \( c \) are specific to each model, capturing how large language models (LLMs) improve with varying numbers of documents and shots (i.e., in-context learning vs. zero-shot capabilities). In contrast, \( i \) models the performance variation within a selected task, reflecting how external knowledge and demonstrations assist in responding to a query. This means the computation allocation model, once estimated, can be applied across various downstream tasks without requiring further calibration.

To estimate the parameters, different combinations of \( \theta \) are tested, and ordinary least squares (OLS) is used to estimate \( a \), \( b \), and \( c \). The parameters for the Gemini 1.5 Flash model are reported in Appendix E.

Validating the Computation Allocation Model for RAG

We evaluate the computation allocation model for RAG by comparing the predicted metrics to the actual values, with normalized results for DRAG visualized in Figure 6. Here, each subplot represents a different dataset, and each line corresponds to a document setting (k), we scale the context length by adjusting in-context examples (m). As illustrated, the performance improves with the increase of k and m across datasets, displaying highly consistent trends between the predicted and actual metric values, despite some variations. Notably, each dataset exhibits different levels of consistency: Bamboogle exhibits the highest consistency, while HotpotQA generates more variable results. Our findings demonstrate how external knowledge and in-context learning can effectively enhance RAG performance with long-context capabilities, suggesting the effectiveness of the computation allocation model for RAG and how they may be used to predict benchmark results.

Ablation Study

To validate the effectiveness of the computation allocation model, we conduct ablation studies and evaluate the fitting performance of different model variants. Specifically, we investigate: (1) exclusion of bb and ii (i.e., Exclude bb); (2) a quadratic formulation of the input log⁡(θ)log(θ) (Quadratic θθ); (3) linear scaling of PP (Linear σσ); and (4) sigmoidal scaling of PP (Sigmoidal σσ). The R2R2 and mean squared error (MSE) values for each variant are summarized in Table 2, where (4) represents the complete design of our computation allocation model. The results show that incorporating the additional parameters bb and ii enhances relevance and reduces error across all tasks. Furthermore, applying an inverse sigmoid scaling to PP significantly improves estimation accuracy compared to using quadratic scaling of θθ or linear scaling.

Domain Generalization

We also assess the generalization capability of the computation allocation model for RAG across unseen domains. Specifically, we test the parameters from Equation 2, trained on data from multiple domains, on a target domain, using only \( i \) derived from the target domain. Table 3 presents the results for a 1M effective context length, comparing the model’s performance to an 8-shot baseline (which scales by increasing the number of retrieved documents) and the optimal results (Oracle). The findings demonstrate that the computation allocation model significantly outperforms the baseline and closely approximates the oracle performance, achieving 96.6% of the optimal results. Notably, domains like Bamboogle and HotpotQA exhibit nearly identical target results, with performance metrics varying by less than 2.5% from the oracle. These outcomes highlight the potential of the computation allocation model to generalize across a broad range of knowledge-intensive tasks.

Length Extrapolation

Beyond predicting performance in unseen domains, we also investigate the extrapolation of context length using the computation allocation model. In this approach, we estimate the parameters from Equation 2 using data from experiments with shorter context lengths and assess their ability to predict performance for longer ones. Table 4 presents the predicted metric values under various extrapolation settings. Our key findings include: (1) The predictions are highly accurate and consistently outperform the 8-shot baseline. For example, the average difference between predicted and oracle results from 128k to 1M tokens is just 2.8%. (2) Extrapolation from 32k to 128k proves challenging, as DRAG tends to perform best at 32k, while IterDRAG excels with longer contexts like 128k, as shown in Figure 4. This leads to a mismatch between training and predictive performance distributions. (3) Extrapolation to 5M context length is less reliable, with the average performance difference between predicted and oracle results reaching 5.6%. Overall, while the computation allocation model provides accurate extrapolations, its effectiveness is more pronounced for context lengths below 1M.

Discussion

Our experiments consistently demonstrate the benefits of scaling inference using DRAG and IterDRAG. When paired with the computation allocation model for RAG, this approach allows us to derive a near-optimal solution for long-context RAG within the constraints of available computation. Below, we explore additional factors that can influence the scaling of long-context RAG.

Retrieval

A key determinant in enhancing RAG performance is the quality of the retrieved documents. To evaluate the impact of retrieval on accuracy, we analyze retrieval performance across different document sizes, with detailed results provided in Appendix A. Across all datasets, we observe improvements in recall as the number of retrieved documents increases, achieving near-perfect recall scores with larger document sets (e.g., ∼1k). However, despite these gains, discounted ranking metrics such as NDCG show diminishing returns, suggesting that an excess of retrieved documents may introduce irrelevant or distracting content. This trend is also evident in Figure 5(b), where RAG performance peaks with 100 to 500 documents. These findings indicate that further refinement of retrieval methods, such as re-ranking, could improve document relevance, especially for complex, multi-hop queries. However, the influence of inference scaling behavior in the context of these refined retrieval techniques remains uncertain. Iterative retrieval, as demonstrated by IterDRAG, improves recall by breaking down complex queries into simpler sub-queries, each gathering additional context for intermediate answers. In conclusion, while retrieving more documents enhances recall, it does not always improve generation quality unless the documents are effectively ranked or filtered. This highlights the need for adaptive retrieval methods that dynamically minimize irrelevant content.

Error Analysis

Despite the overall improvements, our error analysis, presented in Appendix F, reveals persistent challenges, particularly in compositional reasoning tasks requiring multiple reasoning steps. The errors observed can be categorized as follows: (1) inaccurate or outdated retrieval, (2) incorrect or missing reasoning, (3) hallucinations or unfaithful reasoning, and (4) evaluation issues or refusal to answer. The first category emphasizes the need to enhance retrieval methods and maintain a reliable, up-to-date knowledge base, especially for complex queries that rely on multiple supporting facts. In addition, errors stemming from incorrect or missing reasoning steps often lead to incomplete or incorrect answers. Both issues (1) and (2) show significant improvement with IterDRAG, highlighting the value of interleaving retrieval and iterative generation for multi-hop queries. Further advancements in creating faithful LLMs and strategies to reduce hallucinations could also improve RAG performance. Finally, we observe that current evaluation metrics are insufficient in some cases (e.g., with abbreviations), emphasizing the need for more robust and reliable evaluation methods.

Long-Context Modeling

We further explore the effects of long-context modeling on RAG performance. Our findings suggest that retrieving additional documents generally benefits RAG, as demonstrated in Section 4. However, simply extending the context length in each generation step does not always yield better results. Specifically, DRAG performance peaks around 100,000 tokens, while IterDRAG performs optimally at approximately 1,000,000 tokens by leveraging multiple rounds of generation. As shown in the performance plateau in Figures 1 and 10, LLMs struggle to effectively utilize very long contexts (≥100,000 tokens) within each iteration, likely due to the intrinsic challenges of long-context modeling. Based on these observations, we suggest: (1) the need to enhance the model's ability to identify relevant information from large contexts, especially when dealing with many similar documents, and (2) refining long-context modeling techniques to improve in-context learning capabilities, particularly when multiple lengthy demonstrations are provided.

Conclusion

In this paper, we investigate the scaling behavior of inference in long-context Retrieval-Augmented Generation (RAG) models. Our study systematically examines the relationship between inference configurations and model performance, focusing particularly on how RAG performance scales with increasing computational resources during inference. Through our experiments, we establish that RAG performance improves nearly linearly with the increasing order of magnitude of test-time compute when optimal inference parameters are applied. This observation underscores the importance of efficient computation allocation in RAG systems, especially for tasks involving long contexts.

Building on these findings, we develop inference scaling laws that provide a theoretical framework for understanding and predicting RAG performance across a wide range of hyperparameters. Our computation allocation model, derived from these scaling laws, is designed to estimate performance under different configurations, helping practitioners select optimal hyperparameters for long-context RAG.

Through extensive empirical validation, we show that our model’s predictions closely align with experimental results, providing strong evidence of its effectiveness. By demonstrating that optimal configurations can be reliably estimated, we lay the groundwork for future research focused on optimizing inference strategies for long-context RAG. These insights offer valuable guidance for advancing the development of RAG systems, especially in knowledge-intensive tasks requiring large-scale context integration.

Looking forward, we anticipate that these scaling laws and the computation allocation model will serve as essential tools for optimizing RAG in diverse real-world applications, contributing to the broader field of natural language processing and machine learning. Our approach not only offers immediate benefits for inference efficiency but also opens up new avenues for exploration in the design of next-generation RAG models, with applications across a wide range of domains, from document retrieval to conversational agents and beyond.

Press contact

Timon Harz

oneboardhq@outlook.com