Experiments and results
We evaluated the RAG agent on TiresQAwhich is based on Frames paper. An example of a multi-hop question is:
“Of the top two most-watched television season finales (as of June 2024), which finale lasted the longest, and by how much?”
The RAG system needs to perform multiple steps to arrive at the correct answer. First, it must be determined that the two most-watched endings are shows mash and Cheers. Next, he must find their running times, and calculate the length difference. In many RAG setups (Vanilla RAG or Agent RAG without enough context), we can end up in a situation where the form says something like:
“Despite multiple checks, I found no clear running times for M*A*S*H or Cheers. The docs provide viewership data, but not duration in minutes or hours.”
This does not answer the question.
Fortunately, our agent RAG can solve this problem by first searching for TV shows, then using Rewriter Query and Sufficient context Agent to perform a targeted search for the runtime of M*A*S*H or Cheers. Then, Gemini can easily determine which finale took the longest and by how much:
“The M*A*S*H finale ran for 150 minutes, making it the longest of the top two. It was 52 minutes longer than the Cheers finale, which ran for approximately 98 minutes.”
We ran an experiment to test this capability on a large scale (FramesQA contains 824 queries as well as a collection containing 2676 PDF documents). In the “Vanilla” RAG setup, we use Google’s setup Raj engine (which contains an advanced retrieval engine, LLM parser, and reordering). We compared this with our agent RAG on two sites. In the single-set setting, we retrieve from FramesQA documents. In the shared set setting, we also include three other distractor data sets, from which the planner must decide where to retrieve. This cross-group setup simulates use cases where companies have databases managed by separate teams. We calculate accuracy using LLM-as-a-judge to compare system responses to ground truth responses in the dataset.
In the cross-group setting, our system nearly matches the single-group accuracy. Even when the Schema Agent has to select the correct combination out of 4 possibilities, we successfully target search queries and answer 90.1% of questions correctly. The latency for both single and multiple versions is about the same (within 3% on average). This demonstrates that our Agentic RAG system can consider multiple, unrelated data sources, opening possibilities for more flexible retrieval scenarios.







