AI

How do AI agents perform in actual research? Internal in-depth research station report

With the rapid development of large language models (LLMS), so did their commitment as strong research assistants. They are increasingly answering simple factual questions – they are solving “deep research” tasks that involve multi-step reasoning, evaluating contradictory information, sourcing data from the network and synthesizing it into coherent output.

Now, this emerging capability is being sold in different labs under different brand names – Openai calls it “deep research,” humans call it “extended thinking,” Google’s Gemini offers “search + Pro” functionality, and what’s confusing is their “professional search” or “deep research”. But how effective are these products in practice? A new report from Futuresearch, titled Deep Research Bench (DRB): Evaluating Web Research Agents, provides the most stringent assessments to date, with results showing both impressive capabilities and critical shortcomings.

What is a deep research station?

Created by the Futuresearch team, Deep Research Bench is a carefully constructed benchmark designed to evaluate the performance of AI agents on web-based research tasks. These are not simple answers, they reflect the confusing, open-ended challenges faced by analysts, decision makers and researchers in the real world.

This benchmark includes 89 different tasks in 89 categories, such as:

  • Find the number: For example, “How many times did FDA Level II medical equipment happen?”
  • Verify claim: For example, “Is chatgpt 10 times higher than Google search?”
  • Compile the dataset: For example, “Trends for Work for Software Developers in the United States in 2019-2023”

Each task type is carefully constructed with human-verified answers and evaluated using a frozen dataset of scratched web pages (called ReTrosearch). This ensures consistency across models evaluations, thus avoiding fluctuations in real-time networks.

Agent Architecture: React and Rewosearch

The core of the in-depth research benchmark is reaction architecture, which is the abbreviation of “rational + behavior”. This approach mimics how human researchers solve problems – by thinking about tasks, taking actions such as performing web searches, observing results, and then deciding whether to iterate or end.

While earlier models explicitly followed this cycle, new “thinking” models often simplify the process, allowing reasoning to be more fluidly integrated into its actions. To ensure consistency across evaluations, DRB introduced ReTroSearch, a custom static version of the web. Instead of relying on the on-site internet, agents use curated archives of web pages scratched using tools such as Serper, Playwright and Scraperapi. The scale is impressive: For high-complexity tasks such as “collecting evidence”, ReTrosearch can be frozen in time, with access to more than 189,000 pages to ensure a fair and replicable test environment.

Which AI agent performs the best?

Of all competitors, OpenAI’s O3 became the best performer, with a score of 0.51 in a possible 1.0 on the In-depth Research Platform. While this may sound modest, it is important to understand the difficulty of the benchmark: Even flawless agents can be around 0.8 due to the ambiguity of task definition and scores – what the researchers call “noise ceiling.” In other words, even today’s best models still lack organized human researchers.

Still, the rankings provide revealing insights. Not only did O3 lead this goal, but did so with speed and consistency, showing strong performance in almost all task types. Claude 3.7 sonnet from humanity follows closely, showing versatility in its “thinking” and “nothing” modes. Google’s flagship model, Gemini 2.5 Pro, stands out for its ability to handle tasks that require structured planning and step-by-step reasoning. Meanwhile, the open DeepSeek-R1 surprises people – keeping pace with the GPT-4 Turbo and closing the performance gap between open and closed models.

Overall, a clear pattern emerged: the updated “thinking” model always outperformed their earlier peers, while the closed source model maintained a significant advantage in the open weight alternative.

Where are the agents struggling?

Readings of in-depth study of the failure patterns highlighted in the benchmark report is surprisingly familiar. One of the most frustrating aspects I personally encounter when AI brokers just forget what we are doing, especially during long-term research or content creation meetings. As the context window stretches, the model usually starts to lose threads: the key details fade away, the target becomes chaotic, and suddenly, the response feels disconnected or useless. At some point, I learned that even if it means discarding everything generated so far, I usually get better at reducing the loss and starting from scratch.

This forgetfulness is not only anecdote, but also the most important predictor of failure in an in-depth study benchmark assessment. But that’s not the only frequent problem. The report also highlights how some models get stuck in repeated tool usage, doing the same search over and over, as if stuck in a loop. Others show bad query craftsmanship, lazy keyword matching, rather than critically thinking about how to search effectively. And, agents are often victims of premature conclusions – technically checking the box, but without real insight, which is a semi-formed answer.

Even in top-level models, the differences are obvious. For example, the GPT-4 Turbo shows a significant trend that the previous steps are forgotten, while the DeepSeek-R1 is more likely to hallucinate or invent sounds (but not correct). Throughout the stage, models often fail to cross-check or verify discoveries before finalizing the output. These questions will feel very familiar to anyone who relies on AI to work hard and highlight how far we still need to go in building agents that can truly think and research like humans.

What about memory-based performance?

Interestingly, Deep Research Bench also evaluates what it calls the “TOOLLESS” proxy-language model that runs without accessing external tools, such as web search or document retrieval. These agents rely solely on their internal training data and memory, generating answers based solely on what they learned during training. In practice, this means they can’t find anything or verify information – they guess based on what they “remembered.”

Surprisingly, these toolless agents are almost the same as complete research agents for certain tasks. For example, on the verification claim task, the goal is to evaluate the rationality of statements, and they score 0.61, almost matching the average of the tool-enabled agents. This suggests that models such as O3 and Claude have strong internal priors and can often identify the authenticity of the common claims without searching the network.

However, in more demanding tasks, such as deriveing ​​numbers, this requires pieced together multiple values ​​from various sources, or collecting evidence, depending on finding and evaluating various facts in the context – these graphical models completely crashed. Without new information or real-time search capabilities, they simply lack the means to provide accurate or comprehensive answers.

This contrast highlights an important nuance: While today’s LLMs can simulate “know” a lot, in-depth research depends not only on recall, but also on the use of the latest, verifiable inference of information – only the agents of tool design can truly provide.

The final thought

The DRB report shows that one thing is clear: While today’s best AI agents can surpass human averages on narrowly-speaking tasks, they still lag behind skilled generalist researchers, especially when strategically planning, adapting to medium-term processes, and reasoning with nuances.

This gap becomes particularly noticeable during long or complex meetings – it’s what I experienced firsthand, where the broker gradually loses the purpose of the task, resulting in a collapse of coherence and utility.

What makes the in-depth research benchmark so valuable is that it not only tests surface-level knowledge, but also explores the interactions of tool usage, memory, reasoning and adaptability, closer to real-world research than benchmarks like MMLU or GSM8K.

As LLMS continues to incorporate serious knowledge work, Futuresearch tools like DRB are critical not only about what these systems know, but also how well they actually work.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button