F. Pacheco-Torgal: The 52.7% Problem: Why the AI Scientist Still Cannot Be Trusted with Evidence

domingo, 21 de junho de 2026

The 52.7% Problem: Why the AI Scientist Still Cannot Be Trusted with Evidence

A new paper, Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio, quietly punctures, rather painfully, one of the most seductive illusions of the AI age: that fluent machines are already competent scientific readers. When asked to perform one of science’s most consequential tasks, selecting the correct studies for a meta-analysis, they remain alarmingly unreliable.

The authors built MetaSyn 442 expert-curated reviews from Nature Portfolio, paired with eligibility criteria, a corpus of ~140,000 PubMed articles, the verified correct studies, and deliberate look-alikes that seem right but break the rules. The result: even with a retrieval ceiling of 90.9%, no system recovered more than 52.7% of the studies that actually belonged. https://arxiv.org/abs/2606.17041

Science is not about finding relevant-sounding papers; it is about separating the relevant from the merely adjacent. Two studies can share the same disease, the same intervention, the same vocabulary and only one belongs. AI is excellent at producing the appearance of understanding, and far weaker at the disciplined exclusions on which reliability depends. A missed study is not just a missing citation it can be a missing warning. A wrongly included one can distort an effect size, manufacture false consensus, or become the wrong guideline, the wrong standard, the wrong public decision, masquerading as reliable synthesis.

Figure 6 is the humiliation hidden behind the benchmark. The problem is not only that AI misses studies or includes the wrong ones. The four systems read the same 88 reviews and answer in four incompatible ways: scientific pretence. Even GPT-5, with its polished structure and more convincing tone, still collapses under judgement and cannot make evidence synthesis trustworthy. ProtoMA retreats into “mixed” conclusions in more than 80% of cases; DeepSeek-R1 fails to produce a clear directional conclusion in 80% of the test papers.

This should worry us, because academia already rewards the appearance of productivity over the slow dignity of judgement. Universities, publishers, and funders all want more output, faster. Into that system walks generative AI, offering to accelerate everything, as if acceleration were automatically a virtue. But accelerate what? Evidence synthesis is not boring clerical work to be automated away it is the intellectual immune system of science, the process by which we decide what counts and why. The danger is not an AI that knows nothing; it is one that knows enough to be persuasive but not enough to be trustworthy.