Evaluating AI on Biological Reasoning Tasks

The Problem

Most AI benchmarks for biology test fact recall. They ask what the powerhouse of the cell is, or which enzyme unwinds DNA. These are important, but they measure memory — not reasoning.

Biological reasoning requires something different: decomposing a complex question into sub-problems, gathering evidence from multiple sources, assembling that evidence into a coherent hypothesis, and then critically evaluating whether the hypothesis actually holds. A model that can name p53 but cannot explain why its loss leads to uncontrolled proliferation has memorised a fact without understanding the mechanism.

This gap matters because the questions researchers actually ask — “What role does p53 play in cancer?” or “Why does this drug fail in Phase III despite strong preclinical data?” — require multi-step reasoning, not keyword lookup.

What We Built

To explore this gap, we built four open-source tools that each address a different stage of the biological reasoning pipeline. None of them use external LLM APIs — they run entirely on local rule-based engines and small knowledge bases, which makes their behaviour deterministic and inspectable.

BioBench — A 20-question benchmark with dual scoring: multiple-choice accuracy (fact recall) and keyword-based rubric scoring (mechanistic reasoning). The separation reveals models that score well on selection but poorly on explanation.
BioAgent — A 4-agent pipeline where a Planner extracts entities and generates sub-questions, an Evidence agent searches a local knowledge base via TF-IDF, a Hypothesis agent assembles findings, and a Critic evaluates coverage, consistency, and specificity.
AI Experimental Critic — A 10-rule engine that evaluates experiment proposals for missing controls, randomization gaps, underpowered samples, and statistical omissions. Each finding is severity-rated and contributes to a weighted quality score.
Paper-Synth — A synthesis engine that runs five parallel analysis passes over a collection of abstracts: topic extraction, consensus detection, disagreement finding, gap identification, and experiment suggestion.

Evaluation Philosophy

The conventional approach to evaluating AI on scientific tasks is to build a large benchmark and measure accuracy. We took a different approach: build small, specialised tools that each isolate one dimension of reasoning, and evaluate them on whether they produce outputs a domain expert would find useful.

BioBench separates recall from reasoning by design. A model that selects the correct multiple-choice answer but fails to mention the key mechanistic keywords in its free-text explanation gets a high MCQ score but a low rubric score. This dual-score approach exposes a failure mode that single-metric benchmarks miss entirely.

BioAgent evaluates not just the final answer but the intermediate steps: Did the planner identify the right entities? Did the evidence search return relevant entries? Did the hypothesis address all sub-questions? The Critic agent provides a structured quality score across three dimensions rather than a single pass/fail.

The Experimental Critic applies a fixed rubric — the same 10 checks a senior PI would apply during grant review. This is deliberately not learned from data; it codifies expert knowledge into deterministic rules that produce the same output every time.

Architecture

Each tool follows the same design principles: offline-first, deterministic, and inspectable. No API keys, no network calls, no non-deterministic sampling. The architecture choices reflect a deliberate trade-off: we sacrifice the fluency of large language models for the reproducibility of rule-based systems.

BioAgent uses a sequential pipeline where each agent passes a typed data structure to the next. The Planner outputs entities and sub-questions, the Evidence agent outputs ranked search results with relevance scores, the Hypothesis agent outputs a template-filled narrative, and the Critic outputs a structured score with per-dimension breakdowns. Every intermediate result is logged and inspectable.

Paper-Synth runs five analysis passes in parallel rather than sequentially. Each pass operates independently on the same input abstracts, and the results are merged into a structured Markdown report. This architecture naturally supports adding new analysis passes without modifying existing ones.

The Experimental Critic stacks 10 independent rule functions that each receive the same experiment proposal and return zero or more findings. Rules are pure functions with no shared state, which makes them trivial to test, extend, and reorder.

Key Findings

Running these tools across their test suites revealed several patterns worth noting:

Recall and reasoning diverge.BioBench’s dual scoring consistently shows a gap between MCQ accuracy and rubric scores. A model can select “mitochondria” from four options while producing an explanation that fails to mention oxidative phosphorylation, electron transport, or ATP synthesis — the actual mechanistic story.

Structured critique catches what free-form review misses. The Experimental Critic found an average of 6.3 issues per proposal across the demo dataset. The most common critical findings were missing negative controls and absent randomization — issues that are easy to overlook in narrative review but trivial to detect with explicit rules.

Multi-agent decomposition works for well-scoped questions. BioAgent’s pipeline produces coherent answers when the input question maps cleanly to entities in the knowledge base. It degrades gracefully when entities are absent — the Critic scores coverage accordingly rather than hallucinating evidence.

Synthesis is harder than analysis. Paper-Synth can reliably identify topics, consensus terms, and hedging language. But generating actionable experiment suggestions from abstract-level information remains the weakest pass — unsurprising, since experiment design requires domain expertise that keyword-matching cannot replicate.

Limitations

These tools are deliberately constrained. They operate on small, curated knowledge bases (25 entries for BioAgent, 20 questions for BioBench, 3 demo proposals for the Critic). They do not generalise to arbitrary biological questions — they demonstrate an approach, not a production system.

The rule-based approach trades recall for precision. The Experimental Critic will never find a novel flaw that its 10 rules don’t cover. Paper-Synth will miss disagreements expressed without hedging language. BioBench’s rubric scoring depends on keyword lists that may not capture all valid explanations.

There is no learned component in any of these tools. That is the point — they establish baselines for what deterministic, inspectable systems can achieve — but it also means they cannot improve from data without manual rule updates.

All four tools are open-source and installable via pip. The anchor repository links them together with a shared evaluation philosophy and cross-repo example pipeline.

GitHub Repo

bio-ai-systems

Anchor repo — motivation, project map, and cross-tool pipeline example.

View repository →

GitHub Repo

biobench

20-question benchmark with dual MCQ + rubric scoring for biological reasoning.

View repository →

GitHub Repo

bioagent

4-agent pipeline: planner, evidence search, hypothesis assembly, critic.

View repository →

GitHub Repo

ai-experimental-critic

10-rule critique engine for experiment proposals — controls, blinding, power.

View repository →

What’s Next

The immediate next step is expanding BioBench’s question bank beyond 20 items and adding difficulty tiers — introductory, intermediate, and research-level — to better stratify where models succeed and fail.

BioAgent’s knowledge base is the most obvious bottleneck. At 25 entries, it covers major pathways but misses entire subfields. Expanding to 200+ entries with structured metadata (pathway membership, evidence strength, recency) would make the TF-IDF retrieval meaningfully competitive with embedding-based search.

The Experimental Critic could benefit from domain-specific rule packs — one for clinical trials (informed consent, IRB approval, adverse event reporting), one for molecular biology (antibody validation, cell line authentication), one for ecology (spatial autocorrelation, observer bias). The architecture already supports this through its independent rule-function design.

Longer term, we are interested in combining these tools into an end-to-end pipeline: parse a set of papers (Paper-Synth), generate a hypothesis (BioAgent), design an experiment to test it (Experimental Critic), and evaluate the reasoning quality of the entire chain (BioBench). Each tool was designed to be composable — outputs are structured data, not free text — so the integration surface is already there.

The Problem

What We Built

Evaluation Philosophy

Architecture

Key Findings

Limitations

Related Repositories

bio-ai-systems

biobench

bioagent

ai-experimental-critic

What’s Next