Aging biology eval

Probing Claude's reasoning about gene effects on lifespan

Evaluating claude-sonnet-4-6 · 1,379 entries · run 2026-05-11

A 1,379-gene evaluation of Claude Sonnet 4.6's ability to predict whether an aging gene promotes or opposes longevity from its functional annotation (gene symbol, functional descriptors, and GO Molecular Function terms, with lifespan language stripped). Each entry is also evaluated with the symbol blinded as a contamination control. The model reaches 45% accuracy on a 3-class task. The more revealing finding is a directional bias: it identifies pro-longevity genes 73% of the time and anti-longevity genes only 30% of the time, often inverting the label even when its own reasoning arrives at the correct mechanism. Each per-entry page shows the prompt, the model's reasoning, the GenAge ground truth, and an LLM-graded judgment.

45%

Main split accuracy

3-class longevity influence prediction across 1,379 aging genes. Cohen's κ = 0.79 vs. expert hand-grading.

Counterfactual accuracy

42%

Gene symbol blinded

Contamination gap

−3 pp

Main minus counterfactual

Mechanism accuracy

45%

Main split, hallmark prediction

Advisor κ vs. expert

0.79

Cohen's kappa, hand-graded sample

Counterfactual split blinds the gene symbol but not the protein's functional description (e.g., “Insulin-like receptor subunit beta” still identifies daf-2). See methodology →

Failure mode breakdown

Correct

541 (39%)Confident wrong

532 (38%)Appropriately uncertain

187 (14%)Right answer, wrong reasoning

124 (9%)Overhedged

1 (0%)

Accuracy by class

Pro-longevity73.0% (481)

Anti-longevity30.4% (878)

Unclear7.7% (26)

Notable entries

Seven entries spanning GenAge's four model organisms, mixing textbook correct cases on canonical aging genes (daf-2, age-1, foxo, TOR1) with examples of failure modes the eval surfaces: label inversions (chico, Ghrhr) and a confident commitment against a curator-assigned unclear ground truth (SIR2).

daf-2C. elegansage-1C. elegansfoxoD. melanogasterchicoD. melanogasterTOR1S. cerevisiaeSIR2S. cerevisiaeGhrhrM. musculus

Browse all 1,379 entries →

Significance

Aging biology is a domain where superficially correct text can mask shaky reasoning. By stripping lifespan language from each entry and grading both the answer and its mechanism, this eval distinguishes genuine inference from pattern-matching on familiar gene names. The headline finding is not the 45% accuracy itself but the directional asymmetry: the model identifies pro-longevity genes 73% of the time and anti-longevity genes only 30%, frequently producing reasoning that contradicts its own final label. That gap is a tractable target for prompting work, fine-tuning, and benchmarking future models on a task with curator-grounded ground truth.

Next steps

Compare across models. Run the same eval against Claude Opus, GPT-5, Gemini, and open-source models (Llama 4, Qwen 3, DeepSeek-R1) to see whether the directional bias is Claude-specific or a property of LLMs trained on overlapping corpora.
Add a label-inversion failure mode. The current taxonomy buckets inversions into confident_wrong or right_answer_wrong_reasoning. A dedicated reasoning_contradicts_prediction class would surface the ~3% of entries where the reasoning was right but the structured output flipped.
Tighten counterfactual blinding. Strip the gene's functional descriptor from the redacted input as well, leaving only Gene Ontology Molecular Function terms, to measure how much performance survives stricter blinding.
Test prompting interventions. Add explicit examples of GenAge label conventions to the system prompt and measure whether the inversion rate drops.
Expand to human longevity genes. Extend the eval to LongevityMap and human GWAS hits to test whether the asymmetry holds outside model organisms.