Aging biology eval

Probing Claude's reasoning about gene effects on lifespan

by Andrew T. Rodriguez, Ph.D.

Evaluating claude-sonnet-4-6 · 1,379 entries · run 2026-05-11

A 1,379-gene evaluation of Claude Sonnet 4.6's ability to predict whether an aging gene promotes or opposes longevity from its functional annotation (gene symbol, functional descriptors, and GO Molecular Function terms, with lifespan language stripped). Each entry is also evaluated with the symbol blinded as a contamination control. The model reaches 45% accuracy on a 3-class task. The more revealing finding is a directional bias: it identifies pro-longevity genes 73% of the time and anti-longevity genes only 30% of the time, often inverting the label even when its own reasoning arrives at the correct mechanism. Each per-entry page shows the prompt, the model's reasoning, the GenAge ground truth, and an LLM-graded judgment.

45%

Main split accuracy

3-class longevity influence prediction across 1,379 aging genes. Cohen's κ = 0.79 vs. expert hand-grading.

Counterfactual accuracy
42%
Gene symbol blinded
Contamination gap
−3 pp
Main minus counterfactual
Mechanism accuracy
45%
Main split, hallmark prediction
Advisor κ vs. expert
0.79
Cohen's kappa, hand-graded sample

Counterfactual split blinds the gene symbol but not the protein's functional description (e.g., “Insulin-like receptor subunit beta” still identifies daf-2). See methodology →

Failure mode breakdown

Accuracy by class

Pro-longevity73.0% (481)
Anti-longevity30.4% (878)
Unclear7.7% (26)

Notable entries

Seven entries spanning GenAge's four model organisms, mixing textbook correct cases on canonical aging genes (daf-2, age-1, foxo, TOR1) with examples of failure modes the eval surfaces: label inversions (chico, Ghrhr) and a confident commitment against a curator-assigned unclear ground truth (SIR2).

Browse all 1,379 entries →

Significance

Aging biology is a domain where superficially correct text can mask shaky reasoning. By stripping lifespan language from each entry and grading both the answer and its mechanism, this eval distinguishes genuine inference from pattern-matching on familiar gene names. The headline finding is not the 45% accuracy itself but the directional asymmetry: the model identifies pro-longevity genes 73% of the time and anti-longevity genes only 30%, frequently producing reasoning that contradicts its own final label. That gap is a tractable target for prompting work, fine-tuning, and benchmarking future models on a task with curator-grounded ground truth.

Next steps