Aging biology eval
Probing Claude's reasoning about gene effects on lifespan
Evaluating claude-sonnet-4-6 · 1,379 entries · run 2026-05-11
A 1,379-gene evaluation of Claude Sonnet 4.6's ability to predict whether an aging gene promotes or opposes longevity from its functional annotation (gene symbol, functional descriptors, and GO Molecular Function terms, with lifespan language stripped). Each entry is also evaluated with the symbol blinded as a contamination control. The model reaches 45% accuracy on a 3-class task. The more revealing finding is a directional bias: it identifies pro-longevity genes 73% of the time and anti-longevity genes only 30% of the time, often inverting the label even when its own reasoning arrives at the correct mechanism. Each per-entry page shows the prompt, the model's reasoning, the GenAge ground truth, and an LLM-graded judgment.
Main split accuracy
3-class longevity influence prediction across 1,379 aging genes. Cohen's κ = 0.79 vs. expert hand-grading.
Counterfactual split blinds the gene symbol but not the protein's functional description (e.g., “Insulin-like receptor subunit beta” still identifies daf-2). See methodology →
Failure mode breakdown
Accuracy by class
Notable entries
Seven entries spanning GenAge's four model organisms, mixing textbook correct cases on canonical aging genes (daf-2, age-1, foxo, TOR1) with examples of failure modes the eval surfaces: label inversions (chico, Ghrhr) and a confident commitment against a curator-assigned unclear ground truth (SIR2).
Significance
Aging biology is a domain where superficially correct text can mask shaky reasoning. By stripping lifespan language from each entry and grading both the answer and its mechanism, this eval distinguishes genuine inference from pattern-matching on familiar gene names. The headline finding is not the 45% accuracy itself but the directional asymmetry: the model identifies pro-longevity genes 73% of the time and anti-longevity genes only 30%, frequently producing reasoning that contradicts its own final label. That gap is a tractable target for prompting work, fine-tuning, and benchmarking future models on a task with curator-grounded ground truth.
Next steps
- Compare across models. Run the same eval against Claude Opus, GPT-5, Gemini, and open-source models (Llama 4, Qwen 3, DeepSeek-R1) to see whether the directional bias is Claude-specific or a property of LLMs trained on overlapping corpora.
- Add a label-inversion failure mode. The current taxonomy buckets inversions into
confident_wrongorright_answer_wrong_reasoning. A dedicatedreasoning_contradicts_predictionclass would surface the ~3% of entries where the reasoning was right but the structured output flipped. - Tighten counterfactual blinding. Strip the gene's functional descriptor from the redacted input as well, leaving only Gene Ontology Molecular Function terms, to measure how much performance survives stricter blinding.
- Test prompting interventions. Add explicit examples of GenAge label conventions to the system prompt and measure whether the inversion rate drops.
- Expand to human longevity genes. Extend the eval to LongevityMap and human GWAS hits to test whether the asymmetry holds outside model organisms.