Methodology
How the eval was designed, run, and validated.
The contamination problem
Large language models are trained on vast amounts of text from the internet, including scientific databases, review articles, and curated resources. When a model is asked about a well-characterized gene like daf-2, it may answer correctly not because it is reasoning from first principles, but because it encountered the answer during training, a form of memorization rather than reasoning. This is the contamination problem: the test data may have leaked into the training data, inflating apparent capability.
In classical machine learning, contamination is addressed by holding out a test set the model has never seen. For a biology eval built on public databases, perfect isolation is impossible. A different approach is needed: we can measure how much performance depends on the gene symbol by running the same eval twice: once with the real symbol visible, and once with it blinded.
The two splits
Every entry is evaluated in two conditions, called splits:
- Main split. The model sees the real gene symbol, the organism, and the redacted functional description. This is the standard eval condition.
- Counterfactual split. The gene symbol is replaced with the placeholder
GENE-X. Everything else, including the organism name, the gene's functional description, and Gene Ontology Molecular Function terms, is identical to the main split. See the Limitations section for what this does and doesn't blind.
The accuracy gap between the two splits (main accuracy minus counterfactual accuracy) is the contamination gap. A large gap indicates that the model relies heavily on recognizing gene names. A small gap suggests the model can reason from functional descriptions alone, whether or not the symbol is familiar.
The same entry in both splits
Main
Gene: daf-2 Organism: Caenorhabditis elegans Known functions: insulin-like receptor ...
Counterfactual
Gene: GENE-X Organism: Caenorhabditis elegans Known functions: insulin-like receptor ...
Solver and advisor
Each entry is processed by two model calls in sequence.
The solver (Claude Sonnet 4.6, temperature 0) receives the redacted entry and uses a forced tool call to submit a structured prediction: the gene's longevity influence (pro_longevity, anti_longevity, or unclear), a confidence score, the mechanism class, a reasoning paragraph, and up to three key pathways.
The advisor (also Claude Sonnet 4.6, temperature 0) receives the original entry, the GenAge ground-truth label, and the solver's full output. It grades the prediction across four dimensions: answer correctness, mechanism accuracy, reasoning quality (1–5), and failure mode. Using a second model as the grader avoids manual annotation at scale while maintaining structured, auditable output.
Both calls use forced tool use (not the structured-outputs API header) to guarantee JSON that matches the defined schema. The prompts and tool schemas are version-hashed and stored on each runs row, so any run can be reproduced precisely.
Mechanism classes: the hallmarks of aging
The solver is asked to assign each gene to a mechanism class, the aging pathway most relevant to that gene's molecular function. The enum is drawn from the López-Otín 2023 framework, which identifies 12 hallmarks of aging plus other (for mechanisms outside the framework) and unclear (when the model cannot confidently classify). This controlled vocabulary makes mechanism predictions comparable across genes and runs. (López-Otín et al. 2023)
Genomic instability
Accumulation of DNA damage and mutations over time, including double-strand breaks, base modifications, and chromosomal rearrangements that erode genome integrity.
Telomere attrition
Progressive shortening of the protective DNA-protein caps at chromosome ends, eventually triggering replicative arrest or apoptosis when telomeres become critically short.
Epigenetic alterations
Age-related drift in DNA methylation patterns, histone modifications, and chromatin organization that reshapes gene expression independently of changes to the underlying DNA sequence.
Loss of proteostasis
Decline in the cellular machinery that folds, refolds, and degrades proteins, leading to accumulation of misfolded and aggregated species.
Disabled macroautophagy
Reduced capacity for the bulk lysosomal degradation pathway that clears damaged organelles and protein aggregates, allowing cellular waste to accumulate.
Deregulated nutrient sensing
Dysfunction of conserved pathways (insulin/IGF-1, mTOR, AMPK, sirtuins) that detect nutrient availability and coordinate cell growth, metabolism, and stress resistance.
Mitochondrial dysfunction
Decline in mitochondrial efficiency that lowers ATP output, raises reactive oxygen species (ROS), and disrupts the metabolic and signaling roles of these organelles.
Cellular senescence
Accumulation of cells that have permanently exited the cell cycle but persist in tissues, often secreting inflammatory factors that affect surrounding cells.
Stem cell exhaustion
Decline in the number and regenerative capacity of tissue-resident stem cells, undermining tissue maintenance and repair across the lifespan.
Altered intercellular communication
Disruption of signaling between cells, tissues, and organs (including neuroendocrine, paracrine, and immune signals) that coordinates organism-wide homeostasis.
Chronic inflammation
Persistent low-grade systemic inflammation ("inflammaging") that accompanies older age and contributes to many age-related diseases.
Dysbiosis
Disruption of the composition and function of resident microbial communities, particularly in the gut, with downstream effects on metabolism, immunity, and aging.
Other
Mechanisms outside the López-Otín 2023 hallmarks framework.
Unclear
The model couldn't confidently assign a primary mechanism.
Data pipeline
The eval dataset starts from the GenAge model organisms database, which catalogs genes with known effects on lifespan across model organisms (C. elegans, D. melanogaster, S. cerevisiae, and M. musculus). The CSV was downloaded on May 9, 2026 and contained 2,202 entries.
For each entry, per-gene functional annotations were fetched from NCBI Gene via E-utilities: the official full name, functional descriptors, and Gene Ontology Molecular Function (GO MF) terms. GO Biological Process terms and RefSeq summaries were deliberately excluded because they frequently contain lifespan-related language that would leak the answer to the model.
An automated redaction pass then stripped remaining lifespan and aging language from the functional description using a forbidden-terms list (longevity, lifespan, aging, life-extension, senescence, and related terms). Four QC filters further pruned entries: no GO MF terms annotated, sparse functional content, high redaction density, and post-redaction leakage of aging language. After redaction and QC, 1,846 entries remained. A 30-entry spot check of the redaction output was performed by hand before running the eval.
The three valid longevity influence classes used in the eval are pro_longevity, anti_longevity, and unclear, totaling 1,385 entries. The remaining entries carry labels necessary_for_fitness or unannotated, which are excluded from the eval because they don't map cleanly to the three-class prediction task.
Ground truth
The ground-truth labels come from GenAge's curators. For each gene, curators synthesize evidence across multiple studies and assign a Longevity Influence label that captures the gene's normal-function role in promoting or opposing longevity.
This reconciled judgment is the eval's prediction target because it provides a single, cross-study signal. It is not the same as predicting the outcome of any particular manipulation (e.g., overexpression vs. loss-of-function in a specific tissue). The model is asked to predict the gene's normal-function effect, consistent with how GenAge defines the label.
Note that the mechanism class filter on the per-entry browse page is based on the model's predicted mechanism (from the solver output), not a curator-assigned mechanism. GenAge does not provide a per-gene mechanism field, so the predicted mechanism is what we are studying.
Validating the advisor
Because the advisor is itself a language model, its judgments could be systematically biased. To quantify this, 30 entries were randomly sampled and hand-graded by Andrew T. Rodriguez, Ph.D. Cohen's kappa was computed between the advisor's grades and the hand grades on the answer_correct field.
The measured κ is 0.79. Values above 0.7 are considered strong agreement. On the Landis & Koch (1977) scale, this falls in the “substantial agreement” range (0.61–0.80). This provides a concrete calibration point for interpreting the advisor's grading.
Directional bias and label-inversion failures
The model shows a strong directional bias: pro-longevity recall is 73% but anti-longevity recall is only 30%, despite anti-longevity being the larger class (878 vs. 481 entries). Closer inspection of the misclassifications reveals a specific failure mode the eval's seven-class taxonomy doesn't cleanly capture. The model's reasoning text correctly describes the gene's role, but the final structured prediction inverts the label.
Approximately 45 entries (~3% of main-split results) show this pattern. Many are textbook anti-longevity genes such as C. elegans atp-2, clk-1, mrpl-1, eat-2, Drosophila chico, and mouse Ghrhr. In these cases the model's reasoning correctly states that the gene's normal activity opposes longevity (because reducing it extends lifespan), but the prediction field outputs pro_longevity. The advisor's grading reflected this ambiguity inconsistently: some inversions were classified as confident_wrong, others as right_answer_wrong_reasoning, though strictly speaking neither fits.
The pattern suggests the model is confused about the GenAge label convention (that “Anti-Longevity” describes the gene's normal-function role, not the direction of any particular manipulation) rather than the underlying biology. A future iteration of this eval should add a dedicated reasoning_contradicts_prediction failure mode and may want to test whether explicit label-convention examples in the system prompt reduce the inversion rate.
Limitations
- Input is GO MF terms and functional descriptors only. The model sees a deliberately narrow slice of each gene's biology. Richer functional descriptions (including GO Biological Process or RefSeq summaries) might produce different accuracy, but those sources are excluded because they often contain lifespan-related language.
- Single eval run on one model version. The primary results are from a single run of Claude Sonnet 4.6. Results may differ across runs (sampling variation at temperature 0 is minimal but non-zero across API calls) and will almost certainly differ across model versions.
- Advisor is itself an LLM. The grading is automated. The 30-entry hand-grading spot check provides a calibration point, but systematic biases in the advisor may not be captured by a 30-entry sample.
- Mechanism classification is fuzzy at boundaries. The 12-class hallmarks enum forces a single primary mechanism on genes with pleiotropic or context-dependent roles. Many aging genes participate in multiple hallmarks; the mechanism accuracy metric reflects this constraint.
- Counterfactual blinding is incomplete. The counterfactual split replaces the gene symbol with
GENE-Xbut preserves the rest of the functional annotation, including the gene's functional descriptor (e.g., “Insulin-like receptor subunit beta”) and its Gene Ontology Molecular Function terms. The gene symboldaf-2and the protein symbolDAF-2are both blinded, but the functional descriptor “Insulin-like receptor subunit beta” combined with the organism “Caenorhabditis elegans” uniquely identifies the gene. For well-characterized genes, this means blinding is only partial. The small main-vs-counterfactual accuracy gap reflects this limitation more than the model's underlying reasoning capability. A more rigorous future version of this eval would strip the functional descriptors from the redacted input as well, leaving only Gene Ontology Molecular Function terms. The eval remains useful as a test of the model's biological reasoning even where blinding is imperfect: the model still has to articulate a mechanism, identify the correct pathway, and arrive at the correct longevity influence. Readers can inspect the reasoning on each per-entry page and judge whether it reflects genuine biology or pattern-matching. - Model organisms only. The dataset covers the model organisms in GenAge. Human longevity genes are not included in this eval.
References
Hallmarks framework
López-Otín, C., Blasco, M. A., Partridge, L., Serrano, M., & Kroemer, G. (2023). Hallmarks of aging: An expanding universe. Cell, 186(2), 243–278. https://doi.org/10.1016/j.cell.2022.11.001
Aging gene dataset (GenAge / HAGR)
Tacutu, R., Thornton, D., Johnson, E., Budovsky, A., Barardo, D., Craig, T., Diana, E., Lehmann, G., Toren, D., Wang, J., Fraifeld, V. E., & de Magalhães, J. P. (2018). Human Ageing Genomic Resources: new and updated databases. Nucleic Acids Research, 46(D1), D1083–D1090. https://doi.org/10.1093/nar/gkx1042
Gene annotations
Sayers, E. W., et al. (2022). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 50(D1), D20–D26. https://doi.org/10.1093/nar/gkab1112
NCBI E-utilities documentation: https://www.ncbi.nlm.nih.gov/books/NBK25501/
The Gene Ontology Consortium. (2023). The Gene Ontology knowledgebase in 2023. Genetics, 224(1), iyad031. https://doi.org/10.1093/genetics/iyad031
Project: http://geneontology.org
Inter-rater agreement
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.