Methodology

How the eval was designed, run, and validated.

The contamination problem

Large language models are trained on vast amounts of text from the internet, including scientific databases, review articles, and curated resources. When a model is asked about a well-characterized gene like daf-2, it may answer correctly not because it is reasoning from first principles, but because it encountered the answer during training, a form of memorization rather than reasoning. This is the contamination problem: the test data may have leaked into the training data, inflating apparent capability.

In classical machine learning, contamination is addressed by holding out a test set the model has never seen. For a biology eval built on public databases, perfect isolation is impossible. A different approach is needed: we can measure how much performance depends on the gene symbol by running the same eval twice: once with the real symbol visible, and once with it blinded.

The two splits

Every entry is evaluated in two conditions, called splits:

Main split. The model sees the real gene symbol, the organism, and the redacted functional description. This is the standard eval condition.
Counterfactual split. The gene symbol is replaced with the placeholder GENE-X. Everything else, including the organism name, the gene's functional description, and Gene Ontology Molecular Function terms, is identical to the main split. See the Limitations section for what this does and doesn't blind.

The accuracy gap between the two splits (main accuracy minus counterfactual accuracy) is the contamination gap. A large gap indicates that the model relies heavily on recognizing gene names. A small gap suggests the model can reason from functional descriptions alone, whether or not the symbol is familiar.

The same entry in both splits

Main

Gene: daf-2
Organism: Caenorhabditis elegans
Known functions: insulin-like receptor ...

Counterfactual

Gene: GENE-X
Organism: Caenorhabditis elegans
Known functions: insulin-like receptor ...

Solver and advisor

Each entry is processed by two model calls in sequence.

The solver (Claude Sonnet 4.6, temperature 0) receives the redacted entry and uses a forced tool call to submit a structured prediction: the gene's longevity influence (pro_longevity, anti_longevity, or unclear), a confidence score, the mechanism class, a reasoning paragraph, and up to three key pathways.

The advisor (also Claude Sonnet 4.6, temperature 0) receives the original entry, the GenAge ground-truth label, and the solver's full output. It grades the prediction across four dimensions: answer correctness, mechanism accuracy, reasoning quality (1–5), and failure mode. Using a second model as the grader avoids manual annotation at scale while maintaining structured, auditable output.

Both calls use forced tool use (not the structured-outputs API header) to guarantee JSON that matches the defined schema. The prompts and tool schemas are version-hashed and stored on each runs row, so any run can be reproduced precisely.

Mechanism classes: the hallmarks of aging

The solver is asked to assign each gene to exactly one mechanism class, the aging pathway most relevant to that gene's molecular function. Many aging genes plausibly participate in multiple hallmarks; the limitation this creates is discussed in the Limitations section. The enum is drawn from the 12 hallmarks of aging defined by López-Otín et al. (2023), with two additional classes added for this eval: other (for mechanisms outside the framework) and unclear (when the model cannot confidently classify). This controlled vocabulary makes mechanism predictions comparable across genes and runs.

Genomic instability

Accumulation of DNA damage and mutations over time, including double-strand breaks, base modifications, and chromosomal rearrangements that erode genome integrity.

Telomere attrition

Progressive shortening of the protective DNA-protein caps at chromosome ends, eventually triggering replicative arrest or apoptosis when telomeres become critically short.

Epigenetic alterations

Age-related drift in DNA methylation patterns, histone modifications, and chromatin organization that reshapes gene expression independently of changes to the underlying DNA sequence.

Loss of proteostasis

Decline in the cellular machinery that folds, refolds, and degrades proteins, leading to accumulation of misfolded and aggregated species.

Disabled macroautophagy

Reduced capacity for the bulk lysosomal degradation pathway that clears damaged organelles and protein aggregates, allowing cellular waste to accumulate.

Deregulated nutrient sensing

Dysfunction of conserved pathways (insulin/IGF-1, mTOR, AMPK, sirtuins) that detect nutrient availability and coordinate cell growth, metabolism, and stress resistance.

Mitochondrial dysfunction

Decline in mitochondrial efficiency that lowers ATP output, raises reactive oxygen species (ROS), and disrupts the metabolic and signaling roles of these organelles.

Cellular senescence

Accumulation of cells that have permanently exited the cell cycle but persist in tissues, often secreting inflammatory factors that affect surrounding cells.

Stem cell exhaustion

Decline in the number and regenerative capacity of tissue-resident stem cells, undermining tissue maintenance and repair across the lifespan.

Altered intercellular communication

Disruption of signaling between cells, tissues, and organs (including neuroendocrine, paracrine, and immune signals) that coordinates organism-wide homeostasis.

Chronic inflammation

Persistent low-grade systemic inflammation ("inflammaging") that accompanies older age and contributes to many age-related diseases.

Dysbiosis

Disruption of the composition and function of resident microbial communities, particularly in the gut, with downstream effects on metabolism, immunity, and aging.

Other

Mechanisms outside the López-Otín 2023 hallmarks framework.

Unclear

The model couldn't confidently assign a primary mechanism.

Data pipeline

The eval dataset starts from the GenAge model organisms database, which catalogs genes with known effects on lifespan across model organisms (C. elegans, D. melanogaster, S. cerevisiae, and M. musculus). The CSV was downloaded on May 9, 2026 and contained 2,202 entries.

For each entry, per-gene functional annotations were fetched from NCBI Gene via E-utilities: the official full name, functional descriptors, and Gene Ontology Molecular Function (GO MF) terms. GO Biological Process terms and RefSeq summaries were deliberately excluded because they frequently contain lifespan-related language that would leak the answer to the model.

An automated redaction pass then stripped remaining lifespan and aging language from the functional description using a forbidden-terms list (longevity, lifespan, aging, life-extension, senescence, and related terms). Four QC filters further pruned entries: no GO MF terms annotated, sparse functional content, high redaction density, and post-redaction leakage of aging language. After redaction and QC, 1,846 entries remained. A 30-entry spot check of the redaction output was performed by hand before running the eval.

The three valid longevity influence classes used in the eval are pro_longevity, anti_longevity, and unclear, totaling 1,385 entries. The remaining entries carry labels necessary_for_fitness or unannotated, which are excluded from the eval because they don't map cleanly to the three-class prediction task.

Ground truth

The ground-truth labels come from GenAge's curators. For each gene, curators synthesize evidence across multiple studies and assign a Longevity Influence label that captures the gene's normal-function role in promoting or opposing longevity.

This reconciled judgment is the eval's prediction target because it provides a single, cross-study signal. It is not the same as predicting the outcome of any particular manipulation (e.g., overexpression vs. loss-of-function in a specific tissue). The model is asked to predict the gene's normal-function effect, consistent with how GenAge defines the label.

Note that the mechanism class filter on the per-entry browse page is based on the model's predicted mechanism (from the solver output), not a curator-assigned mechanism. GenAge does not provide a per-gene mechanism field, so the predicted mechanism is what we are studying.

Validating the advisor

Because the advisor is itself a language model, its judgments could be systematically biased. To quantify this, 30 entries were randomly sampled and hand-graded by Andrew T. Rodriguez, Ph.D. Cohen's kappa was computed between the advisor's grades and the hand grades on the answer_correct field.

The measured κ is 0.79. Values above 0.7 are considered strong agreement. On the Landis & Koch (1977) scale, this falls in the “substantial agreement” range (0.61–0.80). This provides a concrete calibration point for interpreting the advisor's grading.

Directional bias and label-inversion failures

The model's predictions are not symmetric across the two main classes. Pro-longevity recall is 73%, but anti-longevity recall is only 30%, despite anti-longevity being the larger class (878 vs. 481 entries in the eligible dataset). Closer inspection of the misclassifications reveals a specific failure mode the eval's seven-class taxonomy doesn't cleanly capture. The model's reasoning text correctly describes the gene's role, but the final structured prediction inverts the label.

Approximately 45 entries (~3% of main-split results) show this pattern. Many are textbook anti-longevity genes such as C. elegans atp-2, clk-1, mrpl-1, eat-2, Drosophila chico, and mouse Ghrhr. In these cases the model's reasoning correctly states that the gene's normal activity opposes longevity (because reducing it extends lifespan), but the prediction field outputs pro_longevity. The advisor's grading reflected this ambiguity inconsistently: some inversions were classified as confident_wrong, others as right_answer_wrong_reasoning, though strictly speaking neither fits.

The pattern suggests the model is confused about the GenAge label convention (that “Anti-Longevity” describes the gene's normal-function role, not the direction of any particular manipulation) rather than the underlying biology. A future iteration of this eval should add a dedicated reasoning_contradicts_prediction failure mode and may want to test whether explicit label-convention examples in the system prompt reduce the inversion rate.

Limitations

Input is GO MF terms and functional descriptors only. The model sees a deliberately narrow slice of each gene's biology. Richer functional descriptions (including GO Biological Process or RefSeq summaries) might produce different accuracy, but those sources are excluded because they often contain lifespan-related language.
Single eval run on one model version. The primary results are from a single run of Claude Sonnet 4.6. Results may differ across runs (sampling variation at temperature 0 is minimal but non-zero across API calls) and will almost certainly differ across model versions.
Advisor is itself an LLM. The grading is automated. The 30-entry hand-grading spot check provides a calibration point, but systematic biases in the advisor may not be captured by a 30-entry sample.
Mechanism classification is fuzzy at boundaries. The 12-class hallmarks enum forces a single primary mechanism on genes with pleiotropic or context-dependent roles. Many aging genes participate in multiple hallmarks; the mechanism accuracy metric reflects this constraint.
Counterfactual blinding is incomplete. The counterfactual split replaces the gene symbol with GENE-X but preserves the rest of the functional annotation, including the gene's functional descriptor (e.g., “Insulin-like receptor subunit beta”) and its Gene Ontology Molecular Function terms. The gene symbol daf-2 and the protein symbol DAF-2 are both blinded, but the functional descriptor “Insulin-like receptor subunit beta” combined with the organism “Caenorhabditis elegans” uniquely identifies the gene. For well-characterized genes, this means blinding is only partial. The small main-vs-counterfactual accuracy gap reflects this limitation more than the model's underlying reasoning capability. A more rigorous future version of this eval would strip the functional descriptors from the redacted input as well, leaving only Gene Ontology Molecular Function terms. The eval remains useful as a test of the model's biological reasoning even where blinding is imperfect: the model still has to articulate a mechanism, identify the correct pathway, and arrive at the correct longevity influence. Readers can inspect the reasoning on each per-entry page and judge whether it reflects genuine biology or pattern-matching.
Model organisms only. The dataset covers the model organisms in GenAge. Human longevity genes are not included in this eval.

References

Hallmarks framework

López-Otín, C., Blasco, M. A., Partridge, L., Serrano, M., & Kroemer, G. (2023). Hallmarks of aging: An expanding universe. Cell, 186(2), 243–278. https://doi.org/10.1016/j.cell.2022.11.001

Aging gene dataset (GenAge / HAGR)

Tacutu, R., Thornton, D., Johnson, E., Budovsky, A., Barardo, D., Craig, T., Diana, E., Lehmann, G., Toren, D., Wang, J., Fraifeld, V. E., & de Magalhães, J. P. (2018). Human Ageing Genomic Resources: new and updated databases. Nucleic Acids Research, 46(D1), D1083–D1090. https://doi.org/10.1093/nar/gkx1042

Project: https://genomics.senescence.info/genes/

Gene annotations

Sayers, E. W., et al. (2022). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 50(D1), D20–D26. https://doi.org/10.1093/nar/gkab1112

NCBI E-utilities documentation: https://www.ncbi.nlm.nih.gov/books/NBK25501/

The Gene Ontology Consortium. (2023). The Gene Ontology knowledgebase in 2023. Genetics, 224(1), iyad031. https://doi.org/10.1093/genetics/iyad031

Project: http://geneontology.org

Inter-rater agreement

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310

Related work in machine learning on aging biology

Wan, C., Freitas, A. A., & de Magalhães, J. P. (2015). Predicting the pro-longevity or anti-longevity effect of model organism genes with new hierarchical feature selection methods. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12(2), 262–275. https://doi.org/10.1109/TCBB.2014.2355218

Alsaggaf, I., Freitas, A. A., & Wan, C. (2024). Predicting the pro-longevity or anti-longevity effect of model organism genes with enhanced Gaussian noise augmentation-based contrastive learning on protein-protein interaction networks. NAR Genomics and Bioinformatics, 6(4), lqae153. https://doi.org/10.1093/nargab/lqae153

Kerepesi, C., Daróczy, B., Sturm, Á., Vellai, T., & Benczúr, A. (2018). Prediction and characterization of human ageing-related proteins by using machine learning. Scientific Reports, 8, 4094. https://doi.org/10.1038/s41598-018-22240-w

Fabris, F., de Magalhães, J. P., & Freitas, A. A. (2017). A review of supervised machine learning applied to ageing research. Biogerontology, 18(2), 171–188. https://doi.org/10.1007/s10522-017-9683-y

LLM evaluation methodology

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023 Datasets and Benchmarks Track. https://proceedings.neurips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html