← Dashboard

Methodology

How the eval was designed, run, and validated.

The contamination problem

Large language models are trained on vast amounts of text from the internet, including scientific databases, review articles, and curated resources. When a model is asked about a well-characterized gene like daf-2, it may answer correctly not because it is reasoning from first principles, but because it encountered the answer during training, a form of memorization rather than reasoning. This is the contamination problem: the test data may have leaked into the training data, inflating apparent capability.

In classical machine learning, contamination is addressed by holding out a test set the model has never seen. For a biology eval built on public databases, perfect isolation is impossible. A different approach is needed: we can measure how much performance depends on the gene symbol by running the same eval twice: once with the real symbol visible, and once with it blinded.

The two splits

Every entry is evaluated in two conditions, called splits:

The accuracy gap between the two splits (main accuracy minus counterfactual accuracy) is the contamination gap. A large gap indicates that the model relies heavily on recognizing gene names. A small gap suggests the model can reason from functional descriptions alone, whether or not the symbol is familiar.

The same entry in both splits

Main

Gene: daf-2
Organism: Caenorhabditis elegans
Known functions: insulin-like receptor ...

Counterfactual

Gene: GENE-X
Organism: Caenorhabditis elegans
Known functions: insulin-like receptor ...

Solver and advisor

Each entry is processed by two model calls in sequence.

The solver (Claude Sonnet 4.6, temperature 0) receives the redacted entry and uses a forced tool call to submit a structured prediction: the gene's longevity influence (pro_longevity, anti_longevity, or unclear), a confidence score, the mechanism class, a reasoning paragraph, and up to three key pathways.

The advisor (also Claude Sonnet 4.6, temperature 0) receives the original entry, the GenAge ground-truth label, and the solver's full output. It grades the prediction across four dimensions: answer correctness, mechanism accuracy, reasoning quality (1–5), and failure mode. Using a second model as the grader avoids manual annotation at scale while maintaining structured, auditable output.

EntrySolverPredictionAdvisorGradeGround truth(GenAge)

Both calls use forced tool use (not the structured-outputs API header) to guarantee JSON that matches the defined schema. The prompts and tool schemas are version-hashed and stored on each runs row, so any run can be reproduced precisely.

Mechanism classes: the hallmarks of aging

The solver is asked to assign each gene to a mechanism class, the aging pathway most relevant to that gene's molecular function. The enum is drawn from the López-Otín 2023 framework, which identifies 12 hallmarks of aging plus other (for mechanisms outside the framework) and unclear (when the model cannot confidently classify). This controlled vocabulary makes mechanism predictions comparable across genes and runs. (López-Otín et al. 2023)

Genomic instability

Accumulation of DNA damage and mutations over time, including double-strand breaks, base modifications, and chromosomal rearrangements that erode genome integrity.

Telomere attrition

Progressive shortening of the protective DNA-protein caps at chromosome ends, eventually triggering replicative arrest or apoptosis when telomeres become critically short.

Epigenetic alterations

Age-related drift in DNA methylation patterns, histone modifications, and chromatin organization that reshapes gene expression independently of changes to the underlying DNA sequence.

Loss of proteostasis

Decline in the cellular machinery that folds, refolds, and degrades proteins, leading to accumulation of misfolded and aggregated species.

Disabled macroautophagy

Reduced capacity for the bulk lysosomal degradation pathway that clears damaged organelles and protein aggregates, allowing cellular waste to accumulate.

Deregulated nutrient sensing

Dysfunction of conserved pathways (insulin/IGF-1, mTOR, AMPK, sirtuins) that detect nutrient availability and coordinate cell growth, metabolism, and stress resistance.

Mitochondrial dysfunction

Decline in mitochondrial efficiency that lowers ATP output, raises reactive oxygen species (ROS), and disrupts the metabolic and signaling roles of these organelles.

Cellular senescence

Accumulation of cells that have permanently exited the cell cycle but persist in tissues, often secreting inflammatory factors that affect surrounding cells.

Stem cell exhaustion

Decline in the number and regenerative capacity of tissue-resident stem cells, undermining tissue maintenance and repair across the lifespan.

Altered intercellular communication

Disruption of signaling between cells, tissues, and organs (including neuroendocrine, paracrine, and immune signals) that coordinates organism-wide homeostasis.

Chronic inflammation

Persistent low-grade systemic inflammation ("inflammaging") that accompanies older age and contributes to many age-related diseases.

Dysbiosis

Disruption of the composition and function of resident microbial communities, particularly in the gut, with downstream effects on metabolism, immunity, and aging.

Other

Mechanisms outside the López-Otín 2023 hallmarks framework.

Unclear

The model couldn't confidently assign a primary mechanism.

Data pipeline

The eval dataset starts from the GenAge model organisms database, which catalogs genes with known effects on lifespan across model organisms (C. elegans, D. melanogaster, S. cerevisiae, and M. musculus). The CSV was downloaded on May 9, 2026 and contained 2,202 entries.

For each entry, per-gene functional annotations were fetched from NCBI Gene via E-utilities: the official full name, functional descriptors, and Gene Ontology Molecular Function (GO MF) terms. GO Biological Process terms and RefSeq summaries were deliberately excluded because they frequently contain lifespan-related language that would leak the answer to the model.

An automated redaction pass then stripped remaining lifespan and aging language from the functional description using a forbidden-terms list (longevity, lifespan, aging, life-extension, senescence, and related terms). Four QC filters further pruned entries: no GO MF terms annotated, sparse functional content, high redaction density, and post-redaction leakage of aging language. After redaction and QC, 1,846 entries remained. A 30-entry spot check of the redaction output was performed by hand before running the eval.

The three valid longevity influence classes used in the eval are pro_longevity, anti_longevity, and unclear, totaling 1,385 entries. The remaining entries carry labels necessary_for_fitness or unannotated, which are excluded from the eval because they don't map cleanly to the three-class prediction task.

Ground truth

The ground-truth labels come from GenAge's curators. For each gene, curators synthesize evidence across multiple studies and assign a Longevity Influence label that captures the gene's normal-function role in promoting or opposing longevity.

This reconciled judgment is the eval's prediction target because it provides a single, cross-study signal. It is not the same as predicting the outcome of any particular manipulation (e.g., overexpression vs. loss-of-function in a specific tissue). The model is asked to predict the gene's normal-function effect, consistent with how GenAge defines the label.

Note that the mechanism class filter on the per-entry browse page is based on the model's predicted mechanism (from the solver output), not a curator-assigned mechanism. GenAge does not provide a per-gene mechanism field, so the predicted mechanism is what we are studying.

Validating the advisor

Because the advisor is itself a language model, its judgments could be systematically biased. To quantify this, 30 entries were randomly sampled and hand-graded by Andrew T. Rodriguez, Ph.D. Cohen's kappa was computed between the advisor's grades and the hand grades on the answer_correct field.

The measured κ is 0.79. Values above 0.7 are considered strong agreement. On the Landis & Koch (1977) scale, this falls in the “substantial agreement” range (0.61–0.80). This provides a concrete calibration point for interpreting the advisor's grading.

Directional bias and label-inversion failures

The model shows a strong directional bias: pro-longevity recall is 73% but anti-longevity recall is only 30%, despite anti-longevity being the larger class (878 vs. 481 entries). Closer inspection of the misclassifications reveals a specific failure mode the eval's seven-class taxonomy doesn't cleanly capture. The model's reasoning text correctly describes the gene's role, but the final structured prediction inverts the label.

Approximately 45 entries (~3% of main-split results) show this pattern. Many are textbook anti-longevity genes such as C. elegans atp-2, clk-1, mrpl-1, eat-2, Drosophila chico, and mouse Ghrhr. In these cases the model's reasoning correctly states that the gene's normal activity opposes longevity (because reducing it extends lifespan), but the prediction field outputs pro_longevity. The advisor's grading reflected this ambiguity inconsistently: some inversions were classified as confident_wrong, others as right_answer_wrong_reasoning, though strictly speaking neither fits.

The pattern suggests the model is confused about the GenAge label convention (that “Anti-Longevity” describes the gene's normal-function role, not the direction of any particular manipulation) rather than the underlying biology. A future iteration of this eval should add a dedicated reasoning_contradicts_prediction failure mode and may want to test whether explicit label-convention examples in the system prompt reduce the inversion rate.

Limitations

References

Hallmarks framework

López-Otín, C., Blasco, M. A., Partridge, L., Serrano, M., & Kroemer, G. (2023). Hallmarks of aging: An expanding universe. Cell, 186(2), 243–278. https://doi.org/10.1016/j.cell.2022.11.001

Aging gene dataset (GenAge / HAGR)

Tacutu, R., Thornton, D., Johnson, E., Budovsky, A., Barardo, D., Craig, T., Diana, E., Lehmann, G., Toren, D., Wang, J., Fraifeld, V. E., & de Magalhães, J. P. (2018). Human Ageing Genomic Resources: new and updated databases. Nucleic Acids Research, 46(D1), D1083–D1090. https://doi.org/10.1093/nar/gkx1042

Project: https://genomics.senescence.info/genes/

Gene annotations

Sayers, E. W., et al. (2022). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 50(D1), D20–D26. https://doi.org/10.1093/nar/gkab1112

NCBI E-utilities documentation: https://www.ncbi.nlm.nih.gov/books/NBK25501/

The Gene Ontology Consortium. (2023). The Gene Ontology knowledgebase in 2023. Genetics, 224(1), iyad031. https://doi.org/10.1093/genetics/iyad031

Project: http://geneontology.org

Inter-rater agreement

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.