
Clues to the genetic code’s origin may be hidden in tiny protein fragments, revealing a synchronized and highly structured path to life’s earliest molecular systems.
Genes act as the instruction manual for life, encoding the information that allows cells to build, repair, and reproduce. But scientists still struggle to explain how this system first emerged.
A new study from the University of Illinois Urbana-Champaign takes a different approach, suggesting that the answer may be hidden not in DNA itself, but in the simplest building blocks of proteins.
“We find the origin of the genetic code mysteriously linked to the dipeptide composition of a proteome, the collective of proteins in an organism,” said corresponding author Gustavo Caetano-Anollés, professor in the Department of Crop Sciences, the Carl R. Woese Institute for Genomic Biology, and Biomedical and Translation Sciences of Carle Illinois College of Medicine at U. of I.
Tracing Evolution Through Molecular History
Caetano-Anollés specializes in phylogenomics, the study of how genomes evolve and relate to one another. His team previously created evolutionary maps of protein domains (structural units in proteins) and transfer RNA (tRNA), which carries amino acids to ribosomes during protein production.
In this study, the researchers focused on dipeptides, simple units made of two amino acids linked by a peptide bond. Their analysis showed that the evolutionary patterns of protein domains, tRNA, and dipeptides closely align, suggesting a shared history.
Life on Earth began about 3.8 billion years ago, but the genetic code likely appeared roughly 800 million years later. Scientists still debate how this transition occurred. Some propose that RNA-based enzymes came first, while others argue that proteins initially drove early biological activity.
Caetano-Anollés and his colleagues support the protein-first perspective. Their earlier work indicates that interactions involving ribosomal proteins and tRNA developed later, not at the very beginning of life.
Two Interconnected Biological Codes
According to Caetano-Anollés, life depends on two tightly linked systems. The genetic code stores information in nucleic acids (DNA and RNA), while the protein code determines how molecules carry out cellular functions. The ribosome connects these systems by assembling proteins from amino acids delivered by tRNA.
Aminoacyl tRNA synthetases are the enzymes responsible for attaching the correct amino acids to tRNA molecules. These enzymes help maintain accuracy during protein production and play a central role in preserving the integrity of the genetic code.
“Why does life rely on two languages – one for genes and one for proteins?” Caetano-Anollés asked. “We still don’t know why this dual system exists or what drives the connection between the two. The drivers couldn’t be in RNA, which is functionally clumsy. Proteins, on the other hand, are experts in operating the sophisticated molecular machinery of the cell.”
Dipeptides and the Earliest Protein Structures
The researchers suggest that the proteome may hold important clues about the earliest stages of genetic code development. Dipeptides appear to have been especially important as early structural building blocks of proteins. There are 400 possible dipeptide combinations, and their frequency varies across organisms.
To investigate this, the team analyzed 4.3 billion dipeptide sequences from 1,561 proteomes spanning the three major domains of life: Archaea, Bacteria, and Eukarya. They used these data to build an evolutionary timeline of dipeptides and compared it with patterns seen in protein structural domains.
Previous research from the group had also mapped the evolution of tRNA, providing a timeline for when amino acids became part of the genetic code. These amino acids were grouped based on when they appeared. Group 1 included early amino acids such as tyrosine, serine, and leucine, while Group 2 added eight more.
These early groups were linked to the emergence of error-correcting mechanisms in synthetase enzymes and to an early operational code that ensured each codon matched a specific amino acid. Group 3 consisted of amino acids that appeared later and contributed to more advanced functions in the modern genetic code.
Converging Evidence Across Biological Systems
The researchers had already shown that synthetases and tRNA evolved together as amino acids were incorporated into the genetic code. By adding dipeptides to the analysis, they tested whether this pattern held across another level of biological organization.
“We found the results were congruent,” Caetano-Anollés explained. “Congruence is a key concept in phylogenetic analysis. It means that a statement of evolution obtained with one type of data is confirmed by another. In this case, we examined three sources of information: protein domains, tRNAs, and dipeptide sequences. All three reveal the same progression of amino acids being added to the genetic code in a specific order.”
The study also identified a striking symmetry in dipeptide pairs. Each dipeptide consists of two amino acids, such as alanine-leucine (AL), while its counterpart, called an anti-dipeptide, reverses the order to leucine-alanine (LA). These pairs act as complementary, mirror-like structures.
“We found something remarkable in the phylogenetic tree,” Caetano-Anollés said. “Most dipeptide and anti-dipeptide pairs appeared very close to each other on the evolutionary timeline. This synchronicity was unanticipated. The duality reveals something fundamental about the genetic code with potentially transformative implications for biology. It suggests dipeptides were arising encoded in complementary strands of nucleic acid genomes, likely minimalistic tRNAs that interacted with primordial synthetase enzymes.”
Implications for Modern Science
Understanding how the genetic code evolved provides insight into the origins of life and supports advances in fields such as synthetic biology, biomedical research, and genetic engineering.
“Synthetic biology is recognizing the value of an evolutionary perspective. It strengthens genetic engineering by letting nature guide the design. Understanding the antiquity of biological components and processes is important because it highlights their resilience and resistance to change. To make meaningful modifications, it is essential to understand the constraints and underlying logic of the genetic code,” Caetano-Anollés said.
Reference: “Tracing the Origin of the Genetic Code and Thermostability to Dipeptide Sequences in Proteomes” by Minglei Wang, M. Fayez Aziz and Gustavo Caetano-Anollés, 14 August 2025, Journal of Molecular Biology.
DOI: 10.1016/j.jmb.2025.169396
The study was supported by grants from the National Science Foundation (MCB-0749836 and OISE-1132791), the United States Department of Agriculture (ILLU-802-909 and ILLU-483-625), and Blue Waters supercomputer allocations from the National Center for Supercomputing Applications to Caetano-Anollés.
Never miss a breakthrough: Join the SciTechDaily newsletter.
Follow us on Google and Google News.
2 Comments
thanks
The paper is an eager sequence pattern search which substitutes for genetic phylogenies. It is also using late evolution eukaryotes as the dominant data contributor to ~ 3/4 of patterns (dipeptides).
Since protein domains conserve better than sequences over deep time which essentially corrupt the sequence patterns, using domains has become the more used and phylogenetically better supported method. Speaking of congruences, a paper and a preprint that explore the phylogeny of evolution of genetic code [S. Wehbi,A. Wheeler,B. Morel,N. Manepalli,B.Q. Minh,D.S. Lauretta, & J. Masel, Order of amino acid recruitment into the genetic code resolved by last universal common ancestor’s protein domains, Proc. Natl. Acad. Sci. U.S.A. 121 (52) e2410311121] and core metabolism [Gradual assembly of metabolism at a phosphorylating hydrothermal vent, Mrnjavac et al, q-bio arXiv:2510.08410] show a 2 sigma congruence, well above the usual 80 % threshold. They agree on 7 of the 10 first of the 20 standard amino acid codons which assuming a binomial distribution is a 95 % likelihood, a significant congruence. Notably, the genetic code evolution paper show an early quasispecies like rampant horizontal gene transfer before the robust code was established, which is a positive sign that it is correct. The paper also show other phylogenetic signals that are congruent with the metabolic paper, of small molecules recruited first before the cell membrane import/export properties can be compositionally controlled (by way of incorporating the carboxylation biotine cofactor) and of metal-dependent catalysis before cofactor evolution is complete.
The new paper agree with the code evolution paper on mere 5 of the first 11 amino acids or a 50 % likelihood, a random result (as I would expect).