Close Menu
    Facebook X (Twitter) Instagram
    SciTechDaily
    • Biology
    • Chemistry
    • Earth
    • Health
    • Physics
    • Science
    • Space
    • Technology
    Facebook X (Twitter) Pinterest YouTube RSS
    SciTechDaily
    Home»Biology»New Harvard-Developed AI System Unlocks Biology’s Source Code
    Biology

    New Harvard-Developed AI System Unlocks Biology’s Source Code

    By Harvard University, Department of Organismic and Evolutionary BiologyApril 23, 20241 Comment5 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn WhatsApp Email Reddit
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email Reddit
    DNA Technology Glowing
    A groundbreaking study by Yunha Hwang and team has developed gLM, an AI system that decodes the complex language of genomics from extensive microbial data. This innovation enables a deeper understanding of gene functions and regulations, leading to new discoveries in genomics. gLM exemplifies the potential of AI in advancing life sciences and tackling global challenges. Credit: SciTechDaily.com

    Artificial Intelligence (AI) systems, like ChatGPT, have taken the world by storm. Thre isn’t much they don’t have a hand in, from recommending the next binge-worthy TV show to helping navigate through traffic. But, can AI systems learn the language of life and help biologists reveal exciting breakthroughs in science?

    In a new study published in Nature Communications, an interdisciplinary team of researchers led by Yunha Hwang, PhD candidate in the Department of Organismic and Evolutionary Biology (OEB) at Harvard, have pioneered an artificial intelligence (AI) system capable of deciphering the intricate language of genomics.

    Genomic language is the source code of biology. It describes the biological functions and regulatory grammar encoded in genomes. The researchers asked can we develop an AI engine to “read” the genomic language and become fluent in the language, understanding the meaning, or functions and regulations, of genes? The team fed the microbial metagenomic data set, the largest and most diverse genomic dataset available, to the machine to create the Genomic Language Model (gLM).

    The Challenge of Genomic Data

    “In biology, we have a dictionary of known words and researchers work within those known words. The problem is that this fraction of known words constitutes less than one percent of biological sequences,” said Hwang, “the quantity and diversity of genomic data is exploding, but humans are incapable of processing such a large amount of complex data.”

    Large language models (LLMs), like GPT4, learn meanings of words by processing massive amounts of diverse text data that enables understanding the relationships between words. Genomic language model (gLM) learns from highly diverse metagenomic data, sourced from microbes inhabiting various environments including the ocean, soil, and human gut. With this data, gLM learns to understand the functional “semantics” and regulatory “syntax” of each gene by learning the relationship between the gene and its genomic context. gLM, like LLMs, is a self-supervised model – this means that it learns meaningful representations of genes from data alone and does not require human-assigned labels.

    Unveiling the Unknown in Genomics

    Researchers have sequenced some of the most commonly studied organisms like people, E. coli, and fruit flies. However, even for the most studied genomes, the majority of the genes remain poorly characterized. “We’ve learned so much in this revolutionary age of ‘omics’, including how much we don’t know,” said senior author Professor Peter Girguis, also in OEB at Harvard. “We asked, how can we glean meaning from something without relying on a proverbial dictionary? How do we better understand the content and context of a genome?”

    The study demonstrates that gLM learns enzymatic functions and co-regulated gene modules (called operons), and provides genomic context that can predict gene function. The model also learns taxonomic information and context-dependencies of gene functions. Strikingly, gLM does not know which enzyme it is seeing, nor what bacteria the sequence comes from. However, because it has seen many sequences and understands the evolutionary relationships between the sequences during training, it is able to derive the functional and evolutionary relationships between sequences.

    The Potential of gLM in Biology

    “Like words, genes can have different “meanings” depending on the context they are found in. Conversely, highly differentiated genes can be “synonymous” in function. gLM allows for a much more nuanced framework for understanding gene function. This is in contrast to the existing method of one-to-one mapping from sequence to annotation, which is not representative of the dynamic and context-dependent nature of the genomic language,” said Hwang.

    Hwang teamed with co-authors Andre Cornman (an independent researcher in machine learning and biology), Sergey Ovchinnikov (former John Harvard Distinguished Fellow and current Assistant Professor at MIT), and Elizabeth Kellogg (Associate Faculty at St. Jude Children’s Research Hospital) to form an interdisciplinary team with strong backgrounds in microbiology, genomes, bioinformatics, protein science, and machine learning.

    “In the lab, we are stuck in a step-by-step process of finding a gene, making a protein, purifying it, characterizing it, etc. and so we kind of discover only what we already know,” Girguis said. gLM, however, allows biologists to look at the context of an unknown gene and its role when it’s often found in similar groups of genes. The model can tell researchers that these groups of genes work together to achieve something, and it can provide the answers that do not appear in the “dictionary”.

    “Genomic context contains critical information for understanding the evolutionary history and evolutionary trajectories of different proteins and genes,” Hwang said. “Ultimately, gLM learns this contextual information to help researchers understand the functions of genes that previously were unannotated.”

    “Traditional functional annotation methods typically focus on one protein at a time, ignoring the interactions across proteins. gLM represents a major advancement by integrating the concept of gene neighborhoods with language models, thereby providing a more comprehensive view of protein interactions,” stated Martin Steinegger (Assistant Professor, Seoul National University), an expert in bioinformatics and machine learning, who was not involved in the study.

    With genomic language modeling, biologists can discover new genomic patterns and uncover novel biology. gLM is a significant milestone in interdisciplinary collaboration driving advancements in the life sciences.

    “With gLM we can gain new insights into poorly annotated genomes,” said Hwang. “gLM can also guide experimental validation of functions and enable discoveries of novel functions and biological mechanisms. We hope gLM can accelerate the discovery of novel biotechnological solutions for climate change and bioeconomy.”

    Reference: “Genomic language model predicts protein co-regulation and function” by Yunha Hwang, Andre L. Cornman, Elizabeth H. Kellogg, Sergey Ovchinnikov and Peter R. Girguis, 3 April 2024, Nature Communications.
    DOI: 10.1038/s41467-024-46947-9

    Never miss a breakthrough: Join the SciTechDaily newsletter.
    Follow us on Google and Google News.

    Artificial Intelligence DNA Genomics Harvard University
    Share. Facebook Twitter Pinterest LinkedIn Email Reddit

    Related Articles

    Cracking the Code of Life: New AI Model Learns DNA’s Hidden Language

    Discovering the Genetic Secrets of Immunity – Scientists Are Bringing Extinct Molecules Back to Life

    Key Differences in Seemingly Synonymous Parts of the Genetic Code

    Prolific Changes in the Human Genome in the Past 5,000 Years

    Sequencing DNA From Individual Cells Yields Dramatic New Information

    Examining Microbes to Better Understand Their Role in Health and Diet

    Nucleic Acid Nanoparticles to Target Genes that Promote Tumor Growth

    Butterfly Research Reveals Genetic Sharing from Hybridization

    3-D Image Shows How DNA Packs Itself into a “Fractal Globule”

    1 Comment

    1. Ashley Ferguson on April 29, 2024 5:55 am

      It’s chatgpt not gtp. First paragraph

      Reply
    Leave A Reply Cancel Reply

    • Facebook
    • Twitter
    • Pinterest
    • YouTube

    Don't Miss a Discovery

    Subscribe for the Latest in Science & Tech!

    Trending News

    Why Popular Diabetes Drugs Like Ozempic Don’t Work for Everyone: The “Genetic Glitch”

    Scientists Stunned After Finding Plant Thought Extinct for 60 Years

    Scientists Discover Tiny New Spider That Hunts Prey 6x Its Size

    Natural Component From Licorice Shows Promise for Treating Inflammatory Bowel Disease

    Scientists Warn: Popular Sweetener Linked to Dangerous Metabolic Effects

    Monster Storms on Jupiter Unleash Lightning Beyond Anything on Earth

    Scientists Create “Liquid Gears” That Spin Without Touching

    The Simple Habit That Could Help Prevent Cancer

    Follow SciTechDaily
    • Facebook
    • Twitter
    • YouTube
    • Pinterest
    • Newsletter
    • RSS
    SciTech News
    • Biology News
    • Chemistry News
    • Earth News
    • Health News
    • Physics News
    • Science News
    • Space News
    • Technology News
    Recent Posts
    • Seeing the Invisible: Scientists Develop New Way To Track Particles in 3D
    • The Atomic Gap That Could Cost the Semiconductor Industry Billions
    • Earth’s Secret Advantage: Why Most Alien Worlds May Be Too Dry for Life
    • Ancient Bacteria Turned a DNA System Into a Cell Skeleton
    • Researchers Finally Solve 50-Year-Old Blood Group Mystery
    Copyright © 1998 - 2026 SciTechDaily. All Rights Reserved.
    • Science News
    • About
    • Contact
    • Editorial Board
    • Privacy Policy
    • Terms of Use

    Type above and press Enter to search. Press Esc to cancel.