A natural language model has jumpstarted the process of protein design by creating active enzymes.
Researchers have developed an AI system that can generate artificial enzymes from scratch. In laboratory experiments, some of these enzymes demonstrated efficacy comparable to natural enzymes, even when their artificially created amino acid sequences greatly deviated from any known natural protein.
The experiment shows that natural language processing, initially created for reading and writing language text, can grasp certain fundamental concepts of biology. The AI program, known as ProGen, was developed by Salesforce Research and employs next-token prediction to construct artificial proteins from amino acid sequences.
Scientists said the new technology could become more powerful than directed evolution, the Nobel-prize-winning protein design technology, and it will energize the 50-year-old field of protein engineering by speeding the development of new proteins that can be used for almost anything from therapeutics to degrading plastic.
“The artificial designs perform much better than designs that were inspired by the evolutionary process,” said James Fraser, Ph.D., professor of bioengineering and therapeutic sciences at the UCSF School of Pharmacy, and an author of the work, which was recently published in Nature Biotechnology. A previous version of the paper has been available on the preprint server BiorXiv since July of 2021, where it garnered several dozen citations before being published in a peer-reviewed journal.
“The language model is learning aspects of evolution, but it’s different than the normal evolutionary process,” Fraser said. “We now have the ability to tune the generation of these properties for specific effects. For example, an enzyme that’s incredibly thermostable or likes acidic environments or won’t interact with other proteins.”
To create the model, scientists simply fed the amino acid sequences of 280 million different proteins of all kinds into the machine learning model and let it digest the information for a couple of weeks. Then, they fine-tuned the model by priming it with 56,000 sequences from five lysozyme families, along with some contextual information about these proteins.
The model quickly generated a million sequences, and the research team selected 100 to test, based on how closely they resembled the sequences of natural proteins, as well as how naturalistic the AI proteins’ underlying amino acid “grammar” and “semantics” were.
Out of this first batch of a 100 proteins, which were screened in vitro by Tierra Biosciences, the team made five artificial proteins to test in cells and compared their activity to an enzyme found in the whites of chicken eggs, known as hen egg white lysozyme (HEWL). Similar lysozymes are found in human tears, saliva, and milk, where they defend against bacteria and fungi.
Two of the artificial enzymes were able to break down the cell walls of bacteria with activity comparable to HEWL, yet their sequences were only about 18% identical to one another. The two sequences were about 90% and 70% identical to any known protein.
Just one mutation in a natural protein can make it stop working, but in a different round of screening, the team found that the AI-generated enzymes showed activity even when as little as 31.4% of their sequence resembled any known natural protein.
The AI was even able to learn how the enzymes should be shaped, simply from studying the raw sequence data. Measured with X-ray crystallography, the atomic structures of the artificial proteins looked just as they should, although the sequences were like nothing seen before.
Salesforce Research developed ProGen in 2020, based on a kind of natural language programming their researchers originally developed to generate English language text.
They knew from their previous work that the AI system could teach itself grammar and the meaning of words, along with other underlying rules that make writing well-composed.
“When you train sequence-based models with lots of data, they are really powerful in learning structure and rules,” said Nikhil Naik, Ph.D., Director of AI Research at Salesforce Research, and the senior author of the paper. “They learn what words can co-occur, and also compositionality.”
With proteins, the design choices were almost limitless. Lysozymes are small as proteins go, with up to about 300 amino acids. But with 20 possible amino acids, there are an enormous number (20,300) of possible combinations. That’s greater than taking all the humans who lived throughout time, multiplied by the number of grains of sand on Earth, multiplied by the number of atoms in the universe.
Given the limitless possibilities, it’s remarkable that the model can so easily generate working enzymes.
“The capability to generate functional proteins from scratch out-of-the-box demonstrates we are entering into a new era of protein design,” said Ali Madani, Ph.D., founder of Profluent Bio, a former research scientist at Salesforce Research, and the paper’s first author. “This is a versatile new tool available to protein engineers, and we’re looking forward to seeing the therapeutic applications.”
Reference: “Large language models generate functional protein sequences across diverse families” by Ali Madani, Ben Krause, Eric R. Greene, Subu Subramanian, Benjamin P. Mohr, James M. Holton, Jose Luis Olmos Jr., Caiming Xiong, Zachary Z. Sun, Richard Socher, James S. Fraser and Nikhil Naik, 26 January 2023, Nature Biotechnology.
Please see the paper for a complete author and funding list. A comprehensive codebase for the methods described in the paper is publicly available at https://github.com/salesforce/progen.