Marking a major milestone for biomolecular sciences, a team of researchers — made up of scientists from UC Berkeley, Arc Institute, UCSF, Stanford University and NVIDIA — have developed a machine learning model trained on the DNA of over 100,000 species across the entire tree of life. The model, called Evo 2, can identify patterns in gene sequences across disparate organisms that experimental researchers would typically need years to uncover. In addition to identifying disease-causing mutations in human genes, Evo 2 can design new genomes that are as long as the genomes of simple bacteria.
Similar in scale to the most powerful generative AI large language models, Evo 2 is the largest AI model in biology to date. Building on its predecessor Evo 1, which was trained entirely on single-cell genomes, Evo 2 trained on over 9.3 trillion nucleotides — the building blocks that make up DNA or RNA — from over 128,000 whole genomes as well as metagenomic data. In addition to an expanded collection of bacterial, archaeal and phage genomes, Evo 2 includes information from humans, plants and other single-celled and multi-cellular species in the eukaryotic domain of life.
“Our development of Evo 1 and Evo 2 represents a key moment in the emerging field of generative biology, as the models have enabled machines to read, write and think in the language of nucleotides,” said Patrick Hsu, UC Berkeley assistant professor of bioengineering, Arc Institute co-founder and co-senior author. “Evo 2 has a generalist understanding of the tree of life that’s useful for a multitude of tasks, from predicting disease-causing mutations to designing potential code for artificial life. We’re excited to see what the research community builds on top of these foundation models.”
The announcement and preprint are available on the Arc Institute’s website.