Research Shows AI Catalyzes Gene Activation Research and Uncovers Rare DNA Sequences
Artificial intelligence has exploded across our news feeds, with ChatGPT and related AI technologies becoming the focus of broad public scrutiny. Beyond popular chatbots, biologists are finding ways to leverage AI to probe the core functions of our genes.
Previously, University of California San Diego researchers who investigate DNA sequences that switch genes on used artificial intelligence to identify an enigmatic puzzle piece tied to gene activation, a fundamental process involved in growth, development and disease. Using machine learning, a type of artificial intelligence, School of Biological Sciences Professor James T. Kadonaga and his colleagues discovered the downstream core promoter region (DPR), a “gateway” DNA activation code that’s involved in the operation of up to a third of our genes.
Building from this discovery, Kadonaga and researchers Long Vo ngoc and Torrey E. Rhyne have now used machine learning to identify “synthetic extreme” DNA sequences with specifically designed functions in gene activation. Publishing in the journal Genes & Development, the researchers tested millions of different DNA sequences through machine learning (AI) by comparing the DPR gene activation element in humans versus fruit flies (Drosophila). By using AI, they were able to find rare, custom-tailored DPR sequences that are active in humans but not fruit flies and vice versa. More generally, this approach could now be used to identify synthetic DNA sequences with activities that could be useful in biotechnology and medicine.
“In the future, this strategy could be used to identify synthetic extreme DNA sequences with practical and useful applications. Instead of comparing humans (condition X) versus fruit flies (condition Y) we could test the ability of drug A (condition X) but not drug B (condition Y) to activate a gene,” said Kadonaga, a distinguished professor in the Department of Molecular Biology. “This method could also be used to find custom-tailored DNA sequences that activate a gene in tissue 1 (condition X) but not in tissue 2 (condition Y). There are countless practical applications of this AI-based approach. The synthetic extreme DNA sequences might be very rare, perhaps one-in-a-million—if they exist they could be found by using AI.”
Machine learning is a branch of AI in which computer systems continually improve and learn based on data and experience. In the new research, Kadonaga, Vo ngoc (a former UC San Diego postdoctoral researcher now at Velia Therapeutics) and Rhyne (a staff research associate) used a method known as support vector regression to “train” machine learning models with 200,000 established DNA sequences based on data from real-world laboratory experiments. These were the targets presented as examples for the machine learning system. They then “fed” 50 million test DNA sequences into the machine learning systems for humans and fruit flies and asked them to compare the sequences and identify unique sequences within the two enormous data sets.
While the machine learning systems showed that human and fruit fly sequences largely overlapped, the researchers focused on the core question of whether the AI models could identify rare instances where gene activation is highly active in humans but not in fruit flies. The answer was a resounding “yes.” The machine learning models succeeded in identifying human-specific (and fruit fly-specific) DNA sequences. Importantly, the AI-predicted functions of the extreme sequences were verified in Kadonaga’s laboratory by using conventional (wet lab) testing methods.
“There are countless practical applications of this AI-based approach. The synthetic extreme DNA sequences might be very rare, perhaps one-in-a-million—if they exist they could be found by using AI.”
— Professor James T. Kadonaga
Graphic of AI prediction of 50 million DNA sequences
“Before embarking on this work, we didn’t know if the AI models were ‘intelligent’ enough to predict the activities of 50 million sequences, particularly outlier ‘extreme’ sequences with unusual activities. So, it’s very impressive and quite remarkable that the AI models could predict the activities of the rare one-in-a-million extreme sequences,” said Kadonaga, who added that it would be essentially impossible to conduct the comparable 100 million wet lab experiments that the machine learning technology analyzed since each wet lab experiment would take nearly three weeks to complete.
The rare sequences identified by the machine learning system serve as a successful demonstration and set the stage for other uses of machine learning and other AI technologies in biology.
“In everyday life, people are finding new applications for AI tools such as ChatGPT. Here, we’ve demonstrated the use of AI for the design of customized DNA elements in gene activation. This method should have practical applications in biotechnology and biomedical research,” said Kadonaga. “More broadly, biologists are probably at the very beginning of tapping into the power of AI technology.”
Funding from the National Institutes of Health (R35 GM118060) supported the research.