Summary: Researchers are using artificial intelligence (AI) to dig deep into the mechanisms of gene activation, a crucial process in growth, development, and disease. Utilizing machine learning, the team identified “synthetic extreme” DNA sequences that play specific roles in gene activation.
These sequences were discovered by testing millions of DNA sequences and comparing gene activation elements in humans and fruit flies. This approach could be employed to identify synthetic DNA sequences with potentially significant applications in biotechnology and medicine.
- The UC San Diego researchers identified custom-tailored DPR sequences that are active in humans but not in fruit flies, and vice versa, using machine learning techniques.
- The team used a method called support vector regression to train machine-learning models with 200,000 established DNA sequences.
- The rare sequences identified by the machine learning system set the stage for broader uses of machine learning and other AI technologies in biology.
Artificial intelligence has exploded across our news feeds, with ChatGPT and related AI technologies becoming the focus of broad public scrutiny. Beyond popular chatbots, biologists are finding ways to leverage AI to probe the core functions of our genes.
Previously, University of California San Diego researchers who investigate DNA sequences that switch genes on used artificial intelligence to identify an enigmatic puzzle piece tied to gene activation, a fundamental process involved in growth, development and disease.
Using machine learning, a type of artificial intelligence, School of Biological Sciences Professor James T. Kadonaga and his colleagues discovered the downstream core promoter region (DPR), a “gateway” DNA activation code that’s involved in the operation of up to a third of our genes.
Building from this discovery, Kadonaga and researchers Long Vo ngoc and Torrey E. Rhyne have now used machine learning to identify “synthetic extreme” DNA sequences with specifically designed functions in gene activation.
Publishing in the journal Genes & Development, the researchers tested millions of different DNA sequences through machine learning (AI) by comparing the DPR gene activation element in humans versus fruit flies (Drosophila).
By using AI, they were able to find rare, custom-tailored DPR sequences that are active in humans but not fruit flies and vice versa.
More generally, this approach could now be used to identify synthetic DNA sequences with activities that could be useful in biotechnology and medicine.
“In the future, this strategy could be used to identify synthetic extreme DNA sequences with practical and useful applications. Instead of comparing humans (condition X) versus fruit flies (condition Y) we could test the ability of drug A (condition X) but not drug B (condition Y) to activate a gene,” said Kadonaga, a distinguished professor in the Department of Molecular Biology.
“This method could also be used to find custom-tailored DNA sequences that activate a gene in tissue 1 (condition X) but not in tissue 2 (condition Y). There are countless practical applications of this AI-based approach.
“The synthetic extreme DNA sequences might be very rare, perhaps one-in-a-million— if they exist they could be found by using AI.”
Machine learning is a branch of AI in which computer systems continually improve and learn based on data and experience.
In the new research, Kadonaga, Vo ngoc (a former UC San Diego postdoctoral researcher now at Velia Therapeutics) and Rhyne (a staff research associate) used a method known as support vector regression to “train” machine learning models with 200,000 established DNA sequences based on data from real-world laboratory experiments.
These were the targets presented as examples for the machine learning system. They then “fed” 50 million test DNA sequences into the machine learning systems for humans and fruit flies and asked them to compare the sequences and identify unique sequences within the two enormous data sets.
While the machine learning systems showed that human and fruit fly sequences largely overlapped, the researchers focused on the core question of whether the AI models could identify rare instances where gene activation is highly active in humans but not in fruit flies.
The answer was a resounding “yes.”
The machine learning models succeeded in identifying human-specific (and fruit fly-specific) DNA sequences. Importantly, the AI-predicted functions of the extreme sequences were verified in Kadonaga’s laboratory by using conventional (wet lab) testing methods.
“Before embarking on this work, we didn’t know if the AI models were ‘intelligent’ enough to predict the activities of 50 million sequences, particularly outlier ‘extreme’ sequences with unusual activities.
“So, it’s very impressive and quite remarkable that the AI models could predict the activities of the rare one-in-a-million extreme sequences,” said Kadonaga, who added that it would be essentially impossible to conduct the comparable 100 million wet lab experiments that the machine learning technology analyzed since each wet lab experiment would take nearly three weeks to complete.
The rare sequences identified by the machine learning system serve as a successful demonstration and set the stage for other uses of machine learning and other AI technologies in biology.
“In everyday life, people are finding new applications for AI tools such as ChatGPT. Here, we’ve demonstrated the use of AI for the design of customized DNA elements in gene activation.
“This method should have practical applications in biotechnology and biomedical research,” said Kadonaga.
“More broadly, biologists are probably at the very beginning of tapping into the power of AI technology.”
About this artificial intelligence and genetics research news
Author: Mario Aguilera
Contact: Mario Aguilera – UCSD
Image: The image is credited to Neuroscience News
Original Research: Closed access.
“Analysis of the Drosophila and human DPR elements reveals a distinct human variant whose specificity can be enhanced by machine learning” by James T. Kadonaga et al. Genes & Development
Analysis of the Drosophila and human DPR elements reveals a distinct human variant whose specificity can be enhanced by machine learning
The RNA polymerase II core promoter is the site of convergence of the signals that lead to the initiation of transcription. Here, we performed a comparative analysis of the downstream core promoter region (DPR) in Drosophila and humans by using machine learning.
These studies revealed a distinct human-specific version of the DPR and led to the use of machine learning models for the identification of synthetic extreme DPR motifs with specificity for human transcription factors relative to Drosophila factors and vice versa.
More generally, machine learning models could similarly be used to design synthetic DNA elements with customized functional properties.