Novel Algorithm Finds High Mutations Rates In Complex Genomic Regions
In a paper published this week in the journal Nature Methods, authors Pavel Pevzner, the Ronald R. Taylor Chair and a distinguished professor in the University of California San Diego Department of Computer Science and Engineering, and PhD student Andrey Bzikadze shared a new UniAligner algorithm for comparing highly repetitive genomic regions. This algorithm can identify mutations in complex and biomedically important genomic regions, such as centromeres that play a major role in cell division and in immunoglobulin loci that harbor antibody-encoding genes.
“Last year, the Telomere-to-Telomere (T2T) consortium generated the first complete sequence of the human genome that included centromeres and other complex regions,” said Pevzner. “Before that, these sequences were just dark matter. But now that we have complete genomes for multiple individuals, we want to compare them to each other to gain insights into disease-causing mutation and human evolution. Surprisingly, the classical sequence alignment strategies, that worked so well in the last half century, failed to do it in the case of these newly assembled regions.”
Black Cat puzzle
It’s hard to tell which pieces go where in this puzzle of an all-black cat. The UniAligner helps to reveal differences in the human genome.
Centromeres make up around 3 percent of the human genome and can contain mutations associated with cancer, infertility and other conditions. However, because their genomic sequences contain so many repetitive sequences, they have been incredibly challenging to assemble. Investigators have likened it to piecing together a jigsaw puzzle that only shows an all-black cat. Without much variation, it’s difficult to know which pieces fit where.
In 2020, Pevzner and Bzikadze, a PhD student in the Bioinformatics and Systems Biology program, developed a new computational method to assemble centromere sequences. However, this created a new problem: finding the best way to compare complex and rapidly evolving genomic regions in different humans to determine mutation rates. Unfortunately, existing sequence comparison algorithms, which work for most of the human genome, fail miserably when comparing these elusive regions, making it difficult to determine which mutations are associated with disease.
“I always considered sequence comparison a solved problem and never planned to work in this area,” said Bzikadze. “That changed last year when my colleagues generated the complete sequence of the human genome and were perplexed to find that existing algorithms failed to reveal differences between them.”
Fast and Accurate Algorithm to Help Dissect Human Genome
The problem with standard alignment approaches is that they are not selective enough to produce major insights in the case of repetitive sequences, applying the same scoring parameters to all sequences. However, because of the repetitive nature of centromeres and other complex regions, this can produce millions of spurious substring matches between two sequences, making it nearly impossible to decide which of them are correct.
UniAligner solves this problem by prioritizing rare substrings, which have a much higher probability of providing useful information for aligning two sequences.
In the study, UniAligner aligned human centromeres and revealed extremely high duplications and deletions rates, suggesting centromeres may be some of the most rapidly evolving regions in the human genome. In addition to being quite accurate, the new algorithm is blazingly fast – even faster than current state-of-the-art sequence comparison algorithms.
“The ability to compare the entire human genomes to determine how they are changing is essential for us to investigate evolution and understand disease. This gives us an important new tool to access the most challenging genomic regions.”
Pavel Pevzner, computer science and engineering professor
“We feel this is an important step forward in our efforts to analyze the multiple human genomes,” said Pevzner. “The ability to compare the entire human genomes to determine how they are changing is essential for us to investigate evolution and understand disease. This gives us an important new tool to access the most challenging genomic regions.”