The most comprehensive view of the human genome to date will accelerate the diagnosis of rare diseases and cancer.
Spanish researchers from the Center for Genomic Regulation have participated in the creation of the most comprehensive catalog of genetic variations to date, deciphering some of the most difficult-to-identify and overlooked regions of the human genome . This breakthrough, published this Wednesday in two articles in the journal Nature, will accelerate the diagnosis of rare diseases and cancer .
"Each human genome has around 25,000 structural variants, but only one causes disease, so it's necessary to narrow the search space and screen for variants. With current references, we can go from 25,000 to a few thousand, but that's still looking for a needle in a haystack. Thanks to this new reference that we're releasing with this work , we've narrowed the search space to fewer than 200 candidate variants , which greatly facilitates genetic diagnosis in clinical practice," explained Dr. Bernardo Rodríguez-Martín, co-corresponding author of the study, during the press conference presenting the results.
"This work constitutes the most comprehensive reference to structural genetic variation in the human genome to date. It is a step forward toward personalized medicine based on genomic information," he added. The expert stated that hospitals such as Sant Joan de Deu, which collaborates with the CGR, are already using these technologies to diagnose rare diseases in children.
Furthermore, these new technologies can be applied to the study of mutations that cause cancer . "In 15% of patients, a cancer-causing mutation is not found, and this may be because previous technologies have been unable to detect it. Another major challenge is understanding the mutations that accumulate throughout our lives. This technology allows us to understand how we accumulate mutations as we age and due to environmental and lifestyle factors with unprecedented resolution," adds the researcher, who acknowledges that one of the limitations is the still-high cost of sequencing. " Sequencing a genome costs a thousand euros . In the last five years, it has dropped significantly, about fivefold, so we can imagine a not-so-distant future, in about five years, where the price will have dropped sufficiently and for a few hundred euros, we can sequence a genome with this technology," he concludes.
In 2003, the human genome was sequenced for the first time. It was discovered then that 60% of the genome is repetitive DNA, but the remaining 8% remained unresolved due to its complexity. In 2015, the 1000 Genomes project sequenced more than 1,000 human genomes in 26 populations around the world, but the limitations of these technologies, capable of reading DNA only in very short fragments, left large regions of the genome unexplored. Between 2021 and 2023, the entire human genome will be resolved thanks to long-read technology, but in a single reference, and the Pangenome project emerges to expand the number of references, with 47 individuals from 5 continents.
Now, researchers have significantly expanded the catalog of known human genetic variation. The resulting datasets, published this Wednesday in Nature, constitute the most comprehensive overview of the human genome to date . The first article, jointly led by the European Molecular Biology Laboratory (EMBL), the Heinrich Heine University Düsseldorf (HHU), and the Centre for Genomic Regulation (CRG) in Barcelona, analyzed the genomes of 1,019 individuals from 26 populations across five continents.
The researchers specifically looked for structural variants in the human genome. These are large fragments of DNA that have been deleted, duplicated, inserted, inverted, or rearranged. Differences in structural variants between individuals can involve changes in thousands of DNA letters at once, often knocking out genes and leading to the development of many rare diseases and cancers.
The team found and categorized more than 167,000 structural variants across the 1,019 individuals , doubling the known amount of structural variation in the human pangenome, a benchmark that links DNA from many people rather than relying on a single genome. Each person carried a median of 7.5 million letters of structural changes, highlighting the vast amount of genome editing that nature does on its own.
“We found a treasure trove of hidden genetic variation in these populations, many of which were underrepresented in previous reference sets. For example, 50.9% of the insertions and 14.5% of the deletions we found had not been reported in previous variation catalogs. This is an important step toward mapping blind spots in the human genome and reducing the bias that has long favored genomes of European ancestry, and paves the way for therapies and tests that work equally well in people around the world,” says Dr. Bernardo Rodríguez-Martín.
Approximately three in five (59%) of the discovered variants occurred in less than one percent of individuals, a crucial level of rarity for diagnosing genetic diseases, as it can help filter out harmless variations more effectively . In testing, the new reference set reduces the list of suspected mutations from tens of thousands to just a few hundred, speeding up the diagnosis of rare genetic syndromes and other types of diseases such as cancer.
Bernardo Rodríguez-Martín began working on the project in Jan Korbel's lab at EMBL and completed it after moving to the CRG to found his own group. He developed SVAN, a software program that categorizes each DNA change as either an "extra piece copied" or a "deleted fragment," helping the team analyze the genetic data to identify new patterns.
SVAN revealed that more than half of the newly mapped diversity in the human genome is located in highly repetitive DNA segments—parts of the genome previously considered junk or too difficult to study. “Repetitive elements represent a rich and previously ignored reservoir of genetic diversity. They are key players in human diversity, disease, and evolution,” says Emiliano Sotelo-Fonseca, a CRG PhD student and co-author of the first study.
These repetitive DNA segments include mobile elements, also known as "jumping genes" due to their ability to replicate throughout the genome. Researchers discovered that, among the thousands of mobile elements in the human genome, most germline mutagenesis results from the activity of a few dozen highly active elements.
For example, one particularly overactive LINE-1 element was found to hijack a powerful regulatory switch to produce many more copies of itself than usual, scattering extra genetic material into many people's DNA. Researchers observed a similar trick with another class of jumping genes called SVAs.
"Our work shows how mobile elements enhance their activity by hijacking our genomic regulatory controls, an underappreciated strategy that could contribute to the development of diseases like cancer and deserves further investigation," says Dr. Rodríguez-Martín.
The second paper, jointly led by the European Molecular Biology Laboratory (EMBL) and Heinrich Heine University Düsseldorf (HHU), used a much smaller sample set of just 65 individuals, but combined several powerful sequencing methods to reconstruct human genomes in unprecedented detail .
This approach helped the researchers decode the most difficult-to-read sections, including the centromeres. The nearly complete, gap-free assemblies of each chromosome from these individuals allowed the researchers to detect large genetic variants in regions that had not been detected in the first article or in other studies.
The results show that combining the approach of the first paper, with many genomes sequenced to a modest depth, with the approach of the second paper, with a few genomes in great detail, is the fastest path to a complete and inclusive map of human genetic diversity .
"One study uses less sequencing power, but a much larger cohort. The other uses a smaller cohort, but much more sequencing power per sample. This led to complementary conclusions," notes Dr. Jan Korbel, group leader and interim director of EMBL Heidelberg, and co-senior author of both studies.
Both papers resequenced individuals from the 1000 Genomes Project, the landmark initiative that mapped global genetic diversity in 2015. The project relied on "short-read" sequencing technology, which could read only very small fragments of DNA at a time. These fragments were too short to reveal large chunks of missing or copied DNA, long stretches that change direction, or repeats that appear nearly identical in many places.
The advances made in the new studies were made possible by "long-read" sequencing , a new technology that reads thousands or tens of thousands of DNA letters at a time, helping researchers find large amounts of hidden variation undetectable with previous methods.
Both articles also represent important advances in the construction of a reference human pangenome. For the past twenty years, scientists have used an individual's DNA sequence as a standard human genome, but the pangenome would be more useful for personalized medicine, as it would reflect global diversity.
By developing innovative algorithms that can analyze 1,019 diverse genomes in breadth and 65 ultra-complete genomes in depth, the researchers provide a roadmap that makes the possibility of a true human pangenome more practical rather than aspirational, especially as long-read sequencing costs are declining.
"Thanks to these studies, we have created a comprehensive and medically relevant resource that can now be used by researchers worldwide to better understand the origin of human genomic variation and how it is affected by a wide variety of factors," says Tobias Marschall, professor at Heinrich Heine University Düsseldorf and co-senior author of both studies. "This is an excellent example of collaborative research that opens up new perspectives in genomic science and is a step toward a more complete human pangenome," he concludes.
abc