A Reference Human Pangenome will Allow for a More Thorough and Fair Knowledge of Genetic Diversity

A group of researchers, including those from UC Santa Cruz, have published a draft of the first human pangenome, a new, practical reference for genomics that combines the genetic material of 47 people with various ancestry backgrounds to enable a more thorough and precise understanding of global genomic diversity.

The pangenome represents human genetic variation in a way that was not conceivable with a single reference genome by supplementing the current genomics reference with 119 million bases, or the “letters” in DNA sequences. As demonstrated in a group of ground-breaking publications published today in the journals Nature, Genome Research, Nature Biotechnology, and Nature Methods, it is extremely accurate, more comprehensive, and significantly boosts the detection of variations in the human genome.

The pangenome was produced by the Human Pangenome Reference Consortium (HPRC), which is co-led by UCSC’s Associate Professor of Biomolecular Engineering Benedict Paten and Assistant Professor of Biomolecular Engineering Karen Miga and is now available for use in an assembly hub on the UCSC Genome Browser.

This effort, which will last until 2024 when the researchers hope to publish a final pangenome with genomic data from 350 people, involves more than a dozen UCSC researchers and students.

“We are introducing more diversity and equity into the reference by sampling diverse human beings and including them in this structure that everyone can use,” said Paten, who is the senior author on the main marker paper. “One genome isn’t enough to represent everybody the pangenome will ultimately be something that is inclusive and representative.”

Understanding genomic variation

The average variation in each person’s genome from the next is roughly 0.4%, and knowing how these variations differ from one another can assist diagnose disease, predict health outcomes, and direct treatment. The pangenome reference will enhance scientists’ capacity to recognize and comprehend diversity in upcoming research.

Since 2000, we’ve had a series of increasingly more accurate representations of one genome. But no matter how accurately you represent one genome, that’s not going to represent all of humanity. Now is a turning point: no longer genomics of the one standard human genome, but genomics for everybody.
David Haussler

When researchers and doctors examine a person’s genome to seek for variation, they frequently compare that person’s DNA to that of a standard reference to identify any base pair changes.

For each human chromosome up until this point, the reference genome has mainly been represented by a single sequence that comes from a single person. However, this reference is approximately two decades old and essentially unrepresentative of the vast genetic diversity seen in the human population. This introduces an issue called reference bias into genome analysis.

The new pangenome, in contrast, is a reference that mixes the genomes of 47 people with different ancestries. The pangenome expands to reveal the regions where there are differences and appears as a linear reference where the sequences contain the same bases. It allows researchers a more precise point of comparison for variation that is present in certain populations but not others and simultaneously represents numerous variations of the human genome sequence.

“One genome can’t possibly represent all of the rich variation we know can be observed and studied around the world,” said Miga, Director of the HPRC Production Center at UCSC. “The No. 1 goal of the human pangenome reference is to try to broaden the representation of a reference resource to be more inclusive and more equitable for studying the human species, as a collection of references and not just one.”

Genomic variation can range from modest structural variations (defined as variants that are 50 base pairs or greater) to large structural variants (containing of changes of more than 50 DNA bases). These greater structural variations may have significant effects on health. Due to limited technology and the bias of utilizing a single reference sequence, researchers have not been able to discover more than 70% of the structural variants that exist in human genomes.

About 90 million of the 119 million new bases that the pangenome adds to the reference come from structural variation. Complex structural variants include sequence inversions, insertions, deletions, and tandem repeats, which are segments of two or more bases repeated repeatedly.

With the use of these new bases, researchers will be able to examine parts of the genome for which there was previously no reference and perhaps link structural variations to disease in the future.

“Now, we can map to more structural variants, so we’re finding features and areas in the genome that just weren’t there before,” Miga said. “That’s exciting because it’s allowing us to look at gene regulation in a unique way that we couldn’t study before, because those areas probably would have been inappropriately mapped or just ignored altogether.”

When performing genomic analysis using the pangenome reference, structural variations are detected 104% more frequently than when using the standard reference. Due to the increased amount of data included in the pangenome, the pangenome reference also improves the accuracy of calling minor variations, those only a few bases long, by roughly 34%.

A paired set of chromosomes, one from the mother and one from the father, are carried by every human. It is a significant scientific achievement that the individual genomes included in the pangenome reference have information that can clearly distinguish the two parental sets of chromosomes. These details will aid scientists in understanding the inheritance of different genes and diseases.

This also means the current reference actually includes 94 distinct genome sequences, with the goal of getting to 700 by 2024.

Creating the pangenome

By using cutting-edge computer methods to align various genome sequences into a single, useful reference in a structure known as a pangenome graph, the pangenome was made possible. Paten and researchers in the UCSC Computational Genomics lab helped lead the HPRC efforts to develop the algorithmic methods needed to create this pangenome graph structure.

All of the genomes in the pangenome reference are of extremely high quality and precision thanks to the techniques utilized in this effort, spanning more than 99% of each human genome with more than 99% accuracy.

“In the linear reference, we had only one sequence, one representation of each gene,” said Mobin Asri, a bioinformatics Ph.D. candidate at UCSC and co-first author on the main paper. “But we know that our genes have different variations in the human population. Using the pangenome graph, we want to have all of those variations in a single structure and a graph is a natural way to do this.”

To read DNA from biological materials, the HPRC project mainly utilizes long- and ultra long-read sequencing technology. These methods can now simultaneously decode thousands to millions of base pairs of the genome thanks to recent advancements.

The lengthy DNA readings are then put together into more comprehensive genomic sequences using sophisticated algorithms. Each constructed sequence should, ideally, correspond to one chromosome.

Since current assembly methods are not flawless and long reads contain errors roughly 1% of the time, the built sequences may contain errors in some places. The individual genomes that have been sequenced and assembled pass through a variety of technologies, including a reliability pipeline created by Asri, to be checked for and fixed. The researchers can verify the assemblies are correct and comprehensive once they have been processed by these technologies.

After moving through Asri’s pipeline, the various genomes are compiled via complex algorithmic methods into the pangenome graph structure. Researchers can see changes between different reference sequences visually in the graph genome as divergence regions in otherwise similar routes.

Building an accessible resource

All of the first 47 diploid genomes in the draft pangenome were sourced from individuals who participated in the 1000 Genomes Project (1000G), an influential effort which created a catalog of common human genetic variation from openly consented samples and was completed in 2015.

In order to make the pangenome available to as many researchers as possible, these samples have open consent status, allowing any researcher to utilize the resource without the privacy restrictions that often accompany genome research.

“Becoming a common resource is something that’s fundamental to the success of a human pangenome reference,” Miga said. “It has to have the ability to be accessible and open around the world to all researchers so we can use it as the foundation.”

The HPRC team is concentrating on outreach to guarantee that the pangenome will be used as a valuable resource in clinics all around the world. This entails making it easier for scientists conducting research utilizing the pangenome reference to provide annotations, feedback, and input.

“The draft pangenome is an important proof of principle that we hope is going to influence a lot of people and get them thinking about the pangenome and how it might affect their work,” Paten said. “Looking ahead, we see a lot of engagement with other groups it takes a lot of different people to build something that is going to become a big community resource.”

Along with a focus on accessibility, the HPRC project has a dedicated ethics team focused on the social and legal implications of this project. In addition, they are collaborating with international and Indigenous communities to include their genome sequences in these larger efforts. They are also working to prioritize the study of various samples, explore potential regulatory issues pertaining to clinical adoption, and anticipate difficult issues and help guide informed consent.

Continuing the legacy and future work

The human pangenome is a continuation of decades-long efforts from scientists at UC Santa Cruz to understand the biological code that underlies human life.

In 2000, Jim Kent, then a UCSC graduate student and now a research scientist at the Genomics Institute and director of the UCSC Genome Browser, wrote the code that assembled the first working draft of the human genome. UCSC scientists published it with open access to anyone who wanted to use it. Since then, UCSC has been at the forefront of genomics research.

In April 2022, UCSC’s Karen Miga co-led the Telomere-to-Telomere consortium to assemble the first complete sequencing of a human genome, filling in missing, complex regions of reference that had long eluded scientists.

“Since 2000, we’ve had a series of increasingly more accurate representations of one genome,” said David Haussler, Scientific Director of the UCSC Genomics Institute who led the UCSC team on the original Human Genome Project and advises on the pangenome project. “But no matter how accurately you represent one genome, that’s not going to represent all of humanity. Now is a turning point: no longer genomics of the one standard human genome, but genomics for everybody.”

The researchers are making progress toward the goal of completing the full pangenome by 2024. The team is in the process of recruiting new individuals to represent some populations not included in the 1000 Genomes Project, particularly people of Middle Eastern and African ancestry. Miga, as the director of the Data Production Center at UCSC, will spearhead these efforts going forward.

The researchers are seeking to build an international human pangenome project in addition to finishing the pangenome reference, which would create collaborations with researchers all around the world. These collaborations would involve a two-way exchange of knowledge and skills with the goal of enabling researchers everywhere to conduct their own study by providing them with the tools and expertise required to produce high-quality reference genomes.