In Minutes, Scientists can Assemble Whole Genomes on their Own Computers

Scientists at MIT (Massachusetts Institute of Technology) and the Institut Pasteur in France have found a method for rebuilding whole genomes, including the human genome, on a desktop computer. This method is a hundred times faster than existing state-of-the-art methods and requires just a fifth of the resources.

The research, which was published in the journal Cell Systems on September 14, provides for a more compact representation of genetic data, similar to how words, rather than letters, provide condensed building blocks for language models.

This ability is essential in assessing changes in the gut microbiome linked to disease and bacterial infections, such as sepsis so that we can more rapidly treat them and save lives.
Bonnie Berger

“We can quickly assemble entire genomes and metagenomes, including microbial genomes, on a modest laptop computer,” says Bonnie Berger (@lab_berger), the Simons Professor of Mathematics at the Computer Science and AI Lab at MIT and an author of the study. “This ability is essential in assessing changes in the gut microbiome linked to disease and bacterial infections, such as sepsis so that we can more rapidly treat them and save lives.”

Genome assembly programs have come a long way since the Human Genome Project completed the first full human genome assembly in 2003 at a cost of $2.7 billion and after more than a decade of worldwide collaboration.

However, while human genome assembly projects no longer take years, they still take days and require a lot of computing power. Despite the fact that third-generation sequencing methods provide terabytes of high-quality genomic sequences with tens of thousands of base pairs, genome assembly with such a large amount of data has proven difficult.

Berger and colleagues used language models to approach genome assembly more efficiently than existing strategies, which entail doing pairwise comparisons between all possible pairings of reads. The researchers created a minimizer-space de Bruin graph (mdBG), which employs short sequences of nucleotides called minimizers instead of single nucleotides, based on the notion of a de Bruijn graph, a basic, efficient data structure used for genome assembly.

“Our minimizer-space de Bruijn graphs store only a small fraction of the total nucleotides, while preserving the overall genome structure, enabling them to be orders of magnitude more efficient than classical de Bruijn graphs,” says Berger.

The researchers used their approach to assemble actual HiFi data for Drosophila melanogaster fruit flies (which has near-perfect single-molecule read accuracy), as well as human genome data given by Pacific Biosciences (PacBio). Berger and colleagues discovered that their mdBG-based program used 33 times less time and 8 times less random-access memory (RAM) processing hardware than previous genome assemblers when they compared the results.

Their program assembled the HiFi human genome 81 times quicker and used 18 times less memory than the Peregrine assembler and 338 times faster and used 19 times less memory than the hifiasm assembler.

Berger and colleagues then applied their approach to create an index for a collection of 661,406 bacterial genomes, the biggest such collection to date. They discovered that the innovative approach could scan the whole collection for antimicrobial resistance genes in 13 minutes, whereas regular sequence alignment took 7 hours.

“We knew our representation was efficient but did not know it would scale so well on real data, after further optimizations of the code,” says Berger.

“The overall idea just works and does not require some of the usually expensive pre-processing steps, like error correction, done by most other genome assembly methods,” says Rayan Chikhi (@RayanChikhi), a researcher and group leader at Institut Pasteur and an author of the study.

“We can also handle sequencing data with up to 4% error rates,” adds Berger. “With long-read sequencers with differing error rates rapidly dropping in price, this ability opens the door to the democratization of sequencing data analysis.”

While the approach presently operates best when processing PacBio HiFi readings, which have error rates well below 1%, Berger believes it will soon be compatible with Oxford Nanopore’s ultra-long reads, which have error rates of 5-12 percent but may soon deliver scans at 4%.

“We envision reaching out to field scientists to help them develop fast genomic testing sites, going beyond PCR and marker arrays which might miss important differences between genomes,” says Berger.

This work was supported by the National Institutes of Health, ANR Inception, PRAIRIE, and PANGAIA.