Posted in: information

Tang Fuzhong’s research group realized the de novo assembly of human genome based on single cell sequencing data

With the development of third-generation sequencing technology (TGS, that is, single-molecule sequencing technology), the third-generation genome sequencing data based on a large number of cells is widely used in the assembly of various complex large genomes. Because its reading length is hundreds of times higher than that of second-generation sequencing (NGS) technology, the repeated sequence regions and chromosomal weight sequencing and other complex structural variation regions in the genome can be better assembled.

For the assembly research of human genome, the telomere to telomere (T2T) alliance took the lead in releasing the first complete telomere to telomere human genome reference sequence chm13v1.1 using the homozygous diploid cell line chm13 in March 2022. In March 2022, the human pan Genome Consortium (HPRC) released the haplotype assembly results of the first high-quality human heterodiploid cell line hg002 on the preprint platform biorxiv.

At present, high-quality genome assembly usually relies on the third-generation sequencing data of a large number of cell mixed samples, and requires a large amount of genomic DNA (usually dozens of micrograms of genomic DNA need to be extracted from millions of cells). However, in the practical application of genome assembly, there are often two difficulties:

1. Genetic heterogeneity exists in cell populations. Genome assembly based on a large number of cell third-generation sequencing data needs to ensure that the genetic background of each cell in the sequenced sample is highly consistent, otherwise the assembly results will be difficult to distinguish the differences between different haplotype genomes in the same cell and the genome differences between different cell subsets. Only by reducing or eliminating genetic heterogeneity between cells can we ensure the accuracy of haplotype assembly. However, somatic copy number variation (CNA) is often widespread in human normal tissue samples. At the same time, normal human cells will also continue to accumulate mutations. The same human tissue is often composed of many cell clones containing different mutations. In cancer research, the genomic heterogeneity between different cancer cell subclones in the same tumor sample is more obvious.

2. The number of cells is scarce. In many cases, it is difficult to obtain millions of cells to extract large amounts (a few micrograms) of genomic DNA. For example, in early embryonic development research, judicial testing, especially in cancer genome research (such as circulating tumor cells, tumor biopsy samples, tumor cells in cerebrospinal fluid, and tumor cells in ascites, etc.), the number of cells that can be obtained is often very rare, and these cells are difficult to culture and expand in vitro; Even if it can be cultured and amplified occasionally, there is no guarantee that its genome will not further produce new genetic variation in the process of in vitro culture and amplification.

Single cell gene sequencing technology based on the second generation sequencing (NGS) platform is widely used in the assembly of simple small genomes such as microorganisms. Many kinds of bacteria cannot be cultured in the laboratory. Single cell genome sequencing can be combined with metagenomics to complete the genome assembly of microorganisms. Because the structure, size, and complexity of the human genome are far more than bacteria and other microorganisms, it is impossible to assemble high-quality human genome reference sequences simply using a large number of cell genome sequencing data based on the second-generation sequencing platform (ng50 is difficult to reach the MB (million base pairs) level), so it is more challenging to assemble the human genome using a small amount of DNA or even single-cell genome sequencing data, It not only needs the support of single cell gene long reading sequencing technology based on the third generation sequencing platform, but also needs appropriate assembly software and good bioinformatics analysis strategies.

On July 12, 2022, Tang Fuhui’s research group of biomedical frontier Innovation Center (biopic) of Peking University published a research paper entitled de novo assembly of human genome at single cell levels at nuclear acids research. This study used the optimized smooth SEQ single-cell genome third-generation sequencing technology, based on two third-generation sequencing platforms, Pacific Biosciences (pacbio) hifi and Oxford nanopore technologies (ont), to complete the MB level continuous human genome assembly at the single-cell level for the first time, and used a variety of evaluation indicators to fully explore the impact of different sequencing strategies and assembly tools on the genome assembly results.

1. The third generation sequencing technology of smooth SEQ single cell genome has been comprehensively optimized, making it suitable for pacbio and ont two mainstream single molecule sequencing platforms at the same time. The previous smooth SEQ technology is only applicable to pacbio single molecule sequencing platform, and the use scenario has great limitations. The optimized smooth SEQ technology can be used in both pacbio single molecule sequencing platform and ont single molecule sequencing platform. The use scenario is more flexible, and the accuracy of sequencing data and sequencing cost can be taken into account.

2. Using mainstream assembly tools such as hifiasm, hicanu, wtdbg2 and the third-generation genome sequencing data of 95 single cells (pacbio hifi platform), high-quality genome assembly of human chronic myeloid leukemia (CML) cell line K562 was carried out. The ng50 of the assembled primary contig (the length of the shortest contig that can cover 50%of the known genomic region) can reach 2.11mb, that is, in this assembled reference sequence, more than half (1.5 billion base pairs) of the human genome are covered by at least 2.11mb of contigs. The longest contig can reach 14.12mb, the proportion of complete single copy homologous gene Benchmarks (complete buscos) is close to 95%, and most histocompatibility complex (MHC) sites (a representative complex region on the genome, with a total length of about 6MB) have been successfully assembled (as shown in Figure 1).

Figure 1 Genome assembly results of 95 K562 cells (pacbio hifi)

3. The human genome was assembled with high quality using mainstream assembly tools such as hifiasm, hicanu, wtdbg2 and the genome third generation sequencing data of 157 single cells of human normal diploid cell line hg002 (pacbio hifi platform). The ng50 of the assembled primary contig can reach 0.65mb, and the longest contig can reach 6.82mb. The proportion of complete single copy homologous gene Benchmarks (complete buscos) is close to 91%. In the process of haplotype assembly of hg002 using this data, the study found that the k-mer distribution of the exponentially amplified genome data would shift, so using the trio binning model with parental second-generation sequencing data as an auxiliary to carry out genome haplotype assembly results would be more accurate. Therefore, in this study, two tissue tools, trio hifiasm and Trio hicanu, were used for haplotype assembly, respectively. The ng50 of the parental tandem group was up to about 0.3mb, and the proportion of complete single copy homologous gene benchmark (complete buscos) exceeded 84%. By comparing the assembly typing results of six classical human leukocyte antigen (HLA) loci of hg002 parents, trio hicanu can correctly assemble most of the gene loci of the two parents in the HLA region (as shown in Figure 2).

Figure 2 Genome assembly results of 157 hg002 cells (pacbio hifi)

4. Using flye, NECAT, wtdbg2 and other mainstream assembly tools and the third-generation genome sequencing data of 192 single cells of human normal diploid cell line hg002 (ont platform, low sequencing depth) to assemble the human genome with high quality. The study found that different assembly tools have a great impact on the final assembly results. Flye shows the characteristics that are more suitable for the third-generation sequencing data of single-cell ont. The ng50 of the assembled ligation group can reach 1.38mb, and the longest ligation group can reach 11.42mb. The proportion of complete single copy homologous gene standards (complete buscos) exceeds 93%, and many indicators are far higher than the other two assembly tools. At the same time, the assembly results can fill the gap regions not assembled in 39 hg38 versions of the human reference genome, of which 14 regions are annotated in hg38 with a length of more than 50kb (as shown in Figure 3).

Figure 3 Genome assembly results (ont) of 192 hg002 cells and 30 hg002 cells

5. Using flye, wtdbg2 and other assembly tools and the third-generation genome sequencing data of 30 single cells of human normal diploid cell line hg002 (ont platform, high sequencing depth) to assemble the human genome with high quality. In order to explore the limit of using only a small number of single-cell genome sequencing data for human genome assembly, this study used 1, 10, 20 and 30 single cells to try human genome assembly, and found that only 30 single-cell genome sequencing data with high sequencing depth (average genome coverage ~41.7%) can complete the assembly of contiguous ng50 up to 1.34mb continuity. At the same time, the assembly results can complement the gap regions not assembled from 38 hg38 versions of the human reference genome, of which 15 regions have a length of more than 50kb in the hg38 annotation (as shown in Figure 4).

Figure 4 Genome assembly results (ont) of 30 hg002 cells with high genome coverage

6. Through the de novo assembly of the genome of K562 cell line, this study can more accurately identify more genome insertion events and complex structural variation events than using the third-generation sequencing data of the original single-cell genome. For leukemia cell lines such as K562, whether genome structural variation (SV) events can be better identified after genome de novo assembly is an important issue in cancer research. In this study, the main (primary) contiguous group and the alternate contiguous group assembled by hifiasm and hicanu were used for the identification of structural variation. It was found that the assembled syntagma could identify genome insertion events more accurately than the original single-cell data, with a recall rate of more than 70%and an accuracy of more than 90%. At the same time, three pairs of classical fusion genes in K562:cdc25a-grid1, BCR-ABL1 and nup214-xkr3 can be accurately identified, while cdc25a-grid1 fusion cannot be found when the original single cell genome data is directly compared to the reference genome (as shown in Figure 5). In order to further verify the accuracy of structural variation events found after genome de novo assembly, this study selected 20 structural variation events (14 insertion events and 6 deletion events) that were identified in the assembled contig, but were not identified when the original sequencing data of single-cell genome were directly compared to the reference genome, and the accuracy was as high as 80%, It is proved that the identification of structural variation events by the assembled ligation group is accurate and reliable (as shown in Figure 6).

Figure 5 Accuracy of structural variation event detection in post assembly contig

Figure 6 PCR verifies the results of genome structure variation events

To sum up, in order to solve the problems of cell genetic heterogeneity and cell scarcity encountered in the practical application of genome de novo assembly, this study uses the optimized smooth SEQ technology to adopt different sequencing strategies (multi cells with low sequencing depth) and low cells with high sequencing depth) on two different mainstream third-generation sequencing platforms, Using a variety of different assembly software (hifiasm, hicanu, wtdbg2, flye, NECAT, etc.), multiple evaluation indicators, and different assembly strategies, this paper discusses the feasibility of using single-cell sequencing data to assemble the human genome from scratch, and determines the main factors affecting the assembly results, so as to improve the resolution of genome assembly to the single-cell level (as few as 30 single cells). In the future, with the further development of single-cell sequencing technology and genome assembly strategy, the dream of assembling a human reference genome with MB LEVEL continuity with only one single-cell sequencing data will eventually be realized.

Life Sciences, Peking University学院博士生谢昊伶以及北京大学前沿交叉学科研究院博士生李文为该论文的并列第一作者。北京大学生物医学前沿创新中心汤富酬教授为该论文的通讯作者。该研究项目得到了北大-清华生命科学联合中心、国家自然科学基金委、北京市科技委和北京未来基因诊断高精尖创新中心的支持。

论文链接:

https://doi.org/10.1093/nar/gkac586

富酬研究员简介:

汤富酬,博士,北京大学BIOPIC/ICG研究员,国家“优青”(2013)、“杰青”(2016)。1998年本科毕业于北京大学,2003年在北大获得细胞生物学博士学位,2004-2010年间在英国剑桥大学Gurdon研究所从事博士后研究, 2010年回到北京大学组建实验室,主要从事人类早期胚胎发育的单细胞功能基因组学研究。在国际上率先系统发展了单细胞功能基因组学研究体系,并利用一系列技术体系对人类早期胚胎发育进行了深入、系统的研究,揭示了人类早期胚胎DNA去甲基化过程的异质性以及其他表观遗传学关键特征,发现了人类早期胚胎中基因表达网络的重要表观遗传学调控机理,为人们提供了一个全面分析人类早期胚胎表观遗传调控网络的研究框架,加深了对人类原始生殖细胞的发育以及表观遗传重编程过程的认识。