chhylp123/hifiasm
Fork: 87 Star: 547 (更新于 2024-11-17 10:54:37)
license: MIT
Language: C++ .
Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
最后发布版本: 0.20.0 ( 2024-10-17 13:34:07)
Getting Started
# Install hifiasm (requiring g++ and zlib)
git clone https://github.com/chhylp123/hifiasm
cd hifiasm && make
# Run on test data (use -f0 for small datasets)
wget https://github.com/chhylp123/hifiasm/releases/download/v0.7/chr11-2M.fa.gz
./hifiasm -o test -t4 -f0 chr11-2M.fa.gz 2> test.log
awk '/^S/{print ">"$2;print $3}' test.bp.p_ctg.gfa > test.p_ctg.fa # get primary contigs in FASTA
# Assemble inbred/homozygous genomes (-l0 disables duplication purging)
hifiasm -o CHM13.asm -t32 -l0 CHM13-HiFi.fa.gz 2> CHM13.asm.log
# Assemble heterozygous genomes with built-in duplication purging
hifiasm -o HG002.asm -t32 HG002-file1.fq.gz HG002-file2.fq.gz
# Hi-C phasing with paired-end short reads in two FASTQ files
hifiasm -o HG002.asm --h1 read1.fq.gz --h2 read2.fq.gz HG002-HiFi.fq.gz
# Trio binning assembly (requiring https://github.com/lh3/yak)
yak count -b37 -t16 -o pat.yak <(cat pat_1.fq.gz pat_2.fq.gz) <(cat pat_1.fq.gz pat_2.fq.gz)
yak count -b37 -t16 -o mat.yak <(cat mat_1.fq.gz mat_2.fq.gz) <(cat mat_1.fq.gz mat_2.fq.gz)
hifiasm -o HG002.asm -t32 -1 pat.yak -2 mat.yak HG002-HiFi.fa.gz
# Improve contiguity for diploid genome assembly by self-scaffolding (`--dual-scaf`)
hifiasm -o HG002.asm --dual-scaf --h1 read1.fq.gz --h2 read2.fq.gz HG002-HiFi.fq.gz
# Preserve more telomeres for human genomes (`--telo-m CCCTAA`)
hifiasm -o HG002.asm --telo-m CCCTAA --h1 read1.fq.gz --h2 read2.fq.gz HG002-HiFi.fq.gz
# Hybrid assembly with HiFi, ultralong and Hi-C reads
hifiasm -o HG002.asm --h1 read1.fq.gz --h2 read2.fq.gz --ul ul.fq.gz HG002-HiFi.fq.gz
# Single-sample telomere-to-telomere assembly for diploid human genomes
hifiasm -o HG002.asm --dual-scaf --telo-m CCCTAA --h1 read1.fq.gz --h2 read2.fq.gz --ul ul.fq.gz HG002-HiFi.fq.gz
See tutorial for more details.
Table of Contents
Introduction
Hifiasm is a fast haplotype-resolved de novo assembler initially designed for PacBio HiFi reads. Its latest release could support the telomere-to-telomere assembly by utilizing ultralong Oxford Nanopore reads. Hifiasm produces arguably the best single-sample telomere-to-telomere assemblies combing HiFi, ultralong and Hi-C reads, and it is one of the best haplotype-resolved assemblers for the trio-binning assembly given parental short reads. For a human genome, hifiasm can produce the telomere-to-telomere assembly in one day.
Why Hifiasm?
-
Hifiasm delivers high-quality telomere-to-telomere assemblies. It tends to generate longer contigs and resolve more segmental duplications than other assemblers.
-
Given Hi-C reads or short reads from the parents, hifiasm can produce overall the best haplotype-resolved assembly so far. It is the assembler of choice by the Human Pangenome Project for the first batch of samples.
-
Hifiasm can purge duplications between haplotigs without relying on third-party tools such as purge_dups. Hifiasm does not need polishing tools like pilon or racon, either. This simplifies the assembly pipeline and saves running time.
-
Hifiasm is fast. It can assemble a human genome in half a day and assemble a ~30Gb redwood genome in three days. No genome is too large for hifiasm.
-
Hifiasm is trivial to install and easy to use. It does not required Python, R or C++11 compilers, and can be compiled into a single executable. The default setting works well with a variety of genomes.
Usage
Assembling HiFi reads without additional data types
A typical hifiasm command line looks like:
hifiasm -o NA12878.asm -t 32 NA12878.fq.gz
where NA12878.fq.gz
provides the input reads, -t
sets the number of CPUs in
use and -o
specifies the prefix of output files. For this example, the
primary contigs are written to NA12878.asm.bp.p_ctg.gfa
.
Since v0.15, hifiasm also produces two sets of
partially phased contigs at NA12878.asm.bp.hap?.p_ctg.gfa
. This pair of files
can be thought to represent the two haplotypes in a diploid genome, though with
occasional switch errors. The frequency of switches is determined by the
heterozygosity of the input sample.
At the first run, hifiasm saves corrected reads and
overlaps to disk as NA12878.asm.*.bin
. It reuses the saved results to avoid
the time-consuming all-vs-all overlap calculation next time. You may specify
-i
to ignore precomputed overlaps and redo overlapping from raw reads.
You can also dump error corrected reads in FASTA and read overlaps in PAF with
hifiasm -o NA12878.asm -t 32 --write-paf --write-ec /dev/null
Hifiasm purges haplotig duplications by default. For inbred or homozygous
genomes, you may disable purging with option -l0
. Old HiFi reads may contain
short adapter sequences at the ends of reads. You can specify -z20
to trim
both ends of reads by 20bp. For small genomes, use -f0
to disable the initial
bloom filter which takes 16GB memory at the beginning. For genomes much larger
than human, applying -f38
or even -f39
is preferred to save memory on k-mer
counting.
Hi-C integration
Hifiasm can generate a pair of haplotype-resolved assemblies with paired-end Hi-C reads:
hifiasm -o NA12878.asm -t32 --h1 read1.fq.gz --h2 read2.fq.gz HiFi-reads.fq.gz
In this mode, each contig is supposed to be a haplotig, which by definition comes from one parental haplotype only. Hifiasm often puts all contigs from the same parental chromosome in one assembly. It has cleanly separated chrX and chrY for a human male dataset. Nonetheless, phasing across centromeres is challenging. Hifiasm is often able to phase entire chromosomes but it may fail in rare cases. Also, contigs from different parental chromosomes are randomly mixed as it is just not possible to phase across chromosomes with Hi-C.
Hifiasm does not perform scaffolding for now. You need to run a standalone scaffolder such as SALSA or 3D-DNA to scaffold phased haplotigs.
Trio binning
When parental short reads are available, hifiasm can also generate a pair of haplotype-resolved assemblies with trio binning. To perform such assembly, you need to count k-mers first with yak first and then do assembly:
yak count -k31 -b37 -t16 -o pat.yak paternal.fq.gz
yak count -k31 -b37 -t16 -o mat.yak maternal.fq.gz
hifiasm -o NA12878.asm -t 32 -1 pat.yak -2 mat.yak NA12878.fq.gz
Here NA12878.asm.dip.hap1.p_ctg.gfa
and NA12878.asm.dip.hap2.p_ctg.gfa
give the two
haplotype assemblies. In the binning mode, hifiasm does not purge haplotig
duplicates by default. Because hifiasm reuses saved overlaps, you can
generate both primary/alternate assemblies and trio binning assemblies with
hifiasm -o NA12878.asm -t 32 NA12878.fq.gz 2> NA12878.asm.pri.log
hifiasm -o NA12878.asm -t 32 -1 pat.yak -2 mat.yak /dev/null 2> NA12878.asm.trio.log
The second command line will run much faster than the first.
Ultra-long ONT integration
Hifiasm could integrate ultra-long ONT reads to produce the telomere-to-telomere assembly:
hifiasm -o NA12878.asm -t32 --ul ul.fq.gz HiFi-reads.fq.gz
For the single-sample telomere-to-telomere assembly with Hi-C reads:
hifiasm -o NA12878.asm -t32 --ul ul.fq.gz --h1 read1.fq.gz --h2 read2.fq.gz HiFi-reads.fq.gz
For the trio-binning telomere-to-telomere assembly:
hifiasm -o NA12878.asm -t32 --ul ul.fq.gz -1 pat.yak -2 mat.yak HiFi-reads.fq.gz
Self-scaffolding
For diploid haplotype-resolved genome assembly, hifiasm can further enhance assembly contiguity
by introducing scaffolding. It leverages the assemblies of the two haplotypes to scaffold each other.
Specifically, if there is a gap within the haplotype 1 assembly, hifiasm will use the corresponding
homologous region in haplotype 2 to scaffold haplotype 1. Below is an example using the --dual-scaf
option.
hifiasm -o NA12878.asm -t32 --dual-scaf HiFi-reads.fq.gz
Preserve more telomeres for T2T assemblies
Hifiasm can preserve more telomeres by specifying the telomere motif using the --telo-m
option.
Below is an example applied to human genome assembly.
hifiasm -o NA12878.asm -t32 --telo-m CCCTAA HiFi-reads.fq.gz
Output files
Hifiasm generates different types of assemblies based on the input data. It also writes error corrected reads to the prefix.ec.bin binary file and writes overlaps to prefix.ovlp.source.bin and prefix.ovlp.reverse.bin. For more details, please see the complete documentation.
Results
The following table shows the statistics of several hifiasm primary assemblies assembled with v0.12:
Dataset | Size | Cov. | Asm options | CPU time | Wall time | RAM | N50 |
---|---|---|---|---|---|---|---|
Mouse (C57/BL6J) | 2.6Gb | ×25 | -t48 -l0 | 172.9h | 4.8h | 76G | 21.1Mb |
Maize (B73) | 2.2Gb | ×22 | -t48 -l0 | 203.2h | 5.1h | 68G | 36.7Mb |
Strawberry | 0.8Gb | ×36 | -t48 -D10 | 152.7h | 3.7h | 91G | 17.8Mb |
Frog | 9.5Gb | ×29 | -t48 | 2834.3h | 69.0h | 463G | 9.3Mb |
Redwood | 35.6Gb | ×28 | -t80 | 3890.3h | 65.5h | 699G | 5.4Mb |
Human (CHM13) | 3.1Gb | ×32 | -t48 -l0 | 310.7h | 8.2h | 114G | 88.9Mb |
Human (HG00733) | 3.1Gb | ×33 | -t48 | 269.1h | 6.9h | 135G | 69.9Mb |
Human (HG002) | 3.1Gb | ×36 | -t48 | 305.4h | 7.7h | 137G | 98.7Mb |
Hifiasm can assemble a 3.1Gb human genome in several hours or a ~30Gb hexaploid redwood genome in a few days on a single machine. For trio binning assembly:
Dataset | Cov. | CPU time | Elapsed time | RAM | N50 |
---|---|---|---|---|---|
HG00733, [father], [mother] | ×33 | 269.1h | 6.9h | 135G | 35.1Mb (paternal), 34.9Mb (maternal) |
HG002, [father], [mother] | ×36 | 305.4h | 7.7h | 137G | 41.0Mb (paternal), 40.8Mb (maternal) |
Human assemblies above can be acquired from Zenodo and non-human ones are available here.
Getting Help
For detailed description of options, please see tutorial or man ./hifiasm.1
. The -h
option of hifiasm also provides brief description of options. If you have
further questions, please raise an issue at the issue
page.
Limitations
- Purging haplotig duplications may introduce misassemblies.
Citating Hifiasm
If you use hifiasm in your work, please cite:
Cheng, H., Concepcion, G.T., Feng, X., Zhang, H., Li H. (2021) Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods, 18:170-175. https://doi.org/10.1038/s41592-020-01056-5
Cheng, H., Jarvis, E.D., Fedrigo, O., Koepfli, K.P., Urban, L., Gemmell, N.J., Li, H. (2022) Haplotype-resolved assembly of diploid genomes without parental data. Nature Biotechnology, 40:1332–1335. https://doi.org/10.1038/s41587-022-01261-x
Cheng, H., Asri, M., Lucas, J., Koren, S., Li, H. (2024) Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph. Nat Methods, 21:967-970. https://doi.org/10.1038/s41592-024-02269-8
最近版本更新:(数据更新于 2024-11-21 08:14:10)
2024-10-17 13:34:07 0.20.0
2024-05-06 22:29:45 0.19.9
2023-11-06 13:13:05 0.19.8
2023-10-11 00:44:09 0.19.7
2023-08-18 08:06:07 0.19.6
2023-04-18 08:16:36 0.19.5
2023-04-08 07:58:37 0.19.4
2023-03-23 00:30:26 0.19.3
2023-03-13 23:26:31 0.19.1
2023-03-11 08:56:32 0.19.0
主题(topics):
bioinformatics, denovo-assembly, genomics, hifi-read, pacbio
chhylp123/hifiasm同语言 C++最近更新仓库
2024-11-21 04:48:41 PCSX2/pcsx2
2024-11-20 09:02:24 dail8859/NotepadNext
2024-11-20 04:28:15 microsoft/terminal
2024-11-18 22:35:05 ClickHouse/ClickHouse
2024-11-18 14:36:13 cxasm/notepad--
2024-11-18 00:19:27 MaaAssistantArknights/MaaAssistantArknights